Segments count increased to around 7200, index remains unoptimized

2013-07-12 Thread A Geek
Hi All, I'm running SOLR 4.0 on a Linux machine with around 30GB RAM. I've 2 
cores running under solr as belowCore AA: around 30 GB data , segments count = 
30Core BB: around 216 GB data, segments count=300
Solr is running through jetty and I've allocated max of 12GB heap memory 
through java like: java -Xms4GB -Xmx12GB
Note that, I've another Java based application which keeps feeding new data to 
SOLR index even while running the following optimizes. I noticed that solr 
queries were running slow and thought of running an optimize. So from the SOLR 
web admin, I clicked the optimize button for core AA, and after some time[30-40 
mins] I saw that the segments count was reduced to 1, indicating it got 
optimized. Next, I ran the same thing for the other core BB, the segments count 
kept increasing and I thought it would be good to shutdown the application 
which is feeding data to this index, and by the time I closed this application, 
segment count has reached to a very high value ~7200. And I think it stopped 
optimizing the indices, because I see 2 red circles next to "optimized" and 
"Current" and clicking on the Optimize button is not doing anything[Generally 
the small circle in optimize button keeps moving till the index gets fully 
optimized, which I didn't notice while optimizing Core BB]. One thing I noticed 
is the total index size remained at around 202GB after reaching this high 
segment count unlike the core AA, where during optimization the index size 
increased to around 59GB and then reduced to 30GB and I saw 2 tick marks next 
to "optimized"  and "Current" . Since this is a production/live machine, I am a 
bit concerned and don't want to lose any data or end up with corrupt index. 
Should I just restart SOLR[its running through jetty]? Or any other step? 
Please advise on what's the right/optimum step which also ensures that the core 
BB gets optimized and I don't lose any data or index corruption occurs. I am 
bit worried, please help. Thanks in advance.
Find attached the screenshot from SOLR admin pages for both Core AA and core 
BB, showing the segments count & index size.
Thanks again,DK   

Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Thanks, Jack!


On Fri, Jul 12, 2013 at 9:37 PM, Jack Krupansky wrote:

> For the calculation of norm, see note number 6:
>
> http://lucene.apache.org/core/**4_3_0/core/org/apache/lucene/**
> search/similarities/**TFIDFSimilarity.html
>
> You would need to talk to the Nutch guys to see why THEY are setting
> document boost to 0.0.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Friday, July 12, 2013 11:57 PM
> To: solr-user@lucene.apache.org
> Subject: Re: zero-valued retrieval scores
>
>
> Yes, you are right, the boost on these documents are 0. I didn't provide
> them, though.
>
> I suppose the boost scores come from Nutch (yes, my solr indexes crawled
> web docs). What could be wrong?
>
> again, what exactly is the formula for fieldNorm?
>
>
> On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky *
> *wrote:
>
>  Did you put a boost of 0.0 on the documents, as opposed to the default of
>> 1.0?
>>
>> x * 0.0 = 0.0
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Joe Zhang
>> Sent: Friday, July 12, 2013 10:31 PM
>> To: solr-user@lucene.apache.org
>> Subject: zero-valued retrieval scores
>>
>>
>> when I search a keyword (such as "apple"), most of the docs carry 0.0 as
>> score. Here is an example from explain:
>>
>> str name="
>> http://www.bloomberg.com/slideshow/2013-07-12/world-at-**
>> **work-india.html
>> > 07-12/world-at-work-india.html
>> **>
>>
>> ">
>> 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
>>  1.0 = tf(termFreq(content:appl)=1)
>>  2.096877 = idf(docFreq=5190, maxDocs=15546)
>>  0.0 = fieldNorm(field=content, doc=51)
>> Can somebody help me understand why fieldNorm is 0? What exactly is the
>> formula for computing fieldNorm?
>>
>> Thanks!
>>
>>
>


Re: zero-valued retrieval scores

2013-07-12 Thread Jack Krupansky

For the calculation of norm, see note number 6:

http://lucene.apache.org/core/4_3_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html

You would need to talk to the Nutch guys to see why THEY are setting 
document boost to 0.0.


-- Jack Krupansky

-Original Message- 
From: Joe Zhang

Sent: Friday, July 12, 2013 11:57 PM
To: solr-user@lucene.apache.org
Subject: Re: zero-valued retrieval scores

Yes, you are right, the boost on these documents are 0. I didn't provide
them, though.

I suppose the boost scores come from Nutch (yes, my solr indexes crawled
web docs). What could be wrong?

again, what exactly is the formula for fieldNorm?


On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky 
wrote:



Did you put a boost of 0.0 on the documents, as opposed to the default of
1.0?

x * 0.0 = 0.0

-- Jack Krupansky

-Original Message- From: Joe Zhang
Sent: Friday, July 12, 2013 10:31 PM
To: solr-user@lucene.apache.org
Subject: zero-valued retrieval scores


when I search a keyword (such as "apple"), most of the docs carry 0.0 as
score. Here is an example from explain:

str name="
http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html
">
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
 1.0 = tf(termFreq(content:appl)=1)
 2.096877 = idf(docFreq=5190, maxDocs=15546)
 0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!





Re: zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
Yes, you are right, the boost on these documents are 0. I didn't provide
them, though.

I suppose the boost scores come from Nutch (yes, my solr indexes crawled
web docs). What could be wrong?

again, what exactly is the formula for fieldNorm?


On Fri, Jul 12, 2013 at 8:46 PM, Jack Krupansky wrote:

> Did you put a boost of 0.0 on the documents, as opposed to the default of
> 1.0?
>
> x * 0.0 = 0.0
>
> -- Jack Krupansky
>
> -Original Message- From: Joe Zhang
> Sent: Friday, July 12, 2013 10:31 PM
> To: solr-user@lucene.apache.org
> Subject: zero-valued retrieval scores
>
>
> when I search a keyword (such as "apple"), most of the docs carry 0.0 as
> score. Here is an example from explain:
>
> str name="
> http://www.bloomberg.com/**slideshow/2013-07-12/world-at-**work-india.html
> ">
> 0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
>  1.0 = tf(termFreq(content:appl)=1)
>  2.096877 = idf(docFreq=5190, maxDocs=15546)
>  0.0 = fieldNorm(field=content, doc=51)
> Can somebody help me understand why fieldNorm is 0? What exactly is the
> formula for computing fieldNorm?
>
> Thanks!
>


Re: zero-valued retrieval scores

2013-07-12 Thread Jack Krupansky
Did you put a boost of 0.0 on the documents, as opposed to the default of 
1.0?


x * 0.0 = 0.0

-- Jack Krupansky

-Original Message- 
From: Joe Zhang

Sent: Friday, July 12, 2013 10:31 PM
To: solr-user@lucene.apache.org
Subject: zero-valued retrieval scores

when I search a keyword (such as "apple"), most of the docs carry 0.0 as
score. Here is an example from explain:

str name="
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html";>
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
 1.0 = tf(termFreq(content:appl)=1)
 2.096877 = idf(docFreq=5190, maxDocs=15546)
 0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks! 



zero-valued retrieval scores

2013-07-12 Thread Joe Zhang
when I search a keyword (such as "apple"), most of the docs carry 0.0 as
score. Here is an example from explain:

str name="
http://www.bloomberg.com/slideshow/2013-07-12/world-at-work-india.html";>
0.0 = (MATCH) fieldWeight(content:appl in 51), product of:
  1.0 = tf(termFreq(content:appl)=1)
  2.096877 = idf(docFreq=5190, maxDocs=15546)
  0.0 = fieldNorm(field=content, doc=51)
Can somebody help me understand why fieldNorm is 0? What exactly is the
formula for computing fieldNorm?

Thanks!


java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2013-07-12 Thread Ali, Saqib
I am getting a java.lang.OutOfMemoryError: Requested array size exceeds VM
limit on certain queries.

Please advise:

19:25:02,632 INFO  [org.apache.solr.core.SolrCore]
(http-oktst1509.company.tld/12.5.105.96:8180-9) [collection1] webapp=/solr
path=/select
params={sort=sent_date+asc&distrib=false&wt=javabin&version=2&rows=2147483647&df=text&fl=id&shard.url=
12.5.105.96:8180/solr/collection1/&NOW=1373675102627&start=0&q=thread_id:1439513570014188310&isShard=true&fq=domain:company.tld+AND+owner:11782344&fsv=true}
hits=1 status=0 QTime=1
19:25:02,637 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
(http-oktst1509.company.tld/12.5.105.96:8180-2)
null:java.lang.RuntimeException: java.lang.OutOfMemoryError: Requested
array size exceeds VM limit
at
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:280)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:248)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:275)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:161)
at
org.jboss.as.web.security.SecurityContextAssociationValve.invoke(SecurityContextAssociationValve.java:169)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:155)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:372)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:877)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:679)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:931)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit


Re: Norms

2013-07-12 Thread Lance Norskog
Norms stay in the index even if you delete all of the data. If you just 
changed the schema, emptied the index, and tested again, you've still 
got norms in there.


You can examine the index with Luke to verify this.

On 07/09/2013 08:57 PM, William Bell wrote:

I have a field that has omitNorms=true, but when I look at debugQuery I see
that
the field is being normalized for the score.

What can I do to turn off normalization in the score?

I want a simple way to do 2 things:

boost geodist() highest at 1 mile and lowest at 100 miles.
plus add a boost for a query=edgefield^5.

I only want tf() and no queryNorm. I am not even sure I want idf() but I
can probably live with rare names being boosted.



The results are being normalized. See below. I tried dismax and edismax -
bf, bq and boost.



none
edismax
0.01

display_name,city_state,prov_url,pwid,city_state_alternative


sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)
5
*:*
name_edgy^.9 name_edge^.9 name_word
true
pwid
true

score desc, last_name asc
100
39.740112,-104.984856
store_geohash
false
name_edgy
2<-1 4<-2 6<-3



0.058555886 = queryNorm

product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378),
product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 =
boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 =
boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
(MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
fieldWeight(name_word:nutting in 231378), product of: 1.0 =
tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
(MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
(MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =
sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))







Re: add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib
username: saqib


On Fri, Jul 12, 2013 at 2:35 PM, Ali, Saqib  wrote:

> Hello,
>
> Can you please add me to the ContributorsGroup? I would like to add
> instructions for setting up SolrCloud using Jboss.
>
> thanks.
>
>


add to ContributorsGroup - Instructions for setting up SolrCloud on jboss

2013-07-12 Thread Ali, Saqib
Hello,

Can you please add me to the ContributorsGroup? I would like to add
instructions for setting up SolrCloud using Jboss.

thanks.


Re: add to ContributorsGroup

2013-07-12 Thread Erick Erickson
Done, Thanks for helping!

Erick

On Fri, Jul 12, 2013 at 4:30 PM, Ken Geis  wrote:
> Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd like
> to fix some typos.
>
>
> Thanks,
>
> Ken Geis


add to ContributorsGroup

2013-07-12 Thread Ken Geis
Hi. Could you add me (KenGeis) to the Solr Wiki ContributorsGroup? I'd 
like to fix some typos.



Thanks,

Ken Geis


solr autodetectparser tikaconfig dataimporter error

2013-07-12 Thread Andreas Owen
i am using solr 3.5, tika-app-1.4 and tagcloud 1.2.1. when i try to =
import a
file via xml i get this error, it doesn't matter what file format i try =
to index txt, cfm, pdf all the same error:

SEVERE: Exception while processing: rec document :
SolrInputDocument[{id=3Did(1.0)=3D{myTest.txt},
title=3Dtitle(1.0)=3D{Beratungsseminar kundenbrief}, =
contents=3Dcontents(1.0)=3D{wie
kommuniziert man}, author=3Dauthor(1.0)=3D{Peter Z.},
=
path=3Dpath(1.0)=3D{download/online}}]:org.apache.solr.handler.dataimport.=
DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.common.SolrException log
SEVERE: Full Import
failed:org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:669)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:622)
at
=
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:2=
68)
at
=
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)=

at
=
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.=
java:359)
at
=
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:4=
27)
at
=
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:40=
8)
Caused by: java.lang.NoSuchMethodError:
=
org.apache.tika.parser.AutoDetectParser.setConfig(Lorg/apache/tika/config/=
TikaConfig;)V
at
=
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityP=
rocessor.java:122)
at
=
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityPr=
ocessorWrapper.java:238)
at
=
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.jav=
a:596)
... 6 more

Jul 11, 2013 5:23:36 PM org.apache.solr.update.DirectUpdateHandler2 =
rollback

data-config.xml:


http://127.0.0.1/tkb/internet/";
name=3D"main"/>

=20





=09
=09
=09

 





the lib are included and declared in the logs, i have also tried =
tika-app
1.0 and tagsoup 1.2 with the same result. can someone please help, i =
don't
know where to start looking for the error.

Re: Problem using Term Component in solr

2013-07-12 Thread Erick Erickson
Is the vocabulary known? That is, do you know the abbreviations that
will be used? If so, you could consider synonyms, in which case you'd
go to tokenized titles and use phrase queries to get your matches...

Regexes often don't scale extremely well, although the 4.x FST
implementations are much faster than they used to be.

It seems to me that regularizing the titles is a better idea than
trying to fake it with regexes, but you know your problem space better
than me...

Best
Erick

On Fri, Jul 12, 2013 at 1:32 PM, Parul Gupta(Knimbus)
 wrote:
> Hi,
> Ok I will not use Bold text in my queries
>
> I guess my question is not clear to you
>
> See what I am doing is, i have a live source say 'A'  and a stored database
> say it as 'B'.ok
> A and B ,both have title fields in them.Consider A as non-persistent solr
> and B as persistent solr.
>
> I have to match the title coming from A to the database B.
>
> Since some title from live source A comes in short form e.g 'med. phys.' and
> 'phys. fluids'.
> But corresponding to these titles my database B have titles 'medical
> physics' and 'physics of fluids'.
> Since this type of differences occurs and A not able to search there
> corresponding titles in B by using 'tokenized' field 'title' with using wild
> cards,hence i used Term component first.Which gives me the corresponding
> matched title with B.When i got the full title like 'medical physics',i
> fetched it from HTML,and then again search it in tokenized field of 'title'
> say it 'titlenew'(copy field of title) which brings me result 'medical
> physics'.But I am failing to get match of 'phys. fluids' with 'physics of
> fluids' as it has stop word in it using [a-z0-9]*.
>
> Hope know u will get my issue...and will help..
> thanks..
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Save Solr index in database

2013-07-12 Thread Upayavira
If they ask, tell them that Solr *is* a database. Databases store their
stuff on a file system, so your data is gonna end up there in the end.
Putting Solr indexes inside a database is like storing mysql tables in
Oracle.

Upayavira

On Fri, Jul 12, 2013, at 08:18 PM, Sagar Jadhav wrote:
> I think that makes a lot of sense as I was reading the Solr Cloud
> technique.
> Thanks a lot Shawn for the validation. 
> Thanks a lot everyone for helping me out to go in the right direction. I
> really appreciate all the inputs. I will now go back and get the
> exception
> for getting access to the filesytem.
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Save Solr index in database

2013-07-12 Thread Sagar Jadhav
I think that makes a lot of sense as I was reading the Solr Cloud technique.
Thanks a lot Shawn for the validation. 
Thanks a lot everyone for helping me out to go in the right direction. I
really appreciate all the inputs. I will now go back and get the exception
for getting access to the filesytem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077673.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Save Solr index in database

2013-07-12 Thread Shawn Heisey

On 7/12/2013 12:51 PM, Sagar Jadhav wrote:

The reason for going that route is because our application is clustered and
if the indexing information is on the filesystem, I am not sure whether that
would be replicated. At the same time since its a product it needs to be
packaged with the product and also from a proprietary reason we are not
allowed to use the filesystem.


Solr can do replication from a master server to slaves.  If you 
implement as SolrCloud, then you would have a clustered solution with no 
master/slave designations.  SolrCloud requires a three server minimum 
for a robust deployment.  The third server can be a wimpy thing that 
only runs zookeeper.


Putting your index in a DB is just a bad idea.  It would be hard to find 
help with it, and performance would not be good.


Thanks,
Shawn



Re: Save Solr index in database

2013-07-12 Thread Gora Mohanty
On 13 July 2013 00:19, Shawn Heisey  wrote:
> On 7/12/2013 12:30 PM, sagarmj76 wrote:
>>
>> hi I wanted to understand if it is possible to store/save Solr indexes to
>> the
>> database instead of the filesystem. I checked out some articles where
>> lucene
>> can do it. Hence I assume Solr can too but its not clear to me how to
>> configure Solr to save the indexes in the database instead in the /index
>> directory.
>> Any help is really appreciated as I think I have hit a wall with this.
[...]

As others have noted, think twice about why you
would want to do this. Lucene does it through
JdbcDirectory but as far as I know this is only an
interface without a concrete implementation, though
apparently third-party libraries are available that
implement JdbcDirectory. The Lucene FAQ notes
that this is slow:
http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F

Regards,
Gora


Re: Save Solr index in database

2013-07-12 Thread Sagar Jadhav
The reason for going that route is because our application is clustered and
if the indexing information is on the filesystem, I am not sure whether that
would be replicated. At the same time since its a product it needs to be
packaged with the product and also from a proprietary reason we are not
allowed to use the filesystem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649p4077662.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Save Solr index in database

2013-07-12 Thread Shawn Heisey

On 7/12/2013 12:30 PM, sagarmj76 wrote:

hi I wanted to understand if it is possible to store/save Solr indexes to the
database instead of the filesystem. I checked out some articles where lucene
can do it. Hence I assume Solr can too but its not clear to me how to
configure Solr to save the indexes in the database instead in the /index
directory.
Any help is really appreciated as I think I have hit a wall with this.


If Lucene can do it, then theoretically Solr can do so as well.  You 
could very likely add jars to your classpath (to add a Directory and 
DirectoryFactory implementation that uses a database) and reference that 
class in the Solr config, but unless the class provided a way to 
configure itself, you probably wouldn't be able to specify its config 
within Solr's config without custom plugin code.


A burning question ... why would you want to do this?  Lucene and Solr 
are highly optimized to work well with a local filesystem.  That is the 
path that will give you the best performance.


Thanks,
Shawn



Re: Save Solr index in database

2013-07-12 Thread Alexandre Rafalovitch
And why would you want to do that? Seems rather wrong direction to march in.

I am assuming relational database. There is a commercial solution that
integrates Solr into Cassandra, if I understood it correctly:
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-solr
Even
then, there might be some stuff on the filesystem.

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jul 12, 2013 at 2:30 PM, sagarmj76  wrote:

> hi I wanted to understand if it is possible to store/save Solr indexes to
> the
> database instead of the filesystem. I checked out some articles where
> lucene
> can do it. Hence I assume Solr can too but its not clear to me how to
> configure Solr to save the indexes in the database instead in the /index
> directory.
> Any help is really appreciated as I think I have hit a wall with this.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev
Hello Roman,

Thanks for your interest. I briefly looked on your approach, and I'm really
interested in your numbers.

Here is the trivial code, I'd rather prefer rely on your testing framework,
and can provide you a version of Solr 4.2 with SOLR-3076 applied. Do you
need it?
https://github.com/m-khl/join-tester

What you are saying about benchmark representativeness definitely makes
sense. I didn't try to establish a complete absolutely representative
benchmark. Just wanted to have rough numbers, related for my usecase,
certainly. I'm from eCommerce, that volume was enough for me.

What I didn't get is, 'not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment'. Usually,
there is no problem with blocks in multi segment index, block definitely
can't span across segments. Anyway, please elaborate.
One of block join benefits is an ability to hit only the first matched
child in group, and jump over followings. It doesn't applicable in general,
but get huge gain some times.


On Fri, Jul 12, 2013 at 8:29 PM, Roman Chyla  wrote:

> Hi Mikhail,
> I have commented on your blog, but it seems I have done st wrong, as the
> comment is not there. Would it be possible to share the test setup
> (script)?
>
> I have found out that the crucial thing with joins is the number of 'joins'
> [hits returned] and it seems that the experiments I have seen so far were
> geared towards small collection - even if Erick's index was 26M, the number
> of hits was probably small - you can see a very different story if you face
> some [other] real data. Here is a citation network and I was comparing
> lucene join's [ie not the block joins, because these cannot be used for
> citation data - we cannot reasonably index them into one segment])
>
>
> https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png
>
> Notice, the y axes is sqrt, so the running time for lucene join is growing
> and growing very fast! It takes lucene 30s to do the search that selects 1M
> hits.
>
> The comparison is against our own implementation of a similar search - but
> the main point I am making is that the join benchmarks should be showing
> the number of hits selected by the join operation. Otherwise, a very
> important detail is hidden.
>
> Best,
>
>   roman
>
>
> On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu  > >wrote:
> >
> > > Hi Mikhail,
> > >
> > > I have used wrong the term block join. When I said block join I was
> > > referring to a join performed on a single core versus cross join which
> > was
> > > performed on multiple cores.
> > > But I saw your benchmark (from cache) and it seems that block join has
> > > better performance. Is this functionality available on Solr 4.3.1?
> >
> > nope SOLR-3076 awaits for ages.
> >
> >
> > > I did not find such examples on Solr's wiki page.
> > > Does this functionality require a special schema, or a special
> indexing?
> >
> > Special indexing - yes.
> >
> >
> > > How would I need to index the data from my tables? In my case anyway
> all
> > > the indices have a common schema since I am using dynamic fields, thus
> I
> > > can easily add all documents from all tables in one Solr core, but for
> > each
> > > document to add a discriminator field.
> > >
> > correct. but notion of ' discriminator field' is a little bit different
> for
> > blockjoin.
> >
> >
> > >
> > > Could you point me to some more documentation?
> > >
> >
> > I can recommend only those
> >
> >
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
> > http://www.youtube.com/watch?v=-OiIlIijWH0
> >
> >
> > > Thanks in advance,
> > > Mihaela
> > >
> > >
> > > 
> > >  From: Mikhail Khludnev 
> > > To: solr-user ; mihaela olteanu <
> > > mihaela...@yahoo.com>
> > > Sent: Thursday, July 11, 2013 2:25 PM
> > > Subject: Re: Performance of cross join vs block join
> > >
> > >
> > > Mihaela,
> > >
> > > For me it's reasonable that single core join takes the same time as
> cross
> > > core one. I just can't see which gain can be obtained from in the
> former
> > > case.
> > > I hardly able to comment join code, I looked into, it's not trivial, at
> > > least. With block join it doesn't need to obtain parentId term
> > > values/numbers and lookup parents by them. Both of these actions are
> > > expensive. Also blockjoin works as an iterator, but join need to
> allocate
> > > memory for parents bitset and populate it out of order that impacts
> > > scalability.
> > > Also in None scoring mode BJQ don't need to walk through all children,
> > but
> > > only hits first. Also, nice feature is 'both side leapfrog' if you
> have a
> > > highly restrictive filter/query intersects with BJQ, it allows to skip
> > many
> > > parents and children as well, that's not possible in Join, which has
> > fairly
> > > 'full-scan' 

Save Solr index in database

2013-07-12 Thread sagarmj76
hi I wanted to understand if it is possible to store/save Solr indexes to the
database instead of the filesystem. I checked out some articles where lucene
can do it. Hence I assume Solr can too but its not clear to me how to
configure Solr to save the indexes in the database instead in the /index
directory.
Any help is really appreciated as I think I have hit a wall with this.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Save-Solr-index-in-database-tp4077649.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to set a condition on the number of docs found

2013-07-12 Thread Matt Lieber
Thanks William, I'll do that.

Matt


On 7/12/13 7:38 AM, "William Bell"  wrote:

>Hmmm. One way is:
>
>http://localhost:8983/solr/core/select/?q=*%3A*&facet=true&facet.field=id&;
>facet.offset=10&rows=0&facet.limit=1dersearch/select/?q=*%3A*&facet=true&facet.field=city&facet.offset=10&rows
>=0&facet.limit=1>
>
>If you have a result you have results > 10.
>
>Another way is to just look at it wth a facet.query and have your app deal
>with it.
>
>http:/localhost:8983/solr/core/select/?q=*%3A*&facet=true&facet.query={!lu
>cene%20key=numberofresults}state:CO&rows=0/providersearch/select/?q=*%3A*&facet=true&facet.query={!lucene%20key=numb
>erofresults}state:CO&rows=0>
>
>
>
>
>On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber  wrote:
>
>> Hello there,
>>
>> I would like to be able to know whether I got over a certain threshold
>>of
>> doc results.
>>
>> I.e. Test (Result.numFound > 10 ) -> true.
>>
>> Is there a way to do this ? I can't seem to find how to do this; (other
>> than have to do this test on the client app, which is not great).
>>
>> Thanks,
>> Matt
>>
>>
>> 
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that
>>the
>> communication is free of errors, virus, interception or interference.
>>
>
>
>
>--
>Bill Bell
>billnb...@gmail.com
>cell 720-256-8076









NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


RE: expunging deletes

2013-07-12 Thread Petersen, Robert
OK Thanks Shawn,

 I went with this because 10 wasn't working for us and it looks like my index 
is staying under 20 GB now with numDocs : 16897524 and maxDoc : 19048053


  5
  5
  15
  6144.0
  6.0




-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Wednesday, July 10, 2013 5:34 PM
To: solr-user@lucene.apache.org
Subject: Re: expunging deletes

On 7/10/2013 5:58 PM, Petersen, Robert wrote:
> Using solr 3.6.1 and the following settings, I am trying to run without 
> optimizes.  I used to optimize nightly, but sometimes the optimize took a 
> very long time to complete and slowed down our indexing.  We are continuously 
> indexing our new or changed data all day and night.  After a few days running 
> without an optimize, the index size has nearly doubled and maxdocs is nearly 
> twice the size of numdocs.  I understand deletes should be expunged on 
> merges, but even after trying lots of different settings for our merge policy 
> it seems this growth is somewhat unbounded.  I have tried sending an optimize 
> with numSegments = 2 which is a lot lighter weight then a regular optimize 
> and that does bring the number down but not by too much.  Does anyone have 
> any ideas for better settings for my merge policy that would help?  Here is 
> my current index snapshot too:

Your merge settings are the equivalent of the old mergeFactor set to 35, and 
based on the fact that you have the Explicit set to 105, I'm guessing your 
settings originally came from something I posted - these are the numbers that I 
use.  These settings can result in a very large number of segments on your disk.

Because you index a lot (and probably reindex existing documents often), I can 
understand why you have high merge settings, but if you want to eliminate 
optimizes, you'll need to go lower.  The default merge setting of 10 (with an 
Explicit value of 30) is probably a good starting point, but you might need to 
go even smaller.

On Solr 3.6, an optimize probably cannot take place at the same time as index 
updates -- the optimize would probably delay updates until after it's finished. 
 I remember running into problems on Solr 3.x, so I set up my indexing program 
to stop updates while the index was optimizing.

Solr 4.x should lift any restriction where optimizes and updates can't happen 
at the same time.

With an index size of 25GB, a six-drive RAID10 should be able to optimize in 
10-15 minutes, but if your I/O system is single disk, RAID1, RAID5, or RAID6, 
the write performance may cause this to take longer.
If you went with SSD, optimizes would happen VERY fast.

Thanks,
Shawn





Re: Problem using Term Component in solr

2013-07-12 Thread Parul Gupta(Knimbus)
Hi,
Ok I will not use Bold text in my queries  

I guess my question is not clear to you

See what I am doing is, i have a live source say 'A'  and a stored database
say it as 'B'.ok
A and B ,both have title fields in them.Consider A as non-persistent solr
and B as persistent solr.

I have to match the title coming from A to the database B.

Since some title from live source A comes in short form e.g 'med. phys.' and
'phys. fluids'.
But corresponding to these titles my database B have titles 'medical
physics' and 'physics of fluids'.
Since this type of differences occurs and A not able to search there
corresponding titles in B by using 'tokenized' field 'title' with using wild
cards,hence i used Term component first.Which gives me the corresponding
matched title with B.When i got the full title like 'medical physics',i
fetched it from HTML,and then again search it in tokenized field of 'title'
say it 'titlenew'(copy field of title) which brings me result 'medical
physics'.But I am failing to get match of 'phys. fluids' with 'physics of
fluids' as it has stop word in it using [a-z0-9]*.

Hope know u will get my issue...and will help..
thanks..



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200p4077628.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance of cross join vs block join

2013-07-12 Thread Roman Chyla
Hi Mikhail,
I have commented on your blog, but it seems I have done st wrong, as the
comment is not there. Would it be possible to share the test setup (script)?

I have found out that the crucial thing with joins is the number of 'joins'
[hits returned] and it seems that the experiments I have seen so far were
geared towards small collection - even if Erick's index was 26M, the number
of hits was probably small - you can see a very different story if you face
some [other] real data. Here is a citation network and I was comparing
lucene join's [ie not the block joins, because these cannot be used for
citation data - we cannot reasonably index them into one segment])

https://github.com/romanchyla/r-ranking-fun/blob/master/plots/raw/comparison-join-2nd.png

Notice, the y axes is sqrt, so the running time for lucene join is growing
and growing very fast! It takes lucene 30s to do the search that selects 1M
hits.

The comparison is against our own implementation of a similar search - but
the main point I am making is that the join benchmarks should be showing
the number of hits selected by the join operation. Otherwise, a very
important detail is hidden.

Best,

  roman


On Fri, Jul 12, 2013 at 4:57 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu  >wrote:
>
> > Hi Mikhail,
> >
> > I have used wrong the term block join. When I said block join I was
> > referring to a join performed on a single core versus cross join which
> was
> > performed on multiple cores.
> > But I saw your benchmark (from cache) and it seems that block join has
> > better performance. Is this functionality available on Solr 4.3.1?
>
> nope SOLR-3076 awaits for ages.
>
>
> > I did not find such examples on Solr's wiki page.
> > Does this functionality require a special schema, or a special indexing?
>
> Special indexing - yes.
>
>
> > How would I need to index the data from my tables? In my case anyway all
> > the indices have a common schema since I am using dynamic fields, thus I
> > can easily add all documents from all tables in one Solr core, but for
> each
> > document to add a discriminator field.
> >
> correct. but notion of ' discriminator field' is a little bit different for
> blockjoin.
>
>
> >
> > Could you point me to some more documentation?
> >
>
> I can recommend only those
>
> http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
> http://www.youtube.com/watch?v=-OiIlIijWH0
>
>
> > Thanks in advance,
> > Mihaela
> >
> >
> > 
> >  From: Mikhail Khludnev 
> > To: solr-user ; mihaela olteanu <
> > mihaela...@yahoo.com>
> > Sent: Thursday, July 11, 2013 2:25 PM
> > Subject: Re: Performance of cross join vs block join
> >
> >
> > Mihaela,
> >
> > For me it's reasonable that single core join takes the same time as cross
> > core one. I just can't see which gain can be obtained from in the former
> > case.
> > I hardly able to comment join code, I looked into, it's not trivial, at
> > least. With block join it doesn't need to obtain parentId term
> > values/numbers and lookup parents by them. Both of these actions are
> > expensive. Also blockjoin works as an iterator, but join need to allocate
> > memory for parents bitset and populate it out of order that impacts
> > scalability.
> > Also in None scoring mode BJQ don't need to walk through all children,
> but
> > only hits first. Also, nice feature is 'both side leapfrog' if you have a
> > highly restrictive filter/query intersects with BJQ, it allows to skip
> many
> > parents and children as well, that's not possible in Join, which has
> fairly
> > 'full-scan' nature.
> > Main performance factor for Join is number of child docs.
> > I'm not sure I got all your questions, please specify them in more
> details,
> > if something is still unclear.
> > have you saw my benchmark
> > http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
> >
> >
> >
> > On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu  > >wrote:
> >
> > > Hello,
> > >
> > > Does anyone know about some measurements in terms of performance for
> > cross
> > > joins compared to joins inside a single index?
> > >
> > > Is it faster the join inside a single index that stores all documents
> of
> > > various types (from parent table or from children tables)with a
> > > discriminator field compared to the cross join (basically in this case
> > each
> > > document type resides in its own index)?
> > >
> > > I have performed some tests but to me it seems that having a join in a
> > > single index (bigger index) does not add too much speed improvements
> > > compared to cross joins.
> > >
> > > Why a block join would be faster than a cross join if this is the case?
> > > What are the variables that count when trying to improve the query
> > > execution time?
> > >
> > > Thanks!
> > > Mihaela
> >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Principal Engineer,
> > Grid Dynam

Re: Patch review request: SOLR-5001 (adding book links to the website)

2013-07-12 Thread Steve Rowe
Hi Alexandre,

I'll work on this today.

Steve

On Jul 12, 2013, at 8:26 AM, Alexandre Rafalovitch  wrote:

> Hello,
> 
> As per earlier email thread, I have created a patch for Solr website to
> incorporate links to my new book.
> 
> It would be nice if somebody with commit rights for the (markdown) website
> could look at it before the book's Solr version (4.3.1) stops being the
> latest :-)
> 
> I promise to help with the new Wiki/Guide later in return.
> 
> https://issues.apache.org/jira/browse/SOLR-5001
> 
> Regards,
>   Alex.
> 
> Personal website: http://www.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)



RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Zhang, Lisheng
Thanks very much for all the helps!

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Friday, July 12, 2013 7:31 AM
To: solr-user@lucene.apache.org
Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to
Zookeeper


On 7/12/2013 7:29 AM, Zhang, Lisheng wrote:
> Sorry I might not have asked clearly, our issue is that we have 
> a few thousand collections (can be much more), so running that 
> command is rather tedius, is there a simpler way (all collections 
> share same schema/config)?

When you create each collection with the Collections API (http calls),
you tell it the name of a config set stored in zookeeper.  You can give
all your collections the same config set if you like.

If you manually create collections with the CoreAdmin API instead, you
must use the zkcli script included in Solr to link the collection to the
config set, which can be done either before or after the collection is
created.  The zkcli script provides some automation for the java command
that you were given by Furkan.

Thanks,
Shawn



Re: What does "too many merges...stalling" in indexwriter log mean?

2013-07-12 Thread Shawn Heisey
On 7/12/2013 9:23 AM, Tom Burton-West wrote:
> Do you have any feeling for what gets traded off if we increase the
> maxMergeCount?
> 
> This is completely new for us because we are experimenting with indexing
> pages instead of whole documents.  Since our average document is about 370
> pages, this means that we have increased the number of documents we are
> asking Solr to index by a couple of orders of magnitude. (on the other hand
> the size of the document decreases by a couple of orders of magnitude).
> I'm not sure why increasing the number of documents (and reducing their
> size) is causing more merges.  I'll have to investigate.

I'm not sure that you lose anything, really.  If everything is
proceeding normally before the "stalling" message is logged, I would not
expect it to cause ANY problems.

The reason that I increased this value was because when I did a
full-import of millions of documents from mysql, I would reach the point
where there were three different levels of merges going on at once.
Because the default thread count is one, only the largest merge was
actually occurring, the others were queued and waiting.

With three merges stacked up at once, I had passed the maxMergeCount
threshold, so *indexing* stopped.  It can take several minutes for a
very large merge to finish, so indexing stopped long enough that the
MySQL server would drop the connection established by the JDBC driver.
Once the merge finished and DIH tried to resume indexing, the connection
was gone and it would fail the entire import.

I have never seen more than three merge levels happening at once, so a
value of 6 is probably overkill, but shouldn't be a problem.  The true
goal is to make sure that indexing never stops, not to push the system
limits.  The maxThreadCount parameter should prevent I/O from becoming a
problem.

Thanks,
Shawn



Re: How to set a condition over stats result

2013-07-12 Thread Jack Krupansky
sum(x, y, z) = x + y + z (sums those specific fields values for the current 
document)


sum(x, y) = x + y (sum of those two specific field values for the current 
document)


sum(x) = field(x) = x (the specific field value for the current document)

The "sum" function in function queries is not an aggregate function. Ditto 
for min and max.


-- Jack Krupansky

-Original Message- 
From: mihaela olteanu

Sent: Friday, July 12, 2013 1:44 AM
To: solr-user@lucene.apache.org
Subject: Re: How to set a condition over stats result

What if you perform sub(sum(myfieldvalue),100) > 0 using frange?



From: Jack Krupansky 
To: solr-user@lucene.apache.org
Sent: Friday, July 12, 2013 7:44 AM
Subject: Re: How to set a condition over stats result


None that I know of, short of writing a custom search component. Seriously, 
you could hack up a copy of the stats component with your own logic.


Actually... this may be a case for the new, proposed Script Request Handler, 
which would let you execute a query and then you could do any custom 
JavaScript logic you wanted.


When we get that feature, it might be interesting to implement a variation 
of the standard stats component as a JavaScript script, and then people 
could easily hack it such as in your request. Fascinating.


-- Jack Krupansky

-Original Message- From: Matt Lieber
Sent: Thursday, July 11, 2013 6:08 PM
To: solr-user@lucene.apache.org
Subject: How to set a condition over stats result




Hello,

I am trying to see how I can test the sum of values of an attribute across
docs.
I.e. Whether sum(myfieldvalue)>100 .

I know I can use the stats module which compiles the sum of my attributes
on a certain facet , but how can I perform a test this result (i.e. Is
sum>100) within my stats query? From what I read, it's not supported yet
to perform a function on the stats module..
Any other way to do this ?

Cheers,
Matt












NOTE: This message may contain information that is confidential, 
proprietary, privileged or otherwise protected by law. The message is 
intended solely for the named addressee. If received in error, please 
destroy and notify the sender. Any use of this email is prohibited when 
received in error. Impetus does not represent, warrant and/or guarantee, 
that the integrity of this communication has been maintained nor that the 
communication is free of errors, virus, interception or interference. 



Multiple queries or Filtering Queries in Solr

2013-07-12 Thread dcode


My problem is I have n fields (say around 10) in Solr that are searchable,
they all are indexed and stored. I would like to run a query first on my
whole index of say 5000 docs which will hit around an average of 500 docs.
Next I would like to query using a different set of keywords on these 500
docs and NOT on the whole index.

So the first time I send a query a score will be generated, the second time
I run a query the new score generated should be based on the 500 documents
of the previous query, or in other words Solr should consider only these 500
docs as the whole index.

To summarise this, Index of 5000 will be filtered to 500 and then 50
(5000>500>50). Its basically filtering but I would like to do this in Solr.

I have reasonable basic knowledge and still learning.

Update: If represented mathematically it would look like this:
results1=f(query1)
results2=f(query2, results1)
final_results=f(query3, results2)

I would like this to be accomplish using a program and end-user will only
see 50 results. So faceting is not an option.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-queries-or-Filtering-Queries-in-Solr-tp4077574.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: What does "too many merges...stalling" in indexwriter log mean?

2013-07-12 Thread Tom Burton-West
Thanks Shawn,

Do you have any feeling for what gets traded off if we increase the
maxMergeCount?

This is completely new for us because we are experimenting with indexing
pages instead of whole documents.  Since our average document is about 370
pages, this means that we have increased the number of documents we are
asking Solr to index by a couple of orders of magnitude. (on the other hand
the size of the document decreases by a couple of orders of magnitude).
I'm not sure why increasing the number of documents (and reducing their
size) is causing more merges.  I'll have to investigate.

Tom


On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey  wrote:

> On 7/11/2013 1:47 PM, Tom Burton-West wrote:
>
>> We are seeing the message "too many merges...stalling"  in our indexwriter
>> log.   Is this something to be concerned about?  Does it mean we need to
>> tune something in our indexing configuration?
>>
>
> It sounds like you've run into the maximum number of simultaneous merges,
> which I believe defaults to two, or maybe three.  The following config
> section in  will likely take care of the issue. This assumes
> 3.6 or later, I believe that on older versions, this goes in
> .
>
>   
> 1
> 6
>   
>
> Looking through the source code to confirm, this definitely seems like the
> case.  Increasing maxMergeCount is likely going to speed up your indexing,
> at least by a little bit.  A value of 6 is probably high enough for mere
> mortals, buy you guys don't do anything small, so I won't begin to
> speculate what you'll need.
>
> If you are using spinning disks, you'll want maxThreadCount at 1.  If
> you're using SSD, then you can likely increase that value.
>
> Thanks,
> Shawn
>
>


Re: Is it possible to find a leader from a list of cores in solr via java code

2013-07-12 Thread vicky desai
Hi,

As per the suggestions above I shifted my focus to using CloudSolrServer. In
terms of sending updates to the leaders and reducing network traffic it
works great. But i faced one problem in using CloudSolrServer is that it
opens too many connections as large as five thousand connections. My Code is
as follows

ModifiableSolrParams params = new ModifiableSolrParams();
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS, 3);
params.set(HttpClientUtil.PROP_MAX_CONNECTIONS_PER_HOST, 2);
HttpClient client = HttpClientUtil.createClient(params);
LBHttpSolrServer lbServer = new LBHttpSolrServer(client);
server = new CloudSolrServer(zkHost,lbServer);
server.setDefaultCollection(defaultColllection);


If there is only one instance of solr up then this works great. But in 1
shard 1 replica system it opens up too many connections in waiting state. Am
I doing something incorrect. Any help would be highly appreciated





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-find-a-leader-from-a-list-of-cores-in-solr-via-java-code-tp4074994p4077587.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Norms

2013-07-12 Thread William Bell
Thanks.

Yeah I don't really want the queryNorm on


On Wed, Jul 10, 2013 at 2:39 AM, Daniel Collins wrote:

> I don't know the full answer to your question, but here's what I can offer.
>
> Solr offers 2 types of normalisation, FieldNorm and QueryNorm.  FieldNorm
> is as the name suggests field level normalisation, based on length of the
> field, and can be controlled by the omitNorms parameter on the field.  In
> your example, fieldNorm is always 1.0, see below, so that suggests you have
> correctly turned off field normalisation on the name_edgy field.
>
> 1.0 = fieldNorm(field=name_edgy, doc=231378)
>
> QueryNorm is what I'm still trying to get to the bottom of exactly :)  But
> its something that tries to normalise the results of different term queries
> so they are broadly comparable. You haven't supplied the query you've run ,
> but based on the qf, bf, I'm assuming it breaks down into a DisMax query on
> 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure
> that the results of those 3 queries can be compared.  The exact details of
> it I'm still trying to get to the bottom of (any volunteers with more info
> chip in!)
>
> From earlier answers to the list, queryNorm is calculated in the Similarity
> object, I need to dig further, but that's probably a good place to start.
>
>
>
> On 10 July 2013 04:57, William Bell  wrote:
>
> > I have a field that has omitNorms=true, but when I look at debugQuery I
> see
> > that
> > the field is being normalized for the score.
> >
> > What can I do to turn off normalization in the score?
> >
> > I want a simple way to do 2 things:
> >
> > boost geodist() highest at 1 mile and lowest at 100 miles.
> > plus add a boost for a query=edgefield^5.
> >
> > I only want tf() and no queryNorm. I am not even sure I want idf() but I
> > can probably live with rare names being boosted.
> >
> >
> >
> > The results are being normalized. See below. I tried dismax and edismax -
> > bf, bq and boost.
> >
> > 
> > 
> > none
> > edismax
> > 0.01
> > 
> > display_name,city_state,prov_url,pwid,city_state_alternative
> > 
> > 
> > sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)
> > 5
> > *:*
> > name_edgy^.9 name_edge^.9 name_word
> > true
> > pwid
> > true
> > 
> > score desc, last_name asc
> > 100
> > 39.740112,-104.984856
> > store_geohash
> > false
> > name_edgy
> > 2<-1 4<-2 6<-3
> > 
> > 
> >
> > 0.058555886 = queryNorm
> >
> > product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01
> > times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in
> 231378),
> > product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of:
> 0.9 =
> > boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 =
> > queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378),
> > product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 =
> > idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge,
> > doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378),
> > product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of:
> 0.9 =
> > boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 =
> > queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378),
> > product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 =
> > idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy,
> > doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 =
> > (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 =
> > queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40,
> > maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH)
> > fieldWeight(name_word:nutting in 231378), product of: 1.0 =
> > tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40,
> > maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 =
> > (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 =
> > queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 =
> > idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 =
> > (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 =
> > tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40,
> > maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 =
> >
> >
> sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))
> >
> >
> >
> > --
> > Bill Bell
> > billnb...@gmail.com
> > cell 720-256-8076
> >
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: How to set a condition on the number of docs found

2013-07-12 Thread Jack Krupansky
Test where? I mean, "numFound" is right there at the top of the query 
results, right?


Unfortunately there is no function query value source equivalent to 
"numFound". There is "numdocs", but that is the total documents in the 
index. There is also docfreq(term), which could be used in a function query 
(including the "fl" parameter) if you know a term that has a 1-to-1 
relationship to your query results.


It is worth filing a Jira to add "numfound()" as a function query value 
source.


-- Jack Krupansky

-Original Message- 
From: Matt Lieber

Sent: Friday, July 12, 2013 1:45 AM
To: solr-user@lucene.apache.org
Subject: How to set a condition on the number of docs found

Hello there,

I would like to be able to know whether I got over a certain threshold of
doc results.

I.e. Test (Result.numFound > 10 ) -> true.

Is there a way to do this ? I can't seem to find how to do this; (other
than have to do this test on the client app, which is not great).

Thanks,
Matt









NOTE: This message may contain information that is confidential, 
proprietary, privileged or otherwise protected by law. The message is 
intended solely for the named addressee. If received in error, please 
destroy and notify the sender. Any use of this email is prohibited when 
received in error. Impetus does not represent, warrant and/or guarantee, 
that the integrity of this communication has been maintained nor that the 
communication is free of errors, virus, interception or interference. 



Re: How to set a condition on the number of docs found

2013-07-12 Thread William Bell
Hmmm. One way is:

http://localhost:8983/solr/core/select/?q=*%3A*&facet=true&facet.field=id&facet.offset=10&rows=0&facet.limit=1

If you have a result you have results > 10.

Another way is to just look at it wth a facet.query and have your app deal
with it.

http:/localhost:8983/solr/core/select/?q=*%3A*&facet=true&facet.query={!lucene%20key=numberofresults}state:CO&rows=0




On Thu, Jul 11, 2013 at 11:45 PM, Matt Lieber  wrote:

> Hello there,
>
> I would like to be able to know whether I got over a certain threshold of
> doc results.
>
> I.e. Test (Result.numFound > 10 ) -> true.
>
> Is there a way to do this ? I can't seem to find how to do this; (other
> than have to do this test on the client app, which is not great).
>
> Thanks,
> Matt
>
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Shawn Heisey
On 7/12/2013 7:29 AM, Zhang, Lisheng wrote:
> Sorry I might not have asked clearly, our issue is that we have 
> a few thousand collections (can be much more), so running that 
> command is rather tedius, is there a simpler way (all collections 
> share same schema/config)?

When you create each collection with the Collections API (http calls),
you tell it the name of a config set stored in zookeeper.  You can give
all your collections the same config set if you like.

If you manually create collections with the CoreAdmin API instead, you
must use the zkcli script included in Solr to link the collection to the
config set, which can be done either before or after the collection is
created.  The zkcli script provides some automation for the java command
that you were given by Furkan.

Thanks,
Shawn



Re: Solr Live Nodes not updating immediately

2013-07-12 Thread Shawn Heisey
On 7/11/2013 11:11 PM, Ranjith Venkatesan wrote:
> tickTime in zookeeper was high. When i reduced it to 2000ms solr node status
> gets updated in <20s. Hence resolved my issue. Thanks for helping me.
> 
> I have one more question.
> 
> 1. Is it advisable to reduce the tickTime further.
> 
> 2. Or whats the most appropriate tickTime which gives maximum performance
> and also solr node gets updated in lesser time.
> 
> I hereby included my zoo.cfg configuration
> 
> tickTime=2000
> dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata
> clientPort = 2181
> initLimit=5
> syncLimit=2
> maxClientCnxns=180
> server.1=localhost:2888:3888
> server.2=localhost:3000:4000
> server.3=localhost:2500:3500

Here's mine, comments removed.  Except for dataDir, these are all
default values found in the zookeeper download and on the zookeeper website:

tickTime=2000
initLimit=10
syncLimit=5
dataDir=zoodata
clientPort=2181
server.1=zoo1.REDACTED.com:2888:3888
server.2=zoo2.REDACTED.com:2888:3888
server.3=zoo3.REDACTED.com:2888:3888

http://zookeeper.apache.org/doc/r3.4.5/zookeeperStarted.html#sc_RunningReplicatedZooKeeper

I hope your config is a dev install, because if all your zookeepers are
running on the same server, you have no redundancy in the face of a
server failure.  Servers do fail, even if they have all the redundancy
features you can buy.

Thanks,
Shawn



Re: Leader Election, when?

2013-07-12 Thread Erick Erickson
This is probably not all that important to worry about. The additional
duties of a leader are pretty minimal. And the leaders will shift around
anyway as you restart servers etc. Really feels like a premature
optimization.

Best
Erick

On Thu, Jul 11, 2013 at 3:53 PM, aabreur  wrote:
> I have a working Zookeeper ensemble running with 3 instances and also a
> solrcloud cluster with some solr instances. I've created a collection with
> settings to 2 shards. Then i:
>
> create 1 core on instance1
> create 1 core on instance2
> create 1 core on instance1
> create 1 core on instance2
>
> Just to have this configuration:
>
> instance1: shard1_leader, shard2_replica
> instance2: shard1_replica, shard2_leader
>
> If i add 2 cores to instance1 then 2 cores to instance2, both leaders will
> be on instance1 and no re-election is done.
>
> instance1: shard1_leader, shard2_leader
> instance2: shard1_replica, shard2_replica
>
> Back to my ideal scenario (detached leaders), also when i add a third
> instance with 2 replicas and kill one of my instances running a leader, the
> election picks the instance that already have a leader.
>
> My question is why Zookeeper takes this behavior. Shouldn't it distribute
> leaders? If i deliver some stress to a double-leader instance, is Zookeeper
> going to run an election?
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Request to be added to the ContributorsGroup

2013-07-12 Thread Erick Erickson
Done, at least to the Solr contributor's group, if you want
Lucene, let me know.

Added exactly as "KumarLImbu", don't know whether
1> both the L and I should be capitalized
2> whether the rights-checking cares.

Thanks!
Erick

On Fri, Jul 12, 2013 at 2:51 AM, Kumar Limbu  wrote:
> Hi,
>
> My username is KumarLImbu and I would like to be added to the Contributors
> Group.
>
> Could somebody please help me?
>
> Best Regards,
> Kumar


RE: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Zhang, Lisheng
Sorry I might not have asked clearly, our issue is that we have 
a few thousand collections (can be much more), so running that 
command is rather tedius, is there a simpler way (all collections 
share same schema/config)?

Thanks very much for helps, Lisheng

-Original Message-
From: Furkan KAMACI [mailto:furkankam...@gmail.com]
Sent: Friday, July 12, 2013 1:17 AM
To: solr-user@lucene.apache.org
Subject: Re: solr 4.3.0 cloud in Tomcat, link many collections to
Zookeeper


If you have one collection you just need to define hostnames of Zookeeper
ensembles and run that command once.


2013/7/11 Zhang, Lisheng 

> Hi,
>
> We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to
> 4.3.0), in WIKI page
> for solrCloud in Tomcat:
>
> http://wiki.apache.org/solr/SolrCloudTomcat
>
> we need to link each collection explicitly:
>
> ///
> 8) Link uploaded config with target collection
> java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI
> -cmd linkconfig -collection mycollection -confname ...
> ///
>
> But our application has many cores (a few thousands which all share same
> schema/config,
> is there a moe convenient way ?
>
> Thanks very much for helps, Lisheng
>


Re: How to boost relevance based on distance and age..

2013-07-12 Thread Erick Erickson
the first thing I'd try would be FunctionQueries, see:
http://wiki.apache.org/solr/FunctionQuery.

Be a little careful. You have disjoint conditions, i.e.
one or the other should be used so you'll have two
function queries, basically expressing
if (age < 20 years)
if (age >= 20 years)

The one that _doesn't_ apply should return 1, not 0
since it'll be multiplied by the score.

Best
Erick

On Thu, Jul 11, 2013 at 11:03 AM, Vineel  wrote:
>
>
> Here is the structure of the solr document
>
>
>  52.401790,4.936660
>  1993-12-09T00:00:00Z
>
>
> would like to search for document's based on the following weighted
> criteria..
>
> - distance 0-10miles weight 40
> - distance 10miles and above weight 20
> - Age 0-20years weight 20
> - Age 20years and above weight 10
>
> wondering what are the recommended approaches to build SOLR queries for
> this?
>
> Thanks
> -Vineel
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-boost-relevance-based-on-distance-and-age-tp4077330.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: What happens in indexing request in solr cloud if Zookeepers are all dead?

2013-07-12 Thread Zhang, Lisheng
Thanks very much for your clear explanation!

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, July 11, 2013 1:55 PM
To: solr-user@lucene.apache.org
Subject: Re: What happens in indexing request in solr cloud if
Zookeepers are all dead?


Sorry, no updates if no Zookeepers. There would be no way to assure that any 
node knows the proper configuration. Queries are a little safer using most 
recent configuration without zookeeper, but update consistency requires 
accurate configuration information.

-- Jack Krupansky

-Original Message- 
From: Zhang, Lisheng
Sent: Thursday, July 11, 2013 2:59 PM
To: solr-user@lucene.apache.org
Subject: RE: What happens in indexing request in solr cloud if Zookeepers 
are all dead?

Yes, I should not have used word master/slave for solr cloud!

So if all Zookeepers are dead, could indexing requests be
handled properly (could solr remember the setting for indexing)?

Thanks very much for helps, Lisheng

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday, July 11, 2013 10:46 AM
To: solr-user@lucene.apache.org
Subject: Re: What happens in indexing request in solr cloud if
Zookeepers are all dead?


There are no masters or slaves in SolrCloud - it is fully distributed and
"master-free". Leaders are temporary and can vary over time.

The basic idea for quorum is to prevent "split brain" - two (or more)
distinct sets of nodes (zookeeper nodes, that is) each thinking they
constitute the authoritative source for access to configuration information.
The trick is to require (N/2)+1 nodes for quorum. For n=3, quorum would be
(3/2)+1 = 1+1 = 2, so one node can be down. For n=1, quorum = (1/2)+1 = 0 +
1 = 1. For n=2, quorum would be (2/2)+1 = 1 + 1 = 2, so no nodes can be
down. IOW, for n=2 no nodes can be down for the cluster to do updates.

-- Jack Krupansky

-Original Message- 
From: Zhang, Lisheng
Sent: Thursday, July 11, 2013 9:28 AM
To: solr-user@lucene.apache.org
Subject: What happens in indexing request in solr cloud if Zookeepers are
all dead?

Hi,

In solr cloud latest doc, it mentioned that if all Zookeepers are dead,
distributed
query still works because solr remembers the cluster state.

How about the indexing request handling if all Zookeepers are dead, does
solr
needs Zookeeper to know which box is master and which is slave for indexing
to
work? Could solr remember master/slave relations without Zookeeper?

Also doc said Zookeeper quorum needs to have a majority rule so that we must
have 3 Zookeepers to handle the case one instance is crashed, what would
happen if we have two instances in quorum and one instance is crashed (or
quorum
having 3 instances but two of them are crashed)? I felt the last one should
take
over?

Thanks very much for helps, Lisheng




Re: Solr caching clarifications

2013-07-12 Thread Erick Erickson
Inline

On Thu, Jul 11, 2013 at 8:36 AM, Manuel Le Normand
 wrote:
> Hello,
> As a result of frequent java OOM exceptions, I try to investigate more into
> the solr jvm memory heap usage.
> Please correct me if I am mistaking, this is my understanding of usages for
> the heap (per replica on a solr instance):
> 1. Buffers for indexing - bounded by ramBufferSize
> 2. Solr caches
> 3. Segment merge
> 4. Miscellaneous- buffers for Tlogs, servlet overhead etc.
>
> Particularly I'm concerned by Solr caches and segment merges.
> 1. How much memory consuming (bytes per doc) are FilterCaches (bitDocSet)
> and queryResultCaches (DocList)? I understand it is related to the skip
> spaces between doc id's that match (so it's not saved as a bitmap). But
> basically, is every id saved as a java int?

Different beasts. filterCache consumes, essentially, maxDoc/8 bytes (you
can get the maxDoc number from your Solr admin page). Plus some overhead
for storing the fq text, but that's usually not much. This is for each
entry up to "Size".

queryResultCache is usually trivial unless you've configured it extravagantly.
It's the query string length + queryResultWindowSize integers per entry
(queryResultWindowSize is from solrconfig.xml).

> 2. QueryResultMaxDocsCached - (for example = 100) means that any query
> resulting in more than 100 docs will not be cached (at all) in the
> queryResultCache? Or does it have to do with the documentCache?
It's just a limit on the queryResultCache entry size as far as I can
tell. But again
this cache is relatively small, I'd be surprised if it used
significant resources.

> 3. DocumentCache - written on the wiki it should be greater than
> max_results*concurrent_queries. Max result is just the num of rows
> displayed (rows-start) param, right? Not the queryResultWindow.

Yes. This a cache (I think) for the _contents_ of the documents you'll
be returning to be manipulated by various components during the life
of the query.

> 4. LazyFieldLoading=true - when quering for id's only (fl=id) will this
> cache be used? (on the expense of eviction of docs that were already loaded
> with stored fields)

Not sure, but I don't think this will contribute much to memory pressure. This
is about now many fields are loaded to get a single value from a doc in the
results list, and since one is usually working with 20 or so docs this
is usually
a small amount of memory.

> 5. How large is the heap used by mergings? Assuming we have a merge of 10
> segments of 500MB each (half inverted files - *.pos *.doc etc, half non
> inverted files - *.fdt, *.tvd), how much heap should be left unused for
> this merge?

Again, I don't think this is much of a memory consumer, although I
confess I don't
know the internals. Merging is mostly about I/O.

>
> Thanks in advance,
> Manu

But take a look at the admin page, you can see how much memory various
caches are using by looking at the plugins/stats section.

Best
Erick


Re: Problem using Term Component in solr

2013-07-12 Thread Erick Erickson
bq: Note:Term Component works only on string dataType field.  :(

Not true. Term Component will work on any indexed field. It'll bring
back the _tokens_ that have been indexed though, which are often
individual words so your examples "medical physics" would be two
separate tokens so it may be puzzling.

A general request, please don't put bold text. I know it's an attempt
to help direct attention to the important bits, but (at least in gmail in
my browser) bolds are replaced by "*" before and after, which
especially when looking at wildcard questions is really confusing .

But I have to ask you to back up a bit. _Why_ are you using
TermsComponent to search titles? Why not use Solr for what
it's good for and just search a _tokenized_ title field? This feels
like an XY problem.

Best
Erick

On Thu, Jul 11, 2013 at 2:55 AM, Parul Gupta(Knimbus)
 wrote:
> Hi All
>
> I am using *Term component* in solr for searching titles with short form
> using wild card characters(.*) and [a-z0-9]*.
>
> I am using *Term Component* specifically as wild card characters are not
> working on *select?q=* query search.
>
> Examples of some *title *are:
>
> 1)Medicine, Health Care and Philosophy
> 2)Medical Physics
> 3)Physics of fluids
> 4)Medical Engineering and Physics
>
> ***When i do *solr query*:
> localhost:8080/solr3.6/OA/terms?terms.fl=title&terms.regex=phy.*
> fluids&terms.regex.flag=case_insensitive&terms.limit=10
>
> *Output* is 3rd title:
> *Physics of fluids*
>
> This is relevant output.
>
> ***But when i do *solr query*:
>
> localhost:8080/solr3.6/OA/terms?terms.fl=title&terms.regex=med.*
> phy.*&terms.regex.flag=case_insensitive&terms.limit=10
>
> *Output* are 2nd and 4th title:
>
> *Medical Engineering and Physics*
> *Medical Physics*
>
> This is irrelevant.I want only one result for this query i.e. *Medical
> Physics*
>
> *Although i have changed my wild card
> characters to *[a-z0-9]** instead of *.** ,but than first query doesn't work
> as '*of*' is included in '*Physics of fluids*'.However Second query works
> fine .
>
> example of query is:
>
> localhost:8080/solr3.6/OA/terms?terms.fl=title&terms.regex=med[a-z0-9]*
> phy[a-z0-9]*&terms.regex.flag=case_insensitive&terms.limit=10
>
> This works fine,gives one output *Medical Physics*.
>
>
> If there is another way for searching using *Term Component* or without
> using it..Please suggest to neglect such stop words.
>
> Note:Term Component works only on string dataType field.  :(
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-using-Term-Component-in-solr-tp4077200.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Does Solrj Batch Processing Querying May Confuse?

2013-07-12 Thread Furkan KAMACI
I've crawled some webpages and indexed them at Solr. I've queried data at
Solr via Solrj. url is my unique field and I've define my query as like
that:

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("q", "lang:tr");
params.set("fl", "url");
params.set("sort", "url desc");

I've run my program to query 1000 rows at each query and wrote them in a
file. However I realized that there are some documents that are indexed at
Solr (I query them from admin page, not from Solrj as a 1000 row batch
process) but is not at my file. What may be the problem for that?


Patch review request: SOLR-5001 (adding book links to the website)

2013-07-12 Thread Alexandre Rafalovitch
Hello,

As per earlier email thread, I have created a patch for Solr website to
incorporate links to my new book.

It would be nice if somebody with commit rights for the (markdown) website
could look at it before the book's Solr version (4.3.1) stops being the
latest :-)

I promise to help with the new Wiki/Guide later in return.

https://issues.apache.org/jira/browse/SOLR-5001

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


Re: How to optimize a search?

2013-07-12 Thread Erick Erickson
_Why_ should Rocket Banana (Single) come first?
Essentially you have some ordering in mind and
unless you can express it clearly you'll _never_ get
"ideal" ranking. Really.

But your particular issue can probably be solved by adding
a clause like OR "rocket banana"^5

And I suspect you haven't given us the entire query, or
you're running through edismax or whatever. In future,
please paste the result of adding &debug=all to the
e-mail.

Best
Erick

On Fri, Jul 12, 2013 at 7:32 AM, padcoe  wrote:
> Hello folks,
>
> I'm doing a search for a specific word ("Rocket Banana") in a specific field
> and the document with the result "Rocket Banana (Single)" never comes
> first..and this is the result that should appear in first position...i've
> tried to many ways to perform this search:
>
> title:"Rocket Banana"
> title:("Rocket" AND "Banana")
> title:("Rocket" OR "Banana")
> title:("Rocket"^0.175 AND "Banana"^0.175)
> title:("Rocket"^0.175 OR"Banana"^0.175)
>
> The order returned is basically like:
>
> 12.106901Rocket
> Rocket
> 12.007204 name="title">Rocket
> 12.007203Banana Banana
> Banana
> a lot of results
> 10.398543Rocket Banana
> (Single)
>
>
> How can i optimize my search and return the document that have the "full"
> word that i've searched with a higher scores then others?
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html
> Sent from the Solr - User mailing list archive at Nabble.com.


How to optimize a search?

2013-07-12 Thread padcoe
Hello folks,

I'm doing a search for a specific word ("Rocket Banana") in a specific field
and the document with the result "Rocket Banana (Single)" never comes
first..and this is the result that should appear in first position...i've
tried to many ways to perform this search:

title:"Rocket Banana"
title:("Rocket" AND "Banana")
title:("Rocket" OR "Banana")
title:("Rocket"^0.175 AND "Banana"^0.175)
title:("Rocket"^0.175 OR"Banana"^0.175)

The order returned is basically like:

12.106901Rocket
Rocket
12.007204Rocket
12.007203Banana Banana
Banana
a lot of results
10.398543Rocket Banana
(Single)


How can i optimize my search and return the document that have the "full"
word that i've searched with a higher scores then others?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-optimize-a-search-tp4077531.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud group.query error "shard X did not set sort field values" or how i can set fillFields=true on IndexSearcher.search

2013-07-12 Thread Evgeny Salnikov
Hi!

To repeat the problem, do the following

1. Start a node1 of SolrCloud (4.3.1 default configs) (java
-Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf
-DzkRun -jar start.jar)
2. Import to collection1 -> shard1 some data
3. Try group.query e.g.
http://node1:8983/solr/collection1/select?q=*:*&group=true&group.query=someFiled:someValue.
it is important to have hit on index data.
4. The result is, there is no error
5. Start a node2 of SolrCloud (java -Djetty.port=7574
-DzkHost=localhost:9983 -jar start.jar)
6. On node2 add new core for collection1 -> shard2. Default core
"collection1" unload. We have one collection over two shard. Shard1 - have
data, shard2 - no data.
7. Again try group.query
http://node1:8983/solr/collection1/select?q=*:*&group=true&group.query=someFiled:someValue
.
8. Error: shard 0 did not set sort field values (FieldDoc.fields is null);
you must pass fillFields=true to IndexSearcher.search on each shard

How i can set "fillFields=true to IndexSearcher.search" ?

Thanks in advance,
Evgeny


Custom processing in Solr Request Handler plugin and its debugging ?

2013-07-12 Thread Tony Mullins
Hi,

I have defined my new Solr RequestHandler plugin like this in SolrConfig.xml




And its working fine.

Now I want to do some custom processing from my this plugin by making a
search query to regular '/select' handler.
 
 


And then receive the results back from '/select' handler and perform some
custom processing on those results and send the response back to my custom
"/myendpoint" handler.

And for this I need help on how to make a call to '/select' handler from
within the .MyRequestPlugin class and perform some calculation on the
results.

I also need some help on how to debug my plugin ? As its .jar is been
deployed to solr_hom/lib ... how can I attach my plugin's code in eclipse
to Solr process so I could debug it when user will send request to my
plugin.

Thanks,
Tony


Re: Solr Live Nodes not updating immediately

2013-07-12 Thread Ranjith Venkatesan
Hi,

tickTime in zookeeper was high. When i reduced it to 2000ms solr node status
gets updated in <20s. Hence resolved my issue. Thanks for helping me.

I have one more question.

1. Is it advisable to reduce the tickTime further.

2. Or whats the most appropriate tickTime which gives maximum performance
and also solr node gets updated in lesser time.

I hereby included my zoo.cfg configuration

tickTime=2000
dataDir=/home/local/ranjith-1785/sources/solrcloud/zookeeper-3.4.5_Server1/zoodata
clientPort = 2181
initLimit=5
syncLimit=2
maxClientCnxns=180
server.1=localhost:2888:3888
server.2=localhost:3000:4000
server.3=localhost:2500:3500




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Live-Nodes-not-updating-immediately-tp4076560p4077467.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to set a condition on the number of docs found

2013-07-12 Thread Furkan KAMACI
Do you want to modify Solr source code? Did you check that line at
XMLWriter.java :

*writeAttr("numFound",Long.toString(numFound));*




2013/7/12 Matt Lieber 

> Hello there,
>
> I would like to be able to know whether I got over a certain threshold of
> doc results.
>
> I.e. Test (Result.numFound > 10 ) -> true.
>
> Is there a way to do this ? I can't seem to find how to do this; (other
> than have to do this test on the client app, which is not great).
>
> Thanks,
> Matt
>
>
> 
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Search with punctuations

2013-07-12 Thread kobe.free.wo...@gmail.com
Hi,

Scenario: 

User who perform search forget to put punctuation mark (apostrophe) for ex,
when user wants to search for a value like INT'L, they just key in INTL
(with no punctuation). In this scenario, I wish to return both values with
INTL and INT'L that currently are indexed on SOLR instance. Currently, if I
search for INTL it wont return the row having value INT'L.

Schema Configuration entry for the field type:


  
   
   
   
   
   
   
  
  
   
   
   
   
   
  


Please suggest as to what mechanism should I use to fetch both the values
like INTL and INT'L, when the search is performed for INTL. Also, does the
reg-ex look correct for the analyzers? What all different filters/ tokenizer
can be used to overcome this issue.

Thanks!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-with-punctuations-tp4077510.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance of cross join vs block join

2013-07-12 Thread Mikhail Khludnev
On Fri, Jul 12, 2013 at 12:19 PM, mihaela olteanu wrote:

> Hi Mikhail,
>
> I have used wrong the term block join. When I said block join I was
> referring to a join performed on a single core versus cross join which was
> performed on multiple cores.
> But I saw your benchmark (from cache) and it seems that block join has
> better performance. Is this functionality available on Solr 4.3.1?

nope SOLR-3076 awaits for ages.


> I did not find such examples on Solr's wiki page.
> Does this functionality require a special schema, or a special indexing?

Special indexing - yes.


> How would I need to index the data from my tables? In my case anyway all
> the indices have a common schema since I am using dynamic fields, thus I
> can easily add all documents from all tables in one Solr core, but for each
> document to add a discriminator field.
>
correct. but notion of ' discriminator field' is a little bit different for
blockjoin.


>
> Could you point me to some more documentation?
>

I can recommend only those
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
http://www.youtube.com/watch?v=-OiIlIijWH0


> Thanks in advance,
> Mihaela
>
>
> 
>  From: Mikhail Khludnev 
> To: solr-user ; mihaela olteanu <
> mihaela...@yahoo.com>
> Sent: Thursday, July 11, 2013 2:25 PM
> Subject: Re: Performance of cross join vs block join
>
>
> Mihaela,
>
> For me it's reasonable that single core join takes the same time as cross
> core one. I just can't see which gain can be obtained from in the former
> case.
> I hardly able to comment join code, I looked into, it's not trivial, at
> least. With block join it doesn't need to obtain parentId term
> values/numbers and lookup parents by them. Both of these actions are
> expensive. Also blockjoin works as an iterator, but join need to allocate
> memory for parents bitset and populate it out of order that impacts
> scalability.
> Also in None scoring mode BJQ don't need to walk through all children, but
> only hits first. Also, nice feature is 'both side leapfrog' if you have a
> highly restrictive filter/query intersects with BJQ, it allows to skip many
> parents and children as well, that's not possible in Join, which has fairly
> 'full-scan' nature.
> Main performance factor for Join is number of child docs.
> I'm not sure I got all your questions, please specify them in more details,
> if something is still unclear.
> have you saw my benchmark
> http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?
>
>
>
> On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu  >wrote:
>
> > Hello,
> >
> > Does anyone know about some measurements in terms of performance for
> cross
> > joins compared to joins inside a single index?
> >
> > Is it faster the join inside a single index that stores all documents of
> > various types (from parent table or from children tables)with a
> > discriminator field compared to the cross join (basically in this case
> each
> > document type resides in its own index)?
> >
> > I have performed some tests but to me it seems that having a join in a
> > single index (bigger index) does not add too much speed improvements
> > compared to cross joins.
> >
> > Why a block join would be faster than a cross join if this is the case?
> > What are the variables that count when trying to improve the query
> > execution time?
> >
> > Thanks!
> > Mihaela
>
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

 



About Suggestions

2013-07-12 Thread Lochschmied, Alexander
Hi Solr people!

We need to suggest part numbers in alphabetically order adding up to four 
characters to the already entered part number prefix. That works quite well 
with terms component acting on a multivalued field with keyword tokenizer and 
edge nGram filter. I am mentioning "part numbers" to indicate that each item in 
the multivalued field is a string without whitespace and where special 
characters like dashes cannot be seen as separators.

Is there a way to know if the term (the suggestion) represents such a complete 
part number (without doing another query for each suggestion)?

Since we are using SolJ, what we would need is something like
boolean Term.isRepresentingCompleteFieldValue()

Thanks,
Alexander


Solr 4.3 Shard distributed request check probably incorrect?

2013-07-12 Thread Johann Höchtl
Hi,

we are using Solr 4.3 with regular sharding without ZooKeeper.

I see the following errors inside our logs:
14995742 [qtp427093680-2249] INFO  org.apache.solr.core.SolrCore  - [DE1]
webapp=/solr path=/select
params={mm=2<66%25&tie=0.1&ids=1060691781&qf=Title^1.2+Description^0.01+Keywords^0.4+ArtikelNumber^0.1&distrib=false&q.alt=*:*&wt=javabin&version=2&rows=10&defType=edismax&pf=%0aTitle^1.5+Description^0.3%0a+&NOW=1373459092416&shard.url=
172.31.4.63:8080/solr/DE1&fl=%0aPID,updated,score%0a+&start=0&q=9783426647240&bf=%0a%0a+&partialResults=true&timeAllowed=5000&isShard=true&fq=Price:[*+TO+9]&fq=ShopId1+8+10+12+2975)&ps=100}
status=0 QTime=2
14995742 [qtp427093680-2255] ERROR
org.apache.solr.servlet.SolrDispatchFilter  -
null:java.lang.NullPointerException
at
org.apache.solr.handler.component.QueryComponent.createMainQuery(QueryComponent.java:727)
at
org.apache.solr.handler.component.QueryComponent.regularDistributedProcess(QueryComponent.java:588)
at
org.apache.solr.handler.component.QueryComponent.distributedProcess(QueryComponent.java:541)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:244)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)

Because this is obviously a distributed request isShard=true&distrib=false
shouldn't this request evaluate to a non-distributed request? It seems that
the ResponseBuilder is marked as isDistrib == true, because otherwise it
wouldn't execute the distributedProcess or am I wrong?


Best regards,
Hans


Re: Performance of cross join vs block join

2013-07-12 Thread mihaela olteanu
Hi Mikhail,

I have used wrong the term block join. When I said block join I was referring 
to a join performed on a single core versus cross join which was performed on 
multiple cores.
But I saw your benchmark (from cache) and it seems that block join has better 
performance. Is this functionality available on Solr 4.3.1? I did not find such 
examples on Solr's wiki page.
Does this functionality require a special schema, or a special indexing? How 
would I need to index the data from my tables? In my case anyway all the 
indices have a common schema since I am using dynamic fields, thus I can easily 
add all documents from all tables in one Solr core, but for each document to 
add a discriminator field. 

Could you point me to some more documentation?

Thanks in advance,
Mihaela



 From: Mikhail Khludnev 
To: solr-user ; mihaela olteanu 
 
Sent: Thursday, July 11, 2013 2:25 PM
Subject: Re: Performance of cross join vs block join
 

Mihaela,

For me it's reasonable that single core join takes the same time as cross
core one. I just can't see which gain can be obtained from in the former
case.
I hardly able to comment join code, I looked into, it's not trivial, at
least. With block join it doesn't need to obtain parentId term
values/numbers and lookup parents by them. Both of these actions are
expensive. Also blockjoin works as an iterator, but join need to allocate
memory for parents bitset and populate it out of order that impacts
scalability.
Also in None scoring mode BJQ don't need to walk through all children, but
only hits first. Also, nice feature is 'both side leapfrog' if you have a
highly restrictive filter/query intersects with BJQ, it allows to skip many
parents and children as well, that's not possible in Join, which has fairly
'full-scan' nature.
Main performance factor for Join is number of child docs.
I'm not sure I got all your questions, please specify them in more details,
if something is still unclear.
have you saw my benchmark
http://blog.griddynamics.com/2012/08/block-join-query-performs.html ?



On Thu, Jul 11, 2013 at 1:52 PM, mihaela olteanu wrote:

> Hello,
>
> Does anyone know about some measurements in terms of performance for cross
> joins compared to joins inside a single index?
>
> Is it faster the join inside a single index that stores all documents of
> various types (from parent table or from children tables)with a
> discriminator field compared to the cross join (basically in this case each
> document type resides in its own index)?
>
> I have performed some tests but to me it seems that having a join in a
> single index (bigger index) does not add too much speed improvements
> compared to cross joins.
>
> Why a block join would be faster than a cross join if this is the case?
> What are the variables that count when trying to improve the query
> execution time?
>
> Thanks!
> Mihaela




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics




Re: solr 4.3.0 cloud in Tomcat, link many collections to Zookeeper

2013-07-12 Thread Furkan KAMACI
If you have one collection you just need to define hostnames of Zookeeper
ensembles and run that command once.


2013/7/11 Zhang, Lisheng 

> Hi,
>
> We are testing solr 4.3.0 in Tomcat (considering upgrading solr 3.6.1 to
> 4.3.0), in WIKI page
> for solrCloud in Tomcat:
>
> http://wiki.apache.org/solr/SolrCloudTomcat
>
> we need to link each collection explicitly:
>
> ///
> 8) Link uploaded config with target collection
> java -classpath .:/home/myuser/solr-war-lib/* org.apache.solr.cloud.ZkCLI
> -cmd linkconfig -collection mycollection -confname ...
> ///
>
> But our application has many cores (a few thousands which all share same
> schema/config,
> is there a moe convenient way ?
>
> Thanks very much for helps, Lisheng
>


Re: Leader Election, when?

2013-07-12 Thread Furkan KAMACI
If you want to plan to have 2 shards and if you start up the first node it
will be the leader of first shard. When you start up second node it will be
the leader of second shard. If you start up third node it will be the
replica of first shard. If you start up fourth node it will be the replica
of second shard. If you start up fifth node it will be the replica of first
shard ... and this will continue as like that as round robin.


2013/7/11 aabreur 

> I have a working Zookeeper ensemble running with 3 instances and also a
> solrcloud cluster with some solr instances. I've created a collection with
> settings to 2 shards. Then i:
>
> create 1 core on instance1
> create 1 core on instance2
> create 1 core on instance1
> create 1 core on instance2
>
> Just to have this configuration:
>
> instance1: shard1_leader, shard2_replica
> instance2: shard1_replica, shard2_leader
>
> If i add 2 cores to instance1 then 2 cores to instance2, both leaders will
> be on instance1 and no re-election is done.
>
> instance1: shard1_leader, shard2_leader
> instance2: shard1_replica, shard2_replica
>
> Back to my ideal scenario (detached leaders), also when i add a third
> instance with 2 replicas and kill one of my instances running a leader, the
> election picks the instance that already have a leader.
>
> My question is why Zookeeper takes this behavior. Shouldn't it distribute
> leaders? If i deliver some stress to a double-leader instance, is Zookeeper
> going to run an election?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Leader-Election-when-tp4077381.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Usage of CloudSolrServer?

2013-07-12 Thread Furkan KAMACI
CloudSolrServer uses LBHttpSolrServer by default. CloudSolrServer connects
to Zookeeper and passes the live nodes
to LBHttpSolrServer. LBHttpSolrServer connects each node as round robin. By
the way do you mean "leader" instead of "master"?

2013/7/12 sathish_ix 

> Hi ,
>
> Iam using cloudsolrserver to connect to solrcloud, im indexing the
> documents
> using solrj API using cloudsolrserver object. Index is triggered on master
> node of a collection, whereas if i need to find the status of the loading ,
> it return the message from replica where status is null. How to find which
> instance the cloudsolrserver is connecting ?
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Usage-of-CloudSolrServer-tp4056052p4077471.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>