Re: Unsubscribe me

2015-06-08 Thread François Schiettecatte
Please follow instructions here: http://lucene.apache.org/solr/resources.html

F.


> On Jun 8, 2015, at 1:06 AM, Dylan  wrote:
> 
> On 30 May 2015 12:08, "Lalit Kumar 4"  wrote:
> 
>> Please unsubscribe me as well
>> 
>> On May 30, 2015 15:23, Neha Jatav  wrote:
>> Unsubscribe me
>> 



Re: Unsubscribe me

2015-05-30 Thread François Schiettecatte
Quoting Erik from two days ago:

Please follow the instructions here:

http://lucene.apache.org/solr/resources.html. Be sure to use the exact same 
e-mail you used to subscribe.


> On May 30, 2015, at 6:07 AM, Lalit Kumar 4  wrote:
> 
> Please unsubscribe me as well
> 
> On May 30, 2015 15:23, Neha Jatav  wrote:
> Unsubscribe me



Re: YAJar

2015-05-26 Thread François Schiettecatte
What I am suggesting is that you set up a stand alone version of solr with 
14.0.1 and run some sort of test suite similar to what you would normally use 
solr for in your app. The replace the guava jar and re-run the tests. If all 
works well, and I suspect it will because it did for me, then you can use 18.0. 
Simple really.

François

> On May 26, 2015, at 10:30 AM, Robust Links  wrote:
> 
> i can't run 14.0.1. that is the problem. 14 does not have the interfaces i
> need
> 
> On Tue, May 26, 2015 at 10:28 AM, François Schiettecatte <
> fschietteca...@gmail.com> wrote:
> 
>> Run whatever tests you want with 14.0.1, replace it with 18.0, rerun the
>> tests and compare.
>> 
>> François
>> 
>>> On May 26, 2015, at 10:25 AM, Robust Links 
>> wrote:
>>> 
>>> by "dumping" you mean recompiling solr with guava 18?
>>> 
>>> On Tue, May 26, 2015 at 10:22 AM, François Schiettecatte <
>>> fschietteca...@gmail.com> wrote:
>>> 
>>>> Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a
>>>> while ago and it worked fine for me.
>>>> 
>>>> François
>>>> 
>>>>> On May 26, 2015, at 10:11 AM, Robust Links 
>>>> wrote:
>>>>> 
>>>>> i have a minhash logic that uses guava 18.0 method that is not in guava
>>>>> 14.0.1. This minhash logic is a separate maven project. I'm including
>> it
>>>> in
>>>>> my project via maven.the code is being used as a search component on
>> the
>>>>> set of results. The logic goes through the search results and deletes
>>>>> duplicates. here is the solrconfig.xml
>>>>> 
>>>>>  >>> default="true"
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> tvComponent
>>>>> 
>>>>> terms
>>>>> 
>>>>> minHashDedup
>>>>> 
>>>>> 
>>>>> 
>>>>>  
>>>>> 
>>>>> >>> class="com.xyz.DedupSearchHits">>>>> name="MAX_COMPARISONS">5
>>>>> 
>>>>> DedupSearchHits class is the one implementing the minhash (hence using
>>>>> guava 18). I start solr via the solr.in.sh script. The error I am
>>>> getting
>>>>> is:
>>>>> 
>>>>> 
>>>>> Caused by: java.lang.NoSuchMethodError:
>>>>> 
>>>> 
>> com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode;
>>>>> 
>>>>> at com.xyz.incrementToken(MinHashTokenFilter.java:54)
>>>>> 
>>>>> at com.xyz.MinHash.calculate(MinHash.java:131)
>>>>> 
>>>>> at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89)
>>>>> 
>>>>> at
>>>> com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74)
>>>>> 
>>>>> at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619)
>>>>> 
>>>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311)
>>>>> 
>>>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305)
>>>>> 
>>>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338)
>>>>> 
>>>>> at
>> org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297)
>>>>> 
>>>>> at org.apache.solr.core.SolrCore.(SolrCore.java:813)
>>>>> 
>>>>> 
>>>>> What is the best design to solve this problem?I understand the point of
>>>>> modularity but how can i include logic into solr that does result
>>>>> processing without loading that jar into solr?
>>>>> 
>>>>> thank you
>>>>> 
>>>>> 
>>>>> On Tue, May 26, 2015 at 8:00 AM, Daniel Collins >> 
>>>>> wrote:
>>>>> 
>>>>>> I guess this is one reason why the whole WAR approach is being
>> removed!
>>>>>> Solr should be a black-box that you talk to, and get responses from.
>>>> What
>>>>>> it depends on and how it is deployed, should be irrelevant to you.
>>>>>> 
>>>>>> If you are wanting to override the version of gua

Re: YAJar

2015-05-26 Thread François Schiettecatte
Run whatever tests you want with 14.0.1, replace it with 18.0, rerun the tests 
and compare.

François

> On May 26, 2015, at 10:25 AM, Robust Links  wrote:
> 
> by "dumping" you mean recompiling solr with guava 18?
> 
> On Tue, May 26, 2015 at 10:22 AM, François Schiettecatte <
> fschietteca...@gmail.com> wrote:
> 
>> Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a
>> while ago and it worked fine for me.
>> 
>> François
>> 
>>> On May 26, 2015, at 10:11 AM, Robust Links 
>> wrote:
>>> 
>>> i have a minhash logic that uses guava 18.0 method that is not in guava
>>> 14.0.1. This minhash logic is a separate maven project. I'm including it
>> in
>>> my project via maven.the code is being used as a search component on the
>>> set of results. The logic goes through the search results and deletes
>>> duplicates. here is the solrconfig.xml
>>> 
>>>   > default="true"
>>>> 
>>> 
>>>  
>>> 
>>>  tvComponent
>>> 
>>>  terms
>>> 
>>>  minHashDedup
>>> 
>>>  
>>> 
>>>   
>>> 
>>> > class="com.xyz.DedupSearchHits">>> name="MAX_COMPARISONS">5
>>> 
>>> DedupSearchHits class is the one implementing the minhash (hence using
>>> guava 18). I start solr via the solr.in.sh script. The error I am
>> getting
>>> is:
>>> 
>>> 
>>> Caused by: java.lang.NoSuchMethodError:
>>> 
>> com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode;
>>> 
>>> at com.xyz.incrementToken(MinHashTokenFilter.java:54)
>>> 
>>> at com.xyz.MinHash.calculate(MinHash.java:131)
>>> 
>>> at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89)
>>> 
>>> at
>> com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74)
>>> 
>>> at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619)
>>> 
>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311)
>>> 
>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305)
>>> 
>>> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338)
>>> 
>>> at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297)
>>> 
>>> at org.apache.solr.core.SolrCore.(SolrCore.java:813)
>>> 
>>> 
>>> What is the best design to solve this problem?I understand the point of
>>> modularity but how can i include logic into solr that does result
>>> processing without loading that jar into solr?
>>> 
>>> thank you
>>> 
>>> 
>>> On Tue, May 26, 2015 at 8:00 AM, Daniel Collins 
>>> wrote:
>>> 
>>>> I guess this is one reason why the whole WAR approach is being removed!
>>>> Solr should be a black-box that you talk to, and get responses from.
>> What
>>>> it depends on and how it is deployed, should be irrelevant to you.
>>>> 
>>>> If you are wanting to override the version of guava that Solr uses, then
>>>> you'd have to rebuild Solr (can be done with maven) and manually update
>> the
>>>> pom.xml to use guava 18.0, but why would you? You need to test Solr
>>>> completely (in case any guava bugs affect Solr), deal with any build
>> issues
>>>> that arise (if guava changes any APIs), and cause yourself a world of
>> pain,
>>>> for what gain?
>>>> 
>>>> 
>>>> On 26 May 2015 at 11:29, Robust Links  wrote:
>>>> 
>>>>> i have custom search components.
>>>>> 
>>>>> On Tue, May 26, 2015 at 4:34 AM, Upayavira  wrote:
>>>>> 
>>>>>> Why is your app tied that closely to Solr? I can understand if you are
>>>>>> talking about SolrJ, but normal usage you use a different application
>>>> in
>>>>>> a different JVM from Solr.
>>>>>> 
>>>>>> Upayavira
>>>>>> 
>>>>>> On Tue, May 26, 2015, at 05:14 AM, Robust Links wrote:
>>>>>>> I am stuck in Yet Another Jarmagedon of SOLR. this is a basic
>>>>> question. i
>>>>>>> noticed solr 5.0 is using guava 14.0.1. My app needs guava 18.0. What
>>>>> is
>>>>>>> the pattern to override a jar version uploaded into jetty?
>>>>>>> 
>>>>>>> I am using maven, and solr is being started the old way
>>>>>>> 
>>>>>>> java -jar start.jar
>>>>>>> -Dsolr.solr.home=...
>>>>>>> -Djetty.home=...
>>>>>>> 
>>>>>>> I tried to edit jetty's start.config (then run java
>>>>>>> -DSTART=/my/dir/start.config
>>>>>>> -jar start.jar) but got no where...
>>>>>>> 
>>>>>>> any help would be much appreciated
>>>>>>> 
>>>>>>> Peyman
>>>>>> 
>>>>> 
>>>> 
>> 
>> 



Re: YAJar

2015-05-26 Thread François Schiettecatte
Have you tried dumping guava 14.0.1 and using 18.0 with Solr? I did a while ago 
and it worked fine for me.

François

> On May 26, 2015, at 10:11 AM, Robust Links  wrote:
> 
> i have a minhash logic that uses guava 18.0 method that is not in guava
> 14.0.1. This minhash logic is a separate maven project. I'm including it in
> my project via maven.the code is being used as a search component on the
> set of results. The logic goes through the search results and deletes
> duplicates. here is the solrconfig.xml
> 
>> 
> 
>   
> 
>   tvComponent
> 
>   terms
> 
>   minHashDedup
> 
>   
> 
>
> 
>   name="MAX_COMPARISONS">5
> 
> DedupSearchHits class is the one implementing the minhash (hence using
> guava 18). I start solr via the solr.in.sh script. The error I am getting
> is:
> 
> 
> Caused by: java.lang.NoSuchMethodError:
> com.google.common.hash.HashFunction.hashUnencodedChars(Ljava/lang/CharSequence;)Lcom/google/common/hash/HashCode;
> 
> at com.xyz.incrementToken(MinHashTokenFilter.java:54)
> 
> at com.xyz.MinHash.calculate(MinHash.java:131)
> 
> at com.xyz.Algorithms.minhash.MinHasher.compare(MinHasher.java:89)
> 
> at com.xyz.Algorithms.minhash.DedupSearchHits.init(DedupSearchHits.java:74)
> 
> at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:619)
> 
> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2311)
> 
> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2305)
> 
> at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:2338)
> 
> at org.apache.solr.core.SolrCore.loadSearchComponents(SolrCore.java:1297)
> 
> at org.apache.solr.core.SolrCore.(SolrCore.java:813)
> 
> 
> What is the best design to solve this problem?I understand the point of
> modularity but how can i include logic into solr that does result
> processing without loading that jar into solr?
> 
> thank you
> 
> 
> On Tue, May 26, 2015 at 8:00 AM, Daniel Collins 
> wrote:
> 
>> I guess this is one reason why the whole WAR approach is being removed!
>> Solr should be a black-box that you talk to, and get responses from.  What
>> it depends on and how it is deployed, should be irrelevant to you.
>> 
>> If you are wanting to override the version of guava that Solr uses, then
>> you'd have to rebuild Solr (can be done with maven) and manually update the
>> pom.xml to use guava 18.0, but why would you? You need to test Solr
>> completely (in case any guava bugs affect Solr), deal with any build issues
>> that arise (if guava changes any APIs), and cause yourself a world of pain,
>> for what gain?
>> 
>> 
>> On 26 May 2015 at 11:29, Robust Links  wrote:
>> 
>>> i have custom search components.
>>> 
>>> On Tue, May 26, 2015 at 4:34 AM, Upayavira  wrote:
>>> 
 Why is your app tied that closely to Solr? I can understand if you are
 talking about SolrJ, but normal usage you use a different application
>> in
 a different JVM from Solr.
 
 Upayavira
 
 On Tue, May 26, 2015, at 05:14 AM, Robust Links wrote:
> I am stuck in Yet Another Jarmagedon of SOLR. this is a basic
>>> question. i
> noticed solr 5.0 is using guava 14.0.1. My app needs guava 18.0. What
>>> is
> the pattern to override a jar version uploaded into jetty?
> 
> I am using maven, and solr is being started the old way
> 
> java -jar start.jar
> -Dsolr.solr.home=...
> -Djetty.home=...
> 
> I tried to edit jetty's start.config (then run java
> -DSTART=/my/dir/start.config
> -jar start.jar) but got no where...
> 
> any help would be much appreciated
> 
> Peyman
 
>>> 
>> 



Re: how to debug solr performance degradation

2015-02-24 Thread François Schiettecatte
Rebecca

You don’t want to give all the memory to the JVM. You want to give it just 
enough for it to work optimally and leave the rest of the memory for the OS to 
use for caching data. Giving the JVM too much memory can result in worse 
performance because of GC. There is no magic formula to figuring out the memory 
allocation for the JVM, that is very dependent on the workload. In your case I 
would start with 5GB, and increment by 5GB with each run.

I also use these settings for the JVM

-XX:+UseG1GC -Xms1G -Xmx1G

-XX:+AggressiveOpts -XX:+OptimizeStringConcat -XX:+ParallelRefProcEnabled 
-XX:MaxGCPauseMillis=200

I got them from this list so can’t take credit for them but they work for me.


Cheers

François


> On Feb 24, 2015, at 7:45 PM, Tang, Rebecca  wrote:
> 
> We gave the machine 180G mem to see if it improves performance.  However,
> after we increased the memory, Solr started using only 5% of the physical
> memory.  It has always used 90-something%.
> 
> What could be causing solr to not grab all the physical memory (grabbing
> so little of the physical memory)?
> 
> 
> Rebecca Tang
> Applications Developer, UCSF CKM
> Industry Documents Digital Libraries
> E: rebecca.t...@ucsf.edu
> 
> 
> 
> 
> 
> On 2/24/15 12:44 PM, "Shawn Heisey"  wrote:
> 
>> On 2/24/2015 1:09 PM, Tang, Rebecca wrote:
>>> Our solr index used to perform OK on our beta production box (anywhere
>>> between 0-3 seconds to complete any query), but today I noticed that the
>>> performance is very bad (queries take between 12 ­ 15 seconds).
>>> 
>>> I haven't updated the solr index configuration
>>> (schema.xml/solrconfig.xml) lately.  All that's changed is the data ‹
>>> every month, I rebuild the solr index from scratch and deploy it to the
>>> box.  We will eventually go to incremental builds. But for now, all
>>> indexes are built from scratch.
>>> 
>>> Here are the stats:
>>> Solr index size 183G
>>> Documents in index 14364201
>>> We just have single solr box
>>> It has 100G memory
>>> 500G Harddrive
>>> 16 cpus
>> 
>> The bottom line on this problem, and I'm sure it's not something you're
>> going to want to hear:  You don't have enough memory available to cache
>> your index.  I'd plan on at least 192GB of RAM for an index this size,
>> and 256GB would be better.
>> 
>> Depending on the exact index schema, the nature of your queries, and how
>> large your Java heap for Solr is, 100GB of RAM could be enough for good
>> performance on an index that size ... or it might be nowhere near
>> enough.  I would imagine that one of two things is true here, possibly
>> both:  1) Your queries are very complex and involve accessing a very
>> large percentage of the index data.  2) Your Java heap is enormous,
>> leaving very little RAM for the OS to automatically cache the index.
>> 
>> Adding more memory to the machine, if that's possible, might fix some of
>> the problems.  You can find a discussion of the problem here:
>> 
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>> 
>> If you have any questions after reading that wiki article, feel free to
>> ask them.
>> 
>> Thanks,
>> Shawn
>> 
> 



Re: American British Dictionary for Solr

2015-02-12 Thread François Schiettecatte
Dinesh


See this:

http://wordlist.aspell.net/varcon/

You will need to do some work to convert to a SOLR friendly format though.

Cheers

François

> On Feb 12, 2015, at 12:22 AM, dinesh naik  wrote:
> 
> Hi ,
> We are looking for a dictionary to support American/British English synonym.
> Could you please let us know what all dictionaries are available ?
> -- 
> Best Regards,
> Dinesh Naik



Re: Solr: How to delete a document

2014-09-13 Thread François Schiettecatte
How about adding 'expungeDeletes=true' as well as 'commit=true'?

François

On Sep 13, 2014, at 4:09 PM, FiMka  wrote:

> Hi guys, could you say how to delete a document in Solr? After I delete a
> document it still persists in the search results. For example there is the
> following document saved in Solr:
> After I POST the following data to localhost:8983/solr/update/?commit=true:
> Solr each time says 200 OK and responds the following:
> If I try to search
> localhost:8983/solr/lexikos/select?q=phrase%3A+%22qwerty%22&wt=json&indent=true
> for the document once again, it still shown in the results. So how to remove
> the document from Solr index as well or what else to do? Thanks in advance
> for any assistance!
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-How-to-delete-a-document-tp4158649.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Date field related query

2014-09-02 Thread François Schiettecatte
How about :

datefield:[NOW-1DAY/DAY TO *]

François

On Sep 2, 2014, at 6:54 AM, Aman Tandon  wrote:

> Hi,
> 
> I did it using this, fq=datefield:[2014-09-01T23:59:59Z TO
> 2014-09-02T23:59:59Z].
> Correct me if i am wrong.
> 
> Is there any way to find this using the NOW?
> 
> 
> With Regards
> Aman Tandon
> 
> 
> On Tue, Sep 2, 2014 at 4:08 PM, Aman Tandon  wrote:
> 
>> Hi,
>> 
>> I am working on date and i want to find all those records which are
>> indexed today.
>> 
>> With Regards
>> Aman Tandon
>> 



Re: Random OOM Exceptions

2014-08-14 Thread François Schiettecatte
I would also get some metrics when SOLR is doing nothing, the JVM does do work 
in the background and looking at the memory graph in VisualVM will show a nice 
sawtooth.

François


On Aug 14, 2014, at 1:16 PM, Erick Erickson  wrote:

> bq: I just don’t know why Solr is suddenly going nuts.
> 
> Hmmm, as Shawn says, hard to say at this remove. But
> I've personally doubled the memory requirements for Solr
> on the _same_ index by altering the query to a pathological
> one. Something like
> q=*:*&facet.field=whatever
> where the field "whatever" contains a billion unique strings is
> an example of a pathological query.
> 
> So you may have to do the ugly work of correlating memory spikes
> with the queries just prior to the spike. Which you should be able
> to do from the Solr logs.
> 
> Sorry I can't be more help...
> Erick
> 
> On Thu, Aug 14, 2014 at 9:45 AM, Shawn Heisey  wrote:
>> On 8/14/2014 10:06 AM, Scott Rankin wrote:
>>> My question was actually more about what in Solr might cause the
>>> server to suddenly go from a very consistent heap size of 300-400 MB
>>> to over 2 GB in a matter of minutes with no changes in traffic. I get
>>> why the VM is crashing, I just don’t know why Solr is suddenly going nuts.
>> 
>> That's nearly impossible to answer.  Chances are that something has
>> changed about the requests that Solr is receiving and now it's required
>> to do something that it wasn't before, something that uses a lot of heap
>> memory.
>> 
>> The other likely possibilities are:
>> 
>> * There's a bug in your solr version or in some software component that
>> you are using with Solr.  That can include the Java virtual machine, the
>> servlet container, and/or any third-party Solr components.
>> 
>> * You were running on the hairy edge of heap usage already, and
>> something (a traffic increase, a slight change to your requests) pushed
>> you over the edge into OutOfMemory.
>> 
>> Thanks,
>> Shawn
>> 



Re: Character encoding problems

2014-07-29 Thread François Schiettecatte
Hi

If you are seeing " appelé au téléphone" in the browser, I would guess that 
the data is being rendered in UTF-8 by your server and the content type of the 
html is set to iso-8859-1 or not being set and your browser is defaulting to 
iso-8859-1. 

You can force the encoding to utf-8 in the browser, usually this is a menu item 
(in Chrome/Safari/Firefox).

FWIW having messed around with this kind of stuff in the past, I always 
generate utf-8 and always set the HTML content type to utf-8 with:



Cheers

François


On Jul 29, 2014, at 3:59 PM, Gulliver Smith  wrote:

> Thanks for the information about URIEncoding="UTF-8" in the tomcat
> conf file, but that doesn't answer my main concerns:
> - what is the character encoding of the text in the title_fr field?
> - is there any way to force it to be UTF-8?
> 
> On Tue, Jul 29, 2014 at 8:35 AM,   wrote:
>> Hi,
>> 
>> If you use solr 4.8.1, you don't have to add URIEncoding="UTF-8" in the
>> tomcat conf file anymore :
>> https://wiki.apache.org/solr/SolrTomcat
>> 
>> 
>> Regards,
>> 
>> Aurélien MAZOYER
>> 
>> 
>> On 29.07.2014 14:22, Gulliver Smith wrote:
>>> 
>>> I have solr 4.8.1 under Tomcat 7 on Debian Linux. The connector in
>>> Tomcat's server.xml has been changed to include character encoding
>>> UTF-8:
>>> 
>>> >>   URIEncoding="UTF-8"
>>>   connectionTimeout="2"
>>>   redirectPort="8443" />
>>> 
>>> 
>>> I am posting to the server from PHP 5.5 curl. The extract POST was
>>> intercepted and confirmed that everything is being encode in UTF-8.
>>> 
>>> However, the responses to query commands, whether XML or JSON are
>>> returning field values such as title_fr in something that looks like
>>> latin1 or iso-8859-1 when displayed in a browser or editor.
>>> 
>>> E.g.: "title_fr":[" appelé au téléphone"]
>>> 
>>> The highlights in the query response do have correctly displaying
>>> character codes.
>>> 
>>> E.g. "text_fr":[" \n \n  \n  \n  \n  \n  \n  \n  \n \n \nappelé au
>>> téléphone\nappelé au téléphone\n
>>> 
>>> PHP's utf8_decode doesn't make sense of the title_fr.
>>> 
>>> Is there something to configure to fix this and get proper UTF8
>>> results for everything?
>>> 
>>> Thanks
>>> Gulliver



Re: Java heap space error

2014-07-24 Thread François Schiettecatte
A default garbage collector will be chosen for you by the VM, might help to get 
the stack trace to look at.

François

On Jul 24, 2014, at 10:06 AM, Ameya Aware  wrote:

> ooh ok.
> 
> So you want to say that since i am using large heap but didnt set my
> garbage collection, thats why i why getting java heap space error?
> 
> 
> 
> 
> 
> On Thu, Jul 24, 2014 at 9:58 AM, Marcello Lorenzi 
> wrote:
> 
>> I think that on large heap is suggested to monitor the garbage collection
>> behavior and try to add a strategy adapted to your performance.  On my
>> production environment with a heap of 6 GB I set this parameter (server
>> with 8 cores):
>> 
>> -server -Xms6144m -Xmx6144m -XX:MaxPermSize=512m
>> -Dcom.sun.management.jmxremote -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
>> -XX:+CMSIncrementalMode -XX:+CMSParallelRemarkEnabled
>> -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70
>> -XX:ConcGCThreads=6 -XX:ParallelGCThreads=6
>> 
>> Marcello
>> 
>> 
>> On 07/24/2014 03:53 PM, Ameya Aware wrote:
>> 
>> I did not make any other change than this.. rest of the settings are
>> default.
>> 
>> Do i need to set garbage collection strategy?
>> 
>> 
>> On Thu, Jul 24, 2014 at 9:49 AM, Marcello Lorenzi 
>> wrote:
>> 
>>> Hi,
>>> Did you set a Garbage collection strategy on your JVM ?
>>> 
>>> Marcello
>>> 
>>> 
>>> On 07/24/2014 03:32 PM, Ameya Aware wrote:
>>> 
 Hi
 
 I am in process of indexing around 2,00,000 documents.
 
 I have increase java jeap space to 4 GB using below command :
 
 java -Xmx4096M -Xms4096M -jar start.jar
 
 Still after indexing around 15000 documents it gives java heap space
 error
 again.
 
 
 Any fix for this?
 
 Thanks,
 Ameya
 
 
>>> 
>> 
>> 



Re: Garbage collection issue and RELOADing cores

2014-07-01 Thread François Schiettecatte
Hi

Just following up on my previous post about a memory leak when RELOADing cores, 
I narrowed it down to the SuggestComponent, specifically '...' in 
solrconfig.xml. Comment that out and the leak goes away.

The leak occurs in 4.7, 4.8 and 4.9. It occurs when a core is RELOADed, but not 
if it is UNLOADed and then LOADed. It occurs whether G1, CMS or ParallelGC is 
used for garbage collection.

I used JDK 1.7.0_60 and Tomcat 7.0.54 for the underlying layers.

Not sure where to take it from here?

Cheers

François


On Jun 16, 2014, at 4:50 PM, François Schiettecatte  
wrote:

> Hi
> 
> I am running into an interesting garbage collection issue and am looking for 
> suggestions/thoughts. 
> 
> Because some word lists such as synonyms, plurals, protected words need to be 
> updated on a regular basis I have to RELOAD a number of cores in order to 
> 'pick up' the new lists. 
> 
> What I have found is that I get a memory leak when I do a RELOAD rather than 
> an UNLOAD/CREATE with core admin. This is most pronounced with the G1 GC and 
> much less so with the CMS GC. The former will cause the VM to run out of 
> memory after 5/6 RELOADs, while the latter does so after 30/35 RELOADs. We 
> are not talking about large indices here, the files footprint totals 470MB.
> 
> I am using SOLR 4.8.1, Tomcat 7.0.53, jdk1.7.0_60, on Fedora Core 20. I am 
> not using any fancy GC parameters, I cut everything back to basics, just:
> 
>   -Xmx1G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> 
> and 
> 
>   -Xmx1G -XX:+UseG1GC
> 
> I was curious if anyone else had run into this issue and managed to fix it?
> 
> Thanks
> 
> François
> 
> 
> 



Garbage collection issue and RELOADing cores

2014-06-16 Thread François Schiettecatte
Hi

I am running into an interesting garbage collection issue and am looking for 
suggestions/thoughts. 

Because some word lists such as synonyms, plurals, protected words need to be 
updated on a regular basis I have to RELOAD a number of cores in order to 'pick 
up' the new lists. 

What I have found is that I get a memory leak when I do a RELOAD rather than an 
UNLOAD/CREATE with core admin. This is most pronounced with the G1 GC and much 
less so with the CMS GC. The former will cause the VM to run out of memory 
after 5/6 RELOADs, while the latter does so after 30/35 RELOADs. We are not 
talking about large indices here, the files footprint totals 470MB.

I am using SOLR 4.8.1, Tomcat 7.0.53, jdk1.7.0_60, on Fedora Core 20. I am not 
using any fancy GC parameters, I cut everything back to basics, just:

-Xmx1G -XX:+UseConcMarkSweepGC -XX:+UseParNewGC

and 

-Xmx1G -XX:+UseG1GC

I was curious if anyone else had run into this issue and managed to fix it?

Thanks

François





Re: Any way to view lucene files

2014-06-09 Thread François Schiettecatte
Just click the 'Releases' link:

https://github.com/DmitryKey/luke/releases

François

On Jun 9, 2014, at 10:43 AM, Aman Tandon  wrote:

> No, Anyways thanks Alex, but where is the luke jar?
> 
> With Regards
> Aman Tandon
> 
> 
> On Mon, Jun 9, 2014 at 6:54 AM, Alexandre Rafalovitch 
> wrote:
> 
>> Have you looked at:
>> https://github.com/DmitryKey/luke
>> 
>> Regards,
>>   Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>> 
>> 
>> On Mon, Jun 9, 2014 at 8:12 AM, Aman Tandon 
>> wrote:
>>> I guess this is not available now. I am trying to download from the
>> google,
>>> please take a look https://code.google.com/p/luke/downloads/list
>>> 
>>> If you have any link please share
>>> 
>>> With Regards
>>> Aman Tandon
>>> 
>>> 
>>> On Sat, Jun 7, 2014 at 10:32 PM, Summer Shire 
>> wrote:
>>> 
 
 Did u try  luke 47
 
 
 
> On Jun 6, 2014, at 11:59 PM, Aman Tandon 
 wrote:
> 
> I also tried with solr 4.2 and with luke version Luke 4.0.0-ALPHA
> 
> but got this error:
> java.lang.IllegalArgumentException: A SPI class of type
> org.apache.lucene.codecs.Codec with name 'Lucene42' does not exist.
>> You
> need to add the corresponding JAR file supporting this SPI to your
> classpath.The current classpath supports the following names:
>> [Lucene40,
> Lucene3x, SimpleText, Appending]
> 
> With Regards
> Aman Tandon
> 
> 
> On Sat, Jun 7, 2014 at 12:22 PM, Aman Tandon >> 
> wrote:
> 
>> My solr version is 4.8.1 and luke is 3.5
>> 
>> With Regards
>> Aman Tandon
>> 
>> 
>> On Sat, Jun 7, 2014 at 12:21 PM, Chris Collins >> 
>> wrote:
>> 
>>> What version of Solr / Lucene are you using?  You have to match the
 Luke
>>> version to the same version of Lucene.
>>> 
>>> C
 On Jun 6, 2014, at 11:42 PM, Aman Tandon 
 wrote:
 
 Yes  tried, but it not working at all every time i choose my index
 directory it shows me EOF past
 
 With Regards
 Aman Tandon
 
 
> On Sat, Jun 7, 2014 at 12:01 PM, Chris Collins <
>> ch...@geekychris.com
> 
 wrote:
 
> Have you tried:
> 
> https://code.google.com/p/luke/
> 
> Best
> 
> Chris
> On Jun 6, 2014, at 11:24 PM, Aman Tandon >> 
>>> wrote:
> 
>> Hi,
>> 
>> Is there any way so that i can view what information and which is
>>> there
> in
>> my _e.fnm, etc files. may be with the help of any application or
>> any
> viewer
>> tool.
>> 
>> With Regards
>> Aman Tandon
>> 
 
>> 



Re: OutOfMemoryError while merging large indexes

2014-04-08 Thread François Schiettecatte
Have you tried using:

-XX:-UseGCOverheadLimit 

François

On Apr 8, 2014, at 6:06 PM, Haiying Wang  wrote:

> Hi,
> 
> We were trying to merge a large index (9GB, 21 million docs) into current 
> index (only 13MB), using mergeindexes command ofCoreAdminHandler, but always 
> run into OOM error. We currently set the max heap size to 4GB for the Solr 
> server. We are using 4.6.0, and did not change the original solrconfig.xml. 
> 
> Is there any setting/configure that could help to complete the mergeindexes 
> process without running into OOM error? I can increase the max jvm heap size, 
> but am afraid that may not scale in case larger index need to be merged in 
> the future, and hoping the index merge can be performed with limited memory 
> foorprint. Please help. Thanks!
> 
> The jvm heap setting:   -Xmx4096M -Xms512M
> 
> Command used:
> 
> 
> curl 
> "http://dev101:8983/solr/admin/cores?action=mergeindexes&core=collection1&indexDir=/solr/tmp/data/snapshot.20140407194442777";
> 
> OOM error stack trace:
> 
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
> at
> java.lang.StringCoding$StringDecoder.decode(StringCoding.java:133)
> at java.lang.StringCoding.decode(StringCoding.java:179)
> at java.lang.String.(String.java:483)
> at java.lang.String.(String.java:539)
> at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.readField(CompressingStoredFieldsReader.java:187)
> at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:351)
> at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:276)
> at
> org.apache.lucene.index.IndexReader.document(IndexReader.java:436)
> at 
> org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.merge(CompressingStoredFieldsWriter.java:345)
> at 
> org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:316)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:94)
> at 
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2555)
> at 
> org.apache.solr.update.DirectUpdateHandler2.mergeIndexes(DirectUpdateHandler2.java:449)
> at 
> org.apache.solr.update.processor.RunUpdateProcessor.processMergeIndexes(RunUpdateProcessorFactory.java:88)
> at
> org.apache.solr.update.processor.UpdateRequestProcessor.processMergeIndexes(UpdateRequestProcessor.java:59)
> at 
> org.apache.solr.update.processor.LogUpdateProcessor.processMergeIndexes(LogUpdateProcessorFactory.java:149)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleMergeAction(CoreAdminHandler.java:384)
> at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:188)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:662)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
> at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> 
> Regards,
> 
> Haiying



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Reading Solr index

2014-04-07 Thread François Schiettecatte
Maybe you should try a more recent release of Luke:

https://github.com/DmitryKey/luke/releases

François

On Apr 7, 2014, at 12:27 PM, azhar2007  wrote:

> Hi All,
> 
> I have a solr index which is indexed ins Solr.4.7.0.
> 
> Ive attempted to open the index with Luke4.0.0 and also other verisons with
> no luck.
> Gives me an error message.
> 
> Is there a way of reading the data?
> 
> I would like to convert the file to a readable format where i can see the
> terms it holds from the documents etc. 
> 
> Please Help!!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
> Sent from the Solr - User mailing list archive at Nabble.com.



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: The word "no" in a query

2014-04-02 Thread François Schiettecatte
Have you looked at the debugging output?

http://wiki.apache.org/solr/CommonQueryParameters#Debugging

François

On Apr 2, 2014, at 1:37 AM, Bob Laferriere  wrote:

> 
> I have built an commerce search engine. I am struggling with the word “no” in 
> queries. We have products that are “No Smoking Sign.” When the query is 
> “Smoking AND Sign” the product is found. If I query as “No AND Sign” I get no 
> results? I do not have no as a stop word. Any ideas why I would get zero 
> results back?
> 
> Regards,
> 
> Bob



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: AND not as a boolean operator in Phrase

2014-03-25 Thread François Schiettecatte
Better to user '+A +B' rather than AND/OR, see:

http://searchhub.org/2011/12/28/why-not-and-or-and-not/

François

On Mar 25, 2014, at 10:21 PM, Koji Sekiguchi  wrote:

> (2014/03/26 2:29), abhishek jain wrote:
>> hi friends,
>> 
>> when i search for "A and B" it gives me result for A , B , i am not sure
>> why?
>> 
>> Please guide how can i exact match when it is within phrase/quotes.
> 
> Generally speaking (w/ LuceneQParser), if you want phrase match results,
> use quotes, i.e. q="A B". If you want results which contain both terms A
> and B, do not use quotes but boolean operator AND, i.e. q=A AND B.
> 
> koji
> -- 
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Solr cores across multiple machines

2013-12-17 Thread François Schiettecatte
Hi

Why not copy the core directory instead of the data directory? The conf 
directory is very small and that would ensure that you don't get schema 
mismatch issues.

If you are stuck with copying the data directory, then I would replace the data 
directory in the target core and reload that core, though I would guess that 
YMMV given that this is probably not supported.

François

On Dec 17, 2013, at 1:35 AM, sivaprasad  wrote:

> Hi,
> 
> In my project, we are doing full index on dedicated machine and the index
> will be copied to other search serving machine. For this, we are copying the
> data folder from indexing machine to serving machine manually. Now, we
> wanted to use Solr's SWAP configuration to do this job. Looks like the SWAP
> will work between the cores. Based on our setup, any one has any idea how to
> move the data from indexing machine to serving machine? Is there any other
> alternatives?
> 
> Regards,
> Siva
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-cores-across-multiple-machines-tp4107035.html
> Sent from the Solr - User mailing list archive at Nabble.com.



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Stop/Restart Solr

2013-10-22 Thread François Schiettecatte
Yago has the right command to search for the process, that will get you the 
process ID specifically the first number on the output line, then do 'kill 
###', if that fails 'kill -9 ###'.

François

On Oct 22, 2013, at 12:56 PM, Raheel Hasan  wrote:

> its CentOS...
> 
> and using jetty with solr here..
> 
> 
> On Tue, Oct 22, 2013 at 9:54 PM, François Schiettecatte <
> fschietteca...@gmail.com> wrote:
> 
>> A few more specifics about the environment would help, Windows/Linux/...?
>> Jetty/Tomcat/...?
>> 
>> François
>> 
>> On Oct 22, 2013, at 12:50 PM, Yago Riveiro  wrote:
>> 
>>> If you are asking about if solr has a way to restart himself, I think
>> that the answer is no.
>>> 
>>> If you lost control of the remote machine someone will need to go and
>> restart the machine ...
>>> 
>>> You can try use a kvm or other remote control system
>>> 
>>> --
>>> Yago Riveiro
>>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>>> 
>>> 
>>> On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte wrote:
>>> 
>>>> If you are on linux/unix, use the kill command.
>>>> 
>>>> François
>>>> 
>>>> On Oct 22, 2013, at 12:42 PM, Raheel Hasan 
>>>> > raheelhasan@gmail.com)> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> is there a way to stop/restart java? I lost control over it via SSH and
>>>>> connection was closed. But the Solr (start.jar) is still running.
>>>>> 
>>>>> thanks.
>>>>> 
>>>>> --
>>>>> Regards,
>>>>> Raheel Hasan
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Regards,
> Raheel Hasan



Re: Stop/Restart Solr

2013-10-22 Thread François Schiettecatte
A few more specifics about the environment would help, Windows/Linux/...? 
Jetty/Tomcat/...?

François

On Oct 22, 2013, at 12:50 PM, Yago Riveiro  wrote:

> If you are asking about if solr has a way to restart himself, I think that 
> the answer is no.
> 
> If you lost control of the remote machine someone will need to go and restart 
> the machine ...
> 
> You can try use a kvm or other remote control system
> 
> --  
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 
> 
> On Tuesday, October 22, 2013 at 5:46 PM, François Schiettecatte wrote:
> 
>> If you are on linux/unix, use the kill command.
>> 
>> François
>> 
>> On Oct 22, 2013, at 12:42 PM, Raheel Hasan > (mailto:raheelhasan@gmail.com)> wrote:
>> 
>>> Hi,
>>> 
>>> is there a way to stop/restart java? I lost control over it via SSH and
>>> connection was closed. But the Solr (start.jar) is still running.
>>> 
>>> thanks.
>>> 
>>> --  
>>> Regards,
>>> Raheel Hasan
>>> 
>> 
>> 
>> 
> 
> 



Re: Stop/Restart Solr

2013-10-22 Thread François Schiettecatte
If you are on linux/unix, use the kill command.

François

On Oct 22, 2013, at 12:42 PM, Raheel Hasan  wrote:

> Hi,
> 
> is there a way to stop/restart java? I lost control over it via SSH and
> connection was closed. But the Solr (start.jar) is still running.
> 
> thanks.
> 
> -- 
> Regards,
> Raheel Hasan



Re: Solr timeout after reboot

2013-10-21 Thread François Schiettecatte
Well no, the OS is smarter than that, it manages file system cache along with 
other memory requirements. If applications need more memory then file system 
cache will likely be reduced. 

The command is a cheap trick to get the OS to fill the file system cache as 
quickly as possible, not sure how much it will help though with a 100GB index 
on a 15GB machine. This might work if you 'cat' the index files other than the 
'.fdx' and '.fdt' files.

François

On Oct 21, 2013, at 10:03 AM, michael.boom  wrote:

> I'm using the m3.xlarge server with 15G RAM, but my index size is over 100G,
> so I guess putting running the above command would bite all available
> memory.
> 
> 
> 
> -
> Thanks,
> Michael
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096827.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Exact Match Results

2013-10-21 Thread François Schiettecatte
Kumar

You might want to look into the 'pf' parameter:


https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

François

On Oct 21, 2013, at 9:24 AM, kumar  wrote:

> I am querying solr for exact match results. But it is showing some other
> results also.
> 
> Examle :
> 
> User Query String : 
> 
> Okkadu telugu movie
> 
> Results :
> 
> 1.Okkadu telugu movie
> 2.Okkadunnadu telugu movie
> 3.YuganikiOkkadu telugu movie
> 4.Okkadu telugu movie stills
> 
> 
> how can we order these results that 4th result has to come second.
> 
> 
> Please anyone can you give me any idea?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exact-Match-Results-tp4096816.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr timeout after reboot

2013-10-21 Thread François Schiettecatte
To put the file data into file system cache which would make for faster access.

François


On Oct 21, 2013, at 8:33 AM, michael.boom  wrote:

> Hmm, no, I haven't...
> 
> What would be the effect of this ?
> 
> 
> 
> -
> Thanks,
> Michael
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-timeout-after-reboot-tp4096408p4096809.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Can I use app specific document id as the document id that Solr uses for internal purposes?

2013-10-06 Thread François Schiettecatte
Hi

The approach I take is to store enough data in the SOLR index to render the 
results page, and go to the database if the user want to view a document. 

Cheers

François

On Oct 6, 2013, at 9:45 AM, user 01  wrote:

> @Gora:
> you understood the schema correctly, but I can't believe it's strange but i
> think it is actually the recommended way.. you index your data but don't
> store in a Search engine, you store your actual data in DB, which is the
> right place for it. Data in SE should be just used for indexing. Isn't it ?
> 
> @maephisto: ok, thanks!
> 
> 
> On Sun, Oct 6, 2013 at 6:07 PM, Gora Mohanty  wrote:
> 
>> On 6 October 2013 16:36, Ertio Lew  wrote:
>>> I meant that solr should not be thinking that it has to retrieve any
>> thing
>>> further (as in any stored document data) after once it gets the doc id,
>> so
>>> that one further look up for doc data is prevented.
>> [...]
>> 
>> If I understood your setup correctly, the doc ID is the only field
>> in the Solr schema, and the only data stored in the Solr index.
>> So there is no question of recovering any other data.
>> 
>> Having said that, this is a strange setup and seems to defeat the
>> whole purpose of a search engine. Maybe you could explain further
>> as to what you are trying to achieve: What does storing only doc
>> IDs in Solr gain you? You could as well get these from a database
>> lookup  which it seems that you would be doing anyway.
>> 
>> Regards,
>> Gora
>> 



Re: setQuery in SolrJ

2013-09-02 Thread François Schiettecatte
Shouldn't the search be more like this if you are searching in the 
'descricaoRoteiro' field:

descricaoRoteiro:(BPS 8D BEACH*)

or in your example you have a space in between 'descricaoRoteiro' and 'BPS':

descricaoRoteiro:BPS 8D BEACH*

François


On Sep 2, 2013, at 8:08 AM, Dmitry Kan  wrote:

> Hi,
> 
> What's your default query field in solrconfig.xml?
> 
> 
> [WHAT IS IN HERE?]
> 
> I think what's happening is that the query:
> 
> (descricaoRoteiro: BPS 8D BEACH*)
> 
> gets interpreted as:
> 
> descricaoRoteiro:BPS (8D BEACH*)
> 
> then on the (8D BEACH*) a default field name is applied.
> 
> You can use debugQuery parameter to see how the query was parsed.
> 
> HTH,
> Dmitry
> 
> 
> On Mon, Sep 2, 2013 at 2:53 PM, Sergio Stateri  wrote:
> 
>> hi,
>> 
>> How can I looking for an exact phrase in query.setQuery method (SolrJ)?
>> 
>> Like this:
>> 
>> SolrQuery query = new SolrQuery();
>> query.setQuery( "(descricaoRoteiro: BPS 8D BEACH*)" );
>> query.set("start", "200");
>> query.set("rows", "10");
>> query.addField("descricaoRoteiro");
>> QueryResponse rsp = server.query( query );
>> 
>> 
>> When I run this code, the following exception is thrown:
>> 
>> Exception in thread "main"
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: no
>> field name specified in query and no default specified via 'df' param
>> at
>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:424)
>> at
>> 
>> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>> at
>> 
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
>> at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
>> at
>> 
>> com.teste.SearchRoteirosFromCollection.extrairEApresentarResultados(SearchRoteirosFromCollection.java:65)
>> ...
>> 
>> 
>> But If I search one a word od put * between two words, the search works
>> fine.
>> 
>> 
>> Thanks in advance,
>> 
>> 
>> --
>> Sergio Stateri Jr.
>> stat...@gmail.com
>> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: Mandatory words search in SOLR

2013-05-13 Thread François Schiettecatte
Kamal

You could also use the 'mm' parameter to require a minimum match, or you could 
prepend '+' to each required term.

Cheers

François


On May 13, 2013, at 7:57 AM, Kamal Palei  wrote:

> Hi Rafał Kuć
> I added q.op=AND as per you suggested. I see though some initial record
> document contains both keywords (*java* and *mysql*), towards end I see
> still there are number of
> documents, they have only one key word either *java* or *mysql*.
> 
> Is it the SOLR behaviour or can I ask for a *strict search only if all my
> keywords are present, then only* *fetch record* else not.
> 
> BR,
> Kamal
> 
> 
> 
> On Mon, May 13, 2013 at 4:02 PM, Rafał Kuć  wrote:
> 
>> Hello!
>> 
>> Change  the  default  query  operator. For example add the q.op=AND to
>> your query.
>> 
>> --
>> Regards,
>> Rafał Kuć
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
>> 
>>> Hi SOLR Experts
>>> When I search documents with keyword as *java, mysql* then I get the
>>> documents containing either *java* or *mysql* or both.
>> 
>>> Is it possible to get the documents those contains both *java* and
>> *mysql*.
>> 
>>> In that case, how the query would look like.
>> 
>>> Thanks a lot
>>> Kamal
>> 
>> 



Bug in query parser?

2012-12-28 Thread François Schiettecatte
Hi

Just ran into this bug while playing around with 3.6. Using edismax and 
entering a a search like this "(text:foobar)" causes the query parser to mangle 
the query as shown by the results below. Adding a space after the first paren 
solves this. I checked 3.6.1 and get the same issue. I recall an issue like 
this in 3.6.0 but thought it was quashed in 3.6.1?




0
2

1
score desc, prefix_sort desc
on
number^5 title^3 text
standard
2.2
10
edismax
text^2
*,score
on
0
(text:foobar)
search
10




(text:foobar)
(text:foobar)

+DisjunctionMaxQuery((text:textfoobar | title:textfoobar^3.0 | 
number:text:foobar^5.0)) ()


+(text:textfoobar | title:textfoobar^3.0 | number:text:foobar^5.0) ()


ExtendedDismaxQParser


Cheers

François

Re: Indexing only on change

2012-11-24 Thread François Schiettecatte
I would create a hash of the document content and store that in SOLR along with 
any document info you wish to store. When a document is presented for indexing, 
hash that and compare to the hash of the stored document, index if they are 
different and skip if they are not.

François
 

On Nov 24, 2012, at 3:30 PM, Pratyul Kapoor  wrote:

> Hi,
> 
> I just discovered that solr while editing a particular field of a document,
> removes the entire document and recreates.
> 
> I have a list of 1000s of documents to be indexed. But I am aware that only
> some of those documents would be changed and rest all would already be
> there. Is there any way, I can check whether the incoming and already
> existing document is same, and there is no need of indexing it again.
> 
> Pratyul



Re: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread François Schiettecatte
I suspect it is just part of the wildcard handling, maybe someone can chime in 
here, you may need to catch this before it gets to SOLR.

François

On Nov 12, 2012, at 5:44 PM, johnmu...@aol.com wrote:

> Thanks for the quick response.
> 
> 
> So, I do not want to use ReversedWildcardFilterFactory, but leading wildcard 
> is working and thus is ON by default.  How do I disable it to prevent the use 
> of it and the issues that come with it?
> 
> 
> -- MJ
> 
> 
> 
> -Original Message-
> From: François Schiettecat
> te 
> To: solr-user 
> Sent: Mon, Nov 12, 2012 5:39 pm
> Subject: Re: Is leading wildcard search turned on by default in Solr 3.6.1?
> 
> 
> John
> 
> You can still use leading wildcards even if you dont have the 
> ReversedWildcardFilterFactory in your analysis but it means you will be 
> scanning 
> the entire dictionary when the search is run which can be a performance 
> issue. 
> If you do use ReversedWildcardFilterFactory you wont have that performance 
> issue 
> but you will increase the overall size of your index. Its a tradeoff. 
> 
> When I looked into it for a site I built I decided that the tradeoff was not 
> worth it (after benchmarking) given how few leading wildcards searches it was 
> getting.
> 
> Best regards
> 
> François
> 
> 
> On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote:
> 
>> 
>> 
>> Hi,
>> 
>> 
>> I'm migrating from Solr 1.2 to 3.6.1.  I used the same analyzer as I was, 
>> and 
> re-indexed my data.  I did not add 
>> solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading 
>> wild 
> cards are working!!  Does this mean it's turned on by default?  If so, how do 
> I 
> turn it off, and what are the implication of leaving ON?  Won't my searches 
> be 
> slower and consume more memory?
>> 
>> 
>> Thanks,
>> 
>> 
>> --MJ
>> 
> 
> 
> 
> 



Re: Is leading wildcard search turned on by default in Solr 3.6.1?

2012-11-12 Thread François Schiettecatte
John

You can still use leading wildcards even if you dont have the 
ReversedWildcardFilterFactory in your analysis but it means you will be 
scanning the entire dictionary when the search is run which can be a 
performance issue. If you do use ReversedWildcardFilterFactory you wont have 
that performance issue but you will increase the overall size of your index. 
Its a tradeoff. 

When I looked into it for a site I built I decided that the tradeoff was not 
worth it (after benchmarking) given how few leading wildcards searches it was 
getting.

Best regards

François


On Nov 12, 2012, at 5:33 PM, johnmu...@aol.com wrote:

> 
> 
> Hi,
> 
> 
> I'm migrating from Solr 1.2 to 3.6.1.  I used the same analyzer as I was, and 
> re-indexed my data.  I did not add 
> solr.ReversedWildcardFilterFactory to my index analyzer, but yet leading wild 
> cards are working!!  Does this mean it's turned on by default?  If so, how do 
> I turn it off, and what are the implication of leaving ON?  Won't my searches 
> be slower and consume more memory?
> 
> 
> Thanks,
> 
> 
> --MJ
> 



Re: MMapDirectory, demand paging, lazy evaluation, ramfs and the much maligned RAMDirectory (oh my!)

2012-10-24 Thread François Schiettecatte
Aaron

The best way to make sure the index is cached by the OS is to just cat it on 
startup:

cat `find /path/to/solr/index` > /dev/null

Just make sure your index is smaller than RAM otherwise data will be rotated 
out.

Memory mapping is built on the virtual memory system, and I suspect that ramfs 
is too, so I doubt very much that copying your index to ramfs will help at all. 
Sidebar - a while ago I did a bunch of testing copying indices to shared memory 
(/dev/shm in this case) and there was no advantage compared to just accessing 
indices on disc when using memory mapping once the system got to a steady state.

There has been a lot written about this topic on the list. Basically it come 
down to using MMapDirectory (which is the default), make sure your index is 
smaller than your RAM, and allocate just enough memory to the Java VM. That 
last part requires some benchmarking because it is so workload dependent.

Best regards

François

On Oct 24, 2012, at 8:29 PM, Aaron Daubman  wrote:

> Greetings,
> 
> Most times I've seen the topic of storing one's index in memory, it
> seems the asker was referring (or understood to be referring) to the
> (in)famous "not intended to work with huge indexes" Solr RAMDirectory.
> 
> Let me be clear that that I am not interested in RAMDirectory.
> However, I would like to better understand the oft-recommended and
> currently-default MMapDirectory, and what the tradeoffs would be, when
> using a 64-bit linux server dedicated to this single solr instance,
> with plenty (more than 2x index size) of RAM, of storing the index
> files on SSDs versus on a ramfs mount.
> 
> I understand that using the default MMapDirectory will allow caching
> of the index in-memory, however, my understanding is that mmaped files
> are demand-paged (lazy evaluated), meaning that only after a block is
> read from disk will it be paged into memory - is this correct? is it
> actually block-by-block (page size by page size?) - any pointers to
> decent documentation on this regardless of the effectiveness of the
> approach would be appreciated...
> 
> My concern with using MMapDirectory for an index stored on disk (even
> SSDs), if my understanding is correct, is that there is still a large
> startup cost to MMapDirectory, as it may take many queries before even
> most of a 20G index has been loaded into memory, and there may yet
> still be "dark corners" that only come up in edge-case queries that
> cause QTime spikes should these queries ever occur.
> 
> I would like to ensure that, at startup, no query will incur
> disk-seek/read penalties.
> 
> Is the "right" way to achieve this to copy the index to a ramfs (NOT
> ramdisk) mount and then continue to use MMapDirectory in Solr to read
> the index? I am under the impression that when using ramfs (rather
> than ramdisk, for which this would not work) a file mmaped on a ramfs
> mount will actually share the same address space, and so would not
> incur the typical double-ram overhead of mmaping a file in memory just
> o have yet another copy of the file created in a second memory
> location. Is this correct? If not, would you please point me to
> documentation stating otherwise (I haven't found much documentation
> either way).
> 
> Finally, given the desire to be quick at startup with a large index
> that will still easily fit within a system's memory, am I thinking
> about this wrong or are there other better approaches?
> 
> Thanks, as always,
> Aaron



Re: Solr and Tomcat - problem with unicode characters

2012-08-28 Thread François Schiettecatte
What is probably going on is that the response is not being interpreted as 
UTF-8 but as some other encoding.

What are you using to display the response?

François


On Aug 28, 2012, at 8:08 AM, zehoss  wrote:

> Hi,
> at the beginning I would like to sorry for my english. I hope my message
> will be communicative.
> 
> I would like to ask you for help with Solr running on Tomcat 6.
> 
> I have configured Tomcat's Connector like this:
>connectionTimeout="2" 
>   URIEncoding="UTF-8"
>   redirectPort="8443" />
> 
> But still I have problem with unicode chars.
> 
> My application sends query string to server:
> 
> q=title:%22r%C3%B3%C5%BCnica%22~20^100 OR
> contents:%22r%C3%B3%C5%BCnica%22~20&debugQuery=on&rows=2000&timeAllowed=3000&f1=score,title,contents&wt=json
>  
> 
> 
> but in response I get:
> 
> 'q' => 'title:"różnica"~20^100 OR contents:"różnica"~20'
> 
> and no results found.
> 
> For tests I tried configure Solr on Jetty to be sure that everything on the
> application site is ok.
> And when I start Solr on Jetty I get correct results and in response I get
> correct characters.
> Only when I start Tomcat there are problems.
> 
> There are no exceptions nor warnings in tomcat logs.
> 
> Could anyone suggest me where should I search a problem?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-and-Tomcat-problem-with-unicode-characters-tp4003692.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: recommended SSD

2012-08-23 Thread François Schiettecatte
You should check this at pcper.com:

http://pcper.com/ssd-decoder

http://pcper.com/content/SSD-Decoder-popup

Specs for a wide range of SSDs.

Best regards

François


On Aug 23, 2012, at 5:35 PM, Peyman Faratin  wrote:

> Hi
> 
> Is there a SSD brand and spec that the community recommends for an index of 
> size 56G with mostly reads? We are evaluating this one
> 
> http://www.newegg.com/Product/Product.aspx?Item=N82E16820227706
> 
> thank you
> 
> Peyman
> 
> 



Re: The way to customize ranking?

2012-08-23 Thread François Schiettecatte
I would create two indices, one with your content and one with your ads. This 
approach would allow you to precisely control how many ads you pull back and 
how you merge them into the results, and you would be able to control schemas, 
boosting, defaults fields, etc for each index independently. 

Best regards

François

On Aug 23, 2012, at 11:45 AM, Nicholas Ding  wrote:

> Thank you, but I don't want to filter those ads.
> 
> For example, when user make a search like q=Car
> Result list:
> 1. Ford Automobile (score 10)
> 2. Honda Civic (score 9)
> ...
> ...
> ...
> 99. Paid Ads (score 1, Ad has own field to identify it's an Ad)
> 
> What I want to find is a way to make the score of "Paid Ads" higher than
> "Ford Automobile". Basically, the result structure will look like
> 
> - [Paid Ads Section]
>[Most valuable Ads 1]
>[Most valuable Ads 2]
>[Less valuable Ads 1]
>[Less valuable Ads 2]
> - [Relevant Results Section]
> 
> 
> On Thu, Aug 23, 2012 at 11:33 AM, Karthick Duraisamy Soundararaj <
> karthick.soundara...@gmail.com> wrote:
> 
>> Hi
>> You might add an int  field "Search Rule" that identifies the type of
>> search.
>> example
>>Search Rule  Description
>> 0  Unpaid Search
>> 1  Paid Search - Rule
>> 1
>> 2  Paid Serch - Rule 2
>> 
>> You can use filterqueries (
>> http://wiki.apache.org/solr/CommonQueryParameters)
>> like fq:  Search Rule :[1 TO *]
>> 
>> Alternatively, You can even use a boolean field to identify whether or not
>> a search is paid and then an addtitional field that identifies the type of
>> paid search.
>> 
>> --
>> karthick
>> 
>> On Thu, Aug 23, 2012 at 11:16 AM, Nicholas Ding >> wrote:
>> 
>>> Hi
>>> 
>>> I'm working on Solr to build a local business search in China. We have a
>>> special requirement from advertiser. When user makes a search, if the
>>> results contain paid advertisements, those ads need to be moved on the
>> top
>>> of results. For different ads, they have detailed rules about which comes
>>> first.
>>> 
>>> Could anyone offer me some suggestions how I customize the ranking based
>> on
>>> my requirement?
>>> 
>>> Thanks
>>> Nicholas
>>> 
>> 



Re: Can't find solr.xml

2012-07-11 Thread François Schiettecatte
On Jul 11, 2012, at 2:52 PM, Shawn Heisey wrote:

> On 7/2/2012 2:33 AM, Nabeel Sulieman wrote:
>> Argh! (and hooray!)
>> 
>> I started from scratch again, following the wiki instructions. I did only
>> one thing differently; put my data directory in /opt instead of /home/dev.
>> And now it works!
>> 
>> I'm glad it's working now. I just wish I knew exactly what the difference
>> is. The directory in /opt has exactly the same permissions as the one in
>> /home/dev (chown -R tomcat solr).
> 
> This could be selinux.  I tend to disable it, as configuring it for proper 
> operation with custom software can be tricky.  If this is the problem, there 
> will hopefully be a record of the denial in one of the files in /var/log.  
> CentOS has selinux enabled by default.
> 
> In case you don't know how to turn it off: in /etc/selinux/config, set 
> SELINUX=disabled and reboot.  There may be a way to disable it without 
> rebooting, but I've found that to be the path of least resistance.
> 
> Thanks,
> Shawn
> 


You can temporarily disable selinux until the next reboot with this:

echo 0 > /selinux/enforce

Cheers

François




Re: difference between stored="false" and stored="true" ?

2012-06-30 Thread François Schiettecatte
Giovanni

 means the data is stored in the index and can be returned with 
the search results (see the 'fl' parameter). This is independent of 

Which means that you can store but not index a field:



Best regards

François

On Jun 30, 2012, at 9:57 AM, Giovanni Gherdovich wrote:

> Hi all,
> 
> when declaring a field in the schema.xml file you can
> set the attributes 'indexed' and 'stored' to "true" or "false".
> 
> What is the difference between a 
> and a ?
> 
> I guess understanding this would require me to have
> a closer look to lucene's index data structures;
> what's the pointer to some doc I can read?
> 
> Cheers,
> GGhh



Re: Indexation Speed?

2012-06-19 Thread François Schiettecatte
There is a lot of good information about that on the web, just google for 
'ubuntu performance monitor'

Also the ubuntu website has a pretty good help section:

https://help.ubuntu.com/

and a community wiki:

https://help.ubuntu.com/community

Cheers

François

On Jun 19, 2012, at 9:03 AM, Bruno Mannina wrote:

> Linux Ubuntu :) since 2 months ! so I'm a new in this world :)
> 
> Le 19/06/2012 15:01, François Schiettecatte a écrit :
>> Well that depends on the platform you are on, you did not mention that.
>> 
>> If you are using linux, you could use atop ( http://www.atoptool.nl/ ), or 
>> top, or  iostat or stat, or all four.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 19, 2012, at 8:55 AM, Bruno Mannina wrote:
>> 
>>> CPU is not used, just 50-60% sometimes during the process but How can I 
>>> check IO HDD ?
>>> 
>>> Le 19/06/2012 14:13, François Schiettecatte a écrit :
>>>> Just a suggestion, you might want to monitor CPU usage and disk I/O, there 
>>>> might be a bottleneck.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote:
>>>> 
>>>>> Actually -Xmx512m and no effect
>>>>> 
>>>>> Concerning  maxFieldLength, no problem it's commented
>>>>> 
>>>>> Le 19/06/2012 13:02, Erick Erickson a écrit :
>>>>>> Then try -Xmx600M
>>>>>> next try -Xmx900M
>>>>>> 
>>>>>> 
>>>>>> etc. The idea is to bump things on separate runs.
>>>>>> 
>>>>>> But be a little cautious here. Look in your solrconfig.xml file, you'll 
>>>>>> see
>>>>>> a commented-out line
>>>>>> 1
>>>>>> 
>>>>>> The default behavior for Solr/Lucene is to index the first 10,000 tokens
>>>>>> (not characters, think of tokens as words for not) in each
>>>>>> document and throw the rest on the floor. At the sizes you're talking 
>>>>>> about,
>>>>>> that's probably not a problem, but do be aware of it.
>>>>>> 
>>>>>> Best
>>>>>> Erick
>>>>>> 
>>>>>> On Tue, Jun 19, 2012 at 5:44 AM, Bruno Mannina
>>>>>> wrote:
>>>>>>> Like that?
>>>>>>> 
>>>>>>> java -Xmx300m -jar post.jar myfile.xml
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Le 19/06/2012 11:11, Lance Norskog a écrit :
>>>>>>> 
>>>>>>>> Ah! Java memory size is a java command line option:
>>>>>>>> 
>>>>>>>> http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html
>>>>>>>> 
>>>>>>>> You would try increasing the memory size in stages up to maybe 300m.
>>>>>>>> 
>>>>>>>> On Tue, Jun 19, 2012 at 2:04 AM, Bruno Mannina  
>>>>>>>> wrote:
>>>>>>>>> Le 19/06/2012 10:51, Lance Norskog a écrit :
>>>>>>>>> 
>>>>>>>>>> 675 doc/s is respectable for that server. You might move the memory
>>>>>>>>>> allocated to Java up and down- there is a balance between amount of
>>>>>>>>>> memory in Java v.s. the OS disk buffer.
>>>>>>>>> How can I do that ? is there an option during my command line or in a
>>>>>>>>> config
>>>>>>>>> file?
>>>>>>>>> sorry for this newbie question :(
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> And, of course, use the latest trunk.
>>>>>>>>> Solr 3.6
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Tue, Jun 19, 2012 at 12:10 AM, Bruno Mannina
>>>>>>>>>>  wrote:
>>>>>>>>>>> Correction: file size is 40 Mo !!!
>>>>>>>>>>> 
>>>>>>>>>>> Le 19/06/2012 09:09, Bruno Mannina a écrit :
>>>>>>>>>>> 
>>>>>>>>>>>> Dear All,
>>>>>>>>>>>> 
>>>>>>>>>>>> I would like to know if the indexation speed is right.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have a 40Go file size with around 27 000 docs inside.
>>>>>>>>>>>> I index around 20 fields,
>>>>>>>>>>>> 
>>>>>>>>>>>> My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go
>>>>>>>>>>>> Ram
>>>>>>>>>>>> 
>>>>>>>>>>>> The file takes 40 seconds with the command line:
>>>>>>>>>>>> java -jar post.jar myfile.xml
>>>>>>>>>>>> 
>>>>>>>>>>>> Could I increase this speed or reduce this time?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks a lot,
>>>>>>>>>>>> PS: Newbie user
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>> 
>> 
>> 
> 



Re: Indexation Speed?

2012-06-19 Thread François Schiettecatte
Well that depends on the platform you are on, you did not mention that.

If you are using linux, you could use atop ( http://www.atoptool.nl/ ), or top, 
or  iostat or stat, or all four.

Cheers

François

On Jun 19, 2012, at 8:55 AM, Bruno Mannina wrote:

> CPU is not used, just 50-60% sometimes during the process but How can I check 
> IO HDD ?
> 
> Le 19/06/2012 14:13, François Schiettecatte a écrit :
>> Just a suggestion, you might want to monitor CPU usage and disk I/O, there 
>> might be a bottleneck.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote:
>> 
>>> Actually -Xmx512m and no effect
>>> 
>>> Concerning  maxFieldLength, no problem it's commented
>>> 
>>> Le 19/06/2012 13:02, Erick Erickson a écrit :
>>>> Then try -Xmx600M
>>>> next try -Xmx900M
>>>> 
>>>> 
>>>> etc. The idea is to bump things on separate runs.
>>>> 
>>>> But be a little cautious here. Look in your solrconfig.xml file, you'll see
>>>> a commented-out line
>>>> 1
>>>> 
>>>> The default behavior for Solr/Lucene is to index the first 10,000 tokens
>>>> (not characters, think of tokens as words for not) in each
>>>> document and throw the rest on the floor. At the sizes you're talking 
>>>> about,
>>>> that's probably not a problem, but do be aware of it.
>>>> 
>>>> Best
>>>> Erick
>>>> 
>>>> On Tue, Jun 19, 2012 at 5:44 AM, Bruno Mannina   wrote:
>>>>> Like that?
>>>>> 
>>>>> java -Xmx300m -jar post.jar myfile.xml
>>>>> 
>>>>> 
>>>>> 
>>>>> Le 19/06/2012 11:11, Lance Norskog a écrit :
>>>>> 
>>>>>> Ah! Java memory size is a java command line option:
>>>>>> 
>>>>>> http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html
>>>>>> 
>>>>>> You would try increasing the memory size in stages up to maybe 300m.
>>>>>> 
>>>>>> On Tue, Jun 19, 2012 at 2:04 AM, Bruno Mannina 
>>>>>> wrote:
>>>>>>> Le 19/06/2012 10:51, Lance Norskog a écrit :
>>>>>>> 
>>>>>>>> 675 doc/s is respectable for that server. You might move the memory
>>>>>>>> allocated to Java up and down- there is a balance between amount of
>>>>>>>> memory in Java v.s. the OS disk buffer.
>>>>>>> How can I do that ? is there an option during my command line or in a
>>>>>>> config
>>>>>>> file?
>>>>>>> sorry for this newbie question :(
>>>>>>> 
>>>>>>> 
>>>>>>>> And, of course, use the latest trunk.
>>>>>>> Solr 3.6
>>>>>>> 
>>>>>>> 
>>>>>>>> On Tue, Jun 19, 2012 at 12:10 AM, Bruno Mannina
>>>>>>>>  wrote:
>>>>>>>>> Correction: file size is 40 Mo !!!
>>>>>>>>> 
>>>>>>>>> Le 19/06/2012 09:09, Bruno Mannina a écrit :
>>>>>>>>> 
>>>>>>>>>> Dear All,
>>>>>>>>>> 
>>>>>>>>>> I would like to know if the indexation speed is right.
>>>>>>>>>> 
>>>>>>>>>> I have a 40Go file size with around 27 000 docs inside.
>>>>>>>>>> I index around 20 fields,
>>>>>>>>>> 
>>>>>>>>>> My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go
>>>>>>>>>> Ram
>>>>>>>>>> 
>>>>>>>>>> The file takes 40 seconds with the command line:
>>>>>>>>>> java -jar post.jar myfile.xml
>>>>>>>>>> 
>>>>>>>>>> Could I increase this speed or reduce this time?
>>>>>>>>>> 
>>>>>>>>>> Thanks a lot,
>>>>>>>>>> PS: Newbie user
>>>>>>>>>> 
>>>>>>>>>> 
>> 
>> 
> 



Re: Indexation Speed?

2012-06-19 Thread François Schiettecatte
Just a suggestion, you might want to monitor CPU usage and disk I/O, there 
might be a bottleneck.

Cheers

François

On Jun 19, 2012, at 7:07 AM, Bruno Mannina wrote:

> Actually -Xmx512m and no effect
> 
> Concerning  maxFieldLength, no problem it's commented
> 
> Le 19/06/2012 13:02, Erick Erickson a écrit :
>> Then try -Xmx600M
>> next try -Xmx900M
>> 
>> 
>> etc. The idea is to bump things on separate runs.
>> 
>> But be a little cautious here. Look in your solrconfig.xml file, you'll see
>> a commented-out line
>> 1
>> 
>> The default behavior for Solr/Lucene is to index the first 10,000 tokens
>> (not characters, think of tokens as words for not) in each
>> document and throw the rest on the floor. At the sizes you're talking about,
>> that's probably not a problem, but do be aware of it.
>> 
>> Best
>> Erick
>> 
>> On Tue, Jun 19, 2012 at 5:44 AM, Bruno Mannina  wrote:
>>> Like that?
>>> 
>>> java -Xmx300m -jar post.jar myfile.xml
>>> 
>>> 
>>> 
>>> Le 19/06/2012 11:11, Lance Norskog a écrit :
>>> 
 Ah! Java memory size is a java command line option:
 
 http://javahowto.blogspot.com/2006/06/6-common-errors-in-setting-java-heap.html
 
 You would try increasing the memory size in stages up to maybe 300m.
 
 On Tue, Jun 19, 2012 at 2:04 AM, Bruno Manninawrote:
> 
> Le 19/06/2012 10:51, Lance Norskog a écrit :
> 
>> 675 doc/s is respectable for that server. You might move the memory
>> allocated to Java up and down- there is a balance between amount of
>> memory in Java v.s. the OS disk buffer.
> 
> How can I do that ? is there an option during my command line or in a
> config
> file?
> sorry for this newbie question :(
> 
> 
>> And, of course, use the latest trunk.
> Solr 3.6
> 
> 
>> On Tue, Jun 19, 2012 at 12:10 AM, Bruno Mannina
>>  wrote:
>>> Correction: file size is 40 Mo !!!
>>> 
>>> Le 19/06/2012 09:09, Bruno Mannina a écrit :
>>> 
 Dear All,
 
 I would like to know if the indexation speed is right.
 
 I have a 40Go file size with around 27 000 docs inside.
 I index around 20 fields,
 
 My (old) test server is a DualCore 3.06GHz Intel Xeon with only 1Go
 Ram
 
 The file takes 40 seconds with the command line:
 java -jar post.jar myfile.xml
 
 Could I increase this speed or reduce this time?
 
 Thanks a lot,
 PS: Newbie user
 
 
 
>> 
> 



Re: Solr out of memory exception

2012-03-15 Thread François Schiettecatte
FWIW it looks like this feature has been enabled by default since JDK 6 Update 
23:


http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

François

On Mar 15, 2012, at 6:39 AM, Husain, Yavar wrote:

> Thanks a ton.
> 
> From: Li Li [fancye...@gmail.com]
> Sent: Thursday, March 15, 2012 12:11 PM
> To: Husain, Yavar
> Cc: solr-user@lucene.apache.org
> Subject: Re: Solr out of memory exception
> 
> it seems you are using 64bit jvm(32bit jvm can only allocate about 1.5GB). 
> you should enable pointer compression by -XX:+UseCompressedOops
> 
> On Thu, Mar 15, 2012 at 1:58 PM, Husain, Yavar 
> mailto:yhus...@firstam.com>> wrote:
> Thanks for helping me out.
> 
> I have allocated Xms-2.0GB Xmx-2.0GB
> 
> However i see Tomcat is still using pretty less memory and not 2.0G
> 
> Total Memory on my Windows Machine = 4GB.
> 
> With smaller index size it is working perfectly fine. I was thinking of 
> increasing the system RAM & tomcat heap space allocated but then how come on 
> a different server with exactly same system and solr configuration & memory 
> it is working fine?
> 
> 
> -Original Message-
> From: Li Li [mailto:fancye...@gmail.com]
> Sent: Thursday, March 15, 2012 11:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr out of memory exception
> 
> how many memory are allocated to JVM?
> 
> On Thu, Mar 15, 2012 at 1:27 PM, Husain, Yavar 
> mailto:yhus...@firstam.com>> wrote:
> 
>> Solr is giving out of memory exception. Full Indexing was completed fine.
>> Later while searching maybe when it tries to load the results in memory it
>> starts giving this exception. Though with the same memory allocated to
>> Tomcat and exactly same solr replica on another server it is working
>> perfectly fine. I am working on 64 bit software's including Java & Tomcat
>> on Windows.
>> Any help would be appreciated.
>> 
>> Here are the logs:
>> 
>> The server encountered an internal error (Severe errors in solr
>> configuration. Check your log files for more detailed information on what
>> may be wrong. If you want solr to continue after configuration errors,
>> change: false in
>> null -
>> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at
>> org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1068) at
>> org.apache.solr.core.SolrCore.(SolrCore.java:579) at
>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
>> at
>> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
>> at
>> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4072)
>> at
>> org.apache.catalina.core.StandardContext.start(StandardContext.java:4726)
>> at
>> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:799)
>> at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:779)
>> at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:601) at
>> org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:943) at
>> org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:778) at
>> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:504) at
>> org.apache.catalina.startup.HostConfig.start(HostConfig.java:1317) at
>> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:324)
>> at
>> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:142)
>> at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1065) at
>> org.apache.catalina.core.StandardHost.start(StandardHost.java:840) at
>> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1057) at
>> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:463) at
>> org.apache.catalina.core.StandardService.start(StandardService.java:525) at
>> org.apache.catalina.core.StandardServer.start(StandardServer.java:754) at
>> org.apache.catalina.startup.Catalina.start(Catalina.java:595) at
>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
>> sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at
>> java.lang.reflect.Method.invoke(Unknown Source) at
>> org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289) at
>> org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414) Caused by:
>> java.lang.OutOfMemoryError: Java heap space at
>> org.apache.lucene.index.SegmentTermEnum.termInfo(SegmentTermEnum.java:180)
>> at org.apache.lucene.index.TermInfosReader.(TermIn

Re: Solr logging

2012-02-20 Thread François Schiettecatte
Ola

Here is what I have for this:


##
#
# Log4J configuration for SOLR
#
#   http://wiki.apache.org/solr/SolrLogging
#
#
# 1) Download LOG4J:
#   http://logging.apache.org/log4j/1.2/
#   http://logging.apache.org/log4j/1.2/download.html
#   
http://www.apache.org/dyn/closer.cgi/logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz
#   
http://newverhost.com/pub//logging/log4j/1.2.16/apache-log4j-1.2.16.tar.gz
#
# 2) Download SLF4J:
#   http://www.slf4j.org/
#   http://www.slf4j.org/download.html
#   http://www.slf4j.org/dist/slf4j-1.6.4.tar.gz
#
# 3) Unpack Solr:
#   jar xvf apache-solr-3.5.0.war
#
# 4) Delete:
#   WEB-INF/lib/log4j-over-slf4j-1.6.4.jar
#   WEB-INF/lib/slf4j-jdk14-1.6.4.jar
#
# 5) Copy:
#   apache-log4j-1.2.16/log4j-1.2.16.jar->  WEB-INF/lib
#   slf4j-1.6.4/slf4j-log4j12-1.6.4.jar ->  WEB-INF/lib
#   log4j.properties (this file)->  WEB-INF/classes/ (needs 
to be created)
#
# 6) Pack Solr:
#   jar cvf apache-solr-3.4.0-omim.war admin favicon.ico index.jsp META-INF 
WEB-INF
#
#
#   Author: Francois Schiettecatte
#   Version:1.0
#
##



##
#
# Logging levels (helpful reminder)
#
# DEBUG < INFO < WARN < ERROR < FATAL
#



##
#
# Logging setup
#

log4j.rootLogger=WARN, SOLR


# Daily Rolling File Appender (SOLR)
log4j.appender.SOLR=org.apache.log4j.DailyRollingFileAppender
log4j.appender.SOLR.File=${catalina.base}/logs/solr.log
log4j.appender.SOLR.Append=true
log4j.appender.SOLR.Encoding=UTF-8
log4j.appender.SOLR.DatePattern='-'-MM-dd
log4j.appender.SOLR.layout=org.apache.log4j.PatternLayout
log4j.appender.SOLR.layout.ConversionPattern=%d [%t] %-5p %c - %m%n



##
#
# Logging levels for SOLR
#

# Default logging level
log4j.logger.org.apache.solr=WARN



##



On Feb 20, 2012, at 5:15 AM, ola nowak wrote:

> Yep. I suppose it is. But I have several applications installed on
> glassfish and I want each one of them to write into separate file. And Your
> solution with this jvm option was redirecting all messages from all apps to
> one file. Does anyone knows how to accomplish that?
> 
> 
> On Mon, Feb 20, 2012 at 11:09 AM, darul  wrote:
> 
>> Hmm, I did not try to achieve this but interested if you find a way...
>> 
>> After I believe than having log4j config file outside war archive is a
>> better solution, if you may need to update its content for example.
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Solr-logging-tp3760171p3760322.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 



Re: Development inside or outside of Solr?

2012-02-20 Thread François Schiettecatte
You could take a look at this:

http://www.let.rug.nl/vannoord/TextCat/

Will probably require some work to integrate/implement through

François

On Feb 20, 2012, at 3:37 AM, bing wrote:

> I have looked into the TikaCLI with -language option, and learned that Tika
> can output only the language metadata. It cannot help me to solve my problem
> though, as my main concern is whether to change Solr or not.  Thank you all
> the same. 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3760131.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Help:Solr can't put all pdf files into index

2012-02-09 Thread François Schiettecatte
Have you tried checking any logs?

Have you tried identifying a file which did not make it in and submitting just 
that one and seeing what happens?

François

On Feb 9, 2012, at 10:37 AM, Rong Kang wrote:

> 
> Yes, I put all file in one directory and I have tested file names using 
> code.  
> 
> 
> 
> 
> At 2012-02-09 20:45:49,"Jan Høydahl"  wrote:
>> Hi,
>> 
>> Are you 100% sure that the filename is globally unique, since you use it as 
>> the uniqueKey?
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>> 
>> On 9. feb. 2012, at 08:30, 荣康 wrote:
>> 
>>> Hey ,
>>> I am using solr as my search engine to search my pdf files. I have 18219 
>>> files(different file names) and all the files are in one same directory。But 
>>> when I use solr to import the files into index using Dataimport method, 
>>> solr report only import 17233 files. It's very strange. This problem has 
>>> stoped out project for a few days. I can't handle it.
>>> 
>>> 
>>> please help me!
>>> 
>>> 
>>> Schema.xml
>>> 
>>> 
>>> 
>>>  >> termVectors="true" termPositions="true" termOffsets="true"/>
>>>  >> termVectors="true" termPositions="true" termOffsets="true"/>
>>>   
>>> 
>>> id 
>>> 
>>> 
>>> 
>>> and 
>>>  
>>>
>>>  
>>> >> rootEntity="false" 
>>> dataSource="null"  baseDir="H:/pdf/cls_1_16800_OCRed/1" 
>>> fileName=".*\.(PDF)|(pdf)|(Pdf)|(pDf)|(pdF)|(PDf)|(PdF)|(pDF)" 
>>> onError="skip"> 
>>> 
>>> 
>>> >> url="${f.fileAbsolutePath}" format="text" dataSource="bin" onError="skip">
>>> 
>>>  
>>> 
>>>  
>>>  
>>>
>>>  
>>> 
>>> 
>>> 
>>> 
>>> sincerecly
>>> Rong Kang
>>> 
>>> 
>>> 
>> 



Re: Using UUID for uniqueId

2012-02-08 Thread François Schiettecatte
Anderson

I would say that this is highly unlikely, but you would need to pay attention 
to how they are generated, this would be a good place to start:

http://en.wikipedia.org/wiki/Universally_unique_identifier

Cheers

François

On Feb 8, 2012, at 1:31 PM, Anderson vasconcelos wrote:

> HI all
> 
> If i use the UUID like a uniqueId in the future if i break my index in
> shards, i will have problems? The UUID generation could generate the same
> UUID in differents machines?
> 
> Thanks



Re: Question on Reverse Indexing

2012-01-17 Thread François Schiettecatte
Using ReversedWildcardFilterFactory will double the size of your dictionary 
(more or less), maybe the drop in performance that you are seeing is a result 
of that?

François

On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:

> Hi,
> 
> For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 
> 4.0
> 
> 
>  
> maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> 
> 
> ReversedWildcardFilterFactory was helping us to perform leading wild card 
> searches like *lock.
> 
> But it was observed that the performance of the searches was not good after 
> introducing ReversedWildcardFilterFactory filter.
> 
> Hence we disabled ReversedWildcardFilterFactory filter and re-created the 
> indexes and this time we found the performance of Solr query to be faster.
> 
> But surprisingly it is observed that leading wild card searches were still 
> working inspite of disabling the ReversedWildcardFilterFactory filter.
> 
> 
> This behavior is puzzling everyone and wanted to know how this behavior of 
> reverse indexing works?
> 
> Can anyone share with me on this Solr behavior.
> 
> -Shyam
> 



Re: best query for one-box search string over multiple types & fields?

2012-01-15 Thread François Schiettecatte
Johnny 

What you are going to want to do is boost the artist field with respect to the 
others, for example using edismax my 'qf' parameter is:

number^5 title^3 default

so hits in the number field get a five-fold boost and hits in the title field 
get a three-fold boost. In your case you might want to start with:

artist^5 album^3 song

Getting these parameters right will take a little work, and I would suggest you 
build a set of searches with known results so you can quickly check the effect 
of any tweaks you do.

Useful reading would include:

http://wiki.apache.org/solr/SolrRelevancyFAQ

http://wiki.apache.org/solr/SolrRelevancyCookbook


http://www.lucidimagination.com/blog/2011/12/14/options-to-tune-document’s-relevance-in-solr/


http://www.lucidimagination.com/blog/2011/03/10/solr-relevancy-function-queries/

Cheers

François


On Jan 15, 2012, at 1:19 AM, Johnny Marnell wrote:

> hi all,
> 
> short of it: i want "queen bohemian rhapsody" to return that song named
> "Bohemian Rhapsody" by the artist named "Queen", rather than songs with
> titles like "Bohemian Rhapsody (Queen Cover)".
> 
> i'm indexing a catalog of music with these types of docs and their fields:
> 
> artist (artistName), album (albumName, artistName), and song (songName,
> albumName, artistName).
> 
> the client is one search box, and i'm having trouble handling searching
> over multiple multifields and weighting their exactness.  when a user types
> "queen", i want the artist Queen to be the first hit, and then albums &
> songs titled "queen".
> 
> if "queen bohemian rhapsody" is searched, i want to return that song, but
> instead i'm getting songs like "Bohemian Rhapsody (Queen Cover)" by "Stupid
> Queen Tribute Band" because all three terms are in the songName, i'm
> guessing.  what kind of query do i need?
> 
> i'm indexing all of these fields as multi-fields with ngram, shingle (i
> think this might be really useful for my use case?), keyword, and standard.
> that appears to be working, but i'm not sure how to combine all of this
> together over multiple multi-fields.
> 
> if anyone has good links to broadly summarized use cases of Indexing and
> Querying, that would be great - i would think this would be a common
> situation but i can't find any good resources on the web.  and i'm having
> trouble understanding scoring and boosting.
> 
> this was my first post, hope i did it right, thanks so much!
> 
> -j



Re: Doing url search in solr is slow

2012-01-09 Thread François Schiettecatte
About the search 'referal_url:*www.someurl.com*', having a wildcard at the 
start will cause a dictionary scan for every term you search on unless you use 
ReversedWildcardFilterFactory. That could be the cause of your slowdown if you 
are I/O bound, and even if you are CPU bound for that matter.

François


On Jan 8, 2012, at 8:44 PM, yu shen wrote:

> Hi,
> 
> My solr document has up to 20 fields, containing data from product name,
> date, url etc.
> 
> The volume of documents is around 1.5m.
> 
> My symptom is when doing url search like [ url:*www.someurl.com*
> referal_url:*www.someurl.com* page_url:*www.someurl.com*] will get a
> extraordinary long response time, while search against all other fields,
> the response time will be normal.
> 
> Can anyone share any insights on this?
> 
> Spark



Re: Shutdown hook issue

2011-12-14 Thread François Schiettecatte
I am not an expert on this but the oom-killer will kill off the process 
consuming the greatest amount of memory if the machine runs out of memory, and 
you should see something to that effect in the system log, /var/log/messages I 
think.

François

On Dec 14, 2011, at 2:54 PM, Adolfo Castro Menna wrote:

> I think I found the issue. The ubuntu server is running OOM-Killer which
> might be sending a SIGINT to the java process, probably because of memory
> consumption.
> 
> Thanks,
> Adolfo.
> 
> On Wed, Dec 14, 2011 at 12:44 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
> 
>> Hi,
>> 
>> Solr won't shut down by itself just because it's idle. :)
>> You could run it with debugger attached and breakpoint set in the shutdown
>> hook you are talking about and see what calls it.
>> 
>> Otis
>> 
>> 
>> Performance Monitoring SaaS for Solr -
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>> 
>> 
>> 
>> 
>>> 
>>> From: Adolfo Castro Menna 
>>> To: solr-user@lucene.apache.org
>>> Sent: Wednesday, December 14, 2011 8:17 AM
>>> Subject: Shutdown hook issue
>>> 
>>> Hi All,
>>> 
>>> I'm experiencing some issues with solr. From time to time solr goes down.
>>> After checking the logs, I see that it's due to the shutdown hook being
>>> triggered.
>>> I still don't know why it happens but it seems to be related to solr being
>>> idle. Does anyone have any insights?
>>> 
>>> I'm using Ubuntu 10.04.2 LTS and solr 3.1.0 running on Jetty (default
>>> configuration). Solr runs in background, so it doesn't seem to be related
>>> to a SIGINT unless ubuntu is sending it for some odd reason.
>>> 
>>> Thanks,
>>> Adolfo.
>>> 
>>> 
>>> 
>> 



Re: how index words with their perfix in solr?

2011-11-29 Thread François Schiettecatte
You might try the snowball stemmer too, I am not sure how closely that will fit 
your requirements though.

Alternatively you could use synonyms.

François

On Nov 29, 2011, at 1:08 AM, mina wrote:

> thank you for your answer.i read it and i use this filter in my schema.xml in
> solr:
> 
> 
> 
> but this filter doesn't understand all words with their suffix and prefix.
> this means when i search 'rain' solr doesn't show me any document that have
> 'rainy'.
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-index-words-with-their-perfix-in-solr-tp3542300p3544319.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Don't snowball depending on terms

2011-11-29 Thread François Schiettecatte
It won't and depending on how your analyzer is set up the terms are most likely 
stemmed at index time.

You could create a separate field for unstemmed terms though, or use a less 
aggressive stemmer such as EnglishMinimalStemFilterFactory.

François

On Nov 29, 2011, at 12:33 PM, Robert Brown wrote:

> Is it possible to search a field but not be affected by the snowball filter?
> 
> ie, searching for "manage" is matching "management", but a user may want to 
> restrict results to only containing "manage".
> 
> I was hoping that simply quoting the term would do this, but it doesn't 
> appear to make any difference.
> 
> 
> 
> 
> --
> 
> IntelCompute
> Web Design & Local Online Marketing
> 
> http://www.intelcompute.com
> 



Re: how index words with their perfix in solr?

2011-11-28 Thread François Schiettecatte
It looks like you are using the plural stemmer, you might want to look into 
using the Porter stemmer instead:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming

François

On Nov 28, 2011, at 9:14 AM, mina wrote:

> I use solr 3.3,I want solr index words with their suffixes. when i index
> 'book' and 'books' and search 'book', solr show any document that has 'book'
> or 'books' but when I index 'rain' and 'rainy' and search 'rain', solr show
> any document that has 'rain' but i whant that solr show any document that
> has 'rain' or 'rainy'.help me.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-index-words-with-their-perfix-in-solr-tp3542300p3542300.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: query within search results

2011-11-08 Thread François Schiettecatte
Wouldn't 'diseases AND water' or '+diseases +water' return you that result? Or 
you could search on 'water' while filtering on 'diseases'.

Or am I missing something here?

François

On Nov 8, 2011, at 4:19 PM, sharnel pereira wrote:

> Hi,
> 
> I have 10k records indexed using solr 1.4
> 
> We have a requirement to search within search results.
> 
> example: query for 'water' returns 2000 results. I need the second query
> for 'diseases' to search within those 2000 results.(I cant add a facet as
> the second search should also check non faceted fields)
> 
> Is there a way to get this working.
> 
> Thanks
> Sharnel



Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread François Schiettecatte
Kuli

Good point about just tokenizing the fields :)

I ran a couple of tests to double-check my understanding and you can have a 
wildcard operator at either or both ends of a term. Adding 
ReversedWildcardFilterFactory to your field analyzer will make leading wildcard 
searches a lot faster of course but at the expense of index size.

Cheers

François


On Nov 1, 2011, at 9:07 AM, Michael Kuhlmann wrote:

> Hi,
> 
> this is not exactly true. In Solr, you can't have the wildcard operator on 
> both sides of the operator.
> 
> However, you can tokenize your fields and simply query for "Solr". This is 
> what's Solr made for. :)
> 
> -Kuli
> 
> Am 01.11.2011 13:24, schrieb François Schiettecatte:
>> Arshad
>> 
>> Actually it is available, you need to use the ReversedWildcardFilterFactory 
>> which I am sure you can Google for.
>> 
>> Solr and SQL address different problem sets with some overlaps but there are 
>> significant differences between the two technologies. Actually '%Solr%' is a 
>> worse case for SQL but handled quite elegantly in Solr.
>> 
>> Hope this helps!
>> 
>> Cheers
>> 
>> François
>> 
>> 
>> On Nov 1, 2011, at 7:46 AM, arshad ansari wrote:
>> 
>>> Hi,
>>> 
>>> Is SQL Like operator feature available in Apache Solr Just like we have it
>>> in SQL.
>>> 
>>> SQL example below -
>>> 
>>> *Select * from Employee where employee_name like '%Solr%'*
>>> 
>>> If not is it a Bug with Solr. If this feature available, please tell the
>>> examples available.
>>> 
>>> Thanks!
>>> 
>>> --
>>> Best Regards,
>>> Arshad
>> 
> 



Re: Is SQL Like operator feature available in Apache Solr query

2011-11-01 Thread François Schiettecatte
Arshad

Actually it is available, you need to use the ReversedWildcardFilterFactory 
which I am sure you can Google for.

Solr and SQL address different problem sets with some overlaps but there are 
significant differences between the two technologies. Actually '%Solr%' is a 
worse case for SQL but handled quite elegantly in Solr.

Hope this helps!

Cheers

François


On Nov 1, 2011, at 7:46 AM, arshad ansari wrote:

> Hi,
> 
> Is SQL Like operator feature available in Apache Solr Just like we have it
> in SQL.
> 
> SQL example below -
> 
> *Select * from Employee where employee_name like '%Solr%'*
> 
> If not is it a Bug with Solr. If this feature available, please tell the
> examples available.
> 
> Thanks!
> 
> -- 
> Best Regards,
> Arshad



Re: Uncomplete date expressions

2011-10-29 Thread François Schiettecatte
Erik

I would complement the date with default values as you suggest and store a 
boolean flag indicating whether the date was complete or not, or store the 
original date if it is not complete which would probably be better because the 
presence of that data would tell you that the original date was not complete 
and you would also have it too.

Cheers

François

On Oct 29, 2011, at 9:12 AM, Erik Fäßler wrote:

> Hi all,
> 
> I want to index MEDLINE documents which not always contain complete dates of 
> publication. The year is known always. Now the Solr documentation states, 
> dates must have the format "1995-12-31T23:59:59Z" for which month, day and 
> even the time of the day must be known.
> I could, of course, just complement uncomplete dates with default values, 
> 01-01 for example. But then I won't be able to distinguish between complete 
> and uncomplete dates afterwards which is of importance when displaying the 
> documents.
> 
> I could just store the known information, e.g. the year, into an 
> integer-typed field, but then I won't have date math.
> 
> Is there a good solution to my problem? Probably I'm just missing the 
> obvious, perhaps you can help me :-)
> 
> Best regards,
> 
>   Erik



Re: drastic performance decrease with 20 cores

2011-09-26 Thread François Schiettecatte
You have not said how big your index is but I suspect that allocating 13GB for 
your 20 cores is starving the OS of memory for caching file data. Have you 
tried 6GB with 20 cores? I suspect you will see the same performance as 6GB & 
10 cores.

Generally it is better to allocate just enough memory to SOLR to run optimally 
rather than as much as possible. 'Just enough' depends as well. You will need 
to try out different allocations and see where the sweet spot is.

Cheers

François


On Sep 26, 2011, at 9:53 AM, Bictor Man wrote:

> Hi everyone,
> 
> Sorry if this issue has been discussed before, but I'm new to the list.
> 
> I have a solr (3.4) instance running with 20 cores (around 4 million docs
> each).
> The instance has allocated 13GB in a 16GB RAM server. If I run several sets
> of queries sequentially in each of the cores, the I/O access goes very high,
> so does the system load, while the CPU percentage remains always low.
> It takes almost 1 hour to complete the set of queries.
> 
> If I stop solr and restart it with 6GB allocated and 10 cores, after a bit
> the I/O access goes down and the CPU goes up, taking only around 5 minutes
> to complete all sets of queries.
> 
> Meaning that for me is MUCH more performant having 2 solr instances running
> with half the data and half the memory than a single instance will all the
> data and memory.
> 
> It would be even way faster to have 1 instance with half the cores/memory,
> run the queues, shut it down, start a new instance and repeat the process
> than having a big instance running everything.
> 
> Furthermore, if I take the 20cores/13GB instance, unload 10 of the cores,
> trigger the garbage collector and run the sets of queries again, the
> behavior still remains slow taking like 30 minutes.
> 
> am I missing something here? does solr change its caching policy depending
> on the number of cores at startup or something similar?
> 
> Any hints will be very appreciated.
> 
> Thanks,
> Victor



Re: synonyms.txt: different results on admin and on site..

2011-09-08 Thread François Schiettecatte
Wildcard terms are not analyzed, so your synonyms.txt may come into play here, 
have you check the analysis for deniz* ?

François

On Sep 7, 2011, at 10:08 PM, deniz wrote:

> well yea you are right... i realised that lack of detail issue here... so
> here it comes... 
> 
> 
> This is from my schema.xml and basically i have a synonyms.txt file which
> contains
> 
> deniz,denis,denise
> 
> 
> After posting here, I have checked some stuff that I have faced before,
> while trying to add accented letters to the system... so it seems like same
> or similar stuff... so...
> 
> As i want to support partial matches, the search string is modified on php
> side. if user enters deniz, it is sent to solr as deniz*
> 
> when i check on solr admin, i was able to make searches with 
> deniz,denise,denis and they all return correct results, but when i put the
> wildcard, i get nothing...
> 
> so with the above settings;
> 
> deniz
> denise
> denis
> works smoothly
> 
> deniz*
> denise*
> denis*
> returns nothing...
> 
> 
> should i implement some kinda analyzer or tokenizer or any kinda component
> to overtime this thing? 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Rob Casson wrote:
>> 
>> you should probably post your schema.xml and some parts of your
>> synonyms.txt.  it could be differences between your index and query
>> analysis chains, synonym expansion errors, etc, but folks will likely
>> need more details to help you out.
>> 
>> cheers,
>> rob
>> 
>> On Wed, Sep 7, 2011 at 9:46 PM, deniz 
>> wrote:
>>> could it be related with analysis issue about synonyms once again?
>>> 
>>> 
>>> 
>>> -
>>> Zeki ama calismiyor... Calissa yapar...
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/synonyms-txt-different-results-on-admin-and-on-site-tp3318338p3318464.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>> 
> 
> 
> -
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/synonyms-txt-different-results-on-admin-and-on-site-tp3318338p3318503.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: MMapDirectory failed to map a 23G compound index segment

2011-09-07 Thread François Schiettecatte
My memory of this is a little rusty but isn't mmap also limited by mem + swap 
on the box? What does 'free -g' report?

François

On Sep 7, 2011, at 12:25 PM, Rich Cariens wrote:

> Ahoy ahoy!
> 
> I've run into the dreaded OOM error with MMapDirectory on a 23G cfs compound
> index segment file. The stack trace looks pretty much like every other trace
> I've found when searching for OOM & "map failed"[1]. My configuration
> follows:
> 
> Solr 1.4.1/Lucene 2.9.3 (plus
> SOLR-1969
> )
> CentOS 4.9 (Final)
> Linux 2.6.9-100.ELsmp x86_64 yada yada yada
> Java SE (build 1.6.0_21-b06)
> Hotspot 64-bit Server VM (build 17.0-b16, mixed mode)
> ulimits:
>core file size (blocks, -c) 0
>data seg size(kbytes, -d) unlimited
>file size (blocks, -f) unlimited
>pending signals(-i) 1024
>max locked memory (kbytes, -l) 32
>max memory size (kbytes, -m) unlimited
>open files(-n) 256000
>pipe size (512 bytes, -p) 8
>POSIX message queues (bytes, -q) 819200
>stack size(kbytes, -s) 10240
>cpu time(seconds, -t) unlimited
>max user processes (-u) 1064959
>virtual memory(kbytes, -v) unlimited
>file locks(-x) unlimited
> 
> Any suggestions?
> 
> Thanks in advance,
> Rich
> 
> [1]
> ...
> java.io.IOException: Map failed
> at sun.nio.ch.FileChannelImpl.map(Unknown Source)
> at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> Source)
> at org.apache.lucene.store.MMapDirectory$MMapIndexInput.(Unknown
> Source)
> at org.apache.lucene.store.MMapDirectory.openInput(Unknown Source)
> at org.apache.lucene.index.SegmentReader$CoreReaders.(Unknown Source)
> 
> at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> at org.apache.lucene.index.SegmentReader.get(Unknown Source)
> at org.apache.lucene.index.DirectoryReader.(Unknown Source)
> at org.apache.lucene.index.ReadOnlyDirectoryReader.(Unknown Source)
> at org.apache.lucene.index.DirectoryReader$1.doBody(Unknown Source)
> at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(Unknown
> Source)
> at org.apache.lucene.index.DirectoryReader.open(Unknown Source)
> at org.apache.lucene.index.IndexReader.open(Unknown Source)
> ...
> Caused by: java.lang.OutOfMemoryError: Map failed
> at sun.nio.ch.FileChannelImpl.map0(Native Method)
> ...



Re: Solr and wikipedia for schools

2011-09-04 Thread François Schiettecatte
I note that there is a full download option available, might be easier than 
crawling.

François

On Sep 4, 2011, at 9:56 AM, Markus Jelsma wrote:

> Hi,
> 
> Solr is a search engine, not a crawler. You can use Apache Nutch to crawl 
> your 
> site and have it indexed in Solr.
> 
> Cheers,
> 
>> Hi,
>> 
>> I am new to Solr/Lucene, and have some problems trying to figure out the
>> best way to perform indexing. I think I understand the general principles,
>> but have some trouble translating this to my specific goal, which is the
>> following:
>> 
>> I want to use SolR as a search engine based on general (English) keywords,
>> that has indexed Wikipedia for Schools
>> (http://www.soschildrensvillages.org.uk/charity-news/archive/2008/10/2008-
>> wikipedia-for-schools).
>> 
>> I initially thought that it would be sufficient to add the root document
>> (index.html) to Solr, after which everything would be automagically
>> indexed, but this does not seem to work. I have also tried to use
>> urldatasource in data-config.xml, but there I get a bit confused by the
>> settings.
>> 
>> Could anyone help me understand how I can achieve my goal?
>> 
>> Thanks
>> 
>> Kees



Re: shareSchema="true" - location of schema.xml?

2011-08-31 Thread François Schiettecatte
Satish

You don't say which platform you are on but have you tried links (with ln on 
linux/unix) ?

François

On Aug 31, 2011, at 12:25 AM, Satish Talim wrote:

> I have 1000's of cores and to reduce the cost of loading unloading
> schema.xml, I have my solr.xml as mentioned here -
> http://wiki.apache.org/solr/CoreAdmin
> namely:
> 
> 
>  
>...
>  
> 
> 
> However, I am not sure where to keep the common schema.xml file? In which
> case, do I need the schema.xml in the conf folder of each and every core?
> 
> My folder structure is:
> 
> multicore (contains solr.xml)
>|_ core0
> |_ conf
> ||_ schema.xml
> ||_ solrconfig.xml
> ||_ other files
>   core1
> |_ conf
> ||_ schema.xml
> ||_ solrconfig.xml
> ||_ other files
> |
>   exampledocs (contains 1000's of .csv files and post.jar)
> 
> Satish



Re: Error while decoding %DC (Ü) from URL - results in ?

2011-08-29 Thread François Schiettecatte
Merlin

Just to make sure I understand what is going on here, you are getting searches 
from external crawlers. These are coming in the form of an HTTP request I 
assume?

Have you checked the encoding specified in these requests (in the content type 
header). If the encoding is not specified then iso-8859-1 is usually assumed. 
Also have you checked the default encoding of your container? If you are using 
tomcat that is set using URIEncoding, for example:



François

On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote:

> I double checked all code on that page and it looks like everything is in
> utf-8 and works just perfect. The problematic URLs are called always by bots
> like google bot. Looks like they are operating with a different encoding.
> The page itself has an utf-8 meta tag.
> 
> So it looks like I have to find a way that checks for the encoding and
> encodes apropriatly. this should be a common solr problem if all search
> engines treat utf-8 that way, right?
> 
> Any ideas how to fix that? Is there maybe a special solr functionality for
> this?
> 
> 2011/8/27 François Schiettecatte 
> 
>> Merlin
>> 
>> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so
>> it looks like there is a charset mismatch somewhere.
>> 
>> 
>> Cheers
>> 
>> François
>> 
>> 
>> 
>> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:
>> 
>>> Hello,
>>> 
>>> I am having problems with searches that are issued from spiders that
>> contain
>>> the ASCII encoded character "ü"
>>> 
>>> For example in : "Übersetzung"
>>> 
>>> The solr log shows following query request: /suche/%DCbersetzung
>>> which has been translated into solr query: q=?ersetzung
>>> 
>>> If you enter the search term directly as a user into the search box it
>> will
>>> result into:
>>> /suche/Übersetzung which returns perfect results.
>>> 
>>> I am decoding the URL within PHP: $term = trim(urldecode($q));
>>> 
>>> Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
>>> illigeal first character in Solr.
>>> 
>>> I tried it without urldecode(), with rawurldecode() and with
>> utf8_decode()
>>> but all of those did not help.
>>> 
>>> Thank you for any help or hint on how to solve that problem.
>>> 
>>> Regards, Merlin
>> 
>> 



Re: Error while decoding %DC (Ü) from URL - results in ?

2011-08-27 Thread François Schiettecatte
Merlin

Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so it 
looks like there is a charset mismatch somewhere.


Cheers

François



On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote:

> Hello,
> 
> I am having problems with searches that are issued from spiders that contain
> the ASCII encoded character "ü"
> 
> For example in : "Übersetzung"
> 
> The solr log shows following query request: /suche/%DCbersetzung
> which has been translated into solr query: q=?ersetzung
> 
> If you enter the search term directly as a user into the search box it will
> result into:
> /suche/Übersetzung which returns perfect results.
> 
> I am decoding the URL within PHP: $term = trim(urldecode($q));
> 
> Somehow urldecode() translates the Character Ü (%DC) into a ? which is a
> illigeal first character in Solr.
> 
> I tried it without urldecode(), with rawurldecode() and with utf8_decode()
> but all of those did not help.
> 
> Thank you for any help or hint on how to solve that problem.
> 
> Regards, Merlin



Re: SolrServer instances

2011-08-26 Thread François Schiettecatte
Sounds to me that you are looking for HTTP Persistent Connections (connection 
keep-alive as opposed to close), and a singleton object. This would be outside 
SOLR per se.

A few caveats though, I am not sure if tomcat supports keep-alive, and I am not 
sure how SOLR deals with multiple requests coming down the pipe, and you will 
need to deal with concurrency, and I am not sure what you are looking to gain 
from this, opening an http connection is pretty cheap.

François

On Aug 26, 2011, at 2:09 AM, Jonty Rhods wrote:

> do I also required to close the connection from solr server
> (CommonHttpSolrServer).
> 
> regards
> 
> On Fri, Aug 26, 2011 at 9:45 AM, Jonty Rhods  wrote:
> 
>> Deal all please help I am stuck here as I have not much experience..
>> 
>> thanks
>> 
>> On Thu, Aug 25, 2011 at 6:51 PM, Jonty Rhods wrote:
>> 
>>> Hi All,
>>> 
>>> I am using SolrJ (3.1) and Tomcat 6.x. I want to open solr server once (20
>>> concurrence) and reuse this across all the site. Or something like
>>> connection pool like we are using for DB (ie Apache DBCP). There is a way to
>>> use static method which is a way but I want better solution from you people.
>>> 
>>> 
>>> 
>>> I read one threade where Ahmet suggest to use something like that
>>> 
>>> String serverPath = "http://localhost:8983/solr";;
>>> HttpClient client = new HttpClient(new
>>> MultiThreadedHttpConnectionManager());
>>> URL url = new URL(serverPath);
>>> CommonsHttpSolrServer solrServer = new CommonsHttpSolrServer(url, client);
>>> 
>>> But how to use instance of this across all class.
>>> 
>>> Please suggest.
>>> 
>>> regards
>>> Jonty
>>> 
>> 
>> 



Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread François Schiettecatte
Assuming you are running on Linux, you might want to check /var/log/messages 
too (the location might vary), I think the kernel logs forced process 
termination there. I recall that the kernel will usually picks the process 
consuming the most memory, there may be other factors involved too.

François

On Aug 2, 2011, at 9:04 AM, wakemaster 39 wrote:

> Monitor your memory usage.  I use to encounter a problem like this before
> where nothing was in the logs and the process was just gone.
> 
> Turned out my system was out odd memory and swap got used up because of
> another process which then forced the kernel to start killing off processes.
> Google OOM linux and you will find plenty of other programs and people with
> a similar problem.
> 
> Cameron
> On Aug 2, 2011 6:02 AM, "alexander sulz"  wrote:
>> Hello folks,
>> 
>> I'm using the latest stable Solr release -> 3.3 and I encounter strange
>> phenomena with it.
>> After about 19 hours it just crashes, but I can't find anything in the
>> logs, no exceptions, no warnings,
>> no suspicious info entries..
>> 
>> I have an index-job running from 6am to 8pm every 10 minutes. After each
>> job there is a commit.
>> An optimize-job is done twice a day at 12:15pm and 9:15pm.
>> 
>> Does anyone have an idea what could possibly be wrong or where to look
>> for further debug info?
>> 
>> regards and thank you
>> alex



Re: Solr can not index "F**K"!

2011-07-31 Thread François Schiettecatte
Indeed, the analysis will show if the term is a stop word, the term gets 
removed by the stop filter, turning on verbose output shows that.

François

On Jul 31, 2011, at 6:27 PM, Shashi Kant wrote:

> Check your Stop words list
> On Jul 31, 2011 6:25 PM, "François Schiettecatte" 
> wrote:
>> That seems a little far fetched, have you checked your analysis?
>> 
>> François
>> 
>> On Jul 31, 2011, at 4:58 PM, randohi wrote:
>> 
>>> One of our clients (a hot girl!) brought this to our attention:
>>> In this document there are many f* words:
>>> 
>>> http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm
>>> 
>>> and we have indexed it with latest version of Solr (ver 3.3). But, we if
> we
>>> search F**K, it does not return the document back!
>>> 
>>> We have tried to index it with different text types, but still not
> working.
>>> 
>>> Any idea why F* can not be indexed - being censored by the government? :D
>>> 
>>> 
>>> --
>>> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 



Re: Solr can not index "F**K"!

2011-07-31 Thread François Schiettecatte
That seems a little far fetched, have you checked your analysis?

François

On Jul 31, 2011, at 4:58 PM, randohi wrote:

> One of our clients (a hot girl!) brought this to our attention: 
> In this document there are many f* words:
> 
> http://sec.gov/Archives/edgar/data/1474227/00014742271032/d424b3.htm
> 
> and we have indexed it with latest version of Solr (ver 3.3). But, we if we
> search F**K, it does not return the document back!
> 
> We have tried to index it with different text types, but still not working.
> 
> Any idea why F* can not be indexed - being censored by the government? :D
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-can-not-index-F-K-tp3214246p3214246.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: schema.xml changes, need re-indexing ?

2011-07-27 Thread François Schiettecatte
I have not seen this mentioned anywhere, but I found a useful 'trick' to 
restart solr without having to restart tomcat. All you need to do is 'touch' 
the solr.xml in the solr.home directory. It can take a few seconds but solr 
will restart and reload any config.

Cheers

François 

On Jul 27, 2011, at 2:56 PM, Alexei Martchenko wrote:

> I believe you're fine with that. Don't need to reindex all solr database.
> 
> 2011/7/27 Charles-Andre Martin 
> 
>> Hi,
>> 
>> 
>> 
>> We currently have a big index in production. We would like to add 2
>> non-required fields to our schema.xml :
>> 
>> 
>> 
>> > required="false"/>
>> 
>> > required="false" multiValued="true"/>
>> 
>> 
>> 
>> I made some tests:
>> 
>> 
>> 
>> -  I stopped tomcat
>> 
>> -  I changed the schema.xml
>> 
>> -  I started tomcat
>> 
>> 
>> 
>> The data was still there and I was able to add new document with theses 2
>> fields.
>> 
>> 
>> 
>> So far, it looks I won't need to re-index all my data. Am I right ? Do I
>> need to re-index all my data or in that case I'm fine ?
>> 
>> 
>> 
>> Thank you !
>> 
>> 
>> 
>> Charles-André Martin
>> 
>> 
> 
> 
> -- 
> 
> *Alexei Martchenko* | *CEO* | Superdownloads
> ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
> 5083.1018/5080.3535/5080.3533



Re: performance variation with respect to the index size

2011-07-26 Thread François Schiettecatte
Finally got to running these tests.

Here are the basics...

Core i7 - 960
24GB RAM
Solr index on its own drive

Solr 3.3.0  running under tomcat 7.0.19, jdk1.6.0_26, java opts are:

JAVA_OPTS="-Xmx4096M -XX:-UseGCOverheadLimit" 
 
Raw data is 80GB in SOLR marking for adding, sample below:

5
en
202
2008-07-31T23:29:40Z
http://tomfoolery4.wordpress.com/2008/07/31/finally-a-buffalo-webmedia-site-that-doesnt-sit-on-the-fence/
Finally! A Buffalo Web/Media Site That Doesn’t Sit On The 
Fence!
The Buffalo News has got my back on this one. A lot of area 
musicians, artists, writers and photographers have got my back on this one. And 
now, I'm pleased to say, so does WNYMedia.net, another new voice in a 
small 
sea of journalistic endeavors afoot in Buffalo. What I like about this site 
[...]


icwsm does not include content - 52GB














icwsm2 includes content - 117GB 














I used 1,000 searches from a 162,000 search set I saved from feedster days, 
here are some sample searches:

belize
st louis cardinals
offshoring
2010 olympic games
nanotubes
"beamed power"
"space elevator"
"power beaming"
world news
dogster
vancouver-centre
news


I ran six tests, two on icwsm getting the key and the score (10 rows and 100 
rows), two on icwsm2 getting the key and the score (10 rows and 100 rows), and 
two on icwsm2 getting all the fields and the scores (10 rows and 100 rows). 
Each test was run 10 times consecutively, nothing was running on the machine.

This table shows the time elapsed, the index name, the rows requested and the 
fields requested:

 182  icwsm  10  key,score
 184  icwsm  10  key,score
 182  icwsm  10  key,score
 182  icwsm  10  key,score
 184  icwsm  10  key,score
 183  icwsm  10  key,score
 183  icwsm  10  key,score
 183  icwsm  10  key,score
 184  icwsm  10  key,score
 183  icwsm  10  key,score

 190  icwsm  100  key,score
 183  icwsm  100  key,score
 184  icwsm  100  key,score
 184  icwsm  100  key,score
 183  icwsm  100  key,score
 183  icwsm  100  key,score
 182  icwsm  100  key,score
 183  icwsm  100  key,score
 185  icwsm  100  key,score
 184  icwsm  100  key,score

 204  icwsm2  10  key,score
 183  icwsm2  10  key,score
 184  icwsm2  10  key,score
 184  icwsm2  10  key,score
 185  icwsm2  10  key,score
 184  icwsm2  10  key,score
 183  icwsm2  10  key,score
 185  icwsm2  10  key,score
 184  icwsm2  10  key,score
 184  icwsm2  10  key,score

 288  icwsm2  100  key,score
 184  icwsm2  100  key,score
 186  icwsm2  100  key,score
 184  icwsm2  100  key,score
 186  icwsm2  100  key,score
 186  icwsm2  100  key,score
 186  icwsm2  100  key,score
 186  icwsm2  100  key,score
 189  icwsm2  100  key,score
 188  icwsm2  100  key,score

 185  icwsm2  10  *,score
 184  icwsm2  10  *,score
 183  icwsm2  10  *,score
 184  icwsm2  10  *,score
 184  icwsm2  10  *,score
 184  icwsm2  10  *,score
 185  icwsm2  10  *,score
 184  icwsm2  10  *,score
 184  icwsm2  10  *,score
 184  icwsm2  10  *,score

 206  icwsm2  100  *,score
 185  icwsm2  100  *,score
 186  icwsm2  100  *,score
 190  icwsm2  100  *,score
 195  icwsm2  100  *,score
 191  icwsm2  100  *,score
 193  icwsm2  100  *,score
 190  icwsm2  100  *,score
 186  icwsm2  100  *,score
 186  icwsm2  100  *,score

Basically storing the data in the index has virtually no impact on search speed 
from what I can see which is what I would expect.


Cheers

François






On Jul 8, 2011, at 12:18 PM, Erick Erickson wrote:

> Well, it depends (tm). Raw search time should be unaffected (or very
> close to that). The stored data is in a completely separate file in
> the index directory and is not referenced during searches.
> 
> That said, assembling the response may take longer since you're
> potentially reading more data from the disk to create each document.
> 
> Insure that lazy field loading is turned on, and when you're comparing
> times it would probably be best to return the same fields (perhaps just ID).
> 
> Note that the Qtime in the response packet is the search, exclusive of
> assembling the response so that's probably a good number to measure.
> 
> Best
> Erick
> 
> On Fri, Jul 8, 2011 at 8:01 AM, jame vaalet  wrote:
>> i would prefer every setting to be in its default stage and compare the
>> result with stored = true and False .
>> 
>> 2011/7/8 François Schiettecatte 
>> 
>>> Hi
>>> 
>>> I don't think that anyone has run such benchmarks, in fact this topic came
>>> up two weeks ago and I volunteered some time to do that because I have some
>>> spare time this week, so I am going to run some benchmarks this weekend and
>>> report back.
>>> 
>>> The machin

Re: Spellcheck compounded words

2011-07-26 Thread François Schiettecatte
I get slf4j-log4j12-1.6.1.jar from 
http://www.slf4j.org/dist/slf4j-1.6.1.tar.gz, it is what interfaces  slf4j to 
log4j, you will also need to add log4j-1.2.16.jar to WEB-INF/lib.


François 


On Jul 26, 2011, at 3:40 PM, O. Klein wrote:

> 
> François Schiettecatte wrote:
>> 
>> #
>> # 4) Copy:
>> #slf4j-1.6.1/slf4j-log4j12-1.6.1.jar ->  
>> WEB-INF/lib
>> #log4j.properties (this file)->  
>> WEB-INF/classes/ (needs to be
>> created)
>> #
>> 
> 
> Don't you mean log4j-1.2.16/slf4j-log4j12-1.6.1.jar ?
> 
> Anyways. I was testing on 3.3 and found that when I added
> &spellcheck.maxCollations=2&spellcheck.maxCollationTries=2 as parameters to
> the URL there was no problem at all.
> 
> Adding 
> 
>  2
>  2
> 
> to the default requestHandler in solrconfig.xml caused request to hang.
> 
> Can someone verify if this is a bug?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Spellcheck-compounded-words-tp3192748p3201332.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Spellcheck compounded words

2011-07-26 Thread François Schiettecatte
FWIW, here is the process I follow to create a log4j aware version of the 
apache solr war file and the corresponding lo4j.properties files.

Have fun :)

François


##
#
# Log4J configuration for SOLR
#
#   http://wiki.apache.org/solr/SolrLogging
#
#
# 1) Download SLF4J:
#   http://www.slf4j.org/
#   http://www.slf4j.org/download.html
#   http://www.slf4j.org/dist/slf4j-1.6.1.tar.gz
#
# 2) Unpack Solr:
#   jar xvf apache-solr-3.3.0.war
#
# 3) Delete:
#   WEB-INF/lib/log4j-over-slf4j-1.6.1.jar
#   WEB-INF/lib/slf4j-jdk14-1.6.1.jar
#
# 4) Copy:
#   slf4j-1.6.1/slf4j-log4j12-1.6.1.jar ->  
WEB-INF/lib
#   log4j.properties (this file)->  
WEB-INF/classes/ (needs to be created)
#
# 5) Pack Solr:
#   jar cvf apache-solr-3.3.0.war admin favicon.ico index.jsp 
META-INF WEB-INF
#
#
#   Author: Francois Schiettecatte
#   Version:1.0
#
##



##
#
# Logging levels (helpful reminder)
#
# DEBUG < INFO < WARN < ERROR < FATAL
#



##
#
# Logging setup
#

log4j.rootLogger=ERROR, SOLR


# Daily Rolling File Appender (SOLR)
log4j.appender.SOLR=org.apache.log4j.DailyRollingFileAppender
log4j.appender.SOLR.File=${catalina.base}/logs/solr.log
log4j.appender.SOLR.Append=true
log4j.appender.SOLR.Encoding=UTF-8
log4j.appender.SOLR.DatePattern='-'-MM-dd
log4j.appender.SOLR.layout=org.apache.log4j.PatternLayout
log4j.appender.SOLR.layout.ConversionPattern=%d [%t] %-5p %c - %m%n



##
#
# Logging levels for SOLR
#

# Default logging level
log4j.logger.org.apache.solr=ERROR



##




On Jul 26, 2011, at 2:49 PM, O. Klein wrote:

> Adding log4j-1.2.16.jar and deleting slf4j-jdk14-1.6.1.jar does not fix
> logging for 4.0 for me.
> 
> Anyways, tried it on 3.3 and Solr just hangs here also. No logging, no
> exceptions.
> 
> I'll let you know if I manage to find source of problem.
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Spellcheck-compounded-words-tp3192748p3201202.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: problem searching on non standard characters

2011-07-22 Thread François Schiettecatte
Adding to my previous reply, I just did a quick check on the 'text_en' and 
'text_en_splitting' field types and they both strip leading '#'.

Cheers

François

On Jul 22, 2011, at 10:49 AM, Shawn Heisey wrote:

> On 7/22/2011 8:34 AM, Jason Toy wrote:
>> How does one search for words with characters like # and +.   I have tried
>> searching solr with "#test" and "\#test" but all my results always come up
>> with "test" and not "#test". Is this some kind of configuration option I
>> need to set in solr?
> 
> I would guess that your analysis chain (in schema.xml) includes something 
> that removes and/or splits terms at non-alphanumeric characters.  There are a 
> several components that do this, but WordDelimiterFilter is the one that 
> comes to mind most readily.  I've never used the StandardTokenizer, but I 
> believe it might do something similar.
> 
> Thanks,
> Shawn
> 



Re: problem searching on non standard characters

2011-07-22 Thread François Schiettecatte
Check your analyzers to make sure that these characters are not getting 
stripped out in the tokenization process, the url for 3.3 is somewhere along 
the lines of:

http://localhost/solr/admin/analysis.jsp?highlight=on

And you should be indeed be searching on "\#test".

François

On Jul 22, 2011, at 10:34 AM, Jason Toy wrote:

> How does one search for words with characters like # and +.   I have tried
> searching solr with "#test" and "\#test" but all my results always come up
> with "test" and not "#test". Is this some kind of configuration option I
> need to set in solr?
> 
> -- 
> - sent from my mobile
> 6176064373



Re: POST VS GET and NON English Characters

2011-07-20 Thread François Schiettecatte
You need to do something like this in the ./conf/tomcat server.xml file:



See 'URIEncoding' in http://tomcat.apache.org/tomcat-7.0-doc/config/http.html

Note that this will assume that the encoding of the data is in utf-8 if (and 
ONLY if) the charset parameter is not set in the HTTP request content type 
header, the header looks like this:

Content-Type: text/plain; charset=UTF-8

Also note that most browsers encode data in ISO-8859-1 unless overridden in the 
browser settings or by the content type and charset set in the html in case you 
are using a form. This you can do either by setting it in the http response 
content type header (like above), or as a meta tag like this:



Hope this helps.

Cheers

François



On Jul 20, 2011, at 7:20 AM, Sujatha Arun wrote:

> Paul ,
> 
> I added the fllowing line to catalina.sh  and restarted the server ,but this
> does not seem to help.
> 
> 
> JAVA_OPTS="-Djavax.servlet.request.encoding=UTF-8 -Dfile.encoding=UTF-8"
> Regards
> Sujatha
> 
> On Sun, Jul 17, 2011 at 3:51 AM, Paul Libbrecht  wrote:
> 
>> If you have the option, try setting the default charset of the
>> servlet-container to utf-8.
>> Typically this is done by setting a system property on startup.
>> 
>> My experience has been that the default used to be utf-8 but it is less and
>> less and sometimes in a surprising way!
>> 
>> paul
>> 
>> 
>> Le 16 juil. 2011 à 05:34, Sujatha Arun a écrit :
>> 
>>> It works fine with GET method ,but I am wondering why it does not with
>> POST
>>> method.
>>> 
>>> 2011/7/15 pankaj bhatt 
>>> 
 Hi Arun,
This looks like an Encoding issue to me.
 Can you change your browser settinsg to UTF-8 and hit the search
>> url
 via GET method.
 
  We faced the similar problem with chienese,korean languages, this
 solved the problem.
 
 / Pankaj Bhatt.
 
 2011/7/15 Sujatha Arun 
 
> Hello,
> 
> We have implemented solr search in  several languages .Intially we used
 the
> "GET" method for querying ,but later moved to  "POST" method to
 accomodate
> lengthy queries .
> 
> When we moved form  GET TO POSt method ,the german characteres could no
> longer be searched and I had to use the fucntion utf8_decode in my
> application  for the search to work for german characters.
> 
> Currently I am doing this  while quering using the POST method ,we are
> using
> the standard Request Handler
> 
> 
> $this->_queryterm=iconv("UTF-8", "ISO-8859-1//TRANSLIT//IGNORE",
> $this->_queryterm);
> 
> 
> This makes the query work for german characters and other languages but
> does
> not work for certain charactes  in Lithuvanian and spanish.Example:
> *Not working
> 
> - *Iš
> - Estremadūros
> - sNaująjį
> - MEDŽIAGOTYRA
> - MEDŽIAGOS
> - taškuose
> 
> *Working
> 
> - *garbę
> - ieškoti
> - ispanų
> 
> Any ideas /input  ?
> 
> Regards
> Sujatha
> 
 
>> 
>> 



Re: How to find whether solr server is running or not

2011-07-19 Thread François Schiettecatte
I think anything but a 200 OK mean it is dead like the proverbial parrot :)

François

On Jul 19, 2011, at 7:42 AM, Romi wrote:

> But the problem is when solr server is not runing 
> *"http://host:port/solr/admin/ping"*
> 
> will not give me any json response
> then how will i get the status :(
> 
> when i run this url browser gives me following error
> *Unable to connect
> Firefox can't establish a connection to the server at 192.168.1.9:8983.*
> 
> -
> Thanks & Regards
> Romi
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-find-whether-solr-server-is-running-or-not-tp3181870p3182202.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: - character in search query

2011-07-14 Thread François Schiettecatte
Easy, the hyphen is out on its own (with spaces on either side) and is probably 
getting removed from the search by the tokenizer. Check your analysis.

François

On Jul 14, 2011, at 6:05 AM, roySolr wrote:

> It looks like it's still not working.
> 
> I send this to SOLR: q=arsenal \- london
> 
> I get no results. When i look at the debugQuery i see this:
> 
> (name: arsenal | city:arsenal)~1.0 (name: \ | city:\)~1.0 (name: london |
> city: london)~1.0
> 
> 
> my requesthandler:
> 
>
>
>dismax
>
>   name city
>
>   
>  
> 
> What is going wrong?
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/character-in-search-query-tp3168604p3168666.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Wildcard

2011-07-13 Thread François Schiettecatte
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

http://wiki.apache.org/solr/SolrQuerySyntax

François

On Jul 13, 2011, at 1:29 PM, GAURAV PAREEK wrote:

> Hello,
> 
> What are wildcards we can use with the SOLR ?
> 
> Regards,
> Gaurav



Re: Result list order in case of ties

2011-07-12 Thread François Schiettecatte
You just need to provide a second sort field along the lines of:

sort=score desc, author desc

François

On Jul 12, 2011, at 6:13 AM, Lox wrote:

> Hi,
> 
> In the case where two or more documents are returned with the same score, is
> there a way to tell Solr to sort them alphabetically?
> 
> I have already tried to use the tie-breaker, but I have just one field to
> search.
> 
> Thank you.
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Result-list-order-in-case-of-ties-tp3162001p3162001.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: performance variation with respect to the index size

2011-07-08 Thread François Schiettecatte
Hi

I don't think that anyone has run such benchmarks, in fact this topic came up 
two weeks ago and I volunteered some time to do that because I have some spare 
time this week, so I am going to run some benchmarks this weekend and report 
back.

The machine I have to do this a core i7 960, 24GB, 4TB of disk. I am going to 
run SOLR 3.3 under Tomcat 7.0.16. I have three databases I can use for this, 
icwsm-2009 (38.5GB compressed), cdip (24GB compressed), trec vlc2 (31GB 
compressed). I could also use a copy of wikipedia. I have lots of user searches 
I can use (saved from Feedster days).

I would like some input on a couple of things to make this test as real-world 
as possible. One is any optimizations I should set in solrconfig.xml, and the 
other are the heap/GC settings I should set for tomcat. Anything else?

Cheers

François

On Jul 8, 2011, at 4:08 AM, jame vaalet wrote:

> hi,
> 
> is there any performance degradation (response time etc ) if the index has
> document content text stored in it  (stored=true)?
> 
> -JAME



Re: Wildcard search not working if full word is queried

2011-07-01 Thread François Schiettecatte
Celso

You are very welcome and yes I should have mentioned that wildcard searches are 
not analyzed (which is a recurring theme). This also means that they are not 
downcased, so the search TEST* will probably not find anything either in  your 
set up.

Cheers

François

On Jul 1, 2011, at 5:16 AM, Celso Pinto wrote:

> Hi again,
> 
> read (past tense) TFM :-) and:
> 
> "On wildcard and fuzzy searches, no text analysis is performed on the
> search word."
> 
> Thanks a lot François!
> 
> Regards,
> Celso
> 
> On Fri, Jul 1, 2011 at 10:02 AM, Celso Pinto  wrote:
>> Hi François,
>> 
>> it is indeed being stemmed, thanks a lot for the heads up. It appears
>> that stemming is also configured for the query so it should work just
>> the same, no?
>> 
>> Thanks again.
>> 
>> Regards,
>> Celso
>> 
>> 
>> 2011/6/30 François Schiettecatte :
>>> I would run that word through the analyzer, I suspect that the word 'teste' 
>>> is being stemmed to 'test' in the index, at least that is the first place I 
>>> would check.
>>> 
>>> François
>>> 
>>> On Jun 30, 2011, at 2:21 PM, Celso Pinto wrote:
>>> 
>>>> Hi everyone,
>>>> 
>>>> I'm having some trouble figuring out why a query with an exact word
>>>> followed by the * wildcard, eg. teste*, returns no results while a
>>>> query for test* returns results that have the word "teste" in them.
>>>> 
>>>> I've created a couple of pasties:
>>>> 
>>>> Exact word with wildcard : http://pastebin.com/n9SMNsH0
>>>> Similar word: http://pastebin.com/jQ56Ww6b
>>>> 
>>>> Parameters other than title, description and content have no effect
>>>> other than filtering out unwanted results. In a two of the four
>>>> results, the title has the complete word "teste". On the other two,
>>>> the word appears in the other fields.
>>>> 
>>>> Does anyone have any insights about what I'm doing wrong?
>>>> 
>>>> Thanks in advance.
>>>> 
>>>> Regards,
>>>> Celso
>>> 
>>> 
>> 



Re: Wildcard search not working if full word is queried

2011-06-30 Thread François Schiettecatte
I would run that word through the analyzer, I suspect that the word 'teste' is 
being stemmed to 'test' in the index, at least that is the first place I would 
check.

François

On Jun 30, 2011, at 2:21 PM, Celso Pinto wrote:

> Hi everyone,
> 
> I'm having some trouble figuring out why a query with an exact word
> followed by the * wildcard, eg. teste*, returns no results while a
> query for test* returns results that have the word "teste" in them.
> 
> I've created a couple of pasties:
> 
> Exact word with wildcard : http://pastebin.com/n9SMNsH0
> Similar word: http://pastebin.com/jQ56Ww6b
> 
> Parameters other than title, description and content have no effect
> other than filtering out unwanted results. In a two of the four
> results, the title has the complete word "teste". On the other two,
> the word appears in the other fields.
> 
> Does anyone have any insights about what I'm doing wrong?
> 
> Thanks in advance.
> 
> Regards,
> Celso



Re: filters effect on search results

2011-06-29 Thread François Schiettecatte
Indeed, I find the Porter stemmer to be too 'aggressive' for my taste, I prefer 
the EnglishMinimalStemFilterFactory, with the caveat that it depends on your 
data set.

Cheers

François

On Jun 29, 2011, at 6:21 AM, Ahmet Arslan wrote:

>> Hi, when i query for "elegant" in
>> solr i get results for "elegance" too. 
>> 
>> *I used these filters for index analyze*
>> WhitespaceTokenizerFactory 
>> StopFilterFactory 
>> WordDelimiterFilterFactory
>> LowerCaseFilterFactory 
>> SynonymFilterFactory
>> EnglishPorterFilterFactory
>> RemoveDuplicatesTokenFilterFactory
>> ReversedWildcardFilterFactory 
>> 
>> *
>> and for query analyze:*
>> 
>> .WhitespaceTokenizerFactory
>> SynonymFilterFactory
>> StopFilterFactory
>> WordDelimiterFilterFactory 
>> LowerCaseFilterFactory 
>> EnglishPorterFilterFactory 
>> RemoveDuplicatesTokenFilterFactory 
>> 
>> I want to know which filter affecting my search result.
>> 
> 
> It is EnglishPorterFilterFactory, you can verify it from admin/analysis.jsp 
> page.



Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
Well no, you need to see which files (if any) will suit your needs, they are 
not all synonyms files, I only needed the UK/US english file and I needed to 
process it into a format suitable for the synonyms file.

There may well be other word lists on the net suitable for your needs. I would 
not recommend the use of synonyms unless you have a specific need for them. I 
needed them because we have documents which mix UK/US english, and we need to 
be able to search on medical terms e.g. hemoglobin/haemoglobin and get the same 
results.

Cheers 

François

On Jun 28, 2011, at 9:21 AM, Romi wrote:

> Thanks François Schiettecatte, information you provided is very helpful.
> i need to know one more thing, i downloaded one of the given dictionary but
> it contains many files, do i need to add all this files data in to
> synonyms.text ??
> 
> -
> Thanks & Regards
> Romi
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117733.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Yeah, I read the overview which suggests that duplicates can be prevented from 
entering the index and scanned the rest, it does not look like you can actually 
drop the document entirely. Maybe I am missing something here.

François

On Jun 28, 2011, at 9:14 AM, Mohammad Shariq wrote:

> Hey François,
> thanks for your suggestion, I followed the same link (
> http://wiki.apache.org/solr/Deduplication)
> 
> they have the solution*, either make Hash as uniqueKey OR overwrite on
> duplicate,
> I dont need either.
> 
> I need Discard on Duplicate.
> *
> 
>> 
>> 
>> I have not used it but it looks like it will do the trick.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:
>> 
>>> I found the deduplication thing really useful. Although I have not yet
>>> started to work on it, as there are some other low hanging fruits I've to
>>> capture. Will share my thoughts soon.
>>> 
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>> http://blog.myblive.com> |
>>> Google <http://www.google.com/profiles/pranny>
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte 
>>> 
>>>> Maybe there is a way to get Solr to reject documents that already exist
>> in
>>>> the index but I doubt it, maybe someone else with can chime here here.
>> You
>>>> could do a search for each document prior to indexing it so see if it is
>>>> already in the index, that is probably non-optimal, maybe it is easiest
>> to
>>>> check if the document exists in your Riak repository, it no add it and
>> index
>>>> it, and drop if it already exists.
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I am making the Hash from URL, but I can't use this as UniqueKey
>> because
>>>> I
>>>>> am using UUID as UniqueKey,
>>>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>>>> storage) as storage engine, I dont want to do the overwrite on
>> duplicate.
>>>>> I just need to discard the duplicates.
>>>>> 
>>>>> 
>>>>> 
>>>>> 2011/6/28 François Schiettecatte 
>>>>> 
>>>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>>>> would probably be good enough.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> François
>>>>>> 
>>>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>>>> 
>>>>>>> I also have the problem of duplicate docs.
>>>>>>> I am indexing news articles, Every news article will have the source
>>>> URL,
>>>>>>> If two news-article has the same URL, only one need to index,
>>>>>>> removal of duplicate at index time.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 23 June 2011 21:24, simon  wrote:
>>>>>>> 
>>>>>>>> have you checked out the deduplication process that's available at
>>>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>>>> 
>>>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>>>> 
>>>>>>>> -Simon
>>>>>>>> 
>>>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash 
>>>>>> wrote:
>>>>>>>>> This approach would definitely work is the two documents are
>>>> *Exactly*
>>>>>>>> the
>>>>>>>>> same. But this is very fragile. Even if one extra space has been
>>>> added,
>>>>>>>> the
>>>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>>>> similarity between documents, and remove those documents which are
>>>> more
>>>>>>>> than
>>>>>>>>> 95% similar.
>>>>>>>>> 
>>>>>>>>> *Pranav Prakash*
>>>>>>>>> 
>>>>>>>>> "temet nosce"
>>>>>>>

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Indeed, take a look at this:

http://wiki.apache.org/solr/Deduplication

I have not used it but it looks like it will do the trick.

François

On Jun 28, 2011, at 8:44 AM, Pranav Prakash wrote:

> I found the deduplication thing really useful. Although I have not yet
> started to work on it, as there are some other low hanging fruits I've to
> capture. Will share my thoughts soon.
> 
> 
> *Pranav Prakash*
> 
> "temet nosce"
> 
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
> 
> 
> 2011/6/28 François Schiettecatte 
> 
>> Maybe there is a way to get Solr to reject documents that already exist in
>> the index but I doubt it, maybe someone else with can chime here here. You
>> could do a search for each document prior to indexing it so see if it is
>> already in the index, that is probably non-optimal, maybe it is easiest to
>> check if the document exists in your Riak repository, it no add it and index
>> it, and drop if it already exists.
>> 
>> François
>> 
>> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>> 
>>> I am making the Hash from URL, but I can't use this as UniqueKey because
>> I
>>> am using UUID as UniqueKey,
>>> Since I am using SOLR as  index engine Only and using Riak(key-value
>>> storage) as storage engine, I dont want to do the overwrite on duplicate.
>>> I just need to discard the duplicates.
>>> 
>>> 
>>> 
>>> 2011/6/28 François Schiettecatte 
>>> 
>>>> Create a hash from the url and use that as the unique key, md5 or sha1
>>>> would probably be good enough.
>>>> 
>>>> Cheers
>>>> 
>>>> François
>>>> 
>>>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>>>> 
>>>>> I also have the problem of duplicate docs.
>>>>> I am indexing news articles, Every news article will have the source
>> URL,
>>>>> If two news-article has the same URL, only one need to index,
>>>>> removal of duplicate at index time.
>>>>> 
>>>>> 
>>>>> 
>>>>> On 23 June 2011 21:24, simon  wrote:
>>>>> 
>>>>>> have you checked out the deduplication process that's available at
>>>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>>>> 
>>>>>> http://wiki.apache.org/solr/Deduplication
>>>>>> 
>>>>>> -Simon
>>>>>> 
>>>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash 
>>>> wrote:
>>>>>>> This approach would definitely work is the two documents are
>> *Exactly*
>>>>>> the
>>>>>>> same. But this is very fragile. Even if one extra space has been
>> added,
>>>>>> the
>>>>>>> whole hash would change. What I am really looking for is some %age
>>>>>>> similarity between documents, and remove those documents which are
>> more
>>>>>> than
>>>>>>> 95% similar.
>>>>>>> 
>>>>>>> *Pranav Prakash*
>>>>>>> 
>>>>>>> "temet nosce"
>>>>>>> 
>>>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>>>> http://blog.myblive.com> |
>>>>>>> Google <http://www.google.com/profiles/pranny>
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:
>>>>>>> 
>>>>>>>> What you need to do, is to calculate some HASH (using any message
>>>> digest
>>>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>>>> solr
>>>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>>>> 
>>>>>>>> *Omri Cohen*
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>>>> +972-3-6036295
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/i

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Maybe there is a way to get Solr to reject documents that already exist in the 
index but I doubt it, maybe someone else with can chime here here. You could do 
a search for each document prior to indexing it so see if it is already in the 
index, that is probably non-optimal, maybe it is easiest to check if the 
document exists in your Riak repository, it no add it and index it, and drop if 
it already exists.

François

On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:

> I am making the Hash from URL, but I can't use this as UniqueKey because I
> am using UUID as UniqueKey,
> Since I am using SOLR as  index engine Only and using Riak(key-value
> storage) as storage engine, I dont want to do the overwrite on duplicate.
> I just need to discard the duplicates.
> 
> 
> 
> 2011/6/28 François Schiettecatte 
> 
>> Create a hash from the url and use that as the unique key, md5 or sha1
>> would probably be good enough.
>> 
>> Cheers
>> 
>> François
>> 
>> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
>> 
>>> I also have the problem of duplicate docs.
>>> I am indexing news articles, Every news article will have the source URL,
>>> If two news-article has the same URL, only one need to index,
>>> removal of duplicate at index time.
>>> 
>>> 
>>> 
>>> On 23 June 2011 21:24, simon  wrote:
>>> 
>>>> have you checked out the deduplication process that's available at
>>>> indexing time ? This includes a fuzzy hash algorithm .
>>>> 
>>>> http://wiki.apache.org/solr/Deduplication
>>>> 
>>>> -Simon
>>>> 
>>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash 
>> wrote:
>>>>> This approach would definitely work is the two documents are *Exactly*
>>>> the
>>>>> same. But this is very fragile. Even if one extra space has been added,
>>>> the
>>>>> whole hash would change. What I am really looking for is some %age
>>>>> similarity between documents, and remove those documents which are more
>>>> than
>>>>> 95% similar.
>>>>> 
>>>>> *Pranav Prakash*
>>>>> 
>>>>> "temet nosce"
>>>>> 
>>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
>>>> http://blog.myblive.com> |
>>>>> Google <http://www.google.com/profiles/pranny>
>>>>> 
>>>>> 
>>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:
>>>>> 
>>>>>> What you need to do, is to calculate some HASH (using any message
>> digest
>>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
>> solr
>>>>>> field collapse capabilities. Should not be too complicated..
>>>>>> 
>>>>>> *Omri Cohen*
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>>>> +972-3-6036295
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
>>>> [image:
>>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
>>>>>> WordPress]<http://omricohen.me>
>>>>>> Please consider your environmental responsibility. Before printing
>> this
>>>>>> e-mail message, ask yourself whether you really need a hard copy.
>>>>>> IMPORTANT: The contents of this email and any attachments are
>>>> confidential.
>>>>>> They are intended for the named recipient(s) only. If you have
>> received
>>>>>> this
>>>>>> email by mistake, please notify the sender immediately and do not
>>>> disclose
>>>>>> the contents to anyone or make copies thereof.
>>>>>> Signature powered by
>>>>>> <
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> WiseStamp<
>>>>>> 
>>>> 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- Forwar

Re: Removing duplicate documents from search results

2011-06-28 Thread François Schiettecatte
Create a hash from the url and use that as the unique key, md5 or sha1 would 
probably be good enough.

Cheers

François

On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:

> I also have the problem of duplicate docs.
> I am indexing news articles, Every news article will have the source URL,
> If two news-article has the same URL, only one need to index,
> removal of duplicate at index time.
> 
> 
> 
> On 23 June 2011 21:24, simon  wrote:
> 
>> have you checked out the deduplication process that's available at
>> indexing time ? This includes a fuzzy hash algorithm .
>> 
>> http://wiki.apache.org/solr/Deduplication
>> 
>> -Simon
>> 
>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash  wrote:
>>> This approach would definitely work is the two documents are *Exactly*
>> the
>>> same. But this is very fragile. Even if one extra space has been added,
>> the
>>> whole hash would change. What I am really looking for is some %age
>>> similarity between documents, and remove those documents which are more
>> than
>>> 95% similar.
>>> 
>>> *Pranav Prakash*
>>> 
>>> "temet nosce"
>>> 
>>> Twitter  | Blog <
>> http://blog.myblive.com> |
>>> Google 
>>> 
>>> 
>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:
>>> 
 What you need to do, is to calculate some HASH (using any message digest
 algorithm you want, md5, sha-1 and so on), then do some reading on solr
 field collapse capabilities. Should not be too complicated..
 
 *Omri Cohen*
 
 
 
 Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
>> +972-3-6036295
 
 
 
 
 My profiles: [image: LinkedIn] 
>> [image:
 Twitter]  [image:
 WordPress]
 Please consider your environmental responsibility. Before printing this
 e-mail message, ask yourself whether you really need a hard copy.
 IMPORTANT: The contents of this email and any attachments are
>> confidential.
 They are intended for the named recipient(s) only. If you have received
 this
 email by mistake, please notify the sender immediately and do not
>> disclose
 the contents to anyone or make copies thereof.
 Signature powered by
 <
 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> 
 WiseStamp<
 
>> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> 
 
 
 
 -- Forwarded message --
 From: Pranav Prakash 
 Date: Thu, Jun 23, 2011 at 12:26 PM
 Subject: Removing duplicate documents from search results
 To: solr-user@lucene.apache.org
 
 
 How can I remove very similar documents from search results?
 
 My scenario is that there are documents in the index which are almost
 similar (people submitting same stuff multiple times, sometimes
>> different
 people submitting same stuff). Now when a search is performed for
 "keyword",
 in the top N results, quite frequently, same document comes up multiple
 times. I want to remove those duplicate (or possible duplicate)
>> documents.
 Very similar to what Google does when they say "In order to show you
>> most
 relevant result, duplicates have been removed". How can I achieve this
 functionality using Solr? Does Solr has an implied or plugin which could
 help me with it?
 
 
 *Pranav Prakash*
 
 "temet nosce"
 
 Twitter  | Blog <
>> http://blog.myblive.com
> 
 |
 Google 
 
>>> 
>> 
> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq



Re: Include synonys in solr

2011-06-28 Thread François Schiettecatte
Well you need to find word lists and/or a thesaurus.

This is one place to start:

http://wordlist.sourceforge.net/

I used the US/UK english word list for my synonyms for an index I have because 
it contains both US and UK english terms, the list lacks some medical terms 
though so we just added them.

Cheers

François

On Jun 28, 2011, at 6:55 AM, Romi wrote:

> Please see
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
> 
> No offence, but a simple Google search, or a search of the Wiki
> would have turned this up. Please try such simpler avenues before
> dashing off a message to the list.
> 
> 
> Gora, I heve already read the document and also included synonyms in my
> search results :)
> 
> My question is , when i use this * synonyms="syn.txt" ignoreCase="true" expand="false"/>
> * i need to enter synonyms manually in synonyms.txt. which is really tough
> if you have many words for synonyms. i wanted to ask is there any other
> option so that i need not to enter synonyms manually.. i hope you got my
> point :)
> 
> 
> -
> Thanks & Regards
> Romi
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Include-synonyms-in-solr-tp3116836p3117365.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Searching in Traditional / Simplified Chinese Record

2011-06-20 Thread François Schiettecatte
Wayne

I am not sure what you mean by 'changing the record'.

One option would be to implement something like the synonyms filter to generate 
the TC for SC when you index the document, which would index both the TC and 
the SC in the same location. That way your users would be able to search with 
either TC or SC.

Another option would be to use the same synonyms filter but do the expansion at 
search time.

Cheers

François


On Jun 20, 2011, at 5:41 AM, waynelam wrote:

> Hi,
> 
> I 've recently make change to my schema.xml to support import of Chinese 
> Record.
> What i want to do is to search both Traditional Chinese(TC) (e.g. ?? )and 
> Simplified Chinese (SC) (e.g. ??) Record
> when in the same query. I know I can do that by encoding all SC Record to TC. 
> I want to change to way to index
> rather that change the record.
> 
> Anyone should show me the way in much appreciated.
> 
> 
> Thanks
> 
> Wayne
> 
> 
> -- 
> -
> Wayne Lam
> Assistant Library Officer I
> Systems Development&  Support
> Fong Sum Wood Library
> Lingnan University
> 8 Castle Peak Road
> Tuen Mun, New Territories
> Hong Kong SAR
> China
> Phone:   +852 26168585
> Email:   wayne...@ln.edu.hk
> Website: http://www.library.ln.edu.hk
> 



Re: Extending Solr Highlighter to pull information from external source

2011-06-20 Thread François Schiettecatte
Mike

I would be very interested in the answer to that question too. My hunch is that 
the answer is no too. I have a few text databases that range from 200MB to 
about 60GB with which I could run some tests. I will have some downtime in 
early July and will post results.

From what I can tell the Guardian newspaper is doing just that:


http://www.guardian.co.uk/open-platform/blog/what-is-powering-the-content-api

http://www.lucidimagination.com/blog/2010/04/29/for-the-guardian-solr-is-the-new-database/

Cheers

François


On Jun 20, 2011, at 9:05 AM, Mike Sokolov wrote:

> I'd be very interested in this, as well, if you do it before me and are 
> willing to share...
> 
> A related question I have tried to ask on this list, and have never really 
> gotten a good answer to, is whether it makes sense to just chuck the external 
> storage and treat the lucene index as the primary storage for documents.  I 
> have a feeling the answer is no; perhaps because of increased I/O costs for 
> lucene and solr, but I don't really know.  I've been considering doing some 
> experimentation, but would really love an expert opinion...
> 
> -Mike
> 
> On 06/20/2011 08:41 AM, Jamie Johnson wrote:
>> I am trying to index data where I'm concerned that storing the contents of a
>> specific field will be a bit of a hog so we are planning to retrieve this
>> information as needed for highlighting from an external source.  I am
>> looking to extend the default solr highlighting capability to work with
>> information pulled from this external source and it looks like this is
>> possible by extending DefaultSolrHighlighter (line 418 to pull a particular
>> field from external source) for standard highlighting and
>> BaseFragmentsBuilder (line 99) for FastVectorHighlighter.  I could just hard
>> code this to say if the field name is a specific value look into the
>> external source, is this the best way to accomplish this?  Are there any
>> other extension points to do what I'm suggesting?
>> 
>>   



Re: Is it true that I cannot delete stored content from the index?

2011-06-19 Thread François Schiettecatte
That is correct, but you only need to commit, optimize is not a requirement 
here.

François

On Jun 18, 2011, at 11:54 PM, Mohammad Shariq wrote:

> I have define  in my solr and Deleting the docs from solr using
> this uniqueKey.
> and then doing optimization once in a day.
> is this right way to delete ???
> 
> On 19 June 2011 05:14, Erick Erickson  wrote:
> 
>> Yep, you've got to delete and re-add. Although if you have a
>>  defined you
>> can just re-add that document and Solr will automatically delete the
>> underlying
>> document.
>> 
>> You might have to optimize the index afterwards to get the data to really
>> disappear since the deletion process just marks the document as
>> deleted.
>> 
>> Best
>> Erick
>> 
>> On Sat, Jun 18, 2011 at 1:20 PM, Gabriele Kahlout
>>  wrote:
>>> Hello,
>>> 
>>> I've indexing with the content field stored. Now I'd like to delete all
>>> stored content, is there how to do that without re-indexing?
>>> 
>>> It seems not from lucene
>>> FAQ<
>> http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_update_a_document_or_a_set_of_documents_that_are_already_indexed.3F
>>> 
>>> :
>>> How do I update a document or a set of documents that are already
>>> indexed? There
>>> is no direct update procedure in Lucene. To update an index incrementally
>>> you must first *delete* the documents that were updated, and *then
>>> re-add*them to the index.
>>> 
>>> --
>>> Regards,
>>> K. Gabriele
>>> 
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>> time(x)
>>> < Now + 48h) ⇒ ¬resend(I, this).
>>> 
>>> If an email is sent by a sender that is not a trusted contact or the
>> email
>>> does not contain a valid code then the email is not received. A valid
>> code
>>> starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
>>> L(-[a-z]+[0-9]X)).
>>> 
>> 
> 
> 
> 
> -- 
> Thanks and Regards
> Mohammad Shariq



Re: Multiple indexes

2011-06-18 Thread François Schiettecatte
You would need to run two independent searches and then 'join' the results.

It is best not to apply a 'sql' mindset to SOLR when it comes to 
(de)normalization, whereas you strive for normalization in sql, that is usually 
counter-productive in SOLR. For example, I am working on a project with 30+ 
normalized tables, but only 4 cores.

Perhaps describing what you are trying to achieve would give us greater insight 
and thus be able to make more concrete recommendation?

Cheers

François 

On Jun 18, 2011, at 2:36 PM, shacky wrote:

> Il 18 giugno 2011 20:27, François Schiettecatte
>  ha scritto:
>> Sure.
> 
> So I can have some searches similar to JOIN on MySQL?
> The problem is that I need at least two tables in which search data..



Re: Multiple indexes

2011-06-18 Thread François Schiettecatte
Sure.

François

On Jun 18, 2011, at 2:25 PM, shacky wrote:

> 2011/6/15 Edoardo Tosca :
>> Try to use multiple cores:
>> http://wiki.apache.org/solr/CoreAdmin
> 
> Can I do concurrent searches on multiple cores?



Re: Why does paste get parsed into past?

2011-06-18 Thread François Schiettecatte
What I meant was what stemmer are you using? Maybe it is the stemmer that is 
cutting the 'e'. You can check that on the field analysis solr web page.

François

On Jun 18, 2011, at 11:42 AM, Gabriele Kahlout wrote:

> I'm !sure where those are set, but on reflection I'd keep the default
> settings. My real issue is why are not query keywords treated as a
> set?<http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201106.mbox/%3CBANLkTikHunhyWc2WVTofRYU4ZW=c8oe...@mail.gmail.com%3E>
> 2011/6/18 François Schiettecatte 
> 
>> What do you have set up for stemming?
>> 
>> François
>> 
>> On Jun 18, 2011, at 8:00 AM, Gabriele Kahlout wrote:
>> 
>>> Hello,
>>> 
>>> Debugging query results I find that:
>>> paste
>>> content:past
>>> 
>>> Now paste and past are two different words. Why does Solr not consider
>>> that? How do I make it?
>>> 
>>> --
>>> Regards,
>>> K. Gabriele
>>> 
>>> --- unchanged since 20/9/10 ---
>>> P.S. If the subject contains "[LON]" or the addressee acknowledges the
>>> receipt within 48 hours then I don't resend the email.
>>> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
>>> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>>> 
>>> If an email is sent by a sender that is not a trusted contact or the
>>> email does not contain a valid code then the email is not received. A
>>> valid code starts with a hyphen and ends with "X".
>>> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
>>> ∈ L(-[a-z]+[0-9]X)).
>> 
>> 
> 
> 
> -- 
> Regards,
> K. Gabriele
> 
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
> 
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).



Re: Why does paste get parsed into past?

2011-06-18 Thread François Schiettecatte
What do you have set up for stemming?

François

On Jun 18, 2011, at 8:00 AM, Gabriele Kahlout wrote:

> Hello,
> 
> Debugging query results I find that:
> paste
>  content:past
> 
> Now paste and past are two different words. Why does Solr not consider
> that? How do I make it?
> 
> --
> Regards,
> K. Gabriele
> 
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
> 
> If an email is sent by a sender that is not a trusted contact or the
> email does not contain a valid code then the email is not received. A
> valid code starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈ L(-[a-z]+[0-9]X)).



Re: Performance loss - querying more than 64 cores (randomly)

2011-06-16 Thread François Schiettecatte
I am assuming that you are running on linux here, I have found atop to be very 
useful to see what is going on.

http://freshmeat.net/projects/atop/

dstat is also very useful too but needs a little more work to 'decode'.

Obviously there is contention going on, you just need to figure out where it 
is, most likely it is disk I/O but it could also be the number of cores you 
have. Also I would not say that performance is decreasing rapidly, probably 
more of a gentle slope down if you plot it (your double the number of cores 
every time).

I would be very interested in hearing about what you find.

Cheers

François

On Jun 16, 2011, at 10:00 AM, Andrzej Bialecki wrote:

> On 6/16/11 3:22 PM, Mark Schoy wrote:
>> Hi,
>> 
>> I set up a Solr instance with 512 cores. Each core has 100k documents and 15
>> fields. Solr is running on a CPU with 4 cores (2.7Ghz) and 16GB RAM.
>> 
>> Now I've done some benchmarks with JMeter. On each thread iteration JMeter
>> queriing another Core by random. Here are the results (Duration:  each with
>> 180 second):
>> 
>> Randomly queried cores | queries per second
>> 1| 2016
>> 2 | 2001
>> 4 | 1978
>> 8 | 1958
>> 16 | 2047
>> 32 | 1959
>> 64 | 1879
>> 128 | 1446
>> 256 | 1009
>> 512 | 428
>> 
>> Why are the queries per second until 64 constant and then the performance is
>> degreasing rapidly?
>> 
>> Solr only uses 10GB of the 16GB memory so I think it is not a memory issue.
>> 
> 
> This may be an OS-level disk buffer issue. With a limited disk buffer space 
> the more random IO occurs from different files, the higher is the churn rate, 
> and if the buffers are full then the churn rate may increase dramatically 
> (and the performance will drop then). Modern OS-es try to keep as much data 
> in memory as possible, so the memory usage itself is not that informative - 
> but check what are the pagein/pageout rates when you start hitting the 32 vs 
> 64 cores.
> 
> -- 
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 



Re: Strange behavior

2011-06-14 Thread François Schiettecatte
I think you will need to provide more information than this, no-one on this 
list is omniscient AFAIK.

François

On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

> Hi.
> 
> I've  debugged search on test machine, after copying to production server
> the  entire  directory  (entire solr directory), i've noticed that one
> query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
> production.
> How can that be?
> 



Re: Solr Field name restrictions

2011-06-04 Thread François Schiettecatte
Underscores and dashes are fine, but I would think that colons (:) are verboten.

François

On Jun 4, 2011, at 9:49 PM, Jamie Johnson wrote:

> Is there a list anywhere detailing field name restrictions.  I imagine
> fields containing periods (.) are problematic if you try to use that field
> when doing faceted queries, but are there any others?  Are underscores (_)
> or dashes (-) ok?



  1   2   >