RE: java GC overhead limit exceeded

2010-07-27 Thread Bastian Spitzer
Hi, which version do you use? 1.4.1 is highly recommended since previous versions contained some bugs related to memory usage that could lead to memory leaks. i had this gc overhead limit in my setup as well. only workaround that helped was a dayly restart of all instances. with 1.4.1 this

Re: Design questions/Schema Help

2010-07-27 Thread Chantal Ackermann
Hi, IMHO you can do this with date range queries and (date) facets. The DateMathParser will allow you to normalize dates on min/hours/days. If you hit a limit there, then just add a field with an integer for either min/hour/day. This way you'll loose the month information - which is sometimes

Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

2010-07-27 Thread David Stuart
I would use the string version as Drupal will probably populate it with a url like thing something that may not validate as type url On 27 Jul 2010, at 04:00, Savannah Beckett wrote: I am trying to merge the schema.xml that is the solr/nutch setup with the one from drupal apache solr

Any tips/guidelines to turning the Solr/luence performance in a master/slave/sharding environment

2010-07-27 Thread Chengyang
How to reduce the index files size, decreate the sync time between each nodes. decrease the index create/update time. Thanks.

Russian stemmer

2010-07-27 Thread Oleg Burlaca
Hello, I'm using SnowballPorterFilterFactory with language=Russian. The stemming works ok except people names, geographical places. Here are some examples: searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове. Are there other stemming plugins for the russian language that

Re: Russian stemmer

2010-07-27 Thread Robert Muir
All of your examples stem to ковров: assertAnalyzesTo(a, Коврова Коврову Ковровом Коврове, new String[] { ковров, ковров, ковров, ковров }); } Are you sure you enabled this at *both* index and query time? 2010/7/27 Oleg Burlaca o...@burlaca.com Hello, I'm using

Spellchecking and frequency

2010-07-27 Thread dan sutton
Hi, I've recently been looking into Spellchecking in solr, and was struck by how limited the usefulness of the tool was. Like most corpora , ours contains lots of different spelling mistakes for the same word, so the 'spellcheck.onlyMorePopular' is not really that useful unless you click on it

Re: Russian stemmer

2010-07-27 Thread Robert Muir
another look, your problem is ковров itself... its mapped to ковр a workaround might be to use the protected words functionality to keep ковров and any other problematic people/geo names as-is. separately, in trunk there is an alternative russian stemmer (RussianLightStemFilterFactory), which

Re: Russian stemmer

2010-07-27 Thread Oleg Burlaca
Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and Query time, because the search works ok, except names and geo locations. I've noticed that searching by Коврова also shows documents that contain Коврову, Коврове Search by Ковров, 7 results:

Re: Russian stemmer

2010-07-27 Thread Oleg Burlaca
A similar word is Немцов. The strange thing is that searching for Немцова will not find documents containing Немцов Немцова: 14 articles http://www.sova-center.ru/search/?lg=1q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0 Немцов: 74 articles

Re: Russian stemmer

2010-07-27 Thread Oleg Burlaca
Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query: Немцов* Robert, thanks for the

Re: Russian stemmer

2010-07-27 Thread Robert Muir
2010/7/27 Oleg Burlaca o...@burlaca.com Actually the situation with Немцов из ок, I've just checked how Yandex works with Немцов and Немцова: http://nano.yandex.ru/project/inflect/ I think there are two solutions: a) manually search for both Немцов and then Немцова b) use wildcard query:

clustering component

2010-07-27 Thread Matt Mitchell
Hi, I'm attempting to get the carrot based clustering component (in trunk) to work. I see that the clustering contrib has been disabled for the time being. Does anyone know if this will be re-enabled soon, or even better, know how I could get it working as it is? Thanks, Matt

Re: clustering component

2010-07-27 Thread Stanislaw Osinski
Hi Matt, I'm attempting to get the carrot based clustering component (in trunk) to work. I see that the clustering contrib has been disabled for the time being. Does anyone know if this will be re-enabled soon, or even better, know how I could get it working as it is? I've recently created a

Re: slave index is bigger than master index

2010-07-27 Thread Muneeb Ali
We have three dedicated servers for solr, two for slaves and one for master, all with linux/debian packages installed. I understand that replication does always copies over the index in an exact form as in master index directory (or it is supposed to do that at least), and if the master index

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread Chantal Ackermann
Hi Mitch, thanks for that suggestion. I wasn't aware of that. I've already added a temporary field in my ScriptTransformer that does basically the same. However, with this approach indexing time went up from 20min to more than 5 hours. The new approach is to query the solr index for that other

LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler
Good Morning, afternoon or evening... If someone installed Solr using the LucidWorks.jar (1.4) installation how can one make a small change and recompile. Is there a LucidWorks (tomcat) build somewhere? Regards ericz

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread Alessandro Benedetti
Hi Jon, During the last days we front the same problem. Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract content and from others, Solr throws an exception during the Indexing Process . You must: Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8

DIH $deleteDocByQuery

2010-07-27 Thread Maddy.Jsh
Hi, I have been using DIH to do index documents from database. I am hoping to use DIH to delete documents from index. I search in wiki and found the special commands in DIH to do so. http://wiki.apache.org/solr/DataImportHandler#Special_Commands But there is no example on how to use them. I

Re: NullPointerException with CURL, but not in browser

2010-07-27 Thread Rene Rath
Ouch! Absolutely correct - quoting the URL fixed it. Thanks for saving me a sleepless night! cheers - rene 2010/7/26 Chris Hostetter hossman_luc...@fucit.org : However, when I'm trying this very URL with curl within my (perl) script, I : receive a NullPointerException: : CURL-COMMAND: curl

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK
Hi Chantal, However, with this approach indexing time went up from 20min to more than 5 hours. This is 15x slower than the initial solution... wow. From MySQL I know that IN ()-clauses are the embodiment of endlessness - they perform very, very badly. New idea: Create a method which

Re: LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler
I did not realize the LucidWords.jar comes with an option to install the sources :-) On Tue, Jul 27, 2010 at 10:59 AM, Eric Grobler impalah...@googlemail.comwrote: Good Morning, afternoon or evening... If someone installed Solr using the LucidWorks.jar (1.4) installation how can one make a

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread Chantal Ackermann
Hi Mitch, New idea: Create a method which returns the query-string: returnString(theVIP) { if ( theVIP != null || theVIP != ) { return a query-string to find the vip } else { return SELECT 1 // you need to modify this,

Re: slave index is bigger than master index

2010-07-27 Thread Peter Karich
We have three dedicated servers for solr, two for slaves and one for master, all with linux/debian packages installed. I understand that replication does always copies over the index in an exact form as in master index directory (or it is supposed to do that at least), and if the master

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK
Hi Chantal, instead of: entity name=prog ... field name=vip ... /* multivalued, not required */ entity name=ssc_entry dataSource=ssc onError=continue query=select SSC_VALUE from SSC_VALUE where SSC_ATTRIBUTE_ID=1

question: solrCloud with multiple cores on each machine

2010-07-27 Thread Yatir Ben Shlomo
Hi I am using solrCloud. Suppose I have a total 4 machines dedicated for solr. I want to have 2 machines as replication (salves) and 2 masters But I want to work with 8 logical cores rather 2. i.e. each master (and each slave) will have 4 cores on it. the reason is that I can optimize the cores

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread Chantal Ackermann
Hi Mitch, thanks for the code. Currently, I've got a different solution running but it's always good to have examples. If realized that I have to throw an exception and add the onError attribute to the entity to make that work. I am curious: Can you show how to make a method

RE: Spellcheck help

2010-07-27 Thread Marc Ghorayeb
Thanks for the input, i'll check it out! Marc Subject: RE: Spellcheck help Date: Fri, 23 Jul 2010 13:12:04 -0500 From: james.d...@ingrambook.com To: solr-user@lucene.apache.org In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84): final static String PATTERN =

Re: Russian stemmer

2010-07-27 Thread Oleg Burlaca
Thanks Robert for all your help, The idea of ы[A-Z].* stopwords is ideal for the english language, although in russian nouns are inflected: Борис, Борису, Бориса, Борисом I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned it's more accurate). Once again thanks, Oleg

Re: Russian stemmer

2010-07-27 Thread Robert Muir
right, but your problem is this is the current output: Ковров - Ковр Коврову - Ковров Ковровом - Ковров Коврове - Ковров so, if Ковров was simply left alone, all your forms would match... 2010/7/27 Oleg Burlaca o...@burlaca.com Thanks Robert for all your help, The idea of ы[A-Z].* stopwords

Highlighting parameters wiki

2010-07-27 Thread Stephen Green
The wiki entry for hl.highlightMultiTerm: http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm doesn't appear to be correct. It says: If the SpanScorer is also being used, enables highlighting for range/wildcard/fuzzy/prefix queries. Default is false. But the code in

RE: Spellcheck help

2010-07-27 Thread Dyer, James
If you could, let me know how your testing goes with this change. I too am interested in having the Collate work as good as it can. It looks like the code would be better with this change but then again I don't know what the original author was thinking when this was put in. James Dyer

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan
Hi Yonik, I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am using regular out of box dismax requesthandler. The query is a simple one with 4 filter queries (fq's) and one sort query. During the index generation, I delete a set of rows based on date filter, then add new

Is it possible to get keyword/match's position?

2010-07-27 Thread Ryan Chan
According to SO: http://stackoverflow.com/questions/1557616/retrieving-per-keyword-field-match-position-in-lucene-solr-possible It is not possible, but it is one year ago, is it still true for now? Thanks.

Re: java GC overhead limit exceeded

2010-07-27 Thread Text Analysis
Look into -XX:-GCUseOverheadLimit On 7/26/10, Jonathan Rochkind rochk...@jhu.edu wrote: I am now occasionally getting a Java GC overhead limit exceeded error in my Solr. This may or may not be related to recently adding much better (and more) warming querries. I can get it when trying a

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom
Hi Jason, Are you looking for the total number of unique terms or total number of term occurrences? Checkindex reports both, but does a bunch of other work so is probably not the fastest. If you are looking for total number of term occurrences, you might look at

SpatialSearch: sorting by distance

2010-07-27 Thread Pavel Minchenkov
Hi, I'm trying to sort by distance like this: sort=dist(2,lat,lon,55.755786,37.617633) asc In general results are sorted, but some documents are not in right order. I'm using DistanceUtils.getDistanceMi(...) from lucene spatial to calculate real distance after reading documents from Solr. Solr

does this indicate a commit happened for every add?

2010-07-27 Thread Robert Petersen
I'm adding lots of small docs with several threads to solr and the adds start fast but then slow down. I didn't do any explicit commits and autocommit is turned off but the logs show lots of commit activity on this core and restarting this solr core logged the below. Where did all these commits

Re: Spellchecking and frequency

2010-07-27 Thread Mark Holland
Hi, I found the suggestions returned from the standard solr spellcheck not to be that relevant. By contrast, aspell, given the same dictionary and mispelled words, gives much more accurate suggestions. I therefore wrote an implementation of SolrSpellChecker that wraps jazzy, the java aspell

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread David Thibault
Alessandro all, I was having the same issue with Tika crashing on certain PDFs. I also noticed the bug where no content was extracted after upgrading Tika. When I went to the SOLR issue you link to below, I applied all the patches, downloaded the Tika 0.8 jars, restarted tomcat, posted a

Re: Total number of terms in an index?

2010-07-27 Thread Michael McCandless
In trunk (flex) you can ask each segment for its unique term count. But to compute the unique term count across all segments is necessarily costly (requires merging them, to de-dup), as Hoss described. Mike On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom tburt...@umich.edu wrote: Hi Jason,

RE: Spellchecking and frequency

2010-07-27 Thread Dyer, James
Mark, I'd like to see your code if you open a JIRA for this. I recently opened SOLR-2010 with a patch that does something similar to the second part only of what you describe (find combinations that actually return a match). But I'm not sure if my approach is the best one so I would like to see

Re: Timeout in distributed search

2010-07-27 Thread Chris Hostetter
: Is there anyway to have time out support in distributed search. I : searched https://issues.apache.org/jira/browse/SOLR-502 but looks it is : not in main release of solr1.4 note that issue is marked Fix Version/s: 1.3 ... that means it was fixed in Solr 1.3, well before 1.4 came out. You

Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry

2010-07-27 Thread Chris Hostetter
: : I was wondering if anyone has found any resolution to this email thread? As Grant asked in his reply when this thread was first started (December 2009)... It sounds like you are either using embedded mode or you have some custom code. Are you sure you are releasing your resources

Re: help finding illegal chars in XML doc

2010-07-27 Thread Chris Hostetter
: Thanks for your reply. I could not find in the log files any mention to : that. By the way I only have _MM_DD.request.log files in my directory. : : Do I have to enable any specific log or level to catch those errors? if you are using that java -jar start.jar command for the example

Difficulties with Highlighting

2010-07-27 Thread Nathaniel Grove
I'm a relative beginner at SOLR, indexing and searching Unicode Tibetan texts. I am trying to use the highlighter but it just returns, empty elements, such as: lst name=highlighting lst name=kt-d-0103-text-v4p262a/ /lst What am I doing wrong? The query that generated that is:

Re: SolrCore has a large number of SolrIndexSearchers retained in infoRegistry

2010-07-27 Thread Ken Krugler
On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote: : : I was wondering if anyone has found any resolution to this email thread? As Grant asked in his reply when this thread was first started (December 2009)... It sounds like you are either using embedded mode or you have some custom

Re: Difficulties with Highlighting

2010-07-27 Thread Erik Hatcher
Than - Looks like maybe your text_bo field type isn't analyzing how you'd like? Though that's just a hunch. I pasted the value of that field returned in the link you provided into your analysis.jsp page and it chunked tokens by whitespace. Though I could be experiencing a copy/

Re: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Jason Ronallo
I am getting a similar error with today's nightly build: HTTP Status 500 - Index: 54, Size: 24 java.lang.IndexOutOfBoundsException: Index: 54, Size: 24 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at

min/max, StatsComponent, performance

2010-07-27 Thread Jonathan Rochkind
I thought I asked a variation of this before, but I don't see it on the list, apologies if this is a duplicate, but I have new questions. So I need to find the min and max value of a result set. Which can be several million documents. One way to do this is the StatsComponent. One problem is

Indexing Problem: Where's my data?

2010-07-27 Thread Michael Griffiths
Hi, (The first version of this was rejected for spam). I'm setting up a test instance of Solr, and keep running into the problem of having Solr not work the way I think it should work. Specifically, the data I want to go into the index isn't there after indexing. I'm extracting the data from

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan
Yonik, One more update on this. I used the filter query that was throwing error and used it to delete a subset of results. After that the queries started working correctly. Which indicates that the particular docId was present in the index somewhere, but lucene was not able to find it.

Re: Indexing Problem: Where's my data?

2010-07-27 Thread kenf_nc
for STRING_VALUE, I assume there is a property in the 'select *' results called string_value? if so I'm not sure why it wouldn't work. If not, then that's why, it doesn't have anything to put there. For ATTRIBUTE_NAME, is it possibly a case issue? you called it 'Attribute_Name' in your query,

Re: Highlighting parameters wiki

2010-07-27 Thread Koji Sekiguchi
(10/07/27 23:16), Stephen Green wrote: The wiki entry for hl.highlightMultiTerm: http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm doesn't appear to be correct. It says: If the SpanScorer is also being used, enables highlighting for range/wildcard/fuzzy/prefix queries.

How to 'filter' facet results

2010-07-27 Thread David Thompson
Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I

RE: How to 'filter' facet results

2010-07-27 Thread Jonathan Rochkind
Is there a way to tell Solr to only return a specific set of facet values? I feel like the facet query must be able to do this, but I'm not really understanding the facet query. In my specific case, I'd like to only see facet values for the same values I pass in as query filters, i.e. if I

Re: Tika, Solr running under Tomcat 6 on Debian

2010-07-27 Thread Lance Norskog
I would start over from the Solr 1.4.1 binary distribution and follow the instructions on the wiki: http://wiki.apache.org/solr/ExtractingRequestHandler (Java classpath stuff is notoriously difficult, especially when dynamically configured and loaded. I often cannot tell if Java cannot load the

Re: Spellchecking and frequency

2010-07-27 Thread Erick Erickson
Yonik's Law of Patches reads: A half-baked patch in Jira, with no documentation, no tests and no backwards compatibilty is better than no patch at all. It'd be perfectly appropriate, IMO, for you to post an outline of what your enhancements do over on the SOLR dev list and get a reaction from the

Re: Solr 3.1 and ExtractingRequestHandler resulting in blank content

2010-07-27 Thread Lance Norskog
There are two different datasets that Solr (Lucene really) saves from a document: raw storage and the indexed terms. I don't think the ExtractingRequestHandler ever automatically stored the raw data; in fact Lucene works in Strings internally, not raw byte arrays (this is changing). It should be

Re: slave index is bigger than master index

2010-07-27 Thread Lance Norskog
Ah! You have junk files piling up in the slave index directory. When this happens, you may have to remove data/index entirely. I'm not sure if Solr replication will handle that, or if you have to copy the whole index to reset it. You said the slaves time out- maybe the files are so large that the

Re: Indexing Problem: Where's my data?

2010-07-27 Thread Lance Norskog
Solr respects case for field names. Database fields are supplied in lower-case, so it should be 'attribute_name' and 'string_value'. Also 'product_id', etc. It is easier if you carefully emulate every detail in the examples, for example lower-case names. On Tue, Jul 27, 2010 at 2:59 PM, kenf_nc

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread Lance Norskog
Should this go into the trunk, or does it only solve problems unique to your use case? On Tue, Jul 27, 2010 at 5:49 AM, Chantal Ackermann chantal.ackerm...@btelligent.de wrote: Hi Mitch, thanks for the code. Currently, I've got a different solution running but it's always good to have

Re: Russian stemmer

2010-07-27 Thread Dennis Gearon
I have studied some Russian. I kind of got the picture from the texts that all the exceptions had already been 'found', and were listed in the book. I do know that languages are living, changing organisms, but Russian has got to be more regular than English I would think, even WITH all six