date:20100727

Should this go into the trunk, or does it only solve problems unique
to your use case?

On Tue, Jul 27, 2010 at 5:49 AM, Chantal Ackermann
 wrote:
> Hi Mitch,
>
> thanks for the code. Currently, I've got a different solution running
> but it's always good to have examples.
>
>> > If realized
>> > that I have to throw an exception and add the onError attribute to the
>> > entity to make that work.
>> >
>> I am curious:
>> Can you show how to make a method throwing an exception that is accepted by
>> the onError-attribute?
>
> the catch clause looks for "Exception" so it's actually easy. :-D
>
> Anyway, I've found a "cleaner" way. It is better to subclass the
> XPathEntityProcessor and put it in a state that prevents it from calling
> "initQuery" which triggers the dataSource.getData() call.
> I have overridden the initContext() method setting a go/no go flag that
> I am using in the overridden nextRow() to find out whether to delegate
> to the superclass or not.
>
> This way I can also avoid the code that fills the tmp field with an
> empty value if there is no value to query on.
>
> Cheers,
> Chantal
>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Indexing Problem: Where's my data?

Solr respects case for field names.  Database fields are supplied in
lower-case, so it should be 'attribute_name' and 'string_value'. Also
'product_id', etc.

It is easier if you carefully emulate every detail in the examples,
for example lower-case names.

On Tue, Jul 27, 2010 at 2:59 PM, kenf_nc  wrote:
>
> for STRING_VALUE, I assume there is a property in the 'select *' results
> called string_value? if so I'm not sure why it wouldn't work. If not, then
> that's why, it doesn't have anything to put there.
>
> For ATTRIBUTE_NAME, is it possibly a case issue? you called it
> 'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just
> something to check I guess.
>
> Also, not sure why you are using name= in your fields, for example,
> 
> I thought 'column' was the source field name and 'name' was supposed to be
> the schema field name and if not there it would assume 'column' name. You
> don't have a schema field called "Parent Family" so it looks like it's
> defaulting to column name too which is lucky for you I suppose. But you may
> want to either remove 'name=' or make it match the schema. (and I may be
> completely wrong on this, it's been a while since I got DIH going).
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com

Re: slave index is bigger than master index

Ah! You have junk files piling up in the slave index directory. When
this happens, you may have to remove data/index entirely. I'm not sure
if Solr replication will handle that, or if you have to copy the whole
index to reset it.

You said the slaves time out- maybe the files are so large that the
master & slave need socket timeouts changed? In solrconfig.xml, these
two lines control that. Maybe they need to be increased.

5000
1


On Tue, Jul 27, 2010 at 3:59 AM, Peter Karich  wrote:
>
>> We have three dedicated servers for solr, two for slaves and one for master,
>> all with linux/debian packages installed.
>>
>> I understand that replication does always copies over the index in an exact
>> form as in master index directory (or it is supposed to do that at least),
>> and if the master index was optimized after indexing, one doesn't need to
>> run an optimize call again on master to optimize the slave's index. But in
>> our case thats what fixed it and I agree it is even more confusing now :s
>>
>
> Thats why I said: try it on the slaves too ;-)
> In our case it helped too to shrink 2*index to 1*index.
> I think the data which necessary for the replication won't cleanup
> before the next replication or before an optimize.
> For us it was crucial to shrink the size because of limited
> disc-resources and to make sure that the next
> replication does not increase the index to >3*times of the initial size.
>
> @muneeb so I think, optimization is not necessary or do you have disc
> limitations too?
> @Hoss or others: does this explanation sound logically?
>
>> Another problem is, we are serving live services using slave nodes, so I
>> dont want to effect the live search, while playing with slave nodes'
>> indices.
>>
>
> What do you mean here? Optimizing is too CPU expensive?
>
>> We will be running the indexing on master node today over the night. Lets
>> see if it does it again.
>>
>
> Do you mean increase to double size?
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr 3.1 and ExtractingRequestHandler resulting in blank content

There are two different datasets that Solr (Lucene really) saves from
a document: raw storage and the indexed terms. I don't think the
ExtractingRequestHandler ever automatically stored the raw data; in
fact Lucene works in Strings internally, not raw byte arrays (this is
changing).

It should be indexed- that means if you search 'text' with a word from
the document, it will find those documents and bring back the file
name. Your app has to then use the file name.  Solr/Lucene is not
intended as a general-purpose content store, only an index.

The ERH wiki page doesn't quite say this. It describes what the ERH
does rather than what it does not do :)

On Mon, Jul 26, 2010 at 12:00 PM, David Thibault  wrote:
> Hello all,
>
> I’m working on a project with Solr.  I had 1.4.1 working OK using 
> ExtractingRequestHandler except that it was crashing on some PDFs.  I noticed 
> that Tika bundled with 1.4.1 was 0.4, which was kind of old.  I decided to 
> try updating to 0.7 as per the directions here: 
> http://wiki.apache.org/solr/ExtractingRequestHandler  but it was giving me 
> errors (I forget what they were specifically).
>
> Then I tried downloading Solr 3.1 from the source repository, which I noticed 
> came with Tika 0.7.  I figured this would be an easier route to get working.  
> Now I’m testing with 3.1 and 0.7 and I’m noticing my documents are going into 
> Solr OK, but they all have blank content (no document text stored in Solr).  
> I did see that the default “text” field is not stored. Changing that to 
> stored=true didn’t help.  Changing to 
> fmap.content=attr_content&uprefix=attr_content didn’t help either.  I have 
> attached all relevant info here.  Please let me know if someone sees 
> something I don’t (it’s entirely possible as I’m relatively new to Solr).
>
> Schema.xml:
> 
> 
>  
>     omitNorms="true"/>
>     omitNorms="true"/>
>    
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     omitNorms="true" positionIncrementGap="0"/>
>     precisionStep="0" positionIncrementGap="0"/>
>     precisionStep="6" positionIncrementGap="0"/>
>    
>    
>    
>    
>     omitNorms="true"/>
>     sortMissingLast="true" omitNorms="true"/>
>     sortMissingLast="true" omitNorms="true"/>
>     sortMissingLast="true" omitNorms="true"/>
>     sortMissingLast="true" omitNorms="true"/>
>    
>     positionIncrementGap="100">
>      
>        
>      
>    
>     autoGeneratePhraseQueries="true">
>      
>        
>                        ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>         generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="1"/>
>        
>         protected="protwords.txt"/>
>        
>      
>      
>        
>         ignoreCase="true" expand="true"/>
>                        ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>         generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="1"/>
>        
>         protected="protwords.txt"/>
>        
>      
>    
>     positionIncrementGap="100" >
>      
>        
>         ignoreCase="true" expand="false"/>
>         words="stopwords.txt"/>
>         generateNumberParts="0" catenateWords="1" catenateNumbers="1" 
> catenateAll="0"/>
>        
>         protected="protwords.txt"/>
>        
>        
>      
>    
>     positionIncrementGap="100">
>      
>        
>         words="stopwords.txt" enablePositionIncrements="true" />
>         generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="0"/>
>        
>      
>      
>        
>         ignoreCase="true" expand="true"/>
>                        ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>         generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>        
>      
>    
>     positionIncrementGap="100">
>      
>        
>         words="stopwords.txt" enablePositionIncrements="true" />
>         generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" 
> splitOnCaseChange="0"/>
>        
>                   maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
>      
>      
>        
>         ignoreCase="true" expand="true"/>
>                        ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>         generateNumberParts="1" catenateWor

Re: Spellchecking and frequency

2010-07-27 Thread Erick Erickson

"Yonik's Law of Patches" reads: "A half-baked patch in Jira, with no
documentation, no tests and no backwards compatibilty is better than no
patch at all."

It'd be perfectly appropriate, IMO, for you to post an outline of what your
enhancements do over on the SOLR dev list and get a reaction from the folks
over there as to whether it should be a Jira or not... see
solr-...@lucene.apache.org

Best
Erick

On Tue, Jul 27, 2010 at 2:04 PM, Mark Holland wrote:

> Hi,
>
> I found the suggestions returned from the standard solr spellcheck not to
> be
> that relevant. By contrast, aspell, given the same dictionary and mispelled
> words, gives much more accurate suggestions.
>
> I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
> the java aspell library. I also extended the SpellCheckComponent to take
> the
> matrix of suggested words and query the corpus to find the first
> combination
> of suggestions which returned a match. This works well for my use case,
> where term frequency is irrelevant to spelling or scoring.
>
> I'd like to publish the code in case someone finds it useful (although it's
> a bit crude at the moment and will need a decent tidy up). Would it be
> appropriate to open up a Jira issue for this?
>
> Cheers,
> ~mark
>
> On 27 July 2010 09:33, dan sutton  wrote:
>
> > Hi,
> >
> > I've recently been looking into Spellchecking in solr, and was struck by
> > how
> > limited the usefulness of the tool was.
> >
> > Like most corpora , ours contains lots of different spelling mistakes for
> > the same word, so the 'spellcheck.onlyMorePopular' is not really that
> > useful
> > unless you click on it numerous times.
> >
> > I was thinking that since most of the time people spell words correctly
> why
> > was there no other frequency parameter that could enter into the score?
> > i.e.
> > something like:
> >
> > spell_score ~ edit_dist * freq
> >
> > I'm sure others have come across this issue and was wonding what
> > steps/algorithms they have used to overcome these limitations?
> >
> > Cheers,
> > Dan
> >
>

Re: Tika, Solr running under Tomcat 6 on Debian

I would start over from the Solr 1.4.1 binary distribution and follow
the instructions on the wiki:

http://wiki.apache.org/solr/ExtractingRequestHandler

(Java classpath stuff is notoriously difficult, especially when
dynamically configured and loaded. I often cannot tell if Java cannot
load the class it prints, or if that class requires others.)

On Sat, Jul 24, 2010 at 11:21 PM, Tim AtLee  wrote:
> Hello
>
> I desperately hope someone can help me here...  I'm a bit out of my league
> here.
>
> I am trying to implement content extraction using Tika and Solr as part of a
> search package for a product I am using.  I have been successful in getting
> Solr to work so far as indexing text, and returning search results, however
> I am hitting a wall when I try to use Tika for content extraction.
>
> I add the following configuration to solrconfig.xml:
>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>
>    
>    
>    
>    
>      true
>    
>  
>
> During a test, I receive the following error:
> org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.handler.extraction.ExtractingRequestHandler'
>
> The full text of this error is listed below.
>
> So, as I indicated in the subject line, I am using Debian linux Squeeze
> (testing).  Tomcat is at version 6.0.26 and is installed by apt.
>
> Solr is also installed from apt, and is at version:
> 1.4.0.2010.04.24.07.20.22.
>
> Java -version looks like this:
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
>
> The JDK is also at the same version, and also from apt.
>
> I have built Tika from source (nightly build) using mvn2, and placed
> the complied jar's in /lib.  /lib is located at /var/solr/site/lib, along
> with /var/solr/site/conf and /var/solr/site/data.  Hopefully this is the
> right place to put the jar's.
>
> I also tried building solr from source (also the nightly build), and was
> able to get solr sort of working (not Tika).  I could run a single instance,
> but getting multiple instances running didn't seem to be in the cards.  I
> didn't pursue this any further.  If this is the route I should go down, if
> anyone can direct me on how to install a built Solr war and configure it so
> I can use multiple instances, I'll gladly try it out.
>
> I found a similar issue to mine at
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200911.mbox/,
> From that email, I tried copying the built Solr jars into the Solr site's
> lib directory, then realized that the likelihood of that working was pretty
> slim - jars built from a nightly build trying to work with a .war from 1.4.0
> was probably not going work.  As you might have guessed, it didn't.  This is
> when I tried building Solr from source (thinking that if all the Solr stuff
> was at the same revision, it might work).
>
> I have not tried all of this under Jetty.  It's my understanding that Jetty
> won't let me do multiple instances, and since this is a requirement for what
> I'm doing, I'm more or less constrained to Tomcat.
>
> I have also seen some other references to using OpenJDK instead of Sun JDK.
>  This resulted in the same error (don't recall the site where I saw this
> referenced).
>
> Any help would be greatly appreciated.  I am new to Tomcat and Solr, so I
> may have some dumb follow-up questions that will be googled thoroughly
> first.  Sorry in advance..
>
> Tim
>
> --
>
> -
> org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.handler.extraction.ExtractingRequestHandler'
>        at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
>        at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:414)
>        at
> org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:450)
>        at
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:152)
>        at org.apache.solr.core.SolrCore.(SolrCore.java:557)
>        at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>        at
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:295)
>        at
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:422)
>        at
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:115)
>        at
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3838)
>        at
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4488)
>        at
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
>        at
> org.apache.ca

RE: How to 'filter' facet results

2010-07-27 Thread Jonathan Rochkind

> Is there a way to tell Solr to only return a specific set of facet values?  I
> feel like the facet query must be able to do this, but I'm not really
> understanding the facet query.  In my specific case, I'd like to only see 
> facet
> values for the same values I pass in as query filters, i.e. if I run this 
> query:
>fq=keyword:man OR keyword:bear OR keyword:pig
>facet=on
>facet.field:keyword

> then I only want it to return the facet counts for man, bear, and pig.  The
> resulting docs might have a number of different values for keyword, in 
> addition

For the general case of filtering facet values, I've wanted to do that too in 
more complex situations, and there is no good way I've found. 

For your very specific use case though, yeah, you can do it with facet.query.  
Leave out the facet.field, but instead:

facet.query=keyword:man
facet.query=keyword:bear
facet.query=keyword:pig

You'll get three facet.query results in the response, one each for man, bear, 
pig. 

Solr behind the scenes will kind of do three seperate 'sub-queries', one for 
each facet.query, but since the query itself should be cached, you shouldn't 
notice much difference. Especially if you have a warming query that facets on 
the keyword field (I'm never entirely sure when caches created by warming 
queries will be used by a facet.query, or if it depends on the facet method in 
use, but it can't hurt). 

Jonathan

How to 'filter' facet results

2010-07-27 Thread David Thompson

Is there a way to tell Solr to only return a specific set of facet values?  I 
feel like the facet query must be able to do this, but I'm not really 
understanding the facet query.  In my specific case, I'd like to only see facet 
values for the same values I pass in as query filters, i.e. if I run this query:
fq=keyword:man OR keyword:bear OR keyword:pig
facet=on
facet.field:keyword

then I only want it to return the facet counts for man, bear, and pig.  The 
resulting docs might have a number of different values for keyword, in addition 
to those specified in the filter because keyword is a multiValued field.  How 
can I tell it to only return the facet values for man, bear, and pig?  On the 
client side I could programmatically remove the other facets that I don't care 
about, except that the resulting docs could return hundreds of different 
values.  If I were faceting on a single value, I could say facet.prefix=man, 
and 
that would work, but mostly I need this to work for more than one filter value. 
 
Is there a way to set multiple facet.prefix values?  Any ideas?

-dKt

Re: Highlighting parameters wiki

2010-07-27 Thread Koji Sekiguchi


(10/07/27 23:16), Stephen Green wrote:

The wiki entry for hl.highlightMultiTerm:

http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm

doesn't appear to be correct.  It says:

If the SpanScorer is also being used, enables highlighting for
range/wildcard/fuzzy/prefix queries. Default is false.

But the code in DefaultSolrHighlighter (both on the 1.4 branch that
I'm using and in the trunk) does:

 Boolean highlightMultiTerm =
request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM,
true);
 if(highlightMultiTerm == null) {
   highlightMultiTerm = false;
 }

which looks to me like like it's going to default to true, since
getBool will never return null, and if it gets a null value from the
parameters internally, it will return true.

Shall I file a Jira on this one?  Perhaps it's easier just to fix the Wiki page?

Steve
   

Hi Steve,

Please just fix the wiki page. Thank you for reporting this!

Koji

--
http://www.rondhuit.com/en/

Re: Indexing Problem: Where's my data?

2010-07-27 Thread kenf_nc


for STRING_VALUE, I assume there is a property in the 'select *' results
called string_value? if so I'm not sure why it wouldn't work. If not, then
that's why, it doesn't have anything to put there.

For ATTRIBUTE_NAME, is it possibly a case issue? you called it
'Attribute_Name' in your query, but ATTRIBUTE_NAME in your schema...just
something to check I guess.

Also, not sure why you are using name= in your fields, for example, 

 
I thought 'column' was the source field name and 'name' was supposed to be
the schema field name and if not there it would assume 'column' name. You
don't have a schema field called "Parent Family" so it looks like it's
defaulting to column name too which is lucky for you I suppose. But you may
want to either remove 'name=' or make it match the schema. (and I may be
completely wrong on this, it's been a while since I got DIH going).


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Problem-Where-s-my-data-tp1000660p1000843.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan

Yonik,
One more update on this. I used the filter query that was throwing 
error and used it to delete a subset of results. 
After that the queries started working correctly. 
Which indicates that the particular docId was present in the index somewhere, 
but lucene was not able to find it.

-Kalyan


-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, July 27, 2010 4:46 PM
To: solr-user@lucene.apache.org
Subject: Re: Querying throws java.util.ArrayList.RangeCheck

I haven't been able to reproduce anything...
But if you guys are sure you're not running any custom code, then
there's definitely seems to be a bug somewhere.

Can anyone reproduce this in something you can share?

-Yonik
http://www.lucidimagination.com

Re: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Yonik Seeley

I haven't been able to reproduce anything...
But if you guys are sure you're not running any custom code, then
there's definitely seems to be a bug somewhere.

Can anyone reproduce this in something you can share?

-Yonik
http://www.lucidimagination.com

Indexing Problem: Where's my data?

2010-07-27 Thread Michael Griffiths

Hi,

(The first version of this was rejected for spam).

I'm setting up a test instance of Solr, and keep running into the problem of 
having Solr not work the way I think it should work. Specifically, the data I 
want to go into the index isn't there after indexing. I'm extracting the data 
from MSSQL via DataImportHandler, JDBC 4.0.

My data is set up that for every product ID there is one category 
(hierarchical, but I'm not dealing with that ATM), a family, and a set of 
attributes (which includes name, etc). After indexing, I get Category, Family, 
and Product ID - but nothing from my attribute values (STRING_NAME, below) - 
which is the most useful data.

Is there something wrong with my schema?

I thought it might be that the schema.xml file wasn't respecting the names I 
assigned via the DataImportHandler; when I changed to the column names in the 
schema.xml, I picked up Family and Category (previously, it was only product 
ID).

I'm really banging my head against the wall at this point, so I'd appreciate 
any help. My step will probably be to do a considerably more complicated 
denormalization (in terms of the SQL), which would make the Solr end simpler 
(but that has problems of its own).

Config information below.

Any help appreciated.

Thanks,
Michael

Data Config:



































Schema:














Product_ID
text

min/max, StatsComponent, performance

2010-07-27 Thread Jonathan Rochkind

I thought I asked a variation of this before, but I don't see it on the 
list, apologies if this is a duplicate, but I have new questions.


So I need to find the min and max value of a result set. Which can be 
several million documents. One way to do this is the StatsComponent.


One problem is that I'm having performance problems with StatsComponent 
across so many documents, adding the stats component on the field I'm 
interested in is adding 10s to my query response time.


So one question is if there's any way to increase StatsComponent 
performance. Does it use any caches, or does it operate without caches?  
My Solr is running near the top of it's heap size, although I'm not 
currently getting any OOM errors, perhaps not enough free memory is 
somehow hurting StatsComponent performance. Or any other ideas for 
increasing StatsComponent performance?


But it also occurs to me that the StatsComponent is doing a lot more 
than I need. I just need min/max. And the cardinality of this field is a 
couple orders of magnitude lower than the total number of documents. But 
StatsComponent is also doing a bunch of other things, like sum, median, 
etc.  Perhaps if there were a way to _just_ get min/max, it would be 
faster. Is there any way to get min/max values in a result set other 
than StatsComponent?


Jonathan

Re: Difficulties with Highlighting

2010-07-27 Thread Nathaniel Grove

Erik,

You're right on both accounts. I'll upgrade and then check into whether
our tokenizer is working properly.

Thanks,

Than

Erik Hatcher wrote:

Than -

Looks like maybe your text_bo field type isn't analyzing how you'd
like? Though that's just a hunch. I pasted the value of that field
returned in the link you provided into your analysis.jsp page and it
chunked tokens by whitespace. Though I could be experiencing a
copy/paste/i18n issue.

Also looks like you're on Solr 1.3 - so it's likely quite worth
upgrading to 1.4.1 (don't know if that directly affects this
highlighting issue, just a general recommendation).

Erik

On Jul 27, 2010, at 3:43 PM, Nathaniel Grove wrote:

I'm a relative beginner at SOLR, indexing and searching Unicode
Tibetan texts. I am trying to use the highlighter but it just
returns, empty elements, such as:

What am I doing wrong?

The query that generated that is:

http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=on&version=2.2&q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atext&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&hl=true&hl.fl=pg_bo&hl.snippets=50

The hit is in the multivalued field named "pg_bo" and in a doc with
that id #. I've looked at the various highlighting parameters (not
that I fully understand them) and tried fiddling with those but
nothing helped. I did notice that if you change the hl.fl=*. Then you
get the type field highlighted:

text

But that's not much help. We are using a custom Tibetan tokenizer for
the Unicode Tibetan text fields. Would this have something to do with
it?

Any suggestions would be appreciated!

Thanks for your help,

Than Grove

--
Nathaniel Grove
Research Associate & Technical Director
Tibetan & Himalayan Library
University of Virginia
http://www.thlib.org

Re: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Jason Ronallo

I am getting a similar error with today's nightly build:

HTTP Status 500 - Index: 54, Size: 24
java.lang.IndexOutOfBoundsException: Index: 54, Size: 24 at
java.util.ArrayList.RangeCheck(ArrayList.java:547) at
java.util.ArrayList.get(ArrayList.java:322) at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:264) at

I'm adding and deleting a batch of documents. Currently during
indexing for each document there is a commit. In some cases the
document is deleted just before it is added with a commit for the
delete and a commit for the add.

It appears that if I wait to commit until the end of all indexing, I
avoid this error.

Jason

On Tue, Jul 27, 2010 at 10:25 AM, Manepalli, Kalyan
 wrote:
> Hi Yonik,
> I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am 
> using regular out of box dismax requesthandler.
> The query is a simple one with 4 filter queries (fq's) and one sort query.
> During the index generation, I delete a set of rows based on date filter, 
> then add new rows to the index. Then another process queries the index and 
> generates some stats and updates the index again. Not sure if during this 
> process something is going wrong with the index.
>
> Thanks
> Kalyan
>
> -Original Message-
> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
> Sent: Tuesday, July 27, 2010 12:15 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Querying throws java.util.ArrayList.RangeCheck
>
> Do you have any custom code, or is this stock solr (and which version,
> and what is the request)?
>
> -Yonik
> http://www.lucidimagination.com
>
> On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan
>  wrote:
>> Hi,
>>   I am stuck at this weird problem during querying. While querying the solr 
>> index I am getting the following error.
>> Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 
>> at java.util.ArrayList.RangeCheck(ArrayList.java:547) at 
>> java.util.ArrayList.get(ArrayList.java:322) at 
>> org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at 
>> org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at 
>> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at 
>> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) 
>> at org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at 
>> org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at
>>
>> During debugging I found that the SolrIndexReader is trying to read a 
>> document which doesnt exist in the index.
>> I tried optimizing the index and restarting the server but still no luck.
>>
>> Any help in resolving this issue will be appreciated.
>>
>> Thanks
>> Kalyan
>

Re: Difficulties with Highlighting

2010-07-27 Thread Erik Hatcher


Than -

Looks like maybe your text_bo field type isn't analyzing how you'd  
like?   Though that's just a hunch.  I pasted the value of that field  
returned in the link you provided into your analysis.jsp page and it  
chunked tokens by whitespace.  Though I could be experiencing a copy/ 
paste/i18n issue.


Also looks like you're on Solr 1.3 - so it's likely quite worth  
upgrading to 1.4.1 (don't know if that directly affects this  
highlighting issue, just a general recommendation).


Erik

On Jul 27, 2010, at 3:43 PM, Nathaniel Grove wrote:

I'm a relative beginner at SOLR, indexing and searching Unicode  
Tibetan texts. I am trying to use the highlighter but it just  
returns, empty elements, such as:


  
  
  

What am I doing wrong?

The query that generated that is:

http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=on&version=2.2&q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atext&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&hl=true&hl.fl=pg_bo&hl.snippets=50

The hit is in the multivalued field named "pg_bo" and in a doc with  
that id #. I've looked at the various highlighting parameters (not  
that I fully understand them) and tried fiddling with those but  
nothing helped. I did notice that if you change the hl.fl=*. Then  
you get the type field highlighted:



  
 
  text
  
  


But that's not much help. We are using a custom Tibetan tokenizer  
for the Unicode Tibetan text fields. Would this have something to do  
with it?


Any suggestions would be appreciated!

Thanks for your help,

Than Grove

--
Nathaniel Grove
Research Associate & Technical Director
Tibetan & Himalayan Library
University of Virginia
http://www.thlib.org

Re: SolrCore has a large number of SolrIndexSearchers retained in "infoRegistry"

2010-07-27 Thread skommuri


Thank you very much Hoss for the reply.

I am using the embedded mode (SolrServer). I am not explicitly accessing
SolrIndexSearcher. I am explicitly closing the SolrCore after the request
has been processed.

Although I did notice that I am using SolrQueryRequest object and is not
explicitly getting closed. I will test that one out and will let you know.
Thanks again!


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCore-has-a-large-number-of-SolrIndexSearchers-retained-in-infoRegistry-tp483900p1000472.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCore has a large number of SolrIndexSearchers retained in "infoRegistry"

2010-07-27 Thread Ken Krugler



On Jul 27, 2010, at 12:21pm, Chris Hostetter wrote:


:
: I was wondering if anyone has found any resolution to this email  
thread?


As Grant asked in his reply when this thread was first started  
(December 2009)...



It sounds like you are either using embedded mode or you have some
custom code.  Are you sure you are releasing your resources  
correctly?


...there was no response to his question for clarification.

the problem, given the info we have to work with, definitely seems  
to be
that the custom code utilizing the SolrCore directly is not  
releasing the

resources that it is using in every case.

if you are claling hte execute method, that means you have a
SOlrQueryRequest object -- which means you somehow got an instance of
a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with  
it)
and you are somehow not releasing that SolrIndexSearcher (probably  
because

you are not calling close() on your SolrQueryRequest)


One thing that bit me previously with using APIs in this area of Solr  
is that if you call CoreContainer.getCore(), this increments the open  
count, so you have to balance each getCore() call with a close() call.


The naming here could be better - I think it's common to have an  
expectation that calls to get something don't change any state. Maybe  
openCore()?


-- Ken


But it relaly all depends on how you got ahold of that
SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every
method in SolrCore that gives you access to a SolrIndexSearcher is
documented very clearly on how to "release" it when you are done  
with it

so the ref count can be decremented.


-Hoss



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Difficulties with Highlighting

2010-07-27 Thread Nathaniel Grove

I'm a relative beginner at SOLR, indexing and searching Unicode Tibetan 
texts. I am trying to use the highlighter but it just returns, empty 
elements, such as:


   
   
   

What am I doing wrong?

The query that generated that is:

http://www.thlib.org:8080/thdl-solr/thdl-texts/select?indent=on&version=2.2&q=%E0%BD%91%E0%BD%84%E0%BD%B4%E0%BD%A3%E0%BC%8B%E0%BD%98%E0%BD%81%E0%BD%93%E0%BC%8B+AND+type%3Atext&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&hl=true&hl.fl=pg_bo&hl.snippets=50

The hit is in the multivalued field named "pg_bo" and in a doc with that 
id #. I've looked at the various highlighting parameters (not that I 
fully understand them) and tried fiddling with those but nothing helped. 
I did notice that if you change the hl.fl=*. Then you get the type field 
highlighted:



   
  
   text
   
   


But that's not much help. We are using a custom Tibetan tokenizer for 
the Unicode Tibetan text fields. Would this have something to do with it?


Any suggestions would be appreciated!

Thanks for your help,

Than Grove

--
Nathaniel Grove
Research Associate & Technical Director
Tibetan & Himalayan Library
University of Virginia
http://www.thlib.org

Re: help finding illegal chars in XML doc

2010-07-27 Thread Chris Hostetter


: Thanks for your reply. I could not find in the log files any mention to
: that.  By the way I only have _MM_DD.request.log files in my directory.
: 
: Do I have to enable any specific log or level to catch those errors?

if you are using that "java -jar start.jar" command for the example jetty 
nstance then the log messages i'm refering to are written directly to your 
console.  if you are using running solr in some other servlet container, 
then it all depneds on the servlet container...

http://wiki.apache.org/solr/SolrLogging
http://wiki.apache.org/solr/LoggingInDefaultJettySetup



-Hoss

Re: SolrCore has a large number of SolrIndexSearchers retained in "infoRegistry"

2010-07-27 Thread Chris Hostetter

: 
: I was wondering if anyone has found any resolution to this email thread?

As Grant asked in his reply when this thread was first started (December 
2009)...

>> It sounds like you are either using embedded mode or you have some 
>> custom code.  Are you sure you are releasing your resources correctly?

...there was no response to his question for clarification.

the problem, given the info we have to work with, definitely seems to be 
that the custom code utilizing the SolrCore directly is not releasing the 
resources that it is using in every case.

if you are claling hte execute method, that means you have a 
SOlrQueryRequest object -- which means you somehow got an instance of 
a SolrIndexSearcher (every SOlrQueryRequest has one assocaited with it) 
and you are somehow not releasing that SolrIndexSearcher (probably because 
you are not calling close() on your SolrQueryRequest)

But it relaly all depends on how you got ahold of that 
SOlrQueryRequest/SolrIndexSearcher pair in the first place ... every 
method in SolrCore that gives you access to a SolrIndexSearcher is 
documented very clearly on how to "release" it when you are done with it 
so the ref count can be decremented.


-Hoss

Re: Timeout in distributed search

2010-07-27 Thread Chris Hostetter


:   Is there anyway to have time out support in distributed search. I 
: searched https://issues.apache.org/jira/browse/SOLR-502 but looks it is 
: not in main release of solr1.4

note that issue is marked "Fix Version/s: 1.3" ... that means it 
was fixed in Solr 1.3, well before 1.4 came out.

You should also take a look at the functionality added in SOLR-850, which 
explicitly deals with hard timeouts in distributed searching...

https://issues.apache.org/jira/browse/SOLR-850

...that was first included in Solr 1.4



-Hoss

RE: Spellchecking and frequency

2010-07-27 Thread Dyer, James

Mark,

I'd like to see your code if you open a JIRA for this.  I recently
opened SOLR-2010 with a patch that does something similar to the second
part only of what you describe (find combinations that actually return a
match).  But I'm not sure if my approach is the best one so I would like
to see yours to compare.

James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311

-Original Message-
From: Mark Holland [mailto:mark.holl...@zoopla.co.uk] 
Sent: Tuesday, July 27, 2010 1:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Spellchecking and frequency

Hi,

I found the suggestions returned from the standard solr spellcheck not
to be
that relevant. By contrast, aspell, given the same dictionary and
mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps
jazzy,
the java aspell library. I also extended the SpellCheckComponent to take
the
matrix of suggested words and query the corpus to find the first
combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although
it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton  wrote:

> Hi,
>
> I've recently been looking into Spellchecking in solr, and was struck
by
> how
> limited the usefulness of the tool was.
>
> Like most corpora , ours contains lots of different spelling mistakes
for
> the same word, so the 'spellcheck.onlyMorePopular' is not really that
> useful
> unless you click on it numerous times.
>
> I was thinking that since most of the time people spell words
correctly why
> was there no other frequency parameter that could enter into the
score?
> i.e.
> something like:
>
> spell_score ~ edit_dist * freq
>
> I'm sure others have come across this issue and was wonding what
> steps/algorithms they have used to overcome these limitations?
>
> Cheers,
> Dan
>

Re: Total number of terms in an index?

2010-07-27 Thread Michael McCandless

In trunk (flex) you can ask each segment for its unique term count.

But to compute the unique term count across all segments is
necessarily costly (requires merging them, to de-dup), as Hoss
described.

Mike

On Tue, Jul 27, 2010 at 12:27 PM, Burton-West, Tom  wrote:
> Hi Jason,
>
> Are you looking for the total number of unique terms or total number of term 
> occurrences?
>
> Checkindex reports both, but does a bunch of other work so is probably not 
> the fastest.
>
> If you are looking for total number of term occurrences, you might look at 
> contrib/org/apache/lucene/misc/HighFreqTerms.java.
>
> If you are just looking for the total number of unique terms, I wonder if 
> there is some low level API that would allow you to just access the in-memory 
> representation of the tii file and then multiply the number of terms in it by 
> your indexDivisor (default 128). I haven't dug in to the code so I don't 
> actually know how the tii file gets loaded into a data structure in memory.  
> If there is api access, it seems like this might be the quickest way to get 
> the number of unique terms.  (Of course you would have to do this for each 
> segment).
>
> Tom
> -Original Message-
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> Sent: Monday, July 26, 2010 8:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Total number of terms in an index?
>
>
> : Sorry, like the subject, I mean the total number of terms.
>
> it's not stored anywhere, so the only way to fetch it is to actually
> iteate all of the terms and count them (that's why LukeRequestHandler is
> slow slow to compute this particular value)
>
> If i remember right, someone mentioned at one point that flex would let
> you store data about stuff like this in your index as part of the segment
> writing, but frankly i'm still not sure how that iwll help -- because you
> unless your index is fully optimized, you still have to iterate the terms
> in each segment to 'de-dup' them.
>
>
> -Hoss
>
>

RE: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread David Thibault

Alessandro & all,

I was having the same issue with Tika crashing on certain PDFs.  I also noticed 
the bug where no content was extracted after upgrading Tika.  

When I went to the SOLR issue you link to below, I applied all the patches, 
downloaded the Tika 0.8 jars, restarted tomcat, posted a file via curl, and got 
the following error:
SEVERE: java.lang.NoSuchMethodError: 
org.apache.solr.core.SolrResourceLoader.getClassLoader()Ljava/lang/ClassLoader;
at 
org.apache.solr.handler.extraction.ExtractingRequestHandler.inform(ExtractingRequestHandler.java:93)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappedHandler(RequestHandlers.java:244)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:231)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at 
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:859)
at 
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:579)
at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1555)
at java.lang.Thread.run(Thread.java:619)

This is really weird because I DID apply the SolrResourceLoader patch that adds 
the getClassLoader method.  I even verified by going opening up the JARs and 
looking at the class file in Eclipse...I can see the 
SolrResourceLoader.getClassLoader() method.  

Does anyone know why it can't find the method?  After patching the source I did 
ant clean dist in the base directory of the Solr source tree and everything 
looked like it compiles (BUILD SUCCESSFUL).  Then I copied all the jars from 
dist/ and all the library dependencies from contrib/extraction/lib/ into my 
SOLR_HOME. Restarting tomcat, everything in the logs looked good.

I'm stumped.  It would be very nice to have a Solr implementation using the 
newest versions of PDFBox & Tika and actually have content being extracted...=)

Best,
Dave


-Original Message-
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com] 
Sent: Tuesday, July 27, 2010 6:09 AM
To: solr-user@lucene.apache.org
Subject: Re: Extracting PDF text/comment/callout/typewriter boxes with Solr 
CELL/Tika/PDFBox

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan 

>
> Every so often I need to index new batches of scanned PDFs and occasionally
> Adobe's OCR can't recognize the text in a couple of these documents. In
> these situations I would like to type in a small amount of text onto the
> document and have it be extracted by Solr CELL.
>
> Adobe Pro 9 has a number of different ways to add text directly to a PDF
> file:
>
> *Typewriter
> *Sticky Note
> *Callout boxes
> *Text boxes
>
> I tried indexing documents with each of these text additions with Solr
> 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
>
> If someone has modified their Solr CELL installation to use more recent
> versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
> on whether newer versions can pull the text out of any of these various text
> boxes I'd appreciate that very much.
>
> -Jon
>
>
>
>
> -
> SECURITY/CONFIDENTIALITY W

Re: Spellchecking and frequency

2010-07-27 Thread Mark Holland

Hi,

I found the suggestions returned from the standard solr spellcheck not to be
that relevant. By contrast, aspell, given the same dictionary and mispelled
words, gives much more accurate suggestions.

I therefore wrote an implementation of SolrSpellChecker that wraps jazzy,
the java aspell library. I also extended the SpellCheckComponent to take the
matrix of suggested words and query the corpus to find the first combination
of suggestions which returned a match. This works well for my use case,
where term frequency is irrelevant to spelling or scoring.

I'd like to publish the code in case someone finds it useful (although it's
a bit crude at the moment and will need a decent tidy up). Would it be
appropriate to open up a Jira issue for this?

Cheers,
~mark

On 27 July 2010 09:33, dan sutton  wrote:

> Hi,
>
> I've recently been looking into Spellchecking in solr, and was struck by
> how
> limited the usefulness of the tool was.
>
> Like most corpora , ours contains lots of different spelling mistakes for
> the same word, so the 'spellcheck.onlyMorePopular' is not really that
> useful
> unless you click on it numerous times.
>
> I was thinking that since most of the time people spell words correctly why
> was there no other frequency parameter that could enter into the score?
> i.e.
> something like:
>
> spell_score ~ edit_dist * freq
>
> I'm sure others have come across this issue and was wonding what
> steps/algorithms they have used to overcome these limitations?
>
> Cheers,
> Dan
>

does this indicate a commit happened for every add?

2010-07-27 Thread Robert Petersen

I'm adding lots of small docs with several threads to solr and the adds
start fast but then slow down.  I didn't do any explicit commits and
autocommit is turned off but the logs show lots of commit activity on
this core and restarting this solr core logged the below.  Where did all
these commits come from, the exact same number as my adds?  I'm
stumped...

Jul 27, 2010 10:07:17 AM org.apache.solr.update.DirectUpdateHandler2
close
INFO: closed
DirectUpdateHandler2{commits=456389,autocommits=0,optimizes=0,rollbacks=
0,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,e
rrors=0,cumulative_adds=456393,cumulative_deletesById=0,cumulative_delet
esByQuery=0,cumulative_errors=0}

SpatialSearch: sorting by distance

2010-07-27 Thread Pavel Minchenkov

Hi,

I'm trying to sort by distance like this:

sort=dist(2,lat,lon,55.755786,37.617633) asc

In general results are sorted, but some documents are not in right order.
I'm using DistanceUtils.getDistanceMi(...) from lucene spatial to calculate
real distance after reading documents from Solr.

Solr version from trunk.





Thanks.

-- 
Pavel Minchenkov

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom

Hi Jason,

Are you looking for the total number of unique terms or total number of term 
occurrences?

Checkindex reports both, but does a bunch of other work so is probably not the 
fastest.

If you are looking for total number of term occurrences, you might look at 
contrib/org/apache/lucene/misc/HighFreqTerms.java.
 
If you are just looking for the total number of unique terms, I wonder if there 
is some low level API that would allow you to just access the in-memory 
representation of the tii file and then multiply the number of terms in it by 
your indexDivisor (default 128). I haven't dug in to the code so I don't 
actually know how the tii file gets loaded into a data structure in memory.  If 
there is api access, it seems like this might be the quickest way to get the 
number of unique terms.  (Of course you would have to do this for each segment).

Tom
-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Monday, July 26, 2010 8:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Total number of terms in an index?


: Sorry, like the subject, I mean the total number of terms.

it's not stored anywhere, so the only way to fetch it is to actually 
iteate all of the terms and count them (that's why LukeRequestHandler is 
slow slow to compute this particular value)

If i remember right, someone mentioned at one point that flex would let 
you store data about stuff like this in your index as part of the segment 
writing, but frankly i'm still not sure how that iwll help -- because you 
unless your index is fully optimized, you still have to iterate the terms 
in each segment to 'de-dup' them.


-Hoss

Re: java "GC overhead limit exceeded"

2010-07-27 Thread Text Analysis

Look into -XX:-GCUseOverheadLimit

On 7/26/10, Jonathan Rochkind  wrote:
> I am now occasionally getting a Java "GC overhead limit exceeded" error
> in my Solr. This may or may not be related to recently adding much
> better (and more) warming querries.
>
> I can get it when trying a 'commit', after deleting all documents in my
> index, or in other cases.
>
> Anyone run into this, and have suggestions as to how to set my java
> options to eliminate?  I'm not sure this simply means that my heap size
> needs to be bigger, it seems to be something else.
>
> Any advice appreciated. Googling didn't get me much I trusted.
>
> Jonathan
>

-- 
Sent from my mobile device

Is it possible to get keyword/match's position?

2010-07-27 Thread Ryan Chan

According to SO:
http://stackoverflow.com/questions/1557616/retrieving-per-keyword-field-match-position-in-lucene-solr-possible

It is not possible, but it is one year ago, is it still true for now?

Thanks.

RE: Querying throws java.util.ArrayList.RangeCheck

2010-07-27 Thread Manepalli, Kalyan

Hi Yonik,
I am using Solr 1.4 release dated Feb-9 2010. There is no custom code. I am 
using regular out of box dismax requesthandler.
The query is a simple one with 4 filter queries (fq's) and one sort query. 
During the index generation, I delete a set of rows based on date filter, then 
add new rows to the index. Then another process queries the index and generates 
some stats and updates the index again. Not sure if during this process 
something is going wrong with the index.

Thanks
Kalyan

-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Tuesday, July 27, 2010 12:15 AM
To: solr-user@lucene.apache.org
Subject: Re: Querying throws java.util.ArrayList.RangeCheck

Do you have any custom code, or is this stock solr (and which version,
and what is the request)?

-Yonik
http://www.lucidimagination.com

On Tue, Jul 27, 2010 at 12:30 AM, Manepalli, Kalyan
 wrote:
> Hi,
>   I am stuck at this weird problem during querying. While querying the solr 
> index I am getting the following error.
> Index: 52, Size: 16 java.lang.IndexOutOfBoundsException: Index: 52, Size: 16 
> at java.util.ArrayList.RangeCheck(ArrayList.java:547) at 
> java.util.ArrayList.get(ArrayList.java:322) at 
> org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:288) at 
> org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:217) at 
> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948) at 
> org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:506) at 
> org.apache.lucene.index.IndexReader.document(IndexReader.java:947) at 
> org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444) at
>
> During debugging I found that the SolrIndexReader is trying to read a 
> document which doesnt exist in the index.
> I tried optimizing the index and restarting the server but still no luck.
>
> Any help in resolving this issue will be appreciated.
>
> Thanks
> Kalyan

RE: Spellcheck help

2010-07-27 Thread Dyer, James

If you could, let me know how your testing goes with this change.  I too am 
interested in having the Collate work as good as it can.  It looks like the 
code would be better with this change but then again I don't know what the 
original author was thinking when this was put in.

James Dyer
E-Commerce Systems
Ingram Book Company
(615) 213-4311

-Original Message-
From: Marc Ghorayeb [mailto:dekay...@hotmail.com] 
Sent: Tuesday, July 27, 2010 8:07 AM
To: solr-user@lucene.apache.org
Subject: RE: Spellcheck help


Thanks for the input, i'll check it out!
Marc

> Subject: RE: Spellcheck help
> Date: Fri, 23 Jul 2010 13:12:04 -0500
> From: james.d...@ingrambook.com
> To: solr-user@lucene.apache.org
> 
> In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84):
> 
> final static String PATTERN = "(?:(?!(" + NMTOKEN + 
> ":|\\d+)))[\\p{L}_\\-0-9]+";
> 
> and remove the |\\d+ to make it:
> 
> final static String PATTERN = "(?:(?!" + NMTOKEN + ":))[\\p{L}_\\-0-9]+";
> 
> My testing shows this solves your problem.  The caution is to test it against 
> all your use cases because obviously someone thought we should ignore leading 
> digits from keywords.  Surely there's a reason why although I can't think of 
> it.
> 
> James Dyer
> E-Commerce Systems
> Ingram Book Company
> (615) 213-4311
> 
> -Original Message-
> From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] 
> Sent: Saturday, July 17, 2010 12:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Spellcheck help
> 
> Can anybody help me with this? :(
> 
> -Original Message- 
> From: Marc Ghorayeb
> Sent: Thursday, July 08, 2010 9:46 AM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck help
> 
> 
> Hello,I've been trying to get rid of a bug when using the spellcheck but so 
> far with no success :(When searching for a word that starts with a number, 
> for example "3dsmax", i get the results that i want, BUT the spellcheck says 
> it is not correctly spelled AND the collation gives me "33dsmax". Further 
> investigation shows that the spellcheck is actually only checking "dsmax" 
> which it considers does not exist and gives me "3dsmax" for better results, 
> but since i have spellcheck.collate = true, the collation that i show is 
> "33dsmax" with the first 3 being the one discarded by the spellchecker... 
> Otherwise, the spellcheck works correctly for normal words... any ideas? 
> :(My spellcheck field is fairly classic, whitespace tokenizer, with 
> lowercase filter...Any help would be greatly appreciated :)Thanks,Marc
> _
> Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement !
> http://www.messengersurvotremobile.com/?d=iPhone 
> 
  
_
Exclu : Téléchargez la nouvelle version de Messenger !
http://clk.atdmt.com/FRM/go/244627952/direct/01/

Highlighting parameters wiki

2010-07-27 Thread Stephen Green

The wiki entry for hl.highlightMultiTerm:

http://wiki.apache.org/solr/HighlightingParameters#hl.highlightMultiTerm

doesn't appear to be correct.  It says:

If the SpanScorer is also being used, enables highlighting for
range/wildcard/fuzzy/prefix queries. Default is false.

But the code in DefaultSolrHighlighter (both on the 1.4 branch that
I'm using and in the trunk) does:

Boolean highlightMultiTerm =
request.getParams().getBool(HighlightParams.HIGHLIGHT_MULTI_TERM,
true);
if(highlightMultiTerm == null) {
  highlightMultiTerm = false;
}

which looks to me like like it's going to default to true, since
getBool will never return null, and if it gets a null value from the
parameters internally, it will return true.

Shall I file a Jira on this one?  Perhaps it's easier just to fix the Wiki page?

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com

Re: Russian stemmer

right, but your problem is this is the current output:

Ковров -> Ковр
Коврову -> Ковров
Ковровом -> Ковров
Коврове -> Ковров

so, if Ковров was simply left alone, all your forms would match...

2010/7/27 Oleg Burlaca 

> Thanks Robert for all your help,
>
> The idea of ы[A-Z].* stopwords is ideal for the english language,
> although in russian nouns are inflected: Борис, Борису, Бориса, Борисом
>
> I'll try the RussianLightStemFilterFactory (the article in the PDF
> mentioned
> it's more accurate).
>
> Once again thanks,
> Oleg Burlaca
>
> On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir  wrote:
>
> > 2010/7/27 Oleg Burlaca 
> >
> > > Actually the situation with Немцов из ок,
> > > I've just checked how Yandex works with Немцов and Немцова:
> > > http://nano.yandex.ru/project/inflect/
> > >
> > > I think there are two solutions:
> > > a) manually search for both Немцов and then Немцова
> > > b) use wildcard query: Немцов*
> > >
> >
> > Well, here is one idea of a more general solution.
> > The problem with "protected words" is you must have a complete list.
> >
> > One idea would be to add a filter that protects any words from stemming
> > that
> > match a regular expression:
> > In english maybe someone wants to avoid any capitalized words to reduce
> > trouble: [A-Z].*
> > in your case then some pattern like [A-Я].*ов might prevent problems.
> >
> >
> > > Robert, thanks for the RussianLightStemFilterFactory info,
> > > I've found this page
> > >
> http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
> > > that somehow describes it. Where can I read more about
> > > RussianLightStemFilterFactory ?
> > >
> > >
> > Here is the link:
> >
> >
> http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
> >
> >
> > > Regards,
> > > Oleg
> > >
> > > 2010/7/27 Oleg Burlaca 
> > >
> > > > A similar word is Немцов.
> > > > The strange thing is that searching for "Немцова" will not find
> > documents
> > > > containing "Немцов"
> > > >
> > > > Немцова: 14 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > > >
> > > > Немцов: 74 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: Russian stemmer

Thanks Robert for all your help,

The idea of ы[A-Z].* stopwords is ideal for the english language,
although in russian nouns are inflected: Борис, Борису, Бориса, Борисом

I'll try the RussianLightStemFilterFactory (the article in the PDF mentioned
it's more accurate).

Once again thanks,
Oleg Burlaca

On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir  wrote:

> 2010/7/27 Oleg Burlaca 
>
> > Actually the situation with Немцов из ок,
> > I've just checked how Yandex works with Немцов and Немцова:
> > http://nano.yandex.ru/project/inflect/
> >
> > I think there are two solutions:
> > a) manually search for both Немцов and then Немцова
> > b) use wildcard query: Немцов*
> >
>
> Well, here is one idea of a more general solution.
> The problem with "protected words" is you must have a complete list.
>
> One idea would be to add a filter that protects any words from stemming
> that
> match a regular expression:
> In english maybe someone wants to avoid any capitalized words to reduce
> trouble: [A-Z].*
> in your case then some pattern like [A-Я].*ов might prevent problems.
>
>
> > Robert, thanks for the RussianLightStemFilterFactory info,
> > I've found this page
> > http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
> > that somehow describes it. Where can I read more about
> > RussianLightStemFilterFactory ?
> >
> >
> Here is the link:
>
> http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
>
>
> > Regards,
> > Oleg
> >
> > 2010/7/27 Oleg Burlaca 
> >
> > > A similar word is Немцов.
> > > The strange thing is that searching for "Немцова" will not find
> documents
> > > containing "Немцов"
> > >
> > > Немцова: 14 articles
> > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > >
> > > Немцов: 74 articles
> > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > >
> > >
> > >
> > >
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

RE: Spellcheck help

2010-07-27 Thread Marc Ghorayeb


Thanks for the input, i'll check it out!
Marc

> Subject: RE: Spellcheck help
> Date: Fri, 23 Jul 2010 13:12:04 -0500
> From: james.d...@ingrambook.com
> To: solr-user@lucene.apache.org
> 
> In org.apache.solr.spelling.SpellingQueryConverter, find the line (#84):
> 
> final static String PATTERN = "(?:(?!(" + NMTOKEN + 
> ":|\\d+)))[\\p{L}_\\-0-9]+";
> 
> and remove the |\\d+ to make it:
> 
> final static String PATTERN = "(?:(?!" + NMTOKEN + ":))[\\p{L}_\\-0-9]+";
> 
> My testing shows this solves your problem.  The caution is to test it against 
> all your use cases because obviously someone thought we should ignore leading 
> digits from keywords.  Surely there's a reason why although I can't think of 
> it.
> 
> James Dyer
> E-Commerce Systems
> Ingram Book Company
> (615) 213-4311
> 
> -Original Message-
> From: dekay...@hotmail.com [mailto:dekay...@hotmail.com] 
> Sent: Saturday, July 17, 2010 12:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Spellcheck help
> 
> Can anybody help me with this? :(
> 
> -Original Message- 
> From: Marc Ghorayeb
> Sent: Thursday, July 08, 2010 9:46 AM
> To: solr-user@lucene.apache.org
> Subject: Spellcheck help
> 
> 
> Hello,I've been trying to get rid of a bug when using the spellcheck but so 
> far with no success :(When searching for a word that starts with a number, 
> for example "3dsmax", i get the results that i want, BUT the spellcheck says 
> it is not correctly spelled AND the collation gives me "33dsmax". Further 
> investigation shows that the spellcheck is actually only checking "dsmax" 
> which it considers does not exist and gives me "3dsmax" for better results, 
> but since i have spellcheck.collate = true, the collation that i show is 
> "33dsmax" with the first 3 being the one discarded by the spellchecker... 
> Otherwise, the spellcheck works correctly for normal words... any ideas? 
> :(My spellcheck field is fairly classic, whitespace tokenizer, with 
> lowercase filter...Any help would be greatly appreciated :)Thanks,Marc
> _
> Messenger arrive enfin sur iPhone ! Venez le télécharger gratuitement !
> http://www.messengersurvotremobile.com/?d=iPhone 
> 
  
_
Exclu : Téléchargez la nouvelle version de Messenger !
http://clk.atdmt.com/FRM/go/244627952/direct/01/

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,

thanks for the code. Currently, I've got a different solution running
but it's always good to have examples.

> > If realized 
> > that I have to throw an exception and add the onError attribute to the 
> > entity to make that work. 
> > 
> I am curious:
> Can you show how to make a method throwing an exception that is accepted by
> the onError-attribute?

the catch clause looks for "Exception" so it's actually easy. :-D

Anyway, I've found a "cleaner" way. It is better to subclass the
XPathEntityProcessor and put it in a state that prevents it from calling
"initQuery" which triggers the dataSource.getData() call.
I have overridden the initContext() method setting a go/no go flag that
I am using in the overridden nextRow() to find out whether to delegate
to the superclass or not.

This way I can also avoid the code that fills the tmp field with an
empty value if there is no value to query on.

Cheers,
Chantal

question: solrCloud with multiple cores on each machine

2010-07-27 Thread Yatir Ben Shlomo

Hi
 I am using solrCloud.
Suppose I have a total 4 machines dedicated for solr.
I want to have 2 machines as replication (salves) and 2 masters
But I want to work with 8 logical cores rather 2.
i.e. each master (and each slave) will have 4 cores on it.
the reason is that I can optimize the cores one at a time so the IO intensity 
at any given moment will be low and will not degrade the online performance

Is there a way to configure my solr.xml so that when I am doing a distributed 
search (distrib=true) it will know to query all 8 cores ?

Thanks
Yatir

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,

instead of:

 
 /* multivalued, not required */ 
 
 
 
 

you do:

 
 /* multivalued, not required */ 
 
 
 
 

The yourCustomFunctionToReturnAQueryString(vip, querystring1, querystring2)
{
if(vip != null && !vip.equals(""))
{
 StringBuilder sb = new StringBuilder(50);
 sb.append(querystring1); // SELECT SSC_VALUE from SSC_VALUE where
SSC_ATTRIBUTE_ID=1 
   and SSC_VALUE in (
 sb.append(vip);//VIP-value
 sb.append(querystring2);//just the closing ")"
 return sb.toString();
 }
 else
 {
return "SELECT \"\" AS yourFieldName";
 }
}

I expect that this method is called for every vip-value, if there is one.

Solr DIH uses the returned querystring to query the database. So, if
vip-value is empty or null, you can use a different query that is blazing
fast (i.e. SELECT "" AS yourFieldName - just an example to show the logic).
This query should return a row with an empty string. So Solr fills the
current field with an empty string.

I don't know how to prevent Solr from calling your ssc_entry-entity, when
vip is null or empty.
But this would be a solution to handle empty vip-strings as efficient as
possible. 



> If realized 
> that I have to throw an exception and add the onError attribute to the 
> entity to make that work. 
> 
I am curious:
Can you show how to make a method throwing an exception that is accepted by
the onError-attribute?

I hope we do not talk past eachother here. :-)

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998950.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: slave index is bigger than master index

2010-07-27 Thread Peter Karich


> We have three dedicated servers for solr, two for slaves and one for master,
> all with linux/debian packages installed. 
>
> I understand that replication does always copies over the index in an exact
> form as in master index directory (or it is supposed to do that at least),
> and if the master index was optimized after indexing, one doesn't need to
> run an optimize call again on master to optimize the slave's index. But in
> our case thats what fixed it and I agree it is even more confusing now :s
>   

Thats why I said: try it on the slaves too ;-)
In our case it helped too to shrink 2*index to 1*index.
I think the data which necessary for the replication won't cleanup
before the next replication or before an optimize.
For us it was crucial to shrink the size because of limited
disc-resources and to make sure that the next
replication does not increase the index to >3*times of the initial size.

@muneeb so I think, optimization is not necessary or do you have disc
limitations too?
@Hoss or others: does this explanation sound logically?

> Another problem is, we are serving live services using slave nodes, so I
> dont want to effect the live search, while playing with slave nodes'
> indices. 
>   

What do you mean here? Optimizing is too CPU expensive?

> We will be running the indexing on master node today over the night. Lets
> see if it does it again.
>   

Do you mean increase to double size?

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,


> New idea:
> Create a method which returns the query-string:
> 
> returnString(theVIP)
> {
>if ( theVIP != null || theVIP != "")
>{
>return "a query-string to find the vip"
>}
>else
>{
>return "SELECT 1" // you need to modify this, so that it
> matches your field-definition
>}
> }
> 
> The main-idea is to perform a blazing fast query, instead of a complex
> IN-clause-query.
> Does this sounds like a solution???

I was using "in" because it's a multiValued input that results in
multiValued output (not necessarily but it's most probable - it's either
empty or multiple values).
I don't understand how I can make your solution work with multivalued
input/output?

> > The new approach is to query the solr index for that other database that 
> > I've already setup. This is only a bit slower than the original query 
> > (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
> > 
> Unfortunately I can not follow you. 
> You are querying a solr-index for a database?

Yes, because I've already put one up (second core) and used SolrJ to get
what I want later on, but it would be better to compute the relation
between the two indexes at index time instead of at query time. (If it
would have worked with the db entity the second index wouldn't have been
required, anymore.)
But now that it works well with the url entity I'm fine with maintaining
that second index. It's not that much effort.
I've subclassed URLDataSource to add a check whether the list of input
values is empty and only proceed when this is not the case. If realized
that I have to throw an exception and add the onError attribute to the
entity to make that work.

Thanks!
Chantal

Re: LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler

I did not realize the LucidWords.jar comes with an option to install the
sources :-)

On Tue, Jul 27, 2010 at 10:59 AM, Eric Grobler wrote:

> Good Morning, afternoon or evening...
>
> If someone installed Solr using the LucidWorks.jar (1.4) installation how
> can one make a small change and recompile.
>
> Is there a LucidWorks (tomcat) build somewhere?
>
> Regards
> ericz
>
>
>
>
>

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

2010-07-27 Thread MitchK


Hi Chantal,



> However, with this approach indexing time went up from 20min to more 
> than 5 hours. 
> 
This is 15x slower than the initial solution... wow.
>From MySQL I know that IN ()-clauses are the embodiment of endlessness -
they perform very, very badly.

New idea:
Create a method which returns the query-string:

returnString(theVIP)
{
   if ( theVIP != null || theVIP != "")
   {
   return "a query-string to find the vip"
   }
   else
   {
   return "SELECT 1" // you need to modify this, so that it
matches your field-definition
   }
}

The main-idea is to perform a blazing fast query, instead of a complex
IN-clause-query.
Does this sounds like a solution???



> The new approach is to query the solr index for that other database that 
> I've already setup. This is only a bit slower than the original query 
> (20min). (I'm using URLDataSource to be 1.4.1 conform.) 
> 
Unfortunately I can not follow you. 
You are querying a solr-index for a database?

Kind regards,
- Mitch
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-SQL-query-sub-entity-is-executed-although-variable-is-not-set-null-or-empty-list-tp995983p998859.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NullPointerException with CURL, but not in browser

2010-07-27 Thread Rene Rath

Ouch! Absolutely correct - quoting the URL fixed it. Thanks for saving me a
sleepless night!

cheers - rene

2010/7/26 Chris Hostetter 

>
> : However, when I'm trying this very URL with curl within my (perl) script,
> I
> : receive a NullPointerException:
> : CURL-COMMAND: curl -sL
> :
> http://localhost:8983/solr/select?indent=on&version=2.2&q=*&fq=ListId%3A881&start=0&rows=0&fl=*%2Cscore&qt=standard&wt=standard
>
> it appears you aren't quoting the URL, so that first "&" character  is
> causing the shell to think yo uare done with the command, and you want it
> to be backgrounded (allthough i'm not certain, since it depends on how you
> are having perl execute curl)
>
> i would suggest that you avoid exec/system calls to "curl" from Perl, and
> use an LWP::UserAgent instead.
>
>
> -Hoss
>
>

DIH $deleteDocByQuery

2010-07-27 Thread Maddy.Jsh


Hi,

I have been using DIH to do index documents from database. I am hoping to
use DIH to delete documents from index. I search in wiki and found the
special commands in DIH to do so.

http://wiki.apache.org/solr/DataImportHandler#Special_Commands


But there is no example on how to use them. I tried searching in the web but
couldn't find any samples.

Any help regarding this would be most welcome.

Thanks,
Maddy.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-deleteDocByQuery-tp998816p998816.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting PDF text/comment/callout/typewriter boxes with Solr CELL/Tika/PDFBox

2010-07-27 Thread Alessandro Benedetti

Hi Jon,
During the last days we front the same problem.
Using Solr 1.4.1 classic (tika 0.4 ),from some pdf files we can't extract
content and from others, Solr throws an exception during the Indexing
Process .
You must:
Update tika libraries (into /contrib/extraction/lib)with tika-core.0.8
snapshot and tika-parsers 0.8.
Update PdfBox and all related libraries.
After that You have to patch Solr 1.4.1 following this patch :
https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel
This is the firts way to solve the problem.

Using Solr 1.4.1 (with tika 0.8 snapshot and pdfbox updated) no exception is
thrown during the Indexing process, but no content is extracted.
Using last Solr trunk (with tika 0.8 snapshot and pdfbox updated)  all
sounds good but we don't know how stableit is!
I hope you have now a clear  vision of this issue,
Best Regards



2010/7/26 Sharp, Jonathan 

>
> Every so often I need to index new batches of scanned PDFs and occasionally
> Adobe's OCR can't recognize the text in a couple of these documents. In
> these situations I would like to type in a small amount of text onto the
> document and have it be extracted by Solr CELL.
>
> Adobe Pro 9 has a number of different ways to add text directly to a PDF
> file:
>
> *Typewriter
> *Sticky Note
> *Callout boxes
> *Text boxes
>
> I tried indexing documents with each of these text additions with Solr
> 1.4.1 + Solr CELL but can't extract the text in any of these boxes.
>
> If someone has modified their Solr CELL installation to use more recent
> versions of Tika (above 0.4) or PDFBox (above 0.7.3) and/or can can comment
> on whether newer versions can pull the text out of any of these various text
> boxes I'd appreciate that very much.
>
> -Jon
>
>
>
>
> -
> SECURITY/CONFIDENTIALITY WARNING:
> This message and any attachments are intended solely for the individual or
> entity to which they are addressed. This communication may contain
> information that is privileged, confidential, or exempt from disclosure
> under applicable law (e.g., personal health information, research data,
> financial information). Because this e-mail has been sent without
> encryption, individuals other than the intended recipient may be able to
> view the information, forward it to others or tamper with the information
> without the knowledge or consent of the sender. If you are not the intended
> recipient, or the employee or person responsible for delivering the message
> to the intended recipient, any dissemination, distribution or copying of the
> communication is strictly prohibited. If you received the communication in
> error, please notify the sender immediately by replying to this message and
> deleting the message and any accompanying files from your system. If, due to
> the security risks, you do not wish to receive further communications via
> e-mail, please reply to this message and inform the sender that you do not
> wish to receive further e-mail from the sender.
>
> -
>
>


-- 
--

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

LucidWorks 1.4 compilation

2010-07-27 Thread Eric Grobler

Good Morning, afternoon or evening...

If someone installed Solr using the LucidWorks.jar (1.4) installation how
can one make a small change and recompile.

Is there a LucidWorks (tomcat) build somewhere?

Regards
ericz

Re: DIH : SQL query (sub-entity) is executed although variable is not set (null or empty list)

Hi Mitch,

thanks for that suggestion. I wasn't aware of that. I've already added a
temporary field in my ScriptTransformer that does basically the same.

However, with this approach indexing time went up from 20min to more
than 5 hours.

The new approach is to query the solr index for that other database that
I've already setup. This is only a bit slower than the original query
(20min). (I'm using URLDataSource to be 1.4.1 conform.)

As with the db entity before, for every document a request is sent to
the solr core even if it is useless because the input variable is empty.
It seems that once an entity processor kicks in you cannot avoid the
initial request to its data source?

Thanks,
Chantal

On Mon, 2010-07-26 at 16:22 +0200, MitchK wrote:
> Hi Chantal,
> 
> did you tried to write a  http://wiki.apache.org/solr/DIHCustomFunctions
> custom DIH Function ?
> If not, I think this will be a solution.
> Just check, whether "${prog.vip}" is an empty string or null.
> If so, you need to replace it with a value that never can response anything.
> 
> So the vip-field will always be empty for such queries. 
> Maybe that helps?
> 
> Hopefully, the variable resolver is able to resolve something like
> ${dih.functions.getReplacementIfNeeded(prog.vip).
> 
> Kind regards,
> - Mitch
> 
> 
> 
> Chantal Ackermann wrote:
> > 
> > Hi,
> > 
> > my use case is the following:
> > 
> > In a sub-entity I request rows from a database for an input list of
> > strings:
> > 
> >  /* multivalued, not required */
> >  > query="select SSC_VALUE from SSC_VALUE
> > where SSC_ATTRIBUTE_ID=1
> >   and SSC_VALUE in (${prog.vip})">
> > 
> > 
> > 
> > 
> > The root entity is "prog" and it has an optional multivalued field
> > called "vip". When the list of "vip" values is empty, the SQL for the
> > sub-entity above throws an SQLException. (Working with Oracle which does
> > not allow an empty expression in the "in"-clause.)
> > 
> > Two things:
> > (A) best would be not to run the query whenever ${prog.vip} is null or
> > empty.
> > (B) From the documentation, it is not clear that onError is only checked
> > in the transformer runs but not checked when the SQL for the entity
> > throws an exception. (Trunk version JdbcDataSource lines 250pp).
> > 
> > IMHO, (A) is the better fix, and if so, (B) is the right decision. (If
> > (A) is not easily fixable, making (B) work would be helpful.)
> > 
> > Looking through the code, I've realized that the replacement of the
> > variables is done in a very generic way. I've not yet seen an
> > appropriate way to check on those variables in order to stop the
> > processing of the entity if the variable is empty.
> > Is there a way to do this? Or maybe there is a completely different way
> > to get my use case working. Any help most appreciated!
> > 
> > Thanks,
> > Chantal
> > 
> > 
> >

Re: slave index is bigger than master index

2010-07-27 Thread Muneeb Ali


We have three dedicated servers for solr, two for slaves and one for master,
all with linux/debian packages installed. 

I understand that replication does always copies over the index in an exact
form as in master index directory (or it is supposed to do that at least),
and if the master index was optimized after indexing, one doesn't need to
run an optimize call again on master to optimize the slave's index. But in
our case thats what fixed it and I agree it is even more confusing now :s

Another problem is, we are serving live services using slave nodes, so I
dont want to effect the live search, while playing with slave nodes'
indices. 

We will be running the indexing on master node today over the night. Lets
see if it does it again.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/slave-index-is-bigger-than-master-index-tp996329p998750.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: clustering component

2010-07-27 Thread Stanislaw Osinski

Hi Matt,

I'm attempting to get the carrot based clustering component (in trunk) to
> work. I see that the clustering contrib has been disabled for the time
> being. Does anyone know if this will be re-enabled soon, or even better,
> know how I could get it working as it is?
>

I've recently created a patch to update the clustering algorithms in
branch_3x:

https://issues.apache.org/jira/browse/SOLR-1804

The patch should also work with trunk, but I haven't verified it yet.

S.

clustering component

2010-07-27 Thread Matt Mitchell

Hi,

I'm attempting to get the carrot based clustering component (in trunk) to
work. I see that the clustering contrib has been disabled for the time
being. Does anyone know if this will be re-enabled soon, or even better,
know how I could get it working as it is?

Thanks,
Matt

Re: Russian stemmer

2010/7/27 Oleg Burlaca 

> Actually the situation with Немцов из ок,
> I've just checked how Yandex works with Немцов and Немцова:
> http://nano.yandex.ru/project/inflect/
>
> I think there are two solutions:
> a) manually search for both Немцов and then Немцова
> b) use wildcard query: Немцов*
>

Well, here is one idea of a more general solution.
The problem with "protected words" is you must have a complete list.

One idea would be to add a filter that protects any words from stemming that
match a regular expression:
In english maybe someone wants to avoid any capitalized words to reduce
trouble: [A-Z].*
in your case then some pattern like [A-Я].*ов might prevent problems.


> Robert, thanks for the RussianLightStemFilterFactory info,
> I've found this page
> http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
> that somehow describes it. Where can I read more about
> RussianLightStemFilterFactory ?
>
>
Here is the link:
http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf


> Regards,
> Oleg
>
> 2010/7/27 Oleg Burlaca 
>
> > A similar word is Немцов.
> > The strange thing is that searching for "Немцова" will not find documents
> > containing "Немцов"
> >
> > Немцова: 14 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> >
> > Немцов: 74 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> >
> >
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: Russian stemmer

Actually the situation with Немцов из ок,
I've just checked how Yandex works with Немцов and Немцова:
http://nano.yandex.ru/project/inflect/

I think there are two solutions:
a) manually search for both Немцов and then Немцова
b) use wildcard query: Немцов*

Robert, thanks for the RussianLightStemFilterFactory info,
I've found this page
http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
that somehow describes it. Where can I read more about
RussianLightStemFilterFactory ?

Regards,
Oleg

2010/7/27 Oleg Burlaca 

> A similar word is Немцов.
> The strange thing is that searching for "Немцова" will not find documents
> containing "Немцов"
>
> Немцова: 14 articles
>
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
>
> Немцов: 74 articles
>
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
>
>
>
>

Re: Russian stemmer

A similar word is Немцов.
The strange thing is that searching for "Немцова" will not find documents
containing "Немцов"

Немцова: 14 articles
http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0

Немцов: 74 articles
http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2

Re: Russian stemmer

Yes, I'm sure I've enabled SnowballPorterFilterFactory both at Index and
Query time, because the search works ok,
except names and geo locations.

I've noticed that searching by
Коврова

also shows documents that contain Коврову, Коврове

Search by Ковров, 7 results:
http://www.sova-center.ru/search/?q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2

Search by Коврова, 26 results:
http://www.sova-center.ru/search/?lg=1&q=%D0%BA%D0%BE%D0%B2%D1%80%D0%BE%D0%B2%D0%B0

Adding such words in stopwords.txt will be a tedious task, as there are 7
millions russian names :)

Kind Regards,
Oleg Burlaca



On Tue, Jul 27, 2010 at 11:35 AM, Robert Muir  wrote:

> another look, your problem is ковров itself... its mapped to ковр
>
> a workaround might be to use the protected words functionality to
> keep ковров and any other problematic people/geo names as-is.
>
> separately, in trunk there is an alternative russian stemmer
> (RussianLightStemFilterFactory), which might give you less problems on
> average, but I noticed it has this same problem with the example you gave.
>
> On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir  wrote:
>
> > All of your examples stem to "ковров":
> >
> >assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
> >   new String[] { "ковров", "ковров", "ковров", "ковров" });
> > }
> >
> > Are you sure you enabled this at *both* index and query time?
> >
> > 2010/7/27 Oleg Burlaca 
> >
> > Hello,
> >>
> >> I'm using SnowballPorterFilterFactory with language="Russian".
> >> The stemming works ok except people names, geographical places.
> >> Here are some examples:
> >>
> >> searching for Ковров should also find Коврова, Коврову, Ковровом,
> Коврове.
> >>
> >> Are there other stemming plugins for the russian language that can
> handle
> >> this?
> >> If not, what are the options. A simple solution may be to use the
> wildcard
> >> queries in Standard mode instead of the DisMaxQueryHandler:
> >> Ковров*
> >>
> >> but I'd like to avoid it.
> >>
> >> Thanks.
> >>
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: Russian stemmer

another look, your problem is ковров itself... its mapped to ковр

a workaround might be to use the protected words functionality to
keep ковров and any other problematic people/geo names as-is.

separately, in trunk there is an alternative russian stemmer
(RussianLightStemFilterFactory), which might give you less problems on
average, but I noticed it has this same problem with the example you gave.

On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir  wrote:

> All of your examples stem to "ковров":
>
>assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
>   new String[] { "ковров", "ковров", "ковров", "ковров" });
> }
>
> Are you sure you enabled this at *both* index and query time?
>
> 2010/7/27 Oleg Burlaca 
>
> Hello,
>>
>> I'm using SnowballPorterFilterFactory with language="Russian".
>> The stemming works ok except people names, geographical places.
>> Here are some examples:
>>
>> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>>
>> Are there other stemming plugins for the russian language that can handle
>> this?
>> If not, what are the options. A simple solution may be to use the wildcard
>> queries in Standard mode instead of the DisMaxQueryHandler:
>> Ковров*
>>
>> but I'd like to avoid it.
>>
>> Thanks.
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

-- 
Robert Muir
rcm...@gmail.com

Spellchecking and frequency

2010-07-27 Thread dan sutton

Hi,

I've recently been looking into Spellchecking in solr, and was struck by how
limited the usefulness of the tool was.

Like most corpora , ours contains lots of different spelling mistakes for
the same word, so the 'spellcheck.onlyMorePopular' is not really that useful
unless you click on it numerous times.

I was thinking that since most of the time people spell words correctly why
was there no other frequency parameter that could enter into the score? i.e.
something like:

spell_score ~ edit_dist * freq

I'm sure others have come across this issue and was wonding what
steps/algorithms they have used to overcome these limitations?

Cheers,
Dan

Re: Russian stemmer

All of your examples stem to "ковров":

   assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
  new String[] { "ковров", "ковров", "ковров", "ковров" });
}

Are you sure you enabled this at *both* index and query time?

2010/7/27 Oleg Burlaca 

> Hello,
>
> I'm using SnowballPorterFilterFactory with language="Russian".
> The stemming works ok except people names, geographical places.
> Here are some examples:
>
> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>
> Are there other stemming plugins for the russian language that can handle
> this?
> If not, what are the options. A simple solution may be to use the wildcard
> queries in Standard mode instead of the DisMaxQueryHandler:
> Ковров*
>
> but I'd like to avoid it.
>
> Thanks.
>



-- 
Robert Muir
rcm...@gmail.com

Russian stemmer

Hello,

I'm using SnowballPorterFilterFactory with language="Russian".
The stemming works ok except people names, geographical places.
Here are some examples:

searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.

Are there other stemming plugins for the russian language that can handle
this?
If not, what are the options. A simple solution may be to use the wildcard
queries in Standard mode instead of the DisMaxQueryHandler:
Ковров*

but I'd like to avoid it.

Thanks.

Any tips/guidelines to turning the Solr/luence performance in a master/slave/sharding environment

2010-07-27 Thread Chengyang

How to reduce the index files size, decreate the sync time between each nodes. 
decrease the index create/update time.
Thanks.

Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

2010-07-27 Thread David Stuart

I would use the string version as Drupal will probably populate it with a url 
like thing something that may not validate as type url


On 27 Jul 2010, at 04:00, Savannah Beckett wrote:

> 
> I am trying to merge the schema.xml that is the solr/nutch setup with the one 
> from drupal apache solr module.  I encounter a field that is not mergeable.
> From drupal module:
>  
> From solr/nutch setup:
>  required="true"/>
> I am not sure if there are any more stuff like this that is not mergeable.
>  
> Is there a easy way to deal with schema.xml?
> Thanks.
> From: David Stuart 
> To: solr-user@lucene.apache.org
> Sent: Mon, July 26, 2010 1:46:58 PM
> Subject: Re: How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?
> 
> Hi Savannah,
> 
> I have just answered this question over on drupal.org. 
> http://drupal.org/node/811062
> 
> Response number 5 and 11 will help you. On the solrconfig.xml side of things 
> you will only really need Drupal's version.
> 
> Although still in alpha my Nutch module will help you out with integration 
> http://drupal.org/project/nutch
> 
> Regards,
> 
> David Stuart
> 
> On 26 Jul 2010, at 21:37, Savannah Beckett wrote:
> 
> > I am using Drupal ApacheSolr module to integrate solr with drupal.  I 
> > already 
> > integrated solr with nutch.  I already moved nutch's solrconfig.xml and 
> > schema.xml to solr's example directory, and it work.  I tried to append 
> > Drupal's 
> > ApacheSolr module's own solrconfig.xml and schema.xml into the same xml 
> > files, 
> > but I got the following error when I "java -jar start.jar":
> >  
> > Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log
> > SEVERE: Exception during parsing file: 
> > solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document 
> > following the root element must be well-formed.
> >at 
> > com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
> >at 
> > com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)
> > 
> >at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
> >at org.apache.solr.core.Config.(Config.java:110)
> >at org.apache.solr.core.SolrConfig.(SolrConfig.java:130)
> >at 
> > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134)
> > 
> >at 
> > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
> > 
> > Why?  does solrconfig.xml allow to have 2  sections?  does 
> > schema.xml 
> > allow to have 2  sections?  
> > 
> > Thanks.
> > 
> > 
> 
> 
>

Re: Design questions/Schema Help