date:20110614

Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread mtraynham

Thanks for your help, great solution! Turned out perfectly.  Too bad they
don't actually add this to the sdk.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3066035.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to involve JMX by configuration

2011-06-14 Thread Gora Mohanty

On Wed, Jun 15, 2011 at 7:41 AM, kun xiong  wrote:
> Hi,
>
> I am wondering how to start JMX monitor without code change.
>
> Currently, I have to insert code "LocateRegistry.createRegistry();" into
> SolrCore.java.
>
> And I specify  serviceUrl="service:jmx:rmi:///jndi/rmi://localhost:/solr/${
> solr.core.name}"/> at solrconfig.xml.
>
> Can I make it by only configuration change?

Please take a look at http://wiki.apache.org/solr/SolrJmx
While this mainly covers JMX configuration for the built-in
Jetty server, you can use the described JMX parameters
for other containers, like Tomcat.

Regards,
Gora

How to involve JMX by configuration

2011-06-14 Thread kun xiong

Hi,

I am wondering how to start JMX monitor without code change.

Currently, I have to insert code "LocateRegistry.createRegistry();" into
SolrCore.java.

And I specify  at solrconfig.xml.

Can I make it by only configuration change?

Thanks

Kun

Re: International filters/tokenizers doing too much

2011-06-14 Thread Shawn Heisey


On 6/14/2011 5:34 PM, Robert Muir wrote:

On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey  wrote:

Because the text in my index comes in many different languages with no
ability to know the language ahead of time, I have a need to use
ICUTokenizer and/or the CJK filters, but I have a problem with them as they
are implemented currently.  They do extra things like handle email
addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
do these things.  This is my current index analyzer chain:

the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java

the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/

as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.

But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29

the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.

for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).


My understanding starts to break down horribly with things like this.  I 
can make sense out of very simple Java code, but I can't make sense out 
of this, and don't know how to take these bits of information you've 
given me and do something useful with them.  I will take the information 
to our programming team before I bug you about it again.  They will 
probably have some idea what to do.  I'm hoping that I can just create 
an extra .jar and not touch the existing lucene/solr code.


Beyond the ICU stuff, what kind of options do I have for dealing with 
other character sets (CJK, arabic, cyrillic, etc) in some sane manner 
while not touching typical Latin punctuation?  I notice that for CJK, 
there is only a Tokenizer and an Analyzer, what I really need is a token 
filter that ONLY deals with the CJK characters.  Is that going to be a 
major undertaking that is best handled by an experienced Lucene 
developer?  Would such a thing be required for Arabic and Cyrillic, or 
are they pretty well covered by whitespace and WDF?


Thanks,
Shawn

Re: How to avoid double counting for facet query

2011-06-14 Thread Way Cool

I just checked SolrQueryParser.java from 3.2.0 source. Looks like Yonik
Seeley's changes for
LUCENE-996is not in.
I will check trunk later. Thanks!

On Tue, Jun 14, 2011 at 5:34 PM, Way Cool  wrote:

> I already checked out facet range query. By the way, I did put the
> facet.range.include as below:
> lower
>
> Couple things I don't like though are:
> 1. It returns the following without end values (I have to re-calculate the
> end values) :
> 
> 20
> 3
> 
> 50.0
> 0.0
> 600.0
> 0
>
> 2. I can't specify custom ranges of values, for example, 1,2,3,4,5,...10,
> 15, 20, 30,40,50,60,80,90,100,200, ..., 600, 800, 900, 1000, 2000, ... etc.
>
> Thanks.
>
>
> On Tue, Jun 14, 2011 at 3:50 PM, Chris Hostetter  > wrote:
>
>>
>> : You can use exclusive range queries which are denoted by curly brackets.
>>
>> that will solve the problem of making the fq exclude a bound, but
>> for the range facet counts you'll want to pay attention to look at
>> facet.range.include...
>>
>> http://wiki.apache.org/solr/SimpleFacetParameters#facet.range.include
>>
>>
>> -Hoss
>>
>
>

Re: International filters/tokenizers doing too much

2011-06-14 Thread Robert Muir

On Tue, Jun 14, 2011 at 7:07 PM, Shawn Heisey  wrote:
> Because the text in my index comes in many different languages with no
> ability to know the language ahead of time, I have a need to use
> ICUTokenizer and/or the CJK filters, but I have a problem with them as they
> are implemented currently.  They do extra things like handle email
> addresses, tokenize on non-alphanumeric characters, etc.  I need them to not
> do these things.  This is my current index analyzer chain:

the idea is that you customize it to whatever your app needs, by
passing ICUTokenizerConfig to the Tokenizer:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/java/org/apache/lucene/analysis/icu/segmentation/ICUTokenizerConfig.java

the default implementation (DefaultICUTokenizerConfig) is pretty
minimal, mostly the unicode default word break implementation,
described here: http://unicode.org/reports/tr29/

as you see, you just need to provide a BreakIterator given the script
code, you could implement this by hand in java code, or it could use a
dictionary, or whatever.

But the easiest and usually most performant is just to use rules,
especially since they are compiled to an efficient form for
processing, the syntax is described here:
http://userguide.icu-project.org/boundaryanalysis#TOC-RBBI-Rules

you compile them into a state machine with this:
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/icu/src/tools/java/org/apache/lucene/analysis/icu/RBBIRuleCompiler.java
and you can load the serialized form (statically, or in your factory,
or whatever) with
http://icu-project.org/apiref/icu4j/com/ibm/icu/text/RuleBasedBreakIterator.html#getInstanceFromCompiledRules%28java.io.InputStream%29

the reason the script code is provided, is because if you are
customizing, its pretty easy to screw some languages over with some
rules that might happen to work well for another set of languages.
so this way you can provide different rules depending upon the writing system.

for example you could return special punctuation rules for western
languages when its the latin script, but still return the default impl
for Tibetan or something you might be less familiar with (maybe you
actually speak Tibetan, this was just an example).

Re: How to avoid double counting for facet query

2011-06-14 Thread Way Cool

I already checked out facet range query. By the way, I did put the
facet.range.include as below:
lower

Couple things I don't like though are:
1. It returns the following without end values (I have to re-calculate the
end values) :

20
3

50.0
0.0
600.0
0

2. I can't specify custom ranges of values, for example, 1,2,3,4,5,...10,
15, 20, 30,40,50,60,80,90,100,200, ..., 600, 800, 900, 1000, 2000, ... etc.

Thanks.

On Tue, Jun 14, 2011 at 3:50 PM, Chris Hostetter
wrote:

>
> : You can use exclusive range queries which are denoted by curly brackets.
>
> that will solve the problem of making the fq exclude a bound, but
> for the range facet counts you'll want to pay attention to look at
> facet.range.include...
>
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.range.include
>
>
> -Hoss
>

International filters/tokenizers doing too much

2011-06-14 Thread Shawn Heisey

Because the text in my index comes in many different languages with no 
ability to know the language ahead of time, I have a need to use 
ICUTokenizer and/or the CJK filters, but I have a problem with them as 
they are implemented currently.  They do extra things like handle email 
addresses, tokenize on non-alphanumeric characters, etc.  I need them to 
not do these things.  This is my current index analyzer chain:


http://pastebin.com/dNBGmeeW

My current idea for how to change this is to use the ICUTokenizer 
instead of the WhitespaceTokenizer, then as one of the later steps, run 
it through CJK so that it outputs bigrams for the CJK characters.  The 
reason I can't do this now is that I must let WordDelimiterFilter handle 
punctuation, case changes, and numbers, because of the magic of the 
preserveOriginal flag.


Is it possible to turn off these extra features in these analyzer 
components as they are written now?  If not, is it a painful process for 
someone with Java experience to customize the code so it IS possible?  I 
have not yet looked at the code, but I will do so in the next couple of 
days.  Ideally, I would also like to have a WordDelimiterFilter that is 
fully aware of international capitalization via ICU.  Does any such 
thing exist?


In the current chain, you'll notice a pattern filter.  What this does is 
remove leading and trailing punctuation from tokens.  Punctuation inside 
the token is preserved, for later handling with WordDelimiterFilter.


Thanks,
Shawn

Re: Modifying Configuration from a Browser

2011-06-14 Thread Stefan Matheis


Brandon,

actually afaik there is no ability to to this, not via API or something 
else.


If you really want to start building such an API, would you mind to 
build an generic one? I'm asking because this was already requested as 
feature for the new admin UI 
[https://issues.apache.org/jira/browse/SOLR-2399?focusedCommentId=13007832&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13007832] 



Regards
Stefan

Am 14.06.2011 21:47, schrieb Brandon Fish:

Does anyone have any examples of modifying a configuration file, like
"elevate.xml" from a browser? Is there an API that would help for this?

If nothing exists for this, I am considering implementing something that
would change the "elevate.xml" file then reload the core. Or is there a
better approach for dynamic configuration?

Thank you.

Re: How to avoid double counting for facet query

2011-06-14 Thread Chris Hostetter


: You can use exclusive range queries which are denoted by curly brackets.

that will solve the problem of making the fq exclude a bound, but 
for the range facet counts you'll want to pay attention to look at 
facet.range.include...

http://wiki.apache.org/solr/SimpleFacetParameters#facet.range.include


-Hoss

Search with Dynamic indexing

2011-06-14 Thread zarni aung

Hi,

I have requirements to make large amounts of data (> 5 million) documents
search-able.
The problem is that more than half have highly volatile field values.  I
will also have a data store specifically for Meta Data.
Committing frequently isn't a solution.  What I'm basically trying to
achieve is NRT.
I've read so many postings and articles everywhere and even considered
sharing a single index amongst one WriteOnly Solr instance with >1-n Solr
instances.
Apparently this will not work since, calling commit on a searcher is the
only way new documents will become search-able.  I've also considered using
one WriteOnly Master Instance with > 1-n ReadOnly Solr Slaves but that would
mean there will be lag between snapshots of the master.  Another solution
that I was thinking about is having a smaller R/W Dynamic Master Solr
instance that would only store deltas while I will still have a WriteOnly
Master with a set of ReadOnly slaves.  That would mean I would have to add
some logic to combine and intersect the results from the dynamic Solr
instance and R/O slaves.  In this scenario, I wonder what would happen if I
were to search for the top 25 documents that contains "x"?  What would
happen to scoring and other factors?  Would sharding be better in this
situation?

One more question is that I have not seen a lot of people discuss Solr-RA
NRT?  Is anyone familiar with it?  There's not much mention of it except
here http://solr-ra.tgels.com.

Thanks,

Zarni

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov


opps, please s/Highlight/Wildcard/

On 06/14/2011 05:31 PM, Mike Sokolov wrote:
Wildcard queries aren't analyzed, I think?  I'm not completely sure 
what the best workaround is here: perhaps simply lowercasing the query 
terms yourself in the application.  Also - I hope someone more 
knowledgeable will say that the new HighlightQuery in trunk doesn't 
have this restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  
wrote:



I am using the following for my text field:























I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

RE: Text field case sensitivity problem

2011-06-14 Thread Bob Sandiford

Unfortunately, wild card search terms don't get processed by the analyzers.

One suggestion that's fairly common is to make sure you lower case your wild 
card search terms yourself before issuing the query.

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com
www.sirsidynix.com

> -Original Message-
> From: Jamie Johnson [mailto:jej2...@gmail.com]
> Sent: Tuesday, June 14, 2011 5:13 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text field case sensitivity problem
> 
> Also of interest to me is this returns results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine
> 
> 
> On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson 
> wrote:
> 
> > I am using the following for my text field:
> >
> >  > positionIncrementGap="100" autoGeneratePhraseQueries="true">
> >   
> > 
> > 
> > 
> >  > ignoreCase="true"
> > words="stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> > 
> >   
> >   
> > 
> >  synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >  > ignoreCase="true"
> > words="stopwords.txt"
> > enablePositionIncrements="true"
> > />
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > protected="protwords.txt"/>
> > 
> >   
> > 
> >
> > I have a field defined as
> > />
> >
> > when I execute a go to the following url I get results
> > http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> > but if I do
> > http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> > I get nothing.  I thought the LowerCaseFilterFactory would have
> handled
> > lowercasing both the query and what is being indexed, am I missing
> > something?
> >

Re: Text field case sensitivity problem

2011-06-14 Thread Mike Sokolov

Wildcard queries aren't analyzed, I think?  I'm not completely sure what 
the best workaround is here: perhaps simply lowercasing the query terms 
yourself in the application.  Also - I hope someone more knowledgeable 
will say that the new HighlightQuery in trunk doesn't have this 
restriction, but I'm not sure about that.


-Mike

On 06/14/2011 05:13 PM, Jamie Johnson wrote:

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  wrote:

   

I am using the following for my text field:

 
   
 
 
 
 
 
 
 
 
   
   
 
 
 
 
 
 
 
   
 

I have a field defined as


when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

Re: Strange behavior

2011-06-14 Thread Erick Erickson

Well, you could provide the results with &debugQuery=on. You could
provide the schema.xml and solrconfig.xml files for both. You
could provide a listing of your index files. You could provide some
evidence that you've tried chasing down your problem using tools
like Luke or the Solr admin interface. Something please...

You might also review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

2011/6/14 Denis Kuzmenok :
> What  should  i provide, OS is the same, environment is the same, solr
> is  completely  copied,  searches  work,  except that one, and that is
> strange..
>
>> I think you will need to provide more information than this, no-one on this 
>> list is omniscient AFAIK.
>
>> François
>
>> On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:
>
>>> Hi.
>>>
>>> I've  debugged search on test machine, after copying to production server
>>> the  entire  directory  (entire solr directory), i've noticed that one
>>> query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
>>> production.
>>> How can that be?
>>>
>
>
>
>
>

Re: ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-14 Thread Jonathan Rochkind


Okay, let's try the debug trace again without a pf to be less confusing.

One field in qf, that's ordinary text tokenized, and does get hits:

q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t&mm=100%&debugQuery=true&pf=

churchill : roosevelt
churchill : roosevelt

+((DisjunctionMaxQuery((title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title1_t:roosevelt)~0.01))~2) ()



+(((title1_t:churchil)~0.01 (title1_t:roosevelt)~0.01)~2) ()


And that gets 25 hits. Now we add in a second field to the qf, this 
second field is also ordinarily tokenized. We expect no _fewer_ than 25 
hits, adding another field into qf, right? And indeed it still results 
in exactly 25 hits (no additional hits from the additional qf field).


?q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20title2_t&mm=100%&debugQuery=true&pf=


+((DisjunctionMaxQuery((title2_t:churchil | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((title2_t:roosevelt | title1_t:roosevelt)~0.01))~2) ()



+(((title2_t:churchil | title1_t:churchil)~0.01 (title2_t:roosevelt | 
title1_t:roosevelt)~0.01)~2) ()





Okay, now we go back to just that first (ordinarily tokenized) field, 
but add a second field in that uses KeywordTokenizerFactory.  We expect 
this not neccesarily to ever match for a multi-word query, but we don't 
expect it to be fewer than 25 hits, the 25 hits from the first field in 
the qf should still be there, right? But it's not. What happened, why not?


q=churchill%20%3A%20roosevelt&qt=search&qf=title1_t%20isbn_t&mm=100%&debugQuery=true&pf=


str name="rawquerystring">churchill : roosevelt
churchill : roosevelt
+((DisjunctionMaxQuery((isbn_t:churchill | 
title1_t:churchil)~0.01) DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
()
+(((isbn_t:churchill | 
title1_t:churchil)~0.01 (isbn_t::)~0.01 (isbn_t:roosevelt | 
title1_t:roosevelt)~0.01)~3) ()




On 6/14/2011 5:19 PM, Jonathan Rochkind wrote:
I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, 
the other fields in qf will hit or not, that's good enough.


And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits 
when it shoudln't, and I can't figure out why.


basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field

gets hits.  The ":" is thrown out the text_field, but the mm still 
passes somehow, right?


But, in the same index:

&defType=dismax&mm=100%&q=one : two&qf=text_field 
keyword_tokenized_text_field


gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the 
mm differently, decide there are three tokens in there and they all 
must match, and the token ":" can never match because it's not in my 
index it's stripped out... but somehow this isn't a problem unless I 
include a keyword-tokenized  field in the qf?


This is really confusing, if anyone has any idea what I'm talking 
about it and can shed any light on it, much appreciated.


The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was 
useful for certain use cases for me.


Oh, hey, the debugging trace woudl probably be useful:




churchill : roosevelt


churchill : roosevelt


+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
text:"churchil roosevelt"~3^10.0 | title2_t:"churchil 
roosevelt"~3^50.0 | author_unstem:"churchill roosevelt"~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
author2_unstem:"churchill roosevelt"~3^240.0 | 
title3_unstem:"churchill roosevelt"~3^80.0 | subject_t:"churchil 
roosevelt"~3^10.0 | other_number_unstem:"churchill roosevelt"~3^40.0 | 
subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01)



+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
author_unstem:"churchill roosevelt"~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
author2_unstem:"churchill roosevelt"~3^240.0 | 
title3_unstem:"churchill roosevelt"~3^80.0 | subjec

ampersand, dismax, combining two fields, one of which is keywordTokenizer

2011-06-14 Thread Jonathan Rochkind

I'm aware that using a field tokenized with KeywordTokenizerFactory is 
in a dismax 'qf' is often going to result in 0 hits on that field -- 
(when a whitespace-containing query is entered).  But I do it anyway, 
for cases where a non-whitespace-containing query is entered, then it 
hits.  And in those cases where it doesn't hit, I figure okay, well, the 
other fields in qf will hit or not, that's good enough.


And usually that works. But it works _differently_ when my query 
contains an ampersand (or any other punctuation), result in 0 hits when 
it shoudln't, and I can't figure out why.


basically,

&defType=dismax&mm=100%&q=one : two&qf=text_field

gets hits.  The ":" is thrown out the text_field, but the mm still 
passes somehow, right?


But, in the same index:

&defType=dismax&mm=100%&q=one : two&qf=text_field 
keyword_tokenized_text_field


gets 0 hits.  Somehow maybe the inclusion of the 
keyword_tokenized_text_field in the qf causes dismax to calculate the mm 
differently, decide there are three tokens in there and they all must 
match, and the token ":" can never match because it's not in my index 
it's stripped out... but somehow this isn't a problem unless I include a 
keyword-tokenized  field in the qf?


This is really confusing, if anyone has any idea what I'm talking about 
it and can shed any light on it, much appreciated.


The conclusion I am reaching is just NEVER include anything but a more 
or less ordinarily tokenized field in a dismax qf. Sadly, it was useful 
for certain use cases for me.


Oh, hey, the debugging trace woudl probably be useful:




churchill : roosevelt


churchill : roosevelt


+((DisjunctionMaxQuery((isbn_t:churchill | title1_t:churchil)~0.01) 
DisjunctionMaxQuery((isbn_t::)~0.01) 
DisjunctionMaxQuery((isbn_t:roosevelt | title1_t:roosevelt)~0.01))~3) 
DisjunctionMaxQuery((title2_unstem:"churchill roosevelt"~3^240.0 | 
text:"churchil roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 
| author_unstem:"churchill roosevelt"~3^400.0 | 
title_exactmatch:churchill roosevelt^500.0 | title1_t:"churchil 
roosevelt"~3^60.0 | title1_unstem:"churchill roosevelt"~3^320.0 | 
author2_unstem:"churchill roosevelt"~3^240.0 | title3_unstem:"churchill 
roosevelt"~3^80.0 | subject_t:"churchil roosevelt"~3^10.0 | 
other_number_unstem:"churchill roosevelt"~3^40.0 | 
subject_unstem:"churchill roosevelt"~3^80.0 | title_series_t:"churchil 
roosevelt"~3^40.0 | title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01)



+(((isbn_t:churchill | title1_t:churchil)~0.01 (isbn_t::)~0.01 
(isbn_t:roosevelt | title1_t:roosevelt)~0.01)~3) 
(title2_unstem:"churchill roosevelt"~3^240.0 | text:"churchil 
roosevelt"~3^10.0 | title2_t:"churchil roosevelt"~3^50.0 | 
author_unstem:"churchill roosevelt"~3^400.0 | title_exactmatch:churchill 
roosevelt^500.0 | title1_t:"churchil roosevelt"~3^60.0 | 
title1_unstem:"churchill roosevelt"~3^320.0 | author2_unstem:"churchill 
roosevelt"~3^240.0 | title3_unstem:"churchill roosevelt"~3^80.0 | 
subject_t:"churchil roosevelt"~3^10.0 | other_number_unstem:"churchill 
roosevelt"~3^40.0 | subject_unstem:"churchill roosevelt"~3^80.0 | 
title_series_t:"churchil roosevelt"~3^40.0 | 
title_series_unstem:"churchill roosevelt"~3^60.0 | 
text_unstem:"churchill roosevelt"~3^80.0)~0.01




DisMaxQParser





6.0



3.0



2.0




0.0




0.0




0.0




0.0




0.0




0.0

Re: Text field case sensitivity problem

2011-06-14 Thread Jamie Johnson

Also of interest to me is this returns results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kristine


On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson  wrote:

> I am using the following for my text field:
>
>  positionIncrementGap="100" autoGeneratePhraseQueries="true">
>   
> 
> 
> 
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  protected="protwords.txt"/>
> 
>   
> 
>
> I have a field defined as
>
>
> when I execute a go to the following url I get results
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
> but if I do
> http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
> I get nothing.  I thought the LowerCaseFilterFactory would have handled
> lowercasing both the query and what is being indexed, am I missing
> something?
>

Text field case sensitivity problem

2011-06-14 Thread Jamie Johnson

I am using the following for my text field:


  








  
  







  


I have a field defined as
   

when I execute a go to the following url I get results
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:kris*
but if I do
http://localhost:8983/solr/select?defType=lucene&q=Person_Name:Kris*
I get nothing.  I thought the LowerCaseFilterFactory would have handled
lowercasing both the query and what is being indexed, am I missing
something?

Re: How to avoid double counting for facet query

2011-06-14 Thread Ahmet Arslan

> That's good to know. From the ticket,
> looks like the fix will be in 4.0
> then?

It is already committed. You can use trunk:
svn checkout http://svn.apache.org/repos/asf/lucene/dev/trunk
 
> Currently I can see {} and [] worked, but not combined for
> Solr 3.1. I will
> try 3.2 soon. 

After re-thinking you can simulate the same thing by using a negative clause 
too : facet.query=price:[110 TO 160] -price:160 

I saw an facet by range example in solrconfig.xml. May be this will work for 
you?

http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

0
600
50

Re: Modifying Configuration from a Browser

2011-06-14 Thread Way Cool

+1 Good idea! I was thinking to write a web interface to change contents for
elevate.xml and feed back to Solr core.

On Tue, Jun 14, 2011 at 1:51 PM, Markus Jelsma
wrote:

> There is no API. Upload and restart the core is the way to go.
>
> > Does anyone have any examples of modifying a configuration file, like
> > "elevate.xml" from a browser? Is there an API that would help for this?
> >
> > If nothing exists for this, I am considering implementing something that
> > would change the "elevate.xml" file then reload the core. Or is there a
> > better approach for dynamic configuration?
> >
> > Thank you.
>

Re: How to avoid double counting for facet query

2011-06-14 Thread Way Cool

That's good to know. From the ticket, looks like the fix will be in 4.0
then?

Currently I can see {} and [] worked, but not combined for Solr 3.1. I will
try 3.2 soon. Thanks.

On Tue, Jun 14, 2011 at 2:07 PM, Ahmet Arslan  wrote:

> > You sure Solr supports that?
> > I am getting exceptions by doing that. Ahmet, do you
> > remember where you see
> > that document? Thanks.
>
> I tested it with trunk.
> https://issues.apache.org/jira/browse/SOLR-355
> https://issues.apache.org/jira/browse/LUCENE-996
>
>

Re: How to avoid double counting for facet query

2011-06-14 Thread Ahmet Arslan

> You sure Solr supports that?
> I am getting exceptions by doing that. Ahmet, do you
> remember where you see
> that document? Thanks.

I tested it with trunk. 
https://issues.apache.org/jira/browse/SOLR-355
https://issues.apache.org/jira/browse/LUCENE-996

Re: How to avoid double counting for facet query

2011-06-14 Thread Way Cool

You sure Solr supports that?
I am getting exceptions by doing that. Ahmet, do you remember where you see
that document? Thanks.



On Tue, Jun 14, 2011 at 1:58 PM, Way Cool  wrote:

> Thanks! That's what I was trying to find.
>
>
> On Tue, Jun 14, 2011 at 1:48 PM, Ahmet Arslan  wrote:
>
>> > 23
>> > 1
>> > 
>> > ...
>> > *
>> >
>> > As you notice, the number of the results is 23, however an
>> > extra doc was
>> > found in the 160-200 range.
>> >
>> > Any way I can avoid double counting issue?
>>
>> You can use exclusive range queries which are denoted by curly brackets.
>>
>> price:[110 TO 160}
>> price:[160 TO 200}
>>
>
>

Re: How to avoid double counting for facet query

2011-06-14 Thread Way Cool

Thanks! That's what I was trying to find.

On Tue, Jun 14, 2011 at 1:48 PM, Ahmet Arslan  wrote:

> > 23
> > 1
> > 
> > ...
> > *
> >
> > As you notice, the number of the results is 23, however an
> > extra doc was
> > found in the 160-200 range.
> >
> > Any way I can avoid double counting issue?
>
> You can use exclusive range queries which are denoted by curly brackets.
>
> price:[110 TO 160}
> price:[160 TO 200}
>

Re: Updating only one indexed field for all documents quickly.

2011-06-14 Thread karthik

Look at solr-2272. It might help in your situation. you can have a separate
core & join using the document unique id.

This way in the separate core you can just have the document id & the view
stats & you can just keep updating those 2 fields alone instead of the
entire document.

-- karthik

On Tue, Jun 14, 2011 at 2:41 PM, Adam Duston  wrote:

> Hi Erick,
>
> Thanks for your message.
>
> > What is the use-case you're considering?
>
> The use case is actually quite similar to the one in the blog post. We
> have view counts for Videos in our mysql database. We want to be able
> to find "most viewed videos" that match certain search criteria. So,
> for example, videos that contain a particular word and were also
> viewed the greatest number of times.
>
> We keep track of the view statistics in real-time using Redis, and
> then we dump the view stats into mysql once every 2 hours. It takes a
> while to update the solr search index, so we don't want to update the
> entire index once every 2 hours.
>
> > with the integer field. If you just want to influence the
> > score, then just plain external field fields should work for
> > you.
>
> Is this an appropriate solution, give our use case?
>
> Thank you again,
> Adam
>
>
> On Tue, Jun 14, 2011 at 2:36 PM, Erick Erickson 
> wrote:
> > Nope, there isn't a way to index a single field, it's always
> > the entire document.
> >
> > That said, the URL you pointed to is very interesting, but
> > it may be overkill depending upon what you want to do
> > with the integer field. If you just want to influence the
> > score, then just plain external field fields should work for
> > you.
> >
> > What is the use-case you're considering?
> >
> > Best
> > Erick
> >
> > On Tue, Jun 14, 2011 at 10:33 AM, Adam Duston  wrote:
> >> We are updating one indexed integer field in Solr for all documents
> >> once every two hours. We're using Solr through Haystack so we're not
> >> exactly Solr experts. Is there a way to update just one indexed field
> >> for all documents without reindexing all other fields also? We saw
> >> this blog post [1], which appears to be one solution.
> >>
> >> Adam
> >>
> >> [1]
> http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html
> >>
> >> --
> >> adus...@gmail.com
> >> 312-375-9879
> >> Skype: aduston
> >>
> >
>
>
>
> --
> adus...@gmail.com
> 312-375-9879
> Skype: aduston
>

Re: Modifying Configuration from a Browser

2011-06-14 Thread Markus Jelsma

There is no API. Upload and restart the core is the way to go.

> Does anyone have any examples of modifying a configuration file, like
> "elevate.xml" from a browser? Is there an API that would help for this?
> 
> If nothing exists for this, I am considering implementing something that
> would change the "elevate.xml" file then reload the core. Or is there a
> better approach for dynamic configuration?
> 
> Thank you.

Re: How to avoid double counting for facet query

2011-06-14 Thread Ahmet Arslan

> 23
> 1
> 
> ...
> *
> 
> As you notice, the number of the results is 23, however an
> extra doc was
> found in the 160-200 range.
> 
> Any way I can avoid double counting issue? 

You can use exclusive range queries which are denoted by curly brackets.

price:[110 TO 160}
price:[160 TO 200}

Modifying Configuration from a Browser

2011-06-14 Thread Brandon Fish

Does anyone have any examples of modifying a configuration file, like
"elevate.xml" from a browser? Is there an API that would help for this?

If nothing exists for this, I am considering implementing something that
would change the "elevate.xml" file then reload the core. Or is there a
better approach for dynamic configuration?

Thank you.

Re: Updating only one indexed field for all documents quickly.

2011-06-14 Thread Adam Duston

Hi Erick,

Thanks for your message.

> What is the use-case you're considering?

The use case is actually quite similar to the one in the blog post. We
have view counts for Videos in our mysql database. We want to be able
to find "most viewed videos" that match certain search criteria. So,
for example, videos that contain a particular word and were also
viewed the greatest number of times.

We keep track of the view statistics in real-time using Redis, and
then we dump the view stats into mysql once every 2 hours. It takes a
while to update the solr search index, so we don't want to update the
entire index once every 2 hours.

> with the integer field. If you just want to influence the
> score, then just plain external field fields should work for
> you.

Is this an appropriate solution, give our use case?

Thank you again,
Adam

On Tue, Jun 14, 2011 at 2:36 PM, Erick Erickson  wrote:
> Nope, there isn't a way to index a single field, it's always
> the entire document.
>
> That said, the URL you pointed to is very interesting, but
> it may be overkill depending upon what you want to do
> with the integer field. If you just want to influence the
> score, then just plain external field fields should work for
> you.
>
> What is the use-case you're considering?
>
> Best
> Erick
>
> On Tue, Jun 14, 2011 at 10:33 AM, Adam Duston  wrote:
>> We are updating one indexed integer field in Solr for all documents
>> once every two hours. We're using Solr through Haystack so we're not
>> exactly Solr experts. Is there a way to update just one indexed field
>> for all documents without reindexing all other fields also? We saw
>> this blog post [1], which appears to be one solution.
>>
>> Adam
>>
>> [1] 
>> http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html
>>
>> --
>> adus...@gmail.com
>> 312-375-9879
>> Skype: aduston
>>
>

-- 
adus...@gmail.com
312-375-9879
Skype: aduston

Re: Updating only one indexed field for all documents quickly.

2011-06-14 Thread Erick Erickson

Nope, there isn't a way to index a single field, it's always
the entire document.

That said, the URL you pointed to is very interesting, but
it may be overkill depending upon what you want to do
with the integer field. If you just want to influence the
score, then just plain external field fields should work for
you.

What is the use-case you're considering?

Best
Erick

On Tue, Jun 14, 2011 at 10:33 AM, Adam Duston  wrote:
> We are updating one indexed integer field in Solr for all documents
> once every two hours. We're using Solr through Haystack so we're not
> exactly Solr experts. Is there a way to update just one indexed field
> for all documents without reindexing all other fields also? We saw
> this blog post [1], which appears to be one solution.
>
> Adam
>
> [1] 
> http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html
>
> --
> adus...@gmail.com
> 312-375-9879
> Skype: aduston
>

How to avoid double counting for facet query

2011-06-14 Thread Way Cool

Hi, guys,

I fixed Solr search UI (solr/browse) to display the price range facet values
via
http://thetechietutorials.blogspot.com/2011/06/fix-price-facet-display-in-solr-search.htm
l:

   - Under 
50
   (1331)
   - [50.0 TO 
100]
   (133)
   - [100.0 TO 
150]
   (31)
   - [150.0 TO 
200]
   (7)
   - [200.0 TO 
250]
   (2)
   - [250.0 TO 
300]
   (5)
   - [300.0 TO 
350]
   (3)
   - [350.0 TO 
400]
   (6)
   - [400.0 TO 
450]
   (1)
   - 
600.0+(1)

However I am having double counting issue.

Here is the URL to only return docs whose prices are in between 110.0 and
160.0 and price facets:
http://localhost:8983/solr/select/?q=Shakespeare&version=2.2&rows=0&*
fq=price:[110.0+TO+160]*&*
facet.query=price:[110%20TO%20160]&facet.query=price:[160%20TO%20200]*
&facet.field=price

The response is as below:
*


23
1

...
*

As you notice, the number of the results is 23, however an extra doc was
found in the 160-200 range.

Any way I can avoid double counting issue? Or does anyone have similar
issues?

Thanks,

YH

Re: query parsing - removes a term

2011-06-14 Thread Dmitry Kan

Do you use stop word removal on text field?

Dmitry

On Tue, Jun 14, 2011 at 9:18 PM, Andrea Eakin <
andrea.ea...@systemsbiology.org> wrote:

> I am trying to do the following type of query:
>
> +text:(was wasp) +pub_date_year:[1991 TO 2011]
>
> When I turn debugQuery=on I find that the parsedquery is only sending in
> the
> +text:(wasp) on parsing, and doesn't use the "was" value.  Why is it
> removing one of the terms?
>
> Thanks!
> Andrea
>



-- 
Regards,

Dmitry Kan

query parsing - removes a term

2011-06-14 Thread Andrea Eakin

I am trying to do the following type of query:

+text:(was wasp) +pub_date_year:[1991 TO 2011]

When I turn debugQuery=on I find that the parsedquery is only sending in the
+text:(wasp) on parsing, and doesn't use the "was" value.  Why is it
removing one of the terms?

Thanks!
Andrea

Re: huge shards (300GB each) and load balancing

2011-06-14 Thread Dmitry Kan

Hi Tom,

Thanks a lot for sharing this. We have about half a terabyte total index
size, and we have split our index over 10 shards (horizontal scaling, no
replication). Each shard currently is allocated max 12GB memory. We use
facet search a lot and non-facet search with parameter values generated by
facet search (hence more focused search that hits small portion of solr
documents).

The parameters you have menioned -- termInfosIndexDivisor and
termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
using SOLR 3.1? Did you you do logical sharding or document hash based? Do
you have load balancer between the front SOLR (or front entity) and shards,
do you do merging?

On Wed, Jun 8, 2011 at 10:23 PM, Burton-West, Tom wrote:

> Hi Dmitry,
>
> I am assuming you are splitting one very large index over multiple shards
> rather than replicating and index multiple times.
>
> Just for a point of comparison, I thought I would describe our experience
> with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards.
>  This is split over 4 machines with 3 shards per machine and our shards are
> about 400-500GB.  We get average response times of around 200 ms with the
> 99th percentile queries up around 1-2 seconds. We have a very low qps rate,
> i.e. less than 1 qps.  We also index offline on a separate machine and
> update the indexes nightly.
>
> Some of the issues we have found with very large shards are:
> 1) Becaue of the very large shard size, I/O tends to be the bottleneck,
> with phrase queries containing common words being the slowest.
> 2) Because of the I/O issues running cache-warming queries to get postings
> into the OS disk cache is important as is leaving significant free memory
> for the OS to use for disk caching
> 3) Because of the I/O issues using stop words or CommonGrams produces a
> significant performance increase.
> 2) We have a huge number of unique terms in our indexes.  In order to
> reduce the amount of memory needed by the in-memory terms index we set the
> termInfosIndexDivisor to 8, which causes Solr to only load every 8th term
> from the tii file into memory. This reduced memory use from over 18GB to
> below 3G and got rid of 30 second stop the world java Garbage Collections.
> (See
> http://www.hathitrust.org/blogs/large-scale-search/too-many-words-againfor 
> details)  We later ran into memory problems when indexing so instead
> changed the index time parameter termIndexInterval from 128 to 1024.
>
> (More details here: http://www.hathitrust.org/blogs/large-scale-search)
>
> Tom Burton-West
>
>


-- 
Regards,

Dmitry Kan

Re: Using Edismax

2011-06-14 Thread Jan Høydahl

Hi,

Let's assume you're using Solr version 3.1.0 and an unmodified FieldType 
"text_rev". It looks like this:

  ...

Also let's assume that what you have two docs in your index with these URLs:
A:"http://my.host/SPC265_SharePoint_2010.pptx";
B:"http://my.host/OpenTRs2010.xlsx";

Now you want to match only A and not B, and you attempt that using q=url:_2010

What happens here can easily be simulated by 
http://localhost:8983/solr/admin/analysis.jsp:

Your Tokenizer keeps the whole URL as a token.
The WordDelimiterFilter splits on all kinds of things, also removing the "_". 
Thus you get a match on 2010

What you need to do is design a new FieldType in your schema specifically for 
your need.
Choose a Tokenizer based on what you want to be your tokens.
My suggestion is like this:

Now your tokens will be "http my host SPC265 UNDERSCORE SharePoint UNDERSCORE 
2010 pptx"
A search for url:_2010 would match because the _ is replaced with a special 
token which can then be matched. Proof:

You could do similar thins for other special cases you wish to match. I assume 
that the normal case is that you want to match whole words like sharepoint or 
pptx, and that the _ matching is a special case.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. juni 2011, at 11.42, Tirthankar Chatterjee wrote:

> Eric
> Thx for the reply. But what can I do to avoid getting 2010. I wanted a phrase 
> query with underscore, so it would return results with underscore2010 only.
> 
> Sent from iPod
> 
> On Jun 13, 2011, at 3:47 PM, "Erick Erickson"  wrote:
> 
>> You haven't supplied the information that's really
>> needed to help here, please review:
>> 
>> http://wiki.apache.org/solr/UsingMailingLists
>> 
>> But at a guess your analysis chain contains
>> WordDelimiterFilterFactory, which is splitting
>> the input stream into tokens on letter/number
>> changes, and capitalization changes. So you're
>> getting "2010" indexed as a separate token and
>> you're also searching on it...
>> 
>> Best
>> Erick
>> 
>> On Mon, Jun 13, 2011 at 3:07 PM, Tirthankar Chatterjee
>>  wrote:
>>> We are using edismax for query and the query fired is (url:_2010)
>>> 
>>> http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url: 
>>> 2010&version=2.2&start=0&rows=10&indent=on&defType=edismax
>>> 
>>> the url field is of type text_rev
>>> 
>>> Results that SOLR returns has 1 extra item which we don't want to get. How 
>>> do we achieve that?
>>> 
>>> Results:
>>> 
>>> SPC265_SharePoint_2010.pptx
>>> OpenTRs2010.xlsx(we don't want this to be returned)
>>> 
>>> 
>>> Thanks in advance!!!
>>> 
>>> Tirthankar
>>> 
>>> 
>>> **Legal Disclaimer***
>>> "This communication may contain confidential and privileged
>>> material for the sole use of the intended recipient. Any
>>> unauthorized review, use or distribution by others is strictly
>>> prohibited. If you have received the message in error, please
>>> advise the sender by reply email and delete the message. Thank
>>> you."
>>> *

Solr Data Import Handler - German Language Database

2011-06-14 Thread venkat

Hi All,

I am experiencing a problem with DataImportHandler trying to index a colum
in a table that has several umlaut German characters , the problem is the
field is not getting indexed at all and i get a error log stating that this
field which has been defined as required in the schema is missing, though
the database has values for the field, can anyone help me on this.

Kind Regards

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Data-Import-Handler-German-Language-Database-tp3062651p3062651.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread mtraynham

Thanks for the correction!  I thought I had read that phrases were assumed to
be in order and the slop was the distance between them.  I'll look into this
also.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3063673.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread mtraynham

That is a really good idea.  I'll have to try that.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3063668.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Strange behavior

2011-06-14 Thread Denis Kuzmenok

What  should  i provide, OS is the same, environment is the same, solr
is  completely  copied,  searches  work,  except that one, and that is
strange.. 

> I think you will need to provide more information than this, no-one on this 
> list is omniscient AFAIK.

> François

> On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

>> Hi.
>> 
>> I've  debugged search on test machine, after copying to production server
>> the  entire  directory  (entire solr directory), i've noticed that one
>> query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
>> production.
>> How can that be?
>>

DIH entity threads

2011-06-14 Thread Mark


Hello all,

We are using DIH to index our data (~6M documents) and its taking an 
extremely long time (~24 hours). I am trying to find ways that we can 
speed this up. I've been reading through older posts and it's my 
understanding this should not take that long.


One probably bottleneck is that we have a sub entity pulling in item 
descriptions from a separate datasource which we then strip html from. 
Before stripping the html we run it through JTidy. Our data-config looks 
something like this: http://pastie.org/2067011


I've heard about entity threads and I was wondering if this would help 
in my case? I haven't been able to find any good documentation on this.


Another possible bottleneck is the the number of sub entities we have... 
5 (only 1 of which is CachedSqlEntityProcessor). Any ideas?


Thanks for the help

Re: Strange behavior

2011-06-14 Thread François Schiettecatte

I think you will need to provide more information than this, no-one on this 
list is omniscient AFAIK.

François

On Jun 14, 2011, at 10:44 AM, Denis Kuzmenok wrote:

> Hi.
> 
> I've  debugged search on test machine, after copying to production server
> the  entire  directory  (entire solr directory), i've noticed that one
> query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
> production.
> How can that be?
>

Strange behavior

2011-06-14 Thread Denis Kuzmenok

Hi.

I've  debugged search on test machine, after copying to production server
the  entire  directory  (entire solr directory), i've noticed that one
query  (SDR  S70EE  K)  does  match  on  test  server, and does not on
production.
How can that be?

Updating only one indexed field for all documents quickly.

2011-06-14 Thread Adam Duston

We are updating one indexed integer field in Solr for all documents
once every two hours. We're using Solr through Haystack so we're not
exactly Solr experts. Is there a way to update just one indexed field
for all documents without reindexing all other fields also? We saw
this blog post [1], which appears to be one solution.

Adam

[1] 
http://sujitpal.blogspot.com/2011/05/custom-sorting-in-solr-using-external.html

-- 
adus...@gmail.com
312-375-9879
Skype: aduston

Re: Huge performance drop in distributed search w/ shards on the same server/container

2011-06-14 Thread Johannes Goll

I increased the maximum POST size and headerBufferSize to 10MB ; lowThreads
to 50, maxThreads to 10 and lowResourceMaxIdleTime=15000. We tried
tomcat 6 using the following Connnector settings :

I am getting the same exception as for jetty

SEVERE: org.apache.solr.common.SolrException:
org.apache.solr.client.solrj.SolrServerException: java.net.SocketException:
Connection reset

This seem to point towards a Solr specific issue (solrj.SolrServerException
during individual shard searches).  I monitored the CPU utilization
executing sequential distributed searches and noticed that in the beginning
all CPUs are getting used for a short period of time (multiple lines for
shard searches are shown in the log with isShard=true arguments), then all
CPU except one become idle and the request is being processed by this one
CPU for the longest period of time.

I also noticed in the logs that while most of the individual shard searches
(isShard=true) have low QTimes (5-10), a minority has extreme QTimes
(104402-105126). All shards are fairly similar in size and content (1.2 M
documents) and the StatsComponent is being used
[stats=true&stats.field=weight&stats.facet=library_id]. Here library_id
equals the shard/core name.

Is there an internal timeout for gathering shard results or other fixed
resource limitation ?

Johannes

2011/6/13 Yonik Seeley 

> On Sun, Jun 12, 2011 at 9:10 PM, Johannes Goll 
> wrote:
> > However, sporadically, Jetty 6.1.2X (shipped with  Solr 3.1.)
> > sporadically throws Socket connect exceptions when executing distributed
> > searches.
>
> Are you using the exact jetty.xml that shipped with the solr example
> server,
> or did you make any modifications?
>
> -Yonik
> http://www.lucidimagination.com
>

-- 
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878

RE: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

2011-06-14 Thread Steven A Rowe

On 6/14/2011 at 7:12 AM, Ahmet Arslan wrote:
> --- On Tue, 6/14/11, Nils Weinander  wrote:
> > The documentation states that ISOLatin1AccentFilterFactory
> > is deprecated in favour of ASCIIFoldingFilterFactory:
[...]
> > Is there a way to limit which characters are folded?
> 
> With MappingCharFilterFactory you have fully control over which
> characters are folded. You can see the default mappings in
> mapping-ISOLatin1Accent.txt file.
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory

There is also mapping-FoldToASCII.txt, which, when used with 
MappingCharFilterFactory, corresponds to ASCIIFoldingFilterFactory.

Steve

Re: WordDelimiter and stemEnglishPossessive doesn't work

2011-06-14 Thread roySolr

THANK YOU!!

I thought i only could use one character for the pattern.. Now i use a
regular expression:)



I don't need the wordDelimiter anymore. It's split on # and whitespace

dataset: mcdonald's#burgerking#Free record shop#h&m

mcdonald's
burgerking
free
record
shop
h&m

This is exactly how we want it.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: WordDelimiter and stemEnglishPossessive doesn't work

2011-06-14 Thread Erick Erickson

It's a little obscure, but you can use
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceCharFilterFactory

in front of WhitespaceTokenizer if you prefer. Note that
a CharFilterFactory is different than a FilterFactory, so
read carefully ..

Best
Erick

On Tue, Jun 14, 2011 at 6:15 AM, roySolr  wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread Erick Erickson

<<>>

This is not true. "slop" includes re-arranging the terms, it just takes
a little more slop (see Lucene In Action for an excellent pictorial
explanation).

Best
Erick

On Mon, Jun 13, 2011 at 10:45 PM, mtraynham  wrote:
> Hey Erick,
>
> Thanks for the feedback, but I it doesn't particularly solve my problem.
> The issue with doing a boosted phrase clause is that the terms have to be in
> order to be considered a hit.  I'm seeking a solution where the terms can be
> near another term in any direction.
>
> If I were to use a phrase clause, I would have to permutate every possible
> ordering to get all possible solutions.  i.e. Tom Cruise Dancing, Tom
> Dancing Cruise, Dancing Cruise Tom, etc.
>
> I did try using a TokenizedPhraseQueryNode, which takes a phrase and breaks
> down each word into FieldableNodes.  I did have luck with passing most of
> the later processor mutations, but some still affected it and therefore made
> parsing each field node into a SpanQuery pretty hard.
>
> If I can make a FieldableTokenizedPhraseQueryNode to have untouched
> FieldableNode children, then I could translate all the children into
> SpanQueries and put them into a subsequent NearSpanQuery at the builder
> stage, but this is still pretty incompatible with most of the pipeline.
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3061607.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Adding documents in a batch using Solrj

2011-06-14 Thread Erick Erickson

Have fun. Note that the intent is to have the logging/record
keeping in the superclass (whose name escapes me) and each
update type should be able to use that

Best
Erick

On Mon, Jun 13, 2011 at 11:15 PM, karthik  wrote:
> Thanks Erick. Will certainly take a look.
>
> I am looking to do this for binary objects since i have started with that.
>
> -- karthik
>
> On Mon, Jun 13, 2011 at 8:52 PM, Erick Erickson 
> wrote:
>
>> Take a look at SOLR-445, I started down this road a while
>> ago but then got distracted. If you'd like to pick it up and
>> take it farther, feel free. I haven't applied that patch in a
>> while, so I don't know how easy it will be to apply.
>>
>> Last I left it, it would do much of what you're asking for for
>> xml documents fed to solr, and I was going to get around to
>> some of the other input types but haven't yet. That was what
>> committing this was waiting on.
>>
>> Best
>> Erick
>>
>> On Mon, Jun 13, 2011 at 4:39 PM, karthik  wrote:
>> > Hi Everyone,
>> >
>> > I am trying to use Solrj to add documents to my solr index. In the
>> process
>> > of playing around with the implementation I noticed that when we add
>> > documents in a batch to Solr the response back from solr is just - status
>> &
>> > qtime. I am using Solr 3.1 right now.
>> >
>> > I came across the following scenario that I would like to handle
>> carefully
>> > -
>> >
>> > When there are exceptions caused by one of the document within the batch
>> > then the documents after that specific documents doesnt make it to the
>> index
>> > ie., lets say out of 100 documents trying to get added, doc 56 has an
>> issue
>> > due to schema restrictions, etc., then docs 57 - 100 dont make it to the
>> > index. Even for docs 1 - 55 to get indexed I need the commit outside the
>> > exception handling block of the addBeans() method.
>> >
>> > In the above scenario I would like Solr (or) Solrj to return the doc id's
>> > that got indexed successfully or the doc id's that failed. I would also
>> like
>> > for the documents 57 - 100 to be processed & not get dropped abruptly
>> > because doc 56 had an issue.
>> >
>> > Not sure if there is a way for me to get these details/functionality
>> right
>> > now. If I cant get them, I can try to take a crack at developing a patch.
>> I
>> > would require a lot more help in the latter scenario ;-)
>> >
>> > Thanks in advance.
>> >
>> > -- karthik
>> >
>>
>

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Peter Sturge

SOLR-1872 doesn't add discrete booleans to the query, it does it
programmatically, so you shouldn't see this problem. (if you have a
look at the code, you'll see how it filters queries)
I suppose you could modify SOLR-1872 to use an in-memory,
dynamically-updated user list (+ associated filters) instead of using
the acl file.
This would give you the 'changing users' and 'expiry' functionailty you need.



On Tue, Jun 14, 2011 at 10:08 AM, Sujatha Arun  wrote:
> Thanks Peter , for your input .
>
> I really  would like a document and schema agnostic   solution as  in solr
> 1872.
>
>  Am I right  in my assumption that SOLR1872  is same as the solution that
> we currently have where we add a flter query of the products  to orignal
> query and hence (SOLR 1872) will also run into  TOO many boolean clause
> expanson error?
>
> Regards
> Sujatha
>
>
> On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge wrote:
>
>> Hi,
>>
>> SOLR-1834 is good when the original documents' ACL is accessible.
>> SOLR-1872 is good where the usernames are persistent - neither of
>> these really fit your use case.
>> It sounds like you need more of an 'in-memory', transient access
>> control mechanism. Does the access have to exist beyond the user's
>> session (or the Solr vm session)?
>> Your best bet is probably something like a custom SearchComponent or
>> similar, that keeps track of user purchases, and either adjusts/limits
>> the query or the results to suit.
>> With your own module in the query chain, you can then decide when the
>> 'expiry' is, and limit results accordingly.
>>
>> SearchComponent's are pretty easy to write and integrate. Have a look at:
>>   http://wiki.apache.org/solr/SearchComponent
>> for info on SearchComponent and its usage.
>>
>>
>>
>>
>> On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun  wrote:
>> > Hello,
>> >
>> >
>> > Our Use Case is as follows
>> >
>> > Several solr webapps (one JVM) ,Each webapp catering to one client .Each
>> > client has their users who can purchase products from the  site .Once
>> they
>> > purchase ,they have full access to the products ,other wise they can only
>> > view details .
>> >
>> > The products are not tied to the user at the document  level, simply
>> because
>> > , once the purchase duration of product expires ,the user will no longer
>> > have access to that product.
>> >
>> > So a search for a product once the user logs in and searches for only the
>> > products that he has access to Will translate to something like this .
>> ,the
>> > product ids are obtained form the db  for a particular user and can run
>> > into  n  number.
>> >
>> >  &fq=product_id(100 10001  ..n number)
>> >
>> > but we are currently running into too many Boolean expansion error .We
>> are
>> > not able to tie the user also into roles as each user is mainly any one
>> who
>> > comes to site and purchases a product .
>> >
>> > Given the 2 solutions above as SOLR -1872 where we have to specify the
>> user
>> > in an ACL file  and
>> > query for allow and deny also translates to what  we are trying to do
>> above
>> >
>> > In Case of SOLR 1834 ,we are required to use a crawler (APACHE
>> manifoldCF)
>> > for indexing the Permissions(also the data) into the document and then
>> > querying on it ,this will also not work in our scenario as we have  n web
>> > apps having the same requirement  ,it would be tedious to set this up for
>> > each webapp and also the  requirement that once the user permission for a
>> > product is revoked ,then he should not be able to search  on the same
>> within
>> > his subscribed products.
>> >
>> > Any pointers would be helpful and sorry about the lengthy description.
>> >
>> > Regards
>> > Sujatha
>> >
>>
>

Re: WordDelimiter and stemEnglishPossessive doesn't work

2011-06-14 Thread lee carroll

do you need the word delimiter ?
#|\s
i think its just regex in the pattern tokeniser - i might be wrong though ?




On 14 June 2011 11:15, roySolr  wrote:
> Ok, with catenatewords the index term will be mcdonalds. But that's not what
> i want.
>
> I only use the wordDelimiter to split on whitespace. I have already used the
> PatternTokenizerFactory so i can't use the whitespacetokenizer.
>
> I want my index looks like this:
>
> dataset: mcdonald's#burgerking#Free record shop#h&m
>
> mcdonald's
> burgerking
> free
> record
> shop
> h&m
>
> Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
> splits on whitespaces and nothing more(not removing 's etc)..
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: disable sort by score

2011-06-14 Thread Ahmet Arslan



--- On Tue, 6/14/11, Jason, Kim  wrote:

> From: Jason, Kim 
> Subject: Re: disable sort by score
> To: solr-user@lucene.apache.org
> Date: Tuesday, June 14, 2011, 6:57 AM
> Thanks to reply, Erick!
> 
> Actually, I need sort by score.
> I was just curious that seach result without sorting is
> possible.
> Then I found
> http://lucene.472066.n3.nabble.com/MaxRows-and-disabling-sort-td2260650.html
> In above context, Chris Hostetter-3 wrote
> ++
> http://wiki.apache.org/solr/CommonQueryParameters#sort
> "You can sort by index id using sort=_docid_ asc or
> sort=_docid_ desc"
> 
> if you specify _docid_ asc then solr should return as soon
> as it finds the
> first N matching results w/o scoring all docs (because no
> score will be
> computed) 
> ++
> 
> I tried to check perfomance using _docid_ asc.
> But _docid_ didn't work in distributed search.
> So I made inquiries to know that another method is.

In lucene it is possible with Collector.

"...Collector decouples the score from the collected doc: the score computation 
is skipped entirely if it's not needed. Collectors that do need the score 
should implement the..." [1]

[1]http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/search/Collector.html

But i am not sure how to plug it to solr. There has been some discussion about 
it. 

http://search-lucene.com/?q=custom+collector&fc_project=Solr&fc_type=mail+_hash_+user

Re: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

2011-06-14 Thread Nils Weinander

On Tue, Jun 14, 2011 at 1:11 PM, Ahmet Arslan  wrote:
>
> With MappingCharFilterFactory you have fully control over which characters 
> are folded. You can see the default mappings in
> mapping-ISOLatin1Accent.txt file.
>
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory

Thanks Ahmet! Exactly what I needed.

Nils Weinander

Re: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

2011-06-14 Thread Ahmet Arslan



--- On Tue, 6/14/11, Nils Weinander  wrote:

> From: Nils Weinander 
> Subject: ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory
> To: solr-user@lucene.apache.org
> Date: Tuesday, June 14, 2011, 12:18 PM
> Hi all, I'm new to the list (but not
> totally new to Solr).
> 
> The documentation states that ISOLatin1AccentFilterFactory
> is deprecated
> in favour of ASCIIFoldingFilterFactory:
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
> 
> I see problems with this. If I have understood
> ASCIIFoldingFilterFactory
> correctly it folds both accented characters like 'é' to
> 'e' and national
> characters like 'ö' to 'o'. The former is desirable, the
> latter very much not
> when indexing for example scandinavian languages. Is there
> a way to
> limit which characters are folded?

With MappingCharFilterFactory you have fully control over which characters are 
folded. You can see the default mappings in 
mapping-ISOLatin1Accent.txt file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory

Re: Using Edismax

2011-06-14 Thread Ahmet Arslan

> Thx for the reply. But what can I do to avoid getting 2010.
> I wanted a phrase query with underscore, so it would return
> results with underscore2010 only.

For example, you can remove WordDelimeterFilterFactory from your field type 
definition.

According to your needs, you can use an other fieldType for your url field.

Re: WordDelimiter and stemEnglishPossessive doesn't work

2011-06-14 Thread roySolr

Ok, with catenatewords the index term will be mcdonalds. But that's not what
i want.

I only use the wordDelimiter to split on whitespace. I have already used the
PatternTokenizerFactory so i can't use the whitespacetokenizer.

I want my index looks like this:

dataset: mcdonald's#burgerking#Free record shop#h&m 

mcdonald's
burgerking
free
record
shop
h&m 

Can i configure the wordDelimiter as an whitespaceTokenizer? So it only
splits on whitespaces and nothing more(not removing 's etc).. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/WordDelimiter-and-stemEnglishPossessive-doesn-t-work-tp3047678p3062461.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query on Synonyms feature in Solr

2011-06-14 Thread roySolr

Maybe you can try to escape the synonyms so it's no tokized by whitespace..

Private\ schools,NGO\ Schools,Unaided\ schools 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-on-Synonyms-feature-in-Solr-tp3058197p3062392.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Edismax

2011-06-14 Thread Tirthankar Chatterjee

Eric
Thx for the reply. But what can I do to avoid getting 2010. I wanted a phrase 
query with underscore, so it would return results with underscore2010 only.

Sent from iPod

On Jun 13, 2011, at 3:47 PM, "Erick Erickson"  wrote:

> You haven't supplied the information that's really
> needed to help here, please review:
> 
> http://wiki.apache.org/solr/UsingMailingLists
> 
> But at a guess your analysis chain contains
> WordDelimiterFilterFactory, which is splitting
> the input stream into tokens on letter/number
> changes, and capitalization changes. So you're
> getting "2010" indexed as a separate token and
> you're also searching on it...
> 
> Best
> Erick
> 
> On Mon, Jun 13, 2011 at 3:07 PM, Tirthankar Chatterjee
>  wrote:
>> We are using edismax for query and the query fired is (url:_2010)
>> 
>> http://redcarpet2.dm2.commvault.com:27000/solr/select/?q=url: 
>> 2010&version=2.2&start=0&rows=10&indent=on&defType=edismax
>> 
>> the url field is of type text_rev
>> 
>> Results that SOLR returns has 1 extra item which we don't want to get. How 
>> do we achieve that?
>> 
>> Results:
>> 
>> SPC265_SharePoint_2010.pptx
>> OpenTRs2010.xlsx(we don't want this to be returned)
>> 
>> 
>> Thanks in advance!!!
>> 
>> Tirthankar
>> 
>> 
>> **Legal Disclaimer***
>> "This communication may contain confidential and privileged
>> material for the sole use of the intended recipient. Any
>> unauthorized review, use or distribution by others is strictly
>> prohibited. If you have received the message in error, please
>> advise the sender by reply email and delete the message. Thank
>> you."
>> *

ISOLatin1AccentFilterFactory vs ASCIIFoldingFilterFactory

2011-06-14 Thread Nils Weinander

Hi all, I'm new to the list (but not totally new to Solr).

The documentation states that ISOLatin1AccentFilterFactory is deprecated
in favour of ASCIIFoldingFilterFactory:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory

I see problems with this. If I have understood ASCIIFoldingFilterFactory
correctly it folds both accented characters like 'é' to 'e' and national
characters like 'ö' to 'o'. The former is desirable, the latter very much not
when indexing for example scandinavian languages. Is there a way to
limit which characters are folded?

-- 

Nils Weinander

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Sujatha Arun

Thanks Peter , for your input .

I really  would like a document and schema agnostic   solution as  in solr
1872.

  Am I right  in my assumption that SOLR1872  is same as the solution that
we currently have where we add a flter query of the products  to orignal
query and hence (SOLR 1872) will also run into  TOO many boolean clause
expanson error?

Regards
Sujatha


On Tue, Jun 14, 2011 at 1:53 PM, Peter Sturge wrote:

> Hi,
>
> SOLR-1834 is good when the original documents' ACL is accessible.
> SOLR-1872 is good where the usernames are persistent - neither of
> these really fit your use case.
> It sounds like you need more of an 'in-memory', transient access
> control mechanism. Does the access have to exist beyond the user's
> session (or the Solr vm session)?
> Your best bet is probably something like a custom SearchComponent or
> similar, that keeps track of user purchases, and either adjusts/limits
> the query or the results to suit.
> With your own module in the query chain, you can then decide when the
> 'expiry' is, and limit results accordingly.
>
> SearchComponent's are pretty easy to write and integrate. Have a look at:
>   http://wiki.apache.org/solr/SearchComponent
> for info on SearchComponent and its usage.
>
>
>
>
> On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun  wrote:
> > Hello,
> >
> >
> > Our Use Case is as follows
> >
> > Several solr webapps (one JVM) ,Each webapp catering to one client .Each
> > client has their users who can purchase products from the  site .Once
> they
> > purchase ,they have full access to the products ,other wise they can only
> > view details .
> >
> > The products are not tied to the user at the document  level, simply
> because
> > , once the purchase duration of product expires ,the user will no longer
> > have access to that product.
> >
> > So a search for a product once the user logs in and searches for only the
> > products that he has access to Will translate to something like this .
> ,the
> > product ids are obtained form the db  for a particular user and can run
> > into  n  number.
> >
> >  &fq=product_id(100 10001  ..n number)
> >
> > but we are currently running into too many Boolean expansion error .We
> are
> > not able to tie the user also into roles as each user is mainly any one
> who
> > comes to site and purchases a product .
> >
> > Given the 2 solutions above as SOLR -1872 where we have to specify the
> user
> > in an ACL file  and
> > query for allow and deny also translates to what  we are trying to do
> above
> >
> > In Case of SOLR 1834 ,we are required to use a crawler (APACHE
> manifoldCF)
> > for indexing the Permissions(also the data) into the document and then
> > querying on it ,this will also not work in our scenario as we have  n web
> > apps having the same requirement  ,it would be tedious to set this up for
> > each webapp and also the  requirement that once the user permission for a
> > product is revoked ,then he should not be able to search  on the same
> within
> > his subscribed products.
> >
> > Any pointers would be helpful and sorry about the lengthy description.
> >
> > Regards
> > Sujatha
> >
>

Re: Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Peter Sturge

Hi,

SOLR-1834 is good when the original documents' ACL is accessible.
SOLR-1872 is good where the usernames are persistent - neither of
these really fit your use case.
It sounds like you need more of an 'in-memory', transient access
control mechanism. Does the access have to exist beyond the user's
session (or the Solr vm session)?
Your best bet is probably something like a custom SearchComponent or
similar, that keeps track of user purchases, and either adjusts/limits
the query or the results to suit.
With your own module in the query chain, you can then decide when the
'expiry' is, and limit results accordingly.

SearchComponent's are pretty easy to write and integrate. Have a look at:
   http://wiki.apache.org/solr/SearchComponent
for info on SearchComponent and its usage.

On Tue, Jun 14, 2011 at 8:18 AM, Sujatha Arun  wrote:
> Hello,
>
>
> Our Use Case is as follows
>
> Several solr webapps (one JVM) ,Each webapp catering to one client .Each
> client has their users who can purchase products from the  site .Once they
> purchase ,they have full access to the products ,other wise they can only
> view details .
>
> The products are not tied to the user at the document  level, simply because
> , once the purchase duration of product expires ,the user will no longer
> have access to that product.
>
> So a search for a product once the user logs in and searches for only the
> products that he has access to Will translate to something like this . ,the
> product ids are obtained form the db  for a particular user and can run
> into  n  number.
>
>  &fq=product_id(100 10001  ..n number)
>
> but we are currently running into too many Boolean expansion error .We are
> not able to tie the user also into roles as each user is mainly any one who
> comes to site and purchases a product .
>
> Given the 2 solutions above as SOLR -1872 where we have to specify the user
> in an ACL file  and
> query for allow and deny also translates to what  we are trying to do above
>
> In Case of SOLR 1834 ,we are required to use a crawler (APACHE manifoldCF)
> for indexing the Permissions(also the data) into the document and then
> querying on it ,this will also not work in our scenario as we have  n web
> apps having the same requirement  ,it would be tedious to set this up for
> each webapp and also the  requirement that once the user permission for a
> product is revoked ,then he should not be able to search  on the same within
> his subscribed products.
>
> Any pointers would be helpful and sorry about the lengthy description.
>
> Regards
> Sujatha
>

Re: AndQueryNode to NearSpanQuery

2011-06-14 Thread karsten-solr

Hi member of digitalsmiths,

I also implemented SpanNearQueryNode and some QueryNodeProcessors.
Most possible you can solve your problem by using
QueryNode#setTag:
In QueryNodeProcessor#preProcessNode you can set and remove and reset a Tag to 
mark the AndNodes that should became SpanNodes;
after this you can use the QueryNodeProcessor#postProcessNode method to 
substitute this AndNodes in your OrNodes.

(But be aware of https://issues.apache.org/jira/browse/LUCENE-3045 )

Best regards
  Karsten

 Original-Nachricht 
> Datum: Mon, 13 Jun 2011 19:45:49 -0700 (PDT)
> Von: mtraynham 
> An: solr-user@lucene.apache.org
> Betreff: AndQueryNode to NearSpanQuery
> ...
> The SpanNearQueryNode is a class I made that implements FieldableNode
> and extends QueryNodeImpl (as I want all Fieldable children to be from 
> the same field, therefore just remembering the terms).  Plus it
>  maintains a distance or slop factor and a inOrder boolean.
> 
> The problem here is that I can't keep the children from getting
>  manipulated further down the pipeline, because I want my 
> NearSpanQueryBuilder to use it's original children nodes and at the same 
> time be cloned/changed/etc.  QueryNodeImpl has many private and final 
> methods and you can't override setChildren, etc, etc., but I'd rather 
> stay away from monkey patching. 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/AndQueryNode-to-NearSpanQuery-tp3061286p3061607.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Document Level Security (SOLR-1872 ,SOLR,SOLR-1834)

2011-06-14 Thread Sujatha Arun

Hello,


Our Use Case is as follows

Several solr webapps (one JVM) ,Each webapp catering to one client .Each
client has their users who can purchase products from the  site .Once they
purchase ,they have full access to the products ,other wise they can only
view details .

The products are not tied to the user at the document  level, simply because
, once the purchase duration of product expires ,the user will no longer
have access to that product.

So a search for a product once the user logs in and searches for only the
products that he has access to Will translate to something like this . ,the
product ids are obtained form the db  for a particular user and can run
into  n  number.

 &fq=product_id(100 10001  ..n number)

but we are currently running into too many Boolean expansion error .We are
not able to tie the user also into roles as each user is mainly any one who
comes to site and purchases a product .

Given the 2 solutions above as SOLR -1872 where we have to specify the user
in an ACL file  and
query for allow and deny also translates to what  we are trying to do above

In Case of SOLR 1834 ,we are required to use a crawler (APACHE manifoldCF)
for indexing the Permissions(also the data) into the document and then
querying on it ,this will also not work in our scenario as we have  n web
apps having the same requirement  ,it would be tedious to set this up for
each webapp and also the  requirement that once the user permission for a
product is revoked ,then he should not be able to search  on the same within
his subscribed products.

Any pointers would be helpful and sorry about the lengthy description.

Regards
Sujatha

64 matches

Mail list logo