Querying for multi-word synonyms

2009-03-31 Thread Mark Ferguson
Hi list,

I've been working the last couple days on some synonym functionality and
I've been reading about the limitations regarding query-time multi-word
synonyms. All recommended solutions that I've come across so far suggest
using the SynonymFilter at index time rather than at query time.

Unfortunately, I have to use SynonymFilter at query time due to the nature
of the data I'm indexing. At index time, all I have are keywords but at
query time I will have some semantic markup which allows me to expand into
synonyms. I am wondering if any progress has been made into making query
time synonym searching work correctly. If not, does anyone have some ideas
for alternatives to using SynonymFilter? The only thing I can think of is to
simply create a custom BooleanQuery for the search and feed the synonyms in
manually, but then I am missing out on all the functionality of the dismax
query parser. Any ideas are appreciated, thanks very much.

Mark


Re: Problems with synonyms

2009-03-31 Thread Mark Ferguson
It's okay to not use the SynonymFilter for querying and for indexing. In
fact, you would really only want to use one or the other: either index all
synonyms, or query for them, but not both.

I have read that there are issues with multi-word synonyms and my guess is
that this is where your problem is, but my understanding of the issue is
limited. Hopefully someone else can provide more insight.

Mark


On Tue, Mar 31, 2009 at 2:04 PM, Vernon Chapman wrote:

> Leonardo,
>
> The only other thing I can think of is check the
> Field type in the schema.xml file make sure that you are using the same
> filters.
>
> For example if in your index analyzer you use the solr.SynonymFilterFactory
> filter make sure your query analyzer also uses the same filter class.
>
> Other than that I am stuck, hope that helps
>
> Vernon
>
>
>
> On 3/31/09 3:39 PM, "Leonardo Dias"  wrote:
>
> > Hi, Vernon!
> >
> > We tried both approaches: OR and AND. In both cases, the results were
> smaller
> > when the synonyms was set up, with no change at all when it comes to
> synonyms.
> >
> > Any other ideas? Is it likely to be a bug?
> >
> > Best,
> >
> > Leonardo
> >
> > Vernon Chapman escreveu:
> >>
> >> Leonardo,
> >>
> >> I am no expert but I would check to make sure that the
> >> DefaultOperator parameter in your schema.xml file is set to
> >> OR rather thank AND.
> >>
> >> Vernon
> >>
> >> On 3/31/09 3:24 PM, "Leonardo Dias" 
> >>   wrote:
> >>
> >>
> >>
> >>>
> >>> Hello there. How are you guys?
> >>>
> >>> We're having problems with synonyms here and I thought that maybe you
> >>> guys could help us on how SOLR works for synonyms.
> >>>
> >>> The problem is the following: I'd like to setup a synonym like "dba,
> >>> database administrator".
> >>>
> >>> Instead of increasing the number of results for the keyword "dba", the
> >>> results got smaller and it only brought me back results that had both
> >>> the keywords "dba" and "database administrator" at the same time
> instead
> >>> of bringing back both "dba" and "database administrator" as expected
> >>> since our synonym configuration is using expand=true.
> >>>
> >>> Since in the past this was not the expected behavior, I'd like to know
> >>> whether something changed in the solr/lucene internals so that this
> >>> functionality is now lost, or if I'm doing something wrong with my
> setup.
> >>>
> >>> Currently all fields pass through the Synonym filter factory. The
> >>> analysis shows me that it tries to search for database administrator
> and
> >>> DBA. A debug query also shows me that the query it's trying to do is
> >>> something like this:
> >>>
> >>> +DisjunctionMaxQuery((title:"(dba datab) administr")~0.1)
> >>> DisjunctionMaxQuery((title:"(dba datab) administr"^10.0 |
> >>> observation:"(dba datab) administr"^10.0 | description:"(dba datab)
> >>> administr"^10.0 | company:"(dba datab) administr")~0.1)
> >>>
> >>> The problem is: when I search for this, I get 5 results. When I search
> >>> for dba only, without the "dba, database administrator" line in the
> >>> synonyms.txt file, I get more than 100 results.
> >>>
> >>> Do you guys know why this is happening?
> >>>
> >>> Thank you,
> >>>
> >>> Leonardo
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
>
>
> on 3/31/09 3:39 pm, "leonardo dias"  wrote:
>
> hi, vernon!
>
> we tried both approaches: or and and. in both cases, the results were
> smaller when the synonyms was set up, with no change at all when it comes
> to
> synonyms.
>
> any other ideas? is it likely to be a bug?
>
> best,
>
> leonardo
>
> vernon chapman escreveu:
>
> leonardo,
>
> i am no expert but i would check to make sure that the
> defaultoperator parameter in your schema.xml file is set to
> or rather thank and.
>
> vernon
>
> on 3/31/09 3:24 pm, "leonardo dias" 
>   wrote:
>
>
>
>
> hello there. how are you guys?
>
> we're having problems with synonyms here and i thought that maybe you
> guys could help us on how solr works for synonyms.
>
> the problem is the following: i'd like to setup a synonym like "dba,
> database administrator".
>
> instead of increasing the number of results for the keyword "dba", the
> results got smaller and it only brought me back results that had both
> the keywords "dba" and "database administrator" at the same time instead
> of bringing back both "dba" and "database administrator" as expected
> since our synonym configuration is using expand=true.
>
> since in the past this was not the expected behavior, i'd like to know
> whether something changed in the solr/lucene internals so that this
> functionality is now lost, or if i'm doing something wrong with my setup.
>
> currently all fields pass through the synonym filter factory. the
> analysis shows me that it tries to search for database administrator and
> dba. a debug query also shows me that the query it's trying to do is
> something like this:
>
> +disjunctionmaxquery((title:"(dba datab) administr")~0.1)
> dis

Filter query for number of matches

2009-03-02 Thread Mark Ferguson
Hi,

I am wondering if there is a way to set a filter on the frequency of a
keyword match in a document. For example, if I search for the word "cheerio"
I would like that word to appear at least x times in a field in order for
the document to be returned. I know that Lucene internals already give
higher scoring to more matches, but I would like to set a minimum barrier of
entry.

Thank you,

Mark


Re: Near real-time search of user data

2009-02-19 Thread Mark Ferguson
Thanks Noble and Otis for your suggestions.

After reading more messages on the mailing list relating to this problem, I
decided to implement one suggestion which was to keep an archive index and a
smaller delta index containing only recent updates, then do a distributed
search across them. The delta index is small so can handle rapid commits
(every 1-2 seconds). This setup works well for my architecture because it is
easy to keep track of recent changes in the database and then send those to
the archive index every hour or so, then clear out the delta.

I really like your ideas about closing inactive indexes when using a
multicore setup; having too many indexes open was definitely the issue
plaguing me. Thanks for your great ideas and the time you take on this
project!

Mark



On Thu, Feb 19, 2009 at 9:31 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> we have a similar usecase and I have raised an issue for the same
> (SOLR-880)
> currently we are using an internal patch and we hopw to submit one soon.
>
> we also use an LRU based automatic loading unloading feature. if a
> request comes up for a core that is 'STOPPED' . the core is 'STARTED'
> and the request is served.
>
> We  keep an upper limit of the no:of cores to be kept loaded and if
> the limit is crossed, a least recently used core is 'STOPPED' .
>
> --Noble
>
>
> On Fri, Feb 20, 2009 at 8:53 AM, Otis Gospodnetic
>  wrote:
> >
> > I've used a similar strategy for Simpy.com, but with raw Lucene and not
> Solr.  The crucial piece is to close (inactive) user indices periodically
> and thus free the memory.  Are you doing the same with your per-user Solr
> cores and still running into memory issues?
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> >
> >
> >
> > - Original Message 
> >> From: Mark Ferguson 
> >> To: solr-user@lucene.apache.org
> >> Sent: Friday, February 20, 2009 1:14:15 AM
> >> Subject: Near real-time search of user data
> >>
> >> Hi,
> >>
> >> I am trying to come up with a strategy for a solr setup in which a
> user's
> >> indexed data can be nearly immediately available to them for search. My
> >> current strategy (which is starting to cause problems) is as follows:
> >>
> >>   - each user has their own personal index (core), which gets committed
> >> after each update
> >>   - there is a main index which is basically an aggregate of all user
> >> indexes. This index gets committed every 5 minutes or so.
> >>
> >> In this way, I can search a user's personal index to get real-time
> results,
> >> and concatenate the world results from the main index, which aren't as
> >> important to be immediate.
> >>
> >> This multicore strategy worked well in test scenarios but as the user
> >> indexes get larger it is starting to fall apart as I run into memory
> issues
> >> in maintaining too many cores. It's not realistic to dedicate a new
> machine
> >> to every 5K-10K users and I think this is what I will have to do to
> maintain
> >> the multicore strategy.
> >>
> >> So I am hoping that someone will be able to provide some tips on how to
> >> accomplish what I am looking for. One option is to simply send a commit
> to
> >> the main index every couple seconds, but I was hoping someone with
> >> experience could shed some light on whether this is a viable option
> before I
> >> attempt that route (i.e. can commits be sent that frequently on a large
> >> index?). The indexes are distributed but they could still be in the
> 2-100GB
> >> range.
> >>
> >> Thanks very much for any suggestions!
> >>
> >> Mark
> >
> >
>
>
>
> --
> --Noble Paul
>


Near real-time search of user data

2009-02-19 Thread Mark Ferguson
Hi,

I am trying to come up with a strategy for a solr setup in which a user's
indexed data can be nearly immediately available to them for search. My
current strategy (which is starting to cause problems) is as follows:

  - each user has their own personal index (core), which gets committed
after each update
  - there is a main index which is basically an aggregate of all user
indexes. This index gets committed every 5 minutes or so.

In this way, I can search a user's personal index to get real-time results,
and concatenate the world results from the main index, which aren't as
important to be immediate.

This multicore strategy worked well in test scenarios but as the user
indexes get larger it is starting to fall apart as I run into memory issues
in maintaining too many cores. It's not realistic to dedicate a new machine
to every 5K-10K users and I think this is what I will have to do to maintain
the multicore strategy.

So I am hoping that someone will be able to provide some tips on how to
accomplish what I am looking for. One option is to simply send a commit to
the main index every couple seconds, but I was hoping someone with
experience could shed some light on whether this is a viable option before I
attempt that route (i.e. can commits be sent that frequently on a large
index?). The indexes are distributed but they could still be in the 2-100GB
range.

Thanks very much for any suggestions!

Mark


UpdateResponse status codes?

2009-02-09 Thread Mark Ferguson
Hello,

I am wondering if the UpdateResponse status codes are documented somewhere?
I haven't been able to find them. I know 0 is success..

Thanks,

Mark


Re: Searchers in single/multi-core environments

2009-02-06 Thread Mark Ferguson
>
> > What I'm also curious about is how searchers are handled in a multi-core
> > environment. Does the maxWarmSearchers argument apply to the entire set
> of
> > cores, or to each individual core?
>
>
> It applied to one core unless ofcourse, you are sharing the solrconfig.xml
> with multiple cores. Also, if you call core reload, a new core is created
> (with its own searcher) which replaces the old core.
>
>
Thanks very much for your time and explanation, it is a huge help. Just to
clarify that I am understanding correctly...

For example then, if I have 10 cores, and maxWarmSearchers is 2 for each
core, if I send a commit to all of them at once this will not cause any
exceptions, because each core handles its searchers separately?

Mark


Searchers in single/multi-core environments

2009-02-06 Thread Mark Ferguson
Hello,

My apologies if this topic has already been discussed but I haven't been
able to find a lot of information in the wiki or mailing lists.

I am looking for more information about how searchers work in different
environments. Correct me if I'm mistaken, but my understanding is that in a
single core environment, there is one searcher for the one index which
handles all queries. When a commit occurs, a new searcher is opened up on
the index during the commit. The old searcher is still available until the
commit finishes, at which point the active searcher becomes the new one and
the old searcher is destroyed. This is the purpose of the
maxWarmingSearchers argument -- it is the total number of searchers that can
be open in memory at any given point. What I'm not sure about is how this
number could ever be greater than 2 in a single core environment -- unless
another commit is sent before the new searcher finishes warming?

What I'm also curious about is how searchers are handled in a multi-core
environment. Does the maxWarmSearchers argument apply to the entire set of
cores, or to each individual core? If the latter, how is this handled if
each core uses a different solrconfig.xml and has a different value for
maxWarmSearchers?

Thanks for any information that you can provide.

Mark


Re: instanceDir value is incorrect in multicore environment

2009-02-04 Thread Mark Ferguson
I looked at the core status page and it looks like the problem isn't
actually the instanceDir property, but rather dataDir. It's not being
appended to instanceDir so its path is relative to cwd.

I'm using a patched version of Solr with some of my own custom changes
relating to dataDir, so this is probably just something I screwed up, so
feel free to ignore this email.

Mark


On Wed, Feb 4, 2009 at 6:25 PM, Mark Ferguson wrote:

> Hello,
>
> I have a problem with setting the instanceDir property for the cores in
> solr.xml. When I set the value to be relative, it sets it as relative to the
> location from which I started the application, instead of relative to the
> solr.home property.
>
> I am using Tomcat and I am creating a context for each instance of solr
> that I am running in the conf/Catalina/localhost directory, as per the
> instructions. For example, my solr1.xml file looks like this:
>
> 
> value="/srv/solr/solr1" override="true" />
> 
>
> My solr.xml file in /srv/solr/solr1 looks something like this:
>
> 
>   
>   
>   ...
> 
>
> Now, from whatever location I start the app, it considers that my root
> directory. For example if I run the command from this prompt:
>
> m...@linux-1hpr:/tmp> ~/bin/tomcat/bin/catalina.sh start
>
> It then creates a 'data' directory in /tmp with subdirectories p20, p0 etc.
>
> Any ideas what I'm doing wrong? Thanks a lot.
>
> Mark
>


instanceDir value is incorrect in multicore environment

2009-02-04 Thread Mark Ferguson
Hello,

I have a problem with setting the instanceDir property for the cores in
solr.xml. When I set the value to be relative, it sets it as relative to the
location from which I started the application, instead of relative to the
solr.home property.

I am using Tomcat and I am creating a context for each instance of solr that
I am running in the conf/Catalina/localhost directory, as per the
instructions. For example, my solr1.xml file looks like this:


   


My solr.xml file in /srv/solr/solr1 looks something like this:


  
  
  ...


Now, from whatever location I start the app, it considers that my root
directory. For example if I run the command from this prompt:

m...@linux-1hpr:/tmp> ~/bin/tomcat/bin/catalina.sh start

It then creates a 'data' directory in /tmp with subdirectories p20, p0 etc.

Any ideas what I'm doing wrong? Thanks a lot.

Mark


Re: Setting dataDir in multicore environment

2009-01-27 Thread Mark Ferguson
This is just what I needed, thank you so much for the quick response! It's
really appreciated!

Mark


On Tue, Jan 27, 2009 at 9:59 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> There is a patch given for SOLR-883 .
>
> On Wed, Jan 28, 2009 at 9:43 AM, Noble Paul നോബിള്‍  नोब्ळ्
>  wrote:
> > I shall give a patch today
> >
> > On Tue, Jan 27, 2009 at 11:58 PM, Mark Ferguson
> >  wrote:
> >> Oh I see, thanks for the clarification.
> >>
> >> Unfortunately this brings me back to same problem I started with:
> implicit
> >> properties aren't available when managing indexes through the REST api.
> I
> >> know there is a patch in the works for this issue but I can't wait for
> it.
> >> Is there any way to share the solrconfig.xml file and create indexes
> >> dynamically?
> >>
> >> Mark
> >>
> >>
> >> On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> >> noble.p...@gmail.com> wrote:
> >>
> >>> The behavior is expected
> >>> properties set in solr.xml are not implicitly used anywhere.
> >>> you will have to use those variables explicitly in
> >>> solrconfig.xml/schema.xml
> >>> instead of hardcoding dataDir in solrconfig.xml you can use it as a
> >>> variable $$dataDir
> >>>
> >>> BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943)
> >>> which helps you specify the dataDir in solr.xml
> >>>
> >>>
> >>> On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson
> >>>  wrote:
> >>> > Hi,
> >>> >
> >>> > In my solr.xml file, I am trying to set the dataDir property the way
> it
> >>> is
> >>> > described in the CoreAdmin page on the wiki:
> >>> >
> >>> > 
> >>> >  
> >>> > 
> >>> >
> >>> > However, the property is being completed ignored. It is using
> whatever I
> >>> > have set in the solrconfig.xml file (or ./data, the default value, if
> I
> >>> set
> >>> > nothing in that file). Any idea what I am doing wrong? I am trying
> this
> >>> > approach to avoid using ${solr.core.name} in the solrconfig.xml
> file,
> >>> since
> >>> > dynamic properties are broken for creating cores via the REST api.
> >>> >
> >>> > Mark
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> --Noble Paul
> >>>
> >>
> >
> >
> >
> > --
> > --Noble Paul
> >
>
>
>
> --
> --Noble Paul
>


Re: Setting dataDir in multicore environment

2009-01-27 Thread Mark Ferguson
Oh I see, thanks for the clarification.

Unfortunately this brings me back to same problem I started with: implicit
properties aren't available when managing indexes through the REST api. I
know there is a patch in the works for this issue but I can't wait for it.
Is there any way to share the solrconfig.xml file and create indexes
dynamically?

Mark


On Mon, Jan 26, 2009 at 9:02 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> The behavior is expected
> properties set in solr.xml are not implicitly used anywhere.
> you will have to use those variables explicitly in
> solrconfig.xml/schema.xml
> instead of hardcoding dataDir in solrconfig.xml you can use it as a
> variable $$dataDir
>
> BTW there is an issue (https://issues.apache.org/jira/browse/SOLR-943)
> which helps you specify the dataDir in solr.xml
>
>
> On Tue, Jan 27, 2009 at 5:19 AM, Mark Ferguson
>  wrote:
> > Hi,
> >
> > In my solr.xml file, I am trying to set the dataDir property the way it
> is
> > described in the CoreAdmin page on the wiki:
> >
> > 
> >  
> > 
> >
> > However, the property is being completed ignored. It is using whatever I
> > have set in the solrconfig.xml file (or ./data, the default value, if I
> set
> > nothing in that file). Any idea what I am doing wrong? I am trying this
> > approach to avoid using ${solr.core.name} in the solrconfig.xml file,
> since
> > dynamic properties are broken for creating cores via the REST api.
> >
> > Mark
> >
>
>
>
> --
> --Noble Paul
>


Setting dataDir in multicore environment

2009-01-26 Thread Mark Ferguson
Hi,

In my solr.xml file, I am trying to set the dataDir property the way it is
described in the CoreAdmin page on the wiki:


  


However, the property is being completed ignored. It is using whatever I
have set in the solrconfig.xml file (or ./data, the default value, if I set
nothing in that file). Any idea what I am doing wrong? I am trying this
approach to avoid using ${solr.core.name} in the solrconfig.xml file, since
dynamic properties are broken for creating cores via the REST api.

Mark


Re: solr.core.name property not available on core creation

2009-01-26 Thread Mark Ferguson
Thanks Shalin. Any ideas on a workaround in the mean time? I suppose I could
set the instanceDir property to the data directory rather than the common
directory, then set the config and schema explicitly.

Mark

On Mon, Jan 26, 2009 at 12:16 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> This is a known issue. I'll try to give a patch soon.
>
> https://issues.apache.org/jira/browse/SOLR-883
>
> On Mon, Jan 26, 2009 at 11:59 PM, Mark Ferguson
> wrote:
>
> > Hi,
> >
> > I am trying to set up a multi-core environment in which I share a single
> > conf folder. I am following the instructions described in this thread:
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg16954.html
> >
> > In solrconfig.xml, I am setting dataDir to /srv/solr/cores/data/${
> > solr.core.name}.
> >
> > It works well when using existing cores, but when I try to create a core
> > dynamically via the core admin web interface, it gives me an error
> message:
> >
> > HTTP Status 500 - No system property or default value specified for
> > solr.core.name
> >
> > To create the core I am entering the following url:
> >
> >
> http://localhost:8080/solrcore/admin/cores?action=CREATE&name=p5&instanceDir=
> > .
> >
> > Any suggestions?
> >
> > Mark
> >
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


solr.core.name property not available on core creation

2009-01-26 Thread Mark Ferguson
Hi,

I am trying to set up a multi-core environment in which I share a single
conf folder. I am following the instructions described in this thread:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg16954.html

In solrconfig.xml, I am setting dataDir to /srv/solr/cores/data/${
solr.core.name}.

It works well when using existing cores, but when I try to create a core
dynamically via the core admin web interface, it gives me an error message:

HTTP Status 500 - No system property or default value specified for
solr.core.name

To create the core I am entering the following url:
http://localhost:8080/solrcore/admin/cores?action=CREATE&name=p5&instanceDir=
.

Any suggestions?

Mark


Re: Unable to choose request handler

2009-01-06 Thread Mark Ferguson
Thanks, this fixed the problem. Maybe this parameter could be added to the
standard request handler in the sample solrconfig.xml, as it is confusing
that it uses the default request handler's defType even when not using that
handler. I didn't completely understand your explanation, though. Thanks for
the fix.

Mark


On Tue, Jan 6, 2009 at 3:40 PM, Yonik Seeley  wrote:

> On Tue, Jan 6, 2009 at 5:01 PM, Mark Ferguson 
> wrote:
> > It seems that the problem is related to the defType parameter. When I
> > specify defType=, it uses the correct request handler. It seems that it
> is
> > using the correct request handler, but it is defaulting to
> defType=dismax,
> > even though I have not specified that parameter in the standard request
> > handler configuration.
>
> defType only controls the default type of the main query (not the
> whole handler).
> Try defType=lucene
>
> -Yonik
>
> > On Tue, Jan 6, 2009 at 2:57 PM, Mark Ferguson  >wrote:
> >
> >> Hi,
> >>
> >> In my solrconfig.xml file there are two request handlers configured: one
> >> uses defType=dismax, and the other doesn't. However, it seems that when
> the
> >> dismax request handler is set as my default, I have no way of using the
> >> standard request handler . Here is the relevant part of my
> solrconfig.xml:
> >>
> >> 
> >> 
> >>  
> >>explicit
> >>  
> >>   
> >>
> >>default="true">
> >> 
> >>  dismax
> >>  explicit
> >> 
> >>   
> >>
> >>
> >> When I run a query with the parameters qt=standard&debugQuery=true, I
> can
> >> see that it is still using the DismaxQueryParser. There doesn't seem to
> be
> >> any way to use the standard request handler.
> >>
> >> On the other hand, when I set the standard request handler as my
> default,
> >> the behaviour is equally strange. When I specify no qt parameter at all,
> it
> >> uses the standard request handler as it should. However, when I enter
> either
> >> qt=standard or qt=dismax, it uses the dismax request handler!
> >>
> >> So it appears that the only way I can choose the request handler I want
> is
> >> to make the standard request handler my default, then specify no qt
> >> parameter if I want to use it. Has anyone else tried this?
> >>
> >> Mark
> >>
> >
>


Re: Unable to choose request handler

2009-01-06 Thread Mark Ferguson
I apologize, entering the defType parameter explicitly has nothing to do
with it, this was a caching issue. I tested the different configurations
thoroughly, and this is what I've come up with:

  - When using 'dismax' request handler as default:
- Queries are always parsed using the dismax parser, whether I use
qt=standard, qt=dismax, or qt=. It _does_ use the correct request handler,
because the echo'd params are correct for that handler. However, it seems to
always be using defType=dismax. I can tell this because when I use the
parameter debugQuery=true, I can see that it is creating a
DisjunctionMaxQuery.
  - When using 'standard' request handler as default:
- The behaviour is as expected. When I enter no qt parameter or
qt=standard, it uses the standard request handler and doesn't use dismax for
the defType. When I use qt=dismax, it uses the dismax request handler and
dismax for the defType.

So the problem is when setting the default request handler to dismax, it
always uses defType=dismax (even though it uses the 'standard' request
handler). defType=dismax does not show up in the echo'd parameters, but I
can tell by using debugQuery=true (and the fact that I get no results when I
specify a field).

Can someone try reproducing this using the configuration I specified in my
first post? Sorry again for being confusing, I got sidetracked by the
caching issue.

Mark



On Tue, Jan 6, 2009 at 3:01 PM, Mark Ferguson wrote:

> It seems that the problem is related to the defType parameter. When I
> specify defType=, it uses the correct request handler. It seems that it is
> using the correct request handler, but it is defaulting to defType=dismax,
> even though I have not specified that parameter in the standard request
> handler configuration.
>
>
> On Tue, Jan 6, 2009 at 2:57 PM, Mark Ferguson 
> wrote:
>
>> Hi,
>>
>> In my solrconfig.xml file there are two request handlers configured: one
>> uses defType=dismax, and the other doesn't. However, it seems that when the
>> dismax request handler is set as my default, I have no way of using the
>> standard request handler . Here is the relevant part of my solrconfig.xml:
>>
>> 
>> 
>>  
>>explicit
>>  
>>   
>>
>>   
>> 
>>  dismax
>>  explicit
>> 
>>   
>>
>>
>> When I run a query with the parameters qt=standard&debugQuery=true, I can
>> see that it is still using the DismaxQueryParser. There doesn't seem to be
>> any way to use the standard request handler.
>>
>> On the other hand, when I set the standard request handler as my default,
>> the behaviour is equally strange. When I specify no qt parameter at all, it
>> uses the standard request handler as it should. However, when I enter either
>> qt=standard or qt=dismax, it uses the dismax request handler!
>>
>> So it appears that the only way I can choose the request handler I want is
>> to make the standard request handler my default, then specify no qt
>> parameter if I want to use it. Has anyone else tried this?
>>
>> Mark
>>
>
>


Re: Unable to choose request handler

2009-01-06 Thread Mark Ferguson
It seems that the problem is related to the defType parameter. When I
specify defType=, it uses the correct request handler. It seems that it is
using the correct request handler, but it is defaulting to defType=dismax,
even though I have not specified that parameter in the standard request
handler configuration.

On Tue, Jan 6, 2009 at 2:57 PM, Mark Ferguson wrote:

> Hi,
>
> In my solrconfig.xml file there are two request handlers configured: one
> uses defType=dismax, and the other doesn't. However, it seems that when the
> dismax request handler is set as my default, I have no way of using the
> standard request handler . Here is the relevant part of my solrconfig.xml:
>
> 
> 
>  
>explicit
>  
>   
>
>   
> 
>  dismax
>  explicit
> 
>   
>
>
> When I run a query with the parameters qt=standard&debugQuery=true, I can
> see that it is still using the DismaxQueryParser. There doesn't seem to be
> any way to use the standard request handler.
>
> On the other hand, when I set the standard request handler as my default,
> the behaviour is equally strange. When I specify no qt parameter at all, it
> uses the standard request handler as it should. However, when I enter either
> qt=standard or qt=dismax, it uses the dismax request handler!
>
> So it appears that the only way I can choose the request handler I want is
> to make the standard request handler my default, then specify no qt
> parameter if I want to use it. Has anyone else tried this?
>
> Mark
>


Unable to choose request handler

2009-01-06 Thread Mark Ferguson
Hi,

In my solrconfig.xml file there are two request handlers configured: one
uses defType=dismax, and the other doesn't. However, it seems that when the
dismax request handler is set as my default, I have no way of using the
standard request handler . Here is the relevant part of my solrconfig.xml:



 
   explicit
 
  

  

 dismax
 explicit

  


When I run a query with the parameters qt=standard&debugQuery=true, I can
see that it is still using the DismaxQueryParser. There doesn't seem to be
any way to use the standard request handler.

On the other hand, when I set the standard request handler as my default,
the behaviour is equally strange. When I specify no qt parameter at all, it
uses the standard request handler as it should. However, when I enter either
qt=standard or qt=dismax, it uses the dismax request handler!

So it appears that the only way I can choose the request handler I want is
to make the standard request handler my default, then specify no qt
parameter if I want to use it. Has anyone else tried this?

Mark


Re: Dismax query parser with different field classes

2009-01-02 Thread Mark Ferguson
Hi again,

I have a small problem with using a boost query, which is that I would like
documents found in the boost query to be returned even if the main query
does not include those results. So what I am effectively looking for is an
OR between the dismax query and the boost query, rather than a required main
query or'd with the boost query. Does anything currently exist which can
facilitate this?

For example, currently my parsed query looks something like this, where the
domain, page_title and body_text fields are part of the dismax query, and
user_id is part of the boost query:

+DisjunctionMaxQuery((domain:maps^4.0 page_title:map^2.0 |
body_text:map)~0.01) user_id:12^5.0

Whereas I would like it to look like this:

+(DisjunctionMaxQuery((domain:maps^4.0 page_title:map^2.0 |
body_text:map)~0.01 user_id:12^5.0)

This is because I want all documents with user_id:12 to be returned, even if
there are no keyword matches. I also want all documents with keyword matches
to be returned, even when the user_id doesn't match, so I can't just switch
the query and the boost query.

Any ideas? Thanks for your time.

Mark


On Fri, Jan 2, 2009 at 2:33 PM, Mark Ferguson wrote:

> Hello,
>
> It looks like a boost query will accomplish what I am looking for quite
> nicely.
>
> Mark
>
>
>
> On Wed, Dec 31, 2008 at 5:29 PM, Mark Ferguson 
> wrote:
>
>> Hello,
>>
>> I have a set of documents in which I have different classes of fields that
>> I would like to search separately. For example, I would like to search the
>> HTML body and title of a webpage using one set of keywords, and the page
>> author using another set. I cannot use the dismax parser for this problem
>> because all keywords will search across all the query fields. However, I
>> like the dismax query parser because it handles matching and scoring very
>> nicely.
>>
>> I read one suggestion on this group which was to make one of the queries a
>> query filter. So for example, I may use the dismax query parser to search
>> the body and title of a webpage, then use a filter query for the author.
>> There are two problems with this approach in regards to what I need:
>>   1. The filter query does not affect scoring, but I need the scoring to
>> be influenced by the results of all fields being searched.
>>   2. A filter query will do a simple AND or OR filter, while I would need
>> the search to be an OR search with higher scoring for multiple matches
>> (related to the first problem).
>>
>> I think what I need is a dismax parser into which the parsed query will
>> not just contain all keywords for all fields, but into which you can specify
>> which fields correspond to which sets of keywords. Has anything like this
>> been tackled before? If not, can someone help point me in the right
>> direction for how I would build this myself? Thanks very much for your time.
>>
>> Regards,
>>
>> Mark Ferguson
>>
>
>


Re: Dismax query parser with different field classes

2009-01-02 Thread Mark Ferguson
Hello,

It looks like a boost query will accomplish what I am looking for quite
nicely.

Mark


On Wed, Dec 31, 2008 at 5:29 PM, Mark Ferguson wrote:

> Hello,
>
> I have a set of documents in which I have different classes of fields that
> I would like to search separately. For example, I would like to search the
> HTML body and title of a webpage using one set of keywords, and the page
> author using another set. I cannot use the dismax parser for this problem
> because all keywords will search across all the query fields. However, I
> like the dismax query parser because it handles matching and scoring very
> nicely.
>
> I read one suggestion on this group which was to make one of the queries a
> query filter. So for example, I may use the dismax query parser to search
> the body and title of a webpage, then use a filter query for the author.
> There are two problems with this approach in regards to what I need:
>   1. The filter query does not affect scoring, but I need the scoring to be
> influenced by the results of all fields being searched.
>   2. A filter query will do a simple AND or OR filter, while I would need
> the search to be an OR search with higher scoring for multiple matches
> (related to the first problem).
>
> I think what I need is a dismax parser into which the parsed query will not
> just contain all keywords for all fields, but into which you can specify
> which fields correspond to which sets of keywords. Has anything like this
> been tackled before? If not, can someone help point me in the right
> direction for how I would build this myself? Thanks very much for your time.
>
> Regards,
>
> Mark Ferguson
>


Dismax query parser with different field classes

2008-12-31 Thread Mark Ferguson
Hello,

I have a set of documents in which I have different classes of fields that I
would like to search separately. For example, I would like to search the
HTML body and title of a webpage using one set of keywords, and the page
author using another set. I cannot use the dismax parser for this problem
because all keywords will search across all the query fields. However, I
like the dismax query parser because it handles matching and scoring very
nicely.

I read one suggestion on this group which was to make one of the queries a
query filter. So for example, I may use the dismax query parser to search
the body and title of a webpage, then use a filter query for the author.
There are two problems with this approach in regards to what I need:
  1. The filter query does not affect scoring, but I need the scoring to be
influenced by the results of all fields being searched.
  2. A filter query will do a simple AND or OR filter, while I would need
the search to be an OR search with higher scoring for multiple matches
(related to the first problem).

I think what I need is a dismax parser into which the parsed query will not
just contain all keywords for all fields, but into which you can specify
which fields correspond to which sets of keywords. Has anything like this
been tackled before? If not, can someone help point me in the right
direction for how I would build this myself? Thanks very much for your time.

Regards,

Mark Ferguson


Re: Solrj: Getting response attributes from QueryResponse

2008-12-19 Thread Mark Ferguson
Oops .. thanks for the quick reply, I shouldn't have missed this. :)

Mark

On Fri, Dec 19, 2008 at 1:25 PM, Kevin Hagel  wrote:

>
> http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html#getResults()<http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html#getResults%28%29>
>
> returns a SolrDocumentList
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocumentList.html
>
> which has that information
>
> On Fri, Dec 19, 2008 at 2:22 PM, Mark Ferguson  >wrote:
>
> > Hello,
> >
> > I am trying to get the numFound attribute from a returned QueryResponse
> > object, but for the life of me I can't find where it is stored. When I
> view
> > a response in XML format, it is stored as an attribute on the response
> > node,
> > e.g.:
> >
> > 
> >
> > However, I can't find a way to retrieve these attributes (numFound, start
> > and maxScore). When I look at the QueryResponse itself, I can see that
> the
> > attributes are being stored somewhere, because the toString method
> returns
> > them. For example, queryResponse.toString() returns:
> >
> >
> >
> {responseHeader={status=0,QTime=139,params={wt=javabin,hl=true,rows=15,version=2.2,fl=urlmd5,start=0,q=java}},response={
> > *numFound=1228*,start=03.633028,docs=[SolrDocument[{urlmd5=...
> >
> > The problem is that when I call queryResponse.get('response'), all I get
> is
> > the list of SolrDocuments, I don't have any other attributes. Am I
> missing
> > something or are these attributes just not publically available? If
> they're
> > not, shouldn't they be? Thanks a lot,
> >
> > Mark Ferguson
> >
>


Solrj: Getting response attributes from QueryResponse

2008-12-19 Thread Mark Ferguson
Hello,

I am trying to get the numFound attribute from a returned QueryResponse
object, but for the life of me I can't find where it is stored. When I view
a response in XML format, it is stored as an attribute on the response node,
e.g.:



However, I can't find a way to retrieve these attributes (numFound, start
and maxScore). When I look at the QueryResponse itself, I can see that the
attributes are being stored somewhere, because the toString method returns
them. For example, queryResponse.toString() returns:

{responseHeader={status=0,QTime=139,params={wt=javabin,hl=true,rows=15,version=2.2,fl=urlmd5,start=0,q=java}},response={
*numFound=1228*,start=03.633028,docs=[SolrDocument[{urlmd5=...

The problem is that when I call queryResponse.get('response'), all I get is
the list of SolrDocuments, I don't have any other attributes. Am I missing
something or are these attributes just not publically available? If they're
not, shouldn't they be? Thanks a lot,

Mark Ferguson


Re: Some solrconfig.xml attributes being ignored

2008-12-16 Thread Mark Ferguson
Hi Erik,

Thanks a lot for looking into this, it's greatly appreciated.

Mark


On Tue, Dec 16, 2008 at 2:51 AM, Erik Hatcher wrote:

> Mark,
>
> Looked at the code to discern this...
>
> A fragmenter isn't responsible for the number of snippets - the higher
> level SolrHighlighter is the component that uses that parameter.  So yes, it
> must be specified at the request handler level, not the fragmenter
> configuration.
>
>Erik
>
>
> On Dec 15, 2008, at 7:35 PM, Mark Ferguson wrote:
>
>  It seems like maybe the fragmenter parameters just don't get displayed
>> with
>> echoParams=all set. It may only display as far as the request handler's
>> parameters. The reason I think this is because I tried increasing
>> hl.fragsize to 1000 and the results were returned correctly (much larger
>> snippets), so I know it was read correctly.
>>
>> I moved hl.snippets into the requestHandler config instead of the
>> fragmenter, and this seems to have solved the problem. However, I'm uneasy
>> with this solution because I don't know why it wasn't being read correctly
>> when setting it inside the fragmenter.
>>
>> Mark
>>
>>
>>
>> On Mon, Dec 15, 2008 at 5:08 PM, Mark Ferguson > >wrote:
>>
>>  Thanks for this tip, it's very helpful. Indeed, it looks like none of the
>>> highlighting parameters are being included. It's using the correct
>>> request
>>> handler and hl is set to true, but none of the highlighting parameters
>>> from
>>> solrconfig.xml are in the parameter list.
>>>
>>> Here is my query:
>>>
>>>
>>>
>>> http://localhost:8080/solr1/select?rows=50&hl=true&fl=url,urlmd5,page_title,score&echoParams=all&q=java
>>>
>>> Here are the settings for the request handler and the highlighter:
>>>
>>> 
>>>  
>>>  dismax
>>>  0.01
>>>  body_text^1.0 page_title^1.6 meta_desc^1.3
>>>  *:*
>>>  body_text page_title meta_desc
>>>  0
>>>  0
>>>  regex
>>>  
>>> 
>>>
>>> 
>>>  >> class="org.apache.solr.highlight.RegexFragmenter" default="true">
>>>   
>>> 3
>>> 100
>>> 0.5
>>> \w[-\w ,/\n\"']{50,150}
>>>   
>>>  
>>> 
>>>
>>> And here is the param list returned to me:
>>>
>>> 
>>> all
>>> 0.01
>>> regex
>>> 0
>>> body_text^1.0 page_title^1.6 meta_desc^1.3
>>> 0
>>> *:*
>>> page_title,body_text
>>> dismax
>>> all
>>> url,urlmd5,page_title,score
>>> java
>>> true
>>> 50
>>> 
>>>
>>> So it seems like everything is working except for the highlighter. I
>>> should
>>> mention that when I enter a bogus fragmenter as a parameter (e.g.
>>> hl.fragmenter=bogus), it returns a 400 error that the fragmenter cannot
>>> be
>>> found, so the config file _is_ finding the regex fragmenter. It just
>>> doesn't
>>> seem to actually be including its parameters... Any ideas are
>>> appreciated,
>>> thanks again for the help.
>>>
>>> Mark
>>>
>>>
>>>
>>> On Mon, Dec 15, 2008 at 4:23 PM, Yonik Seeley  wrote:
>>>
>>>  Try adding echoParams=all to your query to verify the params that the
>>>> solr request handler is getting.
>>>>
>>>> -Yonik
>>>>
>>>> On Mon, Dec 15, 2008 at 6:10 PM, Mark Ferguson
>>>>  wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> In my solrconfig.xml file I am setting the attribute hl.snippets to 3.
>>>>>
>>>> When
>>>>
>>>>> I perform a search, it returns only a single snippet for each
>>>>>
>>>> highlighted
>>>>
>>>>> field. However, when I set the hl.snippets field manually as a search
>>>>> parameter, I get up to 3 highlighted snippets. This is the
>>>>> configuration
>>>>> that I am using to set the highlighted parameters:
>>>>>
>>>>> >>>>
>>>> class="org.apache.solr.highlight.RegexFragmenter"
>>>>
>>>>> default="true">
>>>>>  
>>>>>3
>>>>>100
>>>>>0.5
>>>>>\w[-\w ,/\n\"']{50,150}
>>>>>  
>>>>> 
>>>>>
>>>>> I tried setting hl.fragmenter=regex as a parameter as well, to be sure
>>>>>
>>>> that
>>>>
>>>>> it was using the correct one, and the result set is the same. Any ideas
>>>>>
>>>> what
>>>>
>>>>> could be causing this attribute not to be read? It has me concerned
>>>>> that
>>>>> other attributes are being ignored as well.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Mark Ferguson
>>>>>
>>>>>
>>>>
>>>
>>>
>


Re: Some solrconfig.xml attributes being ignored

2008-12-15 Thread Mark Ferguson
It seems like maybe the fragmenter parameters just don't get displayed with
echoParams=all set. It may only display as far as the request handler's
parameters. The reason I think this is because I tried increasing
hl.fragsize to 1000 and the results were returned correctly (much larger
snippets), so I know it was read correctly.

I moved hl.snippets into the requestHandler config instead of the
fragmenter, and this seems to have solved the problem. However, I'm uneasy
with this solution because I don't know why it wasn't being read correctly
when setting it inside the fragmenter.

Mark



On Mon, Dec 15, 2008 at 5:08 PM, Mark Ferguson wrote:

> Thanks for this tip, it's very helpful. Indeed, it looks like none of the
> highlighting parameters are being included. It's using the correct request
> handler and hl is set to true, but none of the highlighting parameters from
> solrconfig.xml are in the parameter list.
>
> Here is my query:
>
>
> http://localhost:8080/solr1/select?rows=50&hl=true&fl=url,urlmd5,page_title,score&echoParams=all&q=java
>
> Here are the settings for the request handler and the highlighter:
>
> 
>   
>dismax
>0.01
>body_text^1.0 page_title^1.6 meta_desc^1.3
>*:*
>body_text page_title meta_desc
>0
>0
>regex
>   
> 
>
> 
>class="org.apache.solr.highlight.RegexFragmenter" default="true">
> 
>   3
>   100
>   0.5
>   \w[-\w ,/\n\"']{50,150}
> 
>   
> 
>
> And here is the param list returned to me:
>
> 
> all
> 0.01
> regex
> 0
> body_text^1.0 page_title^1.6 meta_desc^1.3
> 0
> *:*
> page_title,body_text
> dismax
> all
> url,urlmd5,page_title,score
> java
> true
> 50
> 
>
> So it seems like everything is working except for the highlighter. I should
> mention that when I enter a bogus fragmenter as a parameter (e.g.
> hl.fragmenter=bogus), it returns a 400 error that the fragmenter cannot be
> found, so the config file _is_ finding the regex fragmenter. It just doesn't
> seem to actually be including its parameters... Any ideas are appreciated,
> thanks again for the help.
>
> Mark
>
>
>
> On Mon, Dec 15, 2008 at 4:23 PM, Yonik Seeley  wrote:
>
>> Try adding echoParams=all to your query to verify the params that the
>> solr request handler is getting.
>>
>> -Yonik
>>
>> On Mon, Dec 15, 2008 at 6:10 PM, Mark Ferguson
>>  wrote:
>> > Hello,
>> >
>> > In my solrconfig.xml file I am setting the attribute hl.snippets to 3.
>> When
>> > I perform a search, it returns only a single snippet for each
>> highlighted
>> > field. However, when I set the hl.snippets field manually as a search
>> > parameter, I get up to 3 highlighted snippets. This is the configuration
>> > that I am using to set the highlighted parameters:
>> >
>> > > class="org.apache.solr.highlight.RegexFragmenter"
>> > default="true">
>> >
>> >  3
>> >  100
>> >  0.5
>> >  \w[-\w ,/\n\"']{50,150}
>> >
>> > 
>> >
>> > I tried setting hl.fragmenter=regex as a parameter as well, to be sure
>> that
>> > it was using the correct one, and the result set is the same. Any ideas
>> what
>> > could be causing this attribute not to be read? It has me concerned that
>> > other attributes are being ignored as well.
>> >
>> > Thanks,
>> >
>> > Mark Ferguson
>> >
>>
>
>


Re: Some solrconfig.xml attributes being ignored

2008-12-15 Thread Mark Ferguson
Thanks for this tip, it's very helpful. Indeed, it looks like none of the
highlighting parameters are being included. It's using the correct request
handler and hl is set to true, but none of the highlighting parameters from
solrconfig.xml are in the parameter list.

Here is my query:

http://localhost:8080/solr1/select?rows=50&hl=true&fl=url,urlmd5,page_title,score&echoParams=all&q=java

Here are the settings for the request handler and the highlighter:


  
   dismax
   0.01
   body_text^1.0 page_title^1.6 meta_desc^1.3
   *:*
   body_text page_title meta_desc
   0
   0
   regex
  



  

  3
  100
  0.5
  \w[-\w ,/\n\"']{50,150}

  


And here is the param list returned to me:


all
0.01
regex
0
body_text^1.0 page_title^1.6 meta_desc^1.3
0
*:*
page_title,body_text
dismax
all
url,urlmd5,page_title,score
java
true
50


So it seems like everything is working except for the highlighter. I should
mention that when I enter a bogus fragmenter as a parameter (e.g.
hl.fragmenter=bogus), it returns a 400 error that the fragmenter cannot be
found, so the config file _is_ finding the regex fragmenter. It just doesn't
seem to actually be including its parameters... Any ideas are appreciated,
thanks again for the help.

Mark


On Mon, Dec 15, 2008 at 4:23 PM, Yonik Seeley  wrote:

> Try adding echoParams=all to your query to verify the params that the
> solr request handler is getting.
>
> -Yonik
>
> On Mon, Dec 15, 2008 at 6:10 PM, Mark Ferguson
>  wrote:
> > Hello,
> >
> > In my solrconfig.xml file I am setting the attribute hl.snippets to 3.
> When
> > I perform a search, it returns only a single snippet for each highlighted
> > field. However, when I set the hl.snippets field manually as a search
> > parameter, I get up to 3 highlighted snippets. This is the configuration
> > that I am using to set the highlighted parameters:
> >
> >  class="org.apache.solr.highlight.RegexFragmenter"
> > default="true">
> >
> >  3
> >  100
> >  0.5
> >  \w[-\w ,/\n\"']{50,150}
> >
> > 
> >
> > I tried setting hl.fragmenter=regex as a parameter as well, to be sure
> that
> > it was using the correct one, and the result set is the same. Any ideas
> what
> > could be causing this attribute not to be read? It has me concerned that
> > other attributes are being ignored as well.
> >
> > Thanks,
> >
> > Mark Ferguson
> >
>


Re: Using Regex fragmenter to extract paragraphs

2008-12-15 Thread Mark Ferguson
You actually don't need to escape most characters inside a character class,
the escaping of the period was unnecessary.

I've tried using the example regex ([-\w ,/\n\"']{20,200}), and I'm _still_
getting lots of highlighted snippets that don't match the regex (starting
with a period, etc.) Has anyone else has any trouble with the default regex
fragmenter? If someone has used it and gotten the expected results, can you
let me know, so I know that the problem is on my end?

Thanks for your help,

Mark


On Sun, Dec 14, 2008 at 8:34 AM, Erick Erickson wrote:

> Shouldn't you escape the question mark at the end too?
>
> On Fri, Dec 12, 2008 at 6:22 PM, Mark Ferguson  >wrote:
>
> > Someone helped me with the regex and pointed out a couple mistakes, most
> > notably the extra quantifier in .*{400,600}. My new regex is this:
> >
> > \w.{400,600}[\.!?]
> >
> > Unfortunately, my results still aren't any better. Some results start
> with
> > a
> > word character, some don't, and none seem to end with punctuation. Any
> > ideas
> > would else could be wrong?
> >
> > Mark
> >
> >
> >
> > On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson <
> mark.a.fergu...@gmail.com
> > >wrote:
> >
> > > Hello,
> > >
> > > I am trying to use the regex fragmenter and am having a hard time
> getting
> > > the results I want. I am trying to get fragments that start on a word
> > > character and end on punctuation, but for some reason the fragments
> being
> > > returned to me seem to be very inflexible, despite that I've provided a
> > > large slop. Here are the relevant parameters I'm using, maybe someone
> can
> > > help point out where I've gone wrong:
> > >
> > > 500
> > > regex
> > > 0.8
> > > [\w].*{400,600}[.!?]
> > > true
> > > chinese
> > >
> > > This should be matching between 400-600 characters, beginning with a
> word
> > > character and ending with one of .!?. Here is an example of a typical
> > > result:
> > >
> > > . Check these pictures out. Nine panda cubs on display for the first
> time
> > > Thursday in southwest China. They're less than a year old. They just
> > > recently stopped nursing. There are only 1,600 of these guys left in
> the
> > > mountain forests of central China, another 120 in  > > class='hl'>Chinese breeding facilities and zoos. And they're
> about
> > 20
> > > that live outside China in zoos. They exist almost entirely on bamboo.
> > They
> > > can live to be 30 years old. And these little guys will eventually get
> > much
> > > bigger. They'll grow
> > >
> > > As you can see, it is starting with a period and ending on a word
> > > character! It's almost as if the fragments are just coming out as they
> > will
> > > and the regex isn't doing anything at all, but the results are
> different
> > > when I use the gap fragmenter. In the above result I don't see any
> reason
> > > why it shouldn't have stripped out the preceding period and the last
> two
> > > words, there is plenty of room in the slop and in the regex pattern.
> > Please
> > > help me figure out what I'm doing wrong...
> > >
> > > Thanks a lot,
> > >
> > > Mark Ferguson
> > >
> >
>


Some solrconfig.xml attributes being ignored

2008-12-15 Thread Mark Ferguson
Hello,

In my solrconfig.xml file I am setting the attribute hl.snippets to 3. When
I perform a search, it returns only a single snippet for each highlighted
field. However, when I set the hl.snippets field manually as a search
parameter, I get up to 3 highlighted snippets. This is the configuration
that I am using to set the highlighted parameters:



  3
  100
  0.5
  \w[-\w ,/\n\"']{50,150}



I tried setting hl.fragmenter=regex as a parameter as well, to be sure that
it was using the correct one, and the result set is the same. Any ideas what
could be causing this attribute not to be read? It has me concerned that
other attributes are being ignored as well.

Thanks,

Mark Ferguson


Re: Using Regex fragmenter to extract paragraphs

2008-12-12 Thread Mark Ferguson
Someone helped me with the regex and pointed out a couple mistakes, most
notably the extra quantifier in .*{400,600}. My new regex is this:

\w.{400,600}[\.!?]

Unfortunately, my results still aren't any better. Some results start with a
word character, some don't, and none seem to end with punctuation. Any ideas
would else could be wrong?

Mark



On Fri, Dec 12, 2008 at 2:37 PM, Mark Ferguson wrote:

> Hello,
>
> I am trying to use the regex fragmenter and am having a hard time getting
> the results I want. I am trying to get fragments that start on a word
> character and end on punctuation, but for some reason the fragments being
> returned to me seem to be very inflexible, despite that I've provided a
> large slop. Here are the relevant parameters I'm using, maybe someone can
> help point out where I've gone wrong:
>
> 500
> regex
> 0.8
> [\w].*{400,600}[.!?]
> true
> chinese
>
> This should be matching between 400-600 characters, beginning with a word
> character and ending with one of .!?. Here is an example of a typical
> result:
>
> . Check these pictures out. Nine panda cubs on display for the first time
> Thursday in southwest China. They're less than a year old. They just
> recently stopped nursing. There are only 1,600 of these guys left in the
> mountain forests of central China, another 120 in  class='hl'>Chinese breeding facilities and zoos. And they're about 20
> that live outside China in zoos. They exist almost entirely on bamboo. They
> can live to be 30 years old. And these little guys will eventually get much
> bigger. They'll grow
>
> As you can see, it is starting with a period and ending on a word
> character! It's almost as if the fragments are just coming out as they will
> and the regex isn't doing anything at all, but the results are different
> when I use the gap fragmenter. In the above result I don't see any reason
> why it shouldn't have stripped out the preceding period and the last two
> words, there is plenty of room in the slop and in the regex pattern. Please
> help me figure out what I'm doing wrong...
>
> Thanks a lot,
>
> Mark Ferguson
>


Using Regex fragmenter to extract paragraphs

2008-12-12 Thread Mark Ferguson
Hello,

I am trying to use the regex fragmenter and am having a hard time getting
the results I want. I am trying to get fragments that start on a word
character and end on punctuation, but for some reason the fragments being
returned to me seem to be very inflexible, despite that I've provided a
large slop. Here are the relevant parameters I'm using, maybe someone can
help point out where I've gone wrong:

500
regex
0.8
[\w].*{400,600}[.!?]
true
chinese

This should be matching between 400-600 characters, beginning with a word
character and ending with one of .!?. Here is an example of a typical
result:

. Check these pictures out. Nine panda cubs on display for the first time
Thursday in southwest China. They're less than a year old. They just
recently stopped nursing. There are only 1,600 of these guys left in the
mountain forests of central China, another 120 in Chinese breeding facilities and zoos. And they're about 20
that live outside China in zoos. They exist almost entirely on bamboo. They
can live to be 30 years old. And these little guys will eventually get much
bigger. They'll grow

As you can see, it is starting with a period and ending on a word character!
It's almost as if the fragments are just coming out as they will and the
regex isn't doing anything at all, but the results are different when I use
the gap fragmenter. In the above result I don't see any reason why it
shouldn't have stripped out the preceding period and the last two words,
there is plenty of room in the slop and in the regex pattern. Please help me
figure out what I'm doing wrong...

Thanks a lot,

Mark Ferguson


Format of highlighted fields in query response

2008-12-11 Thread Mark Ferguson
Hello,

I am making a query to my Solr server in which I would like to have a number
of fields returned, with highlighting if available. I've noticed that in the
query response, I get back both the original field name and then in a
different section, the highlighted snippet. I am wondering if there is a
parameter which will allow me to collapse this data, returning only the
highlighted snippet in the doc itself, when available. For example, I am
currently receiving the following data:


  0.2963915
  Chinese Visa Information
  
http://www.danwei.org/china_information/chinese_visa_information.php
  01598a6e06190bd8b05c8b03f51233a1

... and farther down ...

  

  TITLE: Chinese Visa
Information

  


I would like it to just look like this:


  0.2963915
  Chinese Visa
Information
  
http://www.danwei.org/china_information/chinese_visa_information.php
  01598a6e06190bd8b05c8b03f51233a1


The reason I would prefer this second response format is because I don't
need the first field, and it greatly simplifies my call to
QueryResponse.getBeans() in SolrJ, as it will fill in everything I need in
one call.

Thanks very much,

Mark Ferguson


Re: Taxonomy Support on Solr

2008-12-11 Thread Mark Ferguson
I had a similar problem and I solved it by making the directory a
multi-valued field in the index and giving each directory a unique id. So
for example, a document in directory 2 would contain in the index: "dir_id:A
dir_id:B dir_id:2". A search on any of those fields will then return
directory 2. Conversely, a search for dir_id:2 will not return directory B.
I hope I understood your question correctly.

Mark Ferguson

On Thu, Dec 11, 2008 at 3:03 AM, Jana, Kumar Raja  wrote:

> Hi,
>
> Any plans of supporting user-defined classifications on Solr? Is there
> any component which returns all the children of a node (till the leaf
> node) when I search for any node?
>
> May be this would help:
>
> Say I have a few SolrDocuments classified as:
>
>A
>B--C
>123  8--9
>
> (I.e A has 2 child nodes B and C. B has 3 child nodes 1,2,3 and C has 2
> child nodes 8,9)
> When my search criteria matches B, my results should contain B as well
> as 1,2 and 3 too.
> Search for A would return all the nodes mentioned above.
>
> -Kumar
>
>
>