Re: Query time boosting with dismax

2009-12-04 Thread Otis Gospodnetic
Terms no, but fields (with terms) and phrases, yes.


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Girish Redekar 
> To: solr-user@lucene.apache.org
> Sent: Fri, December 4, 2009 11:42:16 PM
> Subject: Query time boosting with dismax
> 
> Hi,
> 
> Is it possible to weigh specific query terms with a Dismax query parser? Is
> it possible to write queries of the sort ...
> field1:(term1)^2.0 + (term2^3.0)
> with dismax?
> 
> Thanks,
> Girish Redekar
> http://girishredekar.net



Re: Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread rajan chandi
Local Solr doesn't look like 64 bit.

ra...@rajan-desktop:~$ uname -a
Linux rajan-desktop 2.6.28-16-server #55-Ubuntu SMP Tue Oct 20 20:50:00 UTC
2009 i686 GNU/Linux


But the Xen Solr server does

ra...@rajan-desktop:~$ uname -a
Linux rajan-desktop 2.6.28-16-server #55-Ubuntu SMP Tue Oct 20 20:50:00 UTC
2009 i686 GNU/Linux


May be that is the reason why Server is taking more RAM.

Thanks all for your responses.

Regards
Rajan

On Sat, Dec 5, 2009 at 11:06 AM, rajan chandi wrote:

> My local ubuntu 9.04 64 bit taking 1.5 GB is not a VM and Debian Lenny 64
> bit taking 2 GB is a Xen Instance.
>
> - Rajan
>
>
> On Sat, Dec 5, 2009 at 10:51 AM, rajan chandi wrote:
>
>> We are using 64 bit VM with 64 bit JDK on it.
>> It is 2.00 GB RAM Zen instance.
>>
>> We're setting up max JVM heap size of 1800 MB max.
>>
>> - Rajan
>>
>>
>>
>> On Fri, Dec 4, 2009 at 8:19 PM, Yonik Seeley 
>> wrote:
>>
>>> Are you explicitly setting the heap sizes?  If not, the JVM is
>>> deciding for itself based on what the box looks like (ram, cpus, OS,
>>> etc).  Are they both the same architecture (32 bit or 64 bit?)
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>> p.s. in general cross-posting to both solr-user and solr-dev is
>>> discouraged.
>>>
>>>
>>> On Fri, Dec 4, 2009 at 5:27 AM, rajan chandi 
>>> wrote:
>>> > Hi All,
>>> >
>>> > We've deployed 4 instances of Solr on a debian server.
>>> >
>>> > It is taking only 1.5 GB of RAM on local ubuntu machine but it is
>>> taking 2.0
>>> > GB plus on Debian Lenny server.
>>> >
>>> > Any ideas/pointers will help.
>>> >
>>> > Regards
>>> > Rajan
>>> >
>>>
>>
>>
>


Re: Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread rajan chandi
My local ubuntu 9.04 64 bit taking 1.5 GB is not a VM and Debian Lenny 64
bit taking 2 GB is a Xen Instance.

- Rajan

On Sat, Dec 5, 2009 at 10:51 AM, rajan chandi wrote:

> We are using 64 bit VM with 64 bit JDK on it.
> It is 2.00 GB RAM Zen instance.
>
> We're setting up max JVM heap size of 1800 MB max.
>
> - Rajan
>
>
>
> On Fri, Dec 4, 2009 at 8:19 PM, Yonik Seeley 
> wrote:
>
>> Are you explicitly setting the heap sizes?  If not, the JVM is
>> deciding for itself based on what the box looks like (ram, cpus, OS,
>> etc).  Are they both the same architecture (32 bit or 64 bit?)
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>> p.s. in general cross-posting to both solr-user and solr-dev is
>> discouraged.
>>
>>
>> On Fri, Dec 4, 2009 at 5:27 AM, rajan chandi 
>> wrote:
>> > Hi All,
>> >
>> > We've deployed 4 instances of Solr on a debian server.
>> >
>> > It is taking only 1.5 GB of RAM on local ubuntu machine but it is taking
>> 2.0
>> > GB plus on Debian Lenny server.
>> >
>> > Any ideas/pointers will help.
>> >
>> > Regards
>> > Rajan
>> >
>>
>
>


Re: Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread rajan chandi
We are using 64 bit VM with 64 bit JDK on it.
It is 2.00 GB RAM Zen instance.

We're setting up max JVM heap size of 1800 MB max.

- Rajan


On Fri, Dec 4, 2009 at 8:19 PM, Yonik Seeley wrote:

> Are you explicitly setting the heap sizes?  If not, the JVM is
> deciding for itself based on what the box looks like (ram, cpus, OS,
> etc).  Are they both the same architecture (32 bit or 64 bit?)
>
> -Yonik
> http://www.lucidimagination.com
>
> p.s. in general cross-posting to both solr-user and solr-dev is
> discouraged.
>
>
> On Fri, Dec 4, 2009 at 5:27 AM, rajan chandi 
> wrote:
> > Hi All,
> >
> > We've deployed 4 instances of Solr on a debian server.
> >
> > It is taking only 1.5 GB of RAM on local ubuntu machine but it is taking
> 2.0
> > GB plus on Debian Lenny server.
> >
> > Any ideas/pointers will help.
> >
> > Regards
> > Rajan
> >
>


Query time boosting with dismax

2009-12-04 Thread Girish Redekar
Hi,

Is it possible to weigh specific query terms with a Dismax query parser? Is
it possible to write queries of the sort ...
field1:(term1)^2.0 + (term2^3.0)
with dismax?

Thanks,
Girish Redekar
http://girishredekar.net


Re: Sanity check on numeric types and which of them to use

2009-12-04 Thread Yonik Seeley
On Fri, Dec 4, 2009 at 7:38 PM, Jay Hill  wrote:
> 1) Is there any benefit to using the "int" type as a TrieIntField w/
> precisionStep=0 over the "pint" type for simple ints that won't be sorted or
> range queried?

No.  But given that people could throw in a random range query and
have it work correctly with a trie based int (vs a plain int), seems
reason enough to prefer it.

> 2) In 1.4, what type is now most efficient for sorting?

trie and plain should be pretty equivalent (trie might be slightly
faster to uninvert the first time).  Both take up less memory in the
field cache than sint.

> 3) The only reason to use a "sint" field is for backward compatibility
> and/or to use sortMissingFirst/SortMissingLast, correct?

I believe so.

-Yonik
http://www.lucidimagination.com


Sanity check on numeric types and which of them to use

2009-12-04 Thread Jay Hill
Looking at the example version of schema.xml there seems to be some
confusion on which numeric field types are best used in different
situations. What confused me was that the type of "int" is now set to a
TrieIntField, but with a precisionStep of 0:
'
the "tint" type is set up as a TrieIntField with a precisionStep of 8:

the "sint" type is unchanged:

and the old IntField is now of type "pint":


It's obvious that the "tint" type would be preferred for range queries. But
these questions come to mind:
1) Is there any benefit to using the "int" type as a TrieIntField w/
precisionStep=0 over the "pint" type for simple ints that won't be sorted or
range queried?
2) In 1.4, what type is now most efficient for sorting?
3) The only reason to use a "sint" field is for backward compatibility
and/or to use sortMissingFirst/SortMissingLast, correct?

-Jay


Re: how to set multiple fq while building a query in solrj

2009-12-04 Thread Erik Hatcher


On Dec 4, 2009, at 4:21 PM, javaxmlsoapdev wrote:



how do I create a query string witih multiple fq params using solrj  
SolrQuery

API.

e.g. I want to build a query as follow

http://servername:port/solr/issues/select/?q=testing&fq=statusName: 
(Female

OR Male)&fq=name="Joe"

I am using solrj client APIs to build query and using SolrQuery as  
follow


solrQuery.setParam("fq" statusString);
solrQuery.setParam("fq", nameString);

It only sets last "fq" (fq=nameString)in the string.. If I swich  
abover
setParam order it sets fq=statusString. How do I set muliple fq  
params in

SolrQuery object.


Use SolrQuery#add() instead.  Or SolrQuery#addFilterQuery()

Erik



Re: WELCOME to solr-user@lucene.apache.org

2009-12-04 Thread khalid y
Hi,

I have a problem with solr. I'm indexing some html content and solr crash
because my id field is multivalued.
I found that Tika read the html and extract metadata like  from my htmls but my documents has an already an id setted by
literal.id=10.

I tried to map the id from Tika by fmap.id=ignored_ but it ignore also my
literal.id

I'm using solr 1.4 and tika 0.5

Someone can explain to me how I can ignore this the Tika id metadata ??

Thanks


Re: Dumping solr requests for indexing

2009-12-04 Thread Mark Miller
Teruhiko Kurosaka wrote:
>> Aha!
>> Sounds like a job for a simple, custom 
>> UpdateRequestProcessor.  Actually, I think URP doesn't get 
>> access to the actual XML, but what it has access may be 
>> enough for you: http://wiki.apache.org/solr/UpdateRequestProcessor
>> 
>
> I added this to solrconfig.xml but I don't see any extra output 
> in the log file.
>
>   
> 
> 
>   
>
> Do I need to do something else to make this effective?
>
> The commented-out example in solrconfic.xml has 'name="dedupe"'
> attribute. Do I have to specify a name? If so, how do I use 
> that name in the request?
>
> Kuro
>   
Look at the comment above the dedupe declaration:

   You have to link the chain to an update handler above to use it ie:
 
   
 dedupe
   
  

-- 
- Mark

http://www.lucidimagination.com





RE: Dumping solr requests for indexing

2009-12-04 Thread Teruhiko Kurosaka

> Aha!
> Sounds like a job for a simple, custom 
> UpdateRequestProcessor.  Actually, I think URP doesn't get 
> access to the actual XML, but what it has access may be 
> enough for you: http://wiki.apache.org/solr/UpdateRequestProcessor

I added this to solrconfig.xml but I don't see any extra output 
in the log file.

  


  

Do I need to do something else to make this effective?

The commented-out example in solrconfic.xml has 'name="dedupe"'
attribute. Do I have to specify a name? If so, how do I use 
that name in the request?

Kuro


Re: Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread Kay Kay

What are the nature of the machines / VM run on ?  32-bit / 64-bit ?

rajan chandi wrote:

Hi All,

We've deployed 4 instances of Solr on a debian server.

It is taking only 1.5 GB of RAM on local ubuntu machine but it is taking 2.0
GB plus on Debian Lenny server.

Any ideas/pointers will help.

Regards
Rajan

  




Re: HTML Stripping slower in Solr 1.4?

2009-12-04 Thread Yonik Seeley
Is BaseCharFilter required for the html strip filter?

-Yonik
http://www.lucidimagination.com


On Tue, Dec 1, 2009 at 1:17 AM, Koji Sekiguchi  wrote:
> Robin,
>
> Thank you for reporting this. Performance degradation of HTML Stripper
> could be in 1.4. I opened a ticket in Lucene:
>
> https://issues.apache.org/jira/browse/LUCENE-2098
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>


Re: how is score computed with hsin functionquery?

2009-12-04 Thread gdeconto

Thanks Lance, I appreciate your response.  

I know what a DIH is and have already written custom transformers.  I just
misunderstood your response to my message (I wasnt aware that we could use
JS to create transformers).

Anyhow, my intent is to change the tool (create a variation of hsin to
support degrees) rather than change the data (which introduces other issues,
such as having to support most systems in degrees and this one system in
radians)

any ideas/advice in that regard?
-- 
View this message in context: 
http://old.nabble.com/how-is-score-computed-with-hsin-functionquery--tp26504265p26638720.html
Sent from the Solr - User mailing list archive at Nabble.com.



how to set multiple fq while building a query in solrj

2009-12-04 Thread javaxmlsoapdev

how do I create a query string witih multiple fq params using solrj SolrQuery
API.

e.g. I want to build a query as follow

http://servername:port/solr/issues/select/?q=testing&fq=statusName:(Female
OR Male)&fq=name="Joe"

I am using solrj client APIs to build query and using SolrQuery as follow

solrQuery.setParam("fq" statusString);
solrQuery.setParam("fq", nameString);

It only sets last "fq" (fq=nameString)in the string.. If I swich abover
setParam order it sets fq=statusString. How do I set muliple fq params in
SolrQuery object.

Thanks,
-- 
View this message in context: 
http://old.nabble.com/how-to-set-multiple-fq-while-building-a-query-in-solrj-tp26638650p26638650.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dumping solr requests for indexing

2009-12-04 Thread Otis Gospodnetic
Aha!
Sounds like a job for a simple, custom UpdateRequestProcessor.  Actually, I 
think URP doesn't get access to the actual XML, but what it has access may be 
enough for you: http://wiki.apache.org/solr/UpdateRequestProcessor

Alternatively, unpack the war, add a custom logging servlet filter, chain it in 
web.xml and that might do the trick.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Teruhiko Kurosaka 
> To: "solr-user@lucene.apache.org" 
> Sent: Fri, December 4, 2009 3:05:57 PM
> Subject: RE: Dumping solr requests for indexing
> 
> Log only tells high-level descriptions of what were done.
> I'd like to capture the exact XML requests with data, so that
> I could re-feed it to Solr to reproduce the issue my
> customer is encountering.
> 
> -kuro  
> 
> > -Original Message-
> > From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> > Sent: Friday, December 04, 2009 11:41 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Dumping solr requests for indexing
> > 
> > The solr log, as well as the servlet container log should 
> > have them all.
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> > 
> > 
> > 
> > - Original Message 
> > > From: Teruhiko Kurosaka 
> > > To: "solr-user@lucene.apache.org" 
> > > Sent: Fri, December 4, 2009 2:23:17 PM
> > > Subject: Dumping solr requests for indexing
> > > 
> > > Is there any way to dump all incoming requests to Solr into a file?
> > > 
> > > My customer is seeing a strange problem of disappearing docs from 
> > > index and I'd like to ask them to capture all incoming requests.
> > > 
> > > Thanks.
> > > 
> > > -kuro
> > 
> > 



Answer: RE: Question: Write to Solr but not via http, and still store date_format

2009-12-04 Thread Peter 4U

Oops, of course the answer was staring me in the face!

   --> Use the EmbeddedSolrServer, rather than the CommonsHttpSolrServer.

 

Live and learn. Live. and learn.

 

Thanks,

Peter

 


 
> From: pete...@hotmail.com
> To: solr-user@lucene.apache.org
> Subject: Question: Write to Solr but not via http, and still store date_format
> Date: Fri, 4 Dec 2009 20:09:19 +
> 
> 
> Hi Solr team,
> 
> 
> 
> Has anyone been able to write to Solr, keeping things like 'date_format', but 
> indexing directly, rather than via http?
> 
> 
> 
> I've been indexing using Lucene Java, and this works well and is very fast, 
> except that any data indexed this way doesn't store date_format et al 
> information (date.format resuts always return 0).
> 
> I like indexing directly into Lucene, rather than via http requests, as it is 
> much faster, particularly at very high input rates.
> 
> 
> 
> Anyone encountered this and managed to solve it?
> 
> 
> 
> Many thanks,
> 
> peter
> 
> 
> 
> _
> Got more than one Hotmail account? Save time by linking them together
> http://clk.atdmt.com/UKM/go/186394591/direct/01/
  
_
Got more than one Hotmail account? Save time by linking them together
 http://clk.atdmt.com/UKM/go/186394591/direct/01/

Re: Question: Write to Solr but not via http, and still store date_format

2009-12-04 Thread Otis Gospodnetic
Are you looking for http://wiki.apache.org/solr/EmbeddedSolr ?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Peter 4U 
> To: Solr 
> Sent: Fri, December 4, 2009 3:09:19 PM
> Subject: Question: Write to Solr but not via http, and still store date_format
> 
> 
> Hi Solr team,
> 
> 
> 
> Has anyone been able to write to Solr, keeping things like 'date_format', but 
> indexing directly, rather than via http?
> 
> 
> 
> I've been indexing using Lucene Java, and this works well and is very fast, 
> except that any data indexed this way doesn't store date_format et al 
> information (date.format resuts always return 0).
> 
> I like indexing directly into Lucene, rather than via http requests, as it is 
> much faster, particularly at very high input rates.
> 
> 
> 
> Anyone encountered this and managed to solve it?
> 
> 
> 
> Many thanks,
> 
> peter
> 
> 
>   
> _
> Got more than one Hotmail account? Save time by linking them together
> http://clk.atdmt.com/UKM/go/186394591/direct/01/



Question: Write to Solr but not via http, and still store date_format

2009-12-04 Thread Peter 4U

Hi Solr team,

 

Has anyone been able to write to Solr, keeping things like 'date_format', but 
indexing directly, rather than via http?

 

I've been indexing using Lucene Java, and this works well and is very fast, 
except that any data indexed this way doesn't store date_format et al 
information (date.format resuts always return 0).

I like indexing directly into Lucene, rather than via http requests, as it is 
much faster, particularly at very high input rates.

 

Anyone encountered this and managed to solve it?

 

Many thanks,

peter

 
  
_
Got more than one Hotmail account? Save time by linking them together
 http://clk.atdmt.com/UKM/go/186394591/direct/01/

RE: Dumping solr requests for indexing

2009-12-04 Thread Teruhiko Kurosaka
Log only tells high-level descriptions of what were done.
I'd like to capture the exact XML requests with data, so that
I could re-feed it to Solr to reproduce the issue my
customer is encountering.

-kuro  

> -Original Message-
> From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
> Sent: Friday, December 04, 2009 11:41 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Dumping solr requests for indexing
> 
> The solr log, as well as the servlet container log should 
> have them all.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> 
> 
> 
> - Original Message 
> > From: Teruhiko Kurosaka 
> > To: "solr-user@lucene.apache.org" 
> > Sent: Fri, December 4, 2009 2:23:17 PM
> > Subject: Dumping solr requests for indexing
> > 
> > Is there any way to dump all incoming requests to Solr into a file?
> > 
> > My customer is seeing a strange problem of disappearing docs from 
> > index and I'd like to ask them to capture all incoming requests.
> > 
> > Thanks.
> > 
> > -kuro
> 
> 

Re: Best way to handle bitfields in solr...

2009-12-04 Thread Otis Gospodnetic
Would http://wiki.apache.org/solr/FunctionQuery#fieldvalue help?

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: William Pierce 
> To: solr-user@lucene.apache.org
> Sent: Fri, December 4, 2009 2:43:25 PM
> Subject: Best way to handle bitfields in solr...
> 
> Folks:
> 
> In my db I currently have fields that represent bitmasks.   Thus, for 
> example, a 
> value of the mask of 48 might represent an "undergraduate" (value = 16) and 
> "graduate" (value = 32).   Currently,  the corresponding field in solr is a 
> multi-valued string field called "EdLevel" which will have 
> Undergraduate and Graduate  as its two values (for 
> this example).   I do the conversion from the int into the list of values as 
> I 
> do the indexing.
> 
> Ideally, I'd like solr to have bitwise operations so that I could store the 
> int 
> value, and then simply search by using bit operations.  However, given that 
> this 
> is not possible,  and that there have been recent threads speaking to 
> performance issues with multi-valued fields,  is there something better I 
> could 
> do?
> 
> TIA,
> 
> - Bill



Best way to handle bitfields in solr...

2009-12-04 Thread William Pierce
Folks:

In my db I currently have fields that represent bitmasks.   Thus, for example, 
a value of the mask of 48 might represent an "undergraduate" (value = 16) and 
"graduate" (value = 32).   Currently,  the corresponding field in solr is a 
multi-valued string field called "EdLevel" which will have 
Undergraduate and Graduate  as its two values 
(for this example).   I do the conversion from the int into the list of values 
as I do the indexing.

Ideally, I'd like solr to have bitwise operations so that I could store the int 
value, and then simply search by using bit operations.  However, given that 
this is not possible,  and that there have been recent threads speaking to 
performance issues with multi-valued fields,  is there something better I could 
do?

TIA,

- Bill

Re: Dumping solr requests for indexing

2009-12-04 Thread Otis Gospodnetic
The solr log, as well as the servlet container log should have them all.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Teruhiko Kurosaka 
> To: "solr-user@lucene.apache.org" 
> Sent: Fri, December 4, 2009 2:23:17 PM
> Subject: Dumping solr requests for indexing
> 
> Is there any way to dump all incoming requests to Solr
> into a file?
> 
> My customer is seeing a strange problem of disappearing
> docs from index and I'd like to ask them to capture all
> incoming requests.
> 
> Thanks.
> 
> -kuro 



Dumping solr requests for indexing

2009-12-04 Thread Teruhiko Kurosaka
Is there any way to dump all incoming requests to Solr
into a file?

My customer is seeing a strange problem of disappearing
docs from index and I'd like to ask them to capture all
incoming requests.

Thanks.

-kuro 


Re: search on tomcat server

2009-12-04 Thread William Pierce

Have you gone through the solr tomcat wiki?

http://wiki.apache.org/solr/SolrTomcat

I found this very helpful when I did our solr installation on tomcat.

- Bill

--
From: "Jill Han" 
Sent: Friday, December 04, 2009 8:54 AM
To: 
Subject: RE: search on tomcat server

I went through all the links on 
http://wiki.apache.org/solr/#Search_and_Indexing

And still have no clue as how to proceed.
1. do I have to do some implementation in order to get solr to search doc. 
on tomcat server?
2. if I have files, such as .doc, docx, .pdf, .jsp, .html, etc under 
window xp, c:/tomcat/webapps/test1, /webapps/test2,

  What should I do to make solr search those directories
3. since I am using tomcat, instead of jetty, is there any demo that shows 
the solr searching features, and real searching result?


Thanks,
Jill


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: Monday, November 30, 2009 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: search on tomcat server

On Mon, Nov 30, 2009 at 9:55 PM, Jill Han  wrote:


I got solr running on the tomcat server,
http://localhost:8080/solr/admin/

After I enter a search word, such as, solr, then hit Search button, it
will go to

http://localhost:8080/solr/select/?q=solr&version=2.2&start=0&rows=10&in
dent=on

 and display

  

-
>
 <

-
>
 <  

 <0

 <0

-
>
 <

 <  10

 <  0

 <  on

 <  solr

 <  2.2



  

 <  

 

 My question is what is the next step to search files on tomcat server?




Looks like you have not added any documents to Solr. See the "Indexing
Documents" section at http://wiki.apache.org/solr/#Search_and_Indexing

--
Regards,
Shalin Shekhar Mangar.



how to do auto-suggest case-insensitive match and return original case field values

2009-12-04 Thread hermida

Hi everyone,

New to forum and to Solr, doing my first major project with it and enjoying
it so far, great software.

In my web application I want to set up auto-suggest as you type
functionality which will search case-insensitively yet return the original
case terms.  It doesn't seem like TermsComponent can do this as it can only
return the lowercase indexed terms your are searching against, not the
original case terms.

There was one post on this forum 
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td2410.html#a24143981
http://old.nabble.com/Auto-suggest..-how-to-do-mixed-case-td2410.html#a24143981
 
where someone asked the same question, and what someone said is to

There is no way to do this right now using TermsComponent. You can index
lower case terms and store the mixed case terms. Then you can use a prefix
query which will return documents (and hence stored field values).

So this got me started, I set out to use Solr Query instead of
TermsComponent to try to do this.  I did the following as mentioned:


  

  



  


  





And used copyField to populate the test_lc field:



This is the easy part (the forum user didn't explain the hard part!) It is
very hard to get the same information that TermsComponent returns using the
regular Solr Query functionality!  For example:

http://localhost:8983/solr/terms?terms.fl=test_lc&terms.prefix=a&terms.sort=count&terms.limit=5&omitHeader=true

15
6
3
1
1

which provides useful sorting by and returning of term frequency counts in
your index.  How does one get this same information with regular Solr Query? 
I set up the following prefix query, searching by the indexed lowercased
field and returning the other:

http://localhost:8983/solr/select?fl=test&q=test_lc%3Aa*&sort=score+desc&rows=5&omitHeader=true


  
3D-structure
acetylation
alternative promoter usage
HLC-7
  


  
alternative splicing
complete proteome
DNA-binding
RACK1
  


  
acetylation
AIG21
WD repeat
GNB2L1
  



  
3D-structure
apoptosis
cathepsin G-like 1
ATSGL1
CTLA-1
  


  
autoantigen Ge-1
autoantigen RCD-8
HERV-H LTR-associating protein 3
HHLA3
  


I can see how to process this in my front-end app to extract the original
terms starting with the prefix letter(s) used in the query, but there are
still some major problems when compared to TermsComponent:

- How do I make sure my auto-suggest list is at least a certain number of
terms long?  Using rows of course doesn't work like terms.limit, because
between returned docs there can be the same term and these will get
collapsed.
- How do I get term frequency counts like TermsComponent does?  I looked at
faceting but I don't understand how to get the TermsComponent behavior using
it.

Sorry for the long message, just wanted to fully explain, thanks for any
help!

leandro

-- 
View this message in context: 
http://old.nabble.com/how-to-do-auto-suggest-case-insensitive-match-and-return-original-case-field-values-tp26636365p26636365.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: High add/delete rate and index fragmentation

2009-12-04 Thread Otis Gospodnetic
Hello,

> You are right that we would need near realtime support. The problem is not
> so much about new records becoming available, but guaranteeing that deleted
> records will not be returned. For this reason, our plan would be to update
> and search a master index, provided that: (1) search while updating records
> is ok, 

It is in general, though I haven't fully tested NRT under high load.

> (2) performance is not degraded substantially due to fragmentation,

You can control that somewhat via mergeFactor.

> (3) optimization does not impact search, 

It will - disk IO, OS cache, and such will be affected, and that will affect 
search.

> and (4) we ensure durability - if a
> node goes down, an update was replicated to another node who can take over.

Maybe just index to > 1 masters?  For example, another non-search tool I'm 
using (Voldemort) has the notion of "required writes", which represents how 
many copies of data should be written at insert/add time.

> It seems that 1 and 2 are not so much of a problem, 3 would need to be
> tested. I would like know more about how 4 has been addressed, so we don't
> lose updates if a master goes down between updates and index replication.

Lucene buffers documents while indexing, to avoid constant disk writes.  HDD 
itself does some of that, too.  So I think you can always lose some data is 
whatever is in the buffers doesn't get flushed when somebody trips over the 
power cord in the data center.

Otis

> > #3 is a mixed bag at this point, and there is no official
> > solution, yet. Shell scripts, and a load balancer could kind of
> > work. Check out SOLR-1277 or SOLR-1395 for progress along these
> > lines.
> >
> 
> Thanks for the links.
> 
> Rodrigo
> 
> 
> > On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro 
> > wrote:
> > > We are considering Solr to store events which will be added and deleted
> > from
> > > the index in a very fast rate. Solr will be used, in this case, to find
> > the
> > > right event we need to process (since they may have several attributes
> > and
> > > we may search the best match based on the query attributes). Our
> > > understanding is that the common use cases are those wherein the read
> > rate
> > > is much higher than writes, and deletes are not as frequent, so we are
> > not
> > > sure Solr handles our use case very well or if it is the right fit. Given
> > > that, I have a few questions:
> > >
> > > 1 - How does Solr/Lucene degrade with the fragmentation? That would
> > probably
> > > determine the rate at which we would need to optimize the index. I
> > presume
> > > that it depends on the rate of insertions and deletions, but would you
> > have
> > > any benchmark on this degradation? Or, in general, how has been your
> > > experience with this use case?
> > >
> > > 2 - Optimizing seems to be a very expensive process. While optimizing the
> > > index, how much does search performance degrade? In this case, having a
> > huge
> > > degradation would not allow us to optimize unless we switch to another
> > copy
> > > of the index while optimize is running.
> > >
> > > 3 - In terms of high availability, what has been your experience
> > detecting
> > > failure of master and having a slave taking over?
> > >
> > > Thanks,
> > > Rodrigo
> > >
> >



Re: Grouping

2009-12-04 Thread Otis Gospodnetic
Not out of the box.  You could "group by" using SOLR-236 perhaps?

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Bruno 
> To: solr-user@lucene.apache.org
> Sent: Fri, December 4, 2009 1:08:59 PM
> Subject: Grouping
> 
> Is there a way to make a group by or distinct query?
> 
> -- 
> Bruno Morelli Vargas
> Mail: brun...@gmail.com
> Msn: brun...@hotmail.com
> Icq: 165055101
> Skype: morellibmv



Re: creating Lucene document from an external XML file.

2009-12-04 Thread Otis Gospodnetic
I think you'd have to dig into Solr (Lucene actually) to inject yourself after 
Analysis.  The UpdateRequestProcessor, as the name implies, it at the request 
level, so pretty high up/early on.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Phanindra Reva 
> To: solr-user@lucene.apache.org
> Sent: Fri, December 4, 2009 7:48:46 AM
> Subject: Re: creating Lucene document from an external XML file.
> 
> Hello..,
>   You have mentioned I can make use of UpdateProcessor API.
> May I know when the flow of execution enters that
> UpdateRequestProcessor class.? To be brief , it would be perfect for
> my case if its after analysis but exactly before its being added to
> the index.
> Thanks alot.
> 
> On Wed, Dec 2, 2009 at 8:56 PM, Chris Hostetter
> wrote:
> >
> > : // "  one possibility to think about is that instead of modifying the 
> documents
> > : before sending them to Solr, you could write an UpdateProcessor tha runs
> > : direclty in Solr and gets access to those Documents after Solr has already
> > : parsed that XML (or even if the documents come from someplace else, like
> > : DIH, or a CSV file) and then make your changes. " //
> > :I have not decided to modify documents, instead I go for
> > : modifying them at run time. (modifying Java object's variables that
> > : contains information extracted from the document-file).
> > : my question is : Is there any part of the api which take document file
> > : path as input , returns java object and gives us a way to modify
> > : inbetween before sending the same object for indexing (to the
> > : IndexWriter - lucene api).
> >
> > Yes ... as i mentioned the UpdateProcessor API is where you have access to
> > the Documents as Lucene objects inside of Solr before they are indexed.
> >
> >
> >
> > -Hoss
> >
> >



Grouping

2009-12-04 Thread Bruno
Is there a way to make a group by or distinct query?

-- 
Bruno Morelli Vargas
Mail: brun...@gmail.com
Msn: brun...@hotmail.com
Icq: 165055101
Skype: morellibmv


Re: High add/delete rate and index fragmentation

2009-12-04 Thread Rodrigo De Castro
On Wed, Dec 2, 2009 at 2:43 PM, Jason Rutherglen  wrote:

> It sounds like you're asking about near realtime search support,
> I'm not sure.  So here's few ideas.
>
> #1 How often do you need to be able to search on the latest
> updates (as opposed to updates from lets say, 10 minutes ago)?
>

You are right that we would need near realtime support. The problem is not
so much about new records becoming available, but guaranteeing that deleted
records will not be returned. For this reason, our plan would be to update
and search a master index, provided that: (1) search while updating records
is ok, (2) performance is not degraded substantially due to fragmentation,
(3) optimization does not impact search, and (4) we ensure durability - if a
node goes down, an update was replicated to another node who can take over.
It seems that 1 and 2 are not so much of a problem, 3 would need to be
tested. I would like know more about how 4 has been addressed, so we don't
lose updates if a master goes down between updates and index replication.


> #3 is a mixed bag at this point, and there is no official
> solution, yet. Shell scripts, and a load balancer could kind of
> work. Check out SOLR-1277 or SOLR-1395 for progress along these
> lines.
>

Thanks for the links.

Rodrigo


> On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro 
> wrote:
> > We are considering Solr to store events which will be added and deleted
> from
> > the index in a very fast rate. Solr will be used, in this case, to find
> the
> > right event we need to process (since they may have several attributes
> and
> > we may search the best match based on the query attributes). Our
> > understanding is that the common use cases are those wherein the read
> rate
> > is much higher than writes, and deletes are not as frequent, so we are
> not
> > sure Solr handles our use case very well or if it is the right fit. Given
> > that, I have a few questions:
> >
> > 1 - How does Solr/Lucene degrade with the fragmentation? That would
> probably
> > determine the rate at which we would need to optimize the index. I
> presume
> > that it depends on the rate of insertions and deletions, but would you
> have
> > any benchmark on this degradation? Or, in general, how has been your
> > experience with this use case?
> >
> > 2 - Optimizing seems to be a very expensive process. While optimizing the
> > index, how much does search performance degrade? In this case, having a
> huge
> > degradation would not allow us to optimize unless we switch to another
> copy
> > of the index while optimize is running.
> >
> > 3 - In terms of high availability, what has been your experience
> detecting
> > failure of master and having a slave taking over?
> >
> > Thanks,
> > Rodrigo
> >
>


Re: HTML Stripping slower in Solr 1.4?

2009-12-04 Thread Robin Wojciki
Thanks Koji for logging the ticket. I noticed its priority is set to
minor. Is there any work around? I feel like I am being half as
productive as every iteration is taking twice as much time.

Thanks
Robin

On Tue, Dec 1, 2009 at 11:47 AM, Koji Sekiguchi  wrote:
> Robin,
>
> Thank you for reporting this. Performance degradation of HTML Stripper
> could be in 1.4. I opened a ticket in Lucene:
>
> https://issues.apache.org/jira/browse/LUCENE-2098
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>


Re: question about schemas (and SOLR-1131?)

2009-12-04 Thread wojtekpia

Could this be solved with a multi-valued custom field type (including a
custom comparator)? The OP's situation deals with multi-valuing products for
each customer. If products contain strictly numeric fields then it seems
like a custom field implementation (or extension of BinaryField?) *should*
be easy - only the comparator part needs work. I'm not clear on how the
existing query parsers would handle this though, so there's probably some
work there too. SOLR-1131 seems like a more general solution that supports
analysis that numeric fields don't need.


gdeconto wrote:
> 
> I saw an interesting thread in the solr-dev forum about multiple fields
> per fieldtype (https://issues.apache.org/jira/browse/SOLR-1131)
> 
> from the sounds of it, it might be of interest and/or use in these types
> of problems;  for your example, you might be able to define a fieldtype
> that houses the product data.
> 
> note that I only skimmed the thread. hopefully, I'll get get some time to
> look at it more closely
> 

-- 
View this message in context: 
http://old.nabble.com/question-about-schemas-tp26600956p26636170.html
Sent from the Solr - User mailing list archive at Nabble.com.



Solr 1.4: StringIndexOutOfBoundsException in SpellCheckComponent with HTMLStripCharFilterFactory

2009-12-04 Thread Robin Wojciki
I am running a search in Solr 1.4 and I am getting the
StringIndexOutOfBoundsException pasted below. The spell check field
uses HTMLStripCharFilterFactory. However, the search works fine if I
do not use the HTMLStripCharFilterFactory.

If I set a breakpoint at SpellCheckComponent.java: 248, the value of
the variable "best" is as shown in the screenshot:
http://yfrog.com/j5solrdebuginspectp

At the end of first iteration, offset = 5 - (24 - 0) = -19
This causes the index out of bounds exception.

The spell check field is defined as:















Stack Trace:
=
String index out of range: -19

java.lang.StringIndexOutOfBoundsException: String index out of range: -19
at java.lang.AbstractStringBuilder.replace(Unknown Source)
at java.lang.StringBuilder.replace(Unknown Source)
at 
org.apache.solr.handler.component.SpellCheckComponent.toNamedList(SpellCheckComponent.java:248)
at 
org.apache.solr.handler.component.SpellCheckComponent.process(SpellCheckComponent.java:143)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)


Re: High add/delete rate and index fragmentation

2009-12-04 Thread Rodrigo De Castro
On Thu, Dec 3, 2009 at 3:59 PM, Lance Norskog  wrote:

> #2: The standard architecture is with a master that only does indexing
> and one or more slaves that only handle queries. The slaves poll the
> master for index updates regularly. Java 1.4 has a built-in system for
> this.
>


How do you achieve durability with the standard architecture? For one of our
use cases (which does not have much churn), we are considering this
architecture, but I don't want an update to be lost if the master goes down
before slaves update. What I was thinking initially is that this could be
achieved having a master per datacenter, which would synchronously update
other masters through a RequestHandler. So I could guarantee this
durability, but of course this architecture would have issues of its own.
like when there is a network partitioning, how you could handle master no
longer being in sync. Is there some work being done to address this use
case?



> An alternate architecture has multiple servers which do both indexing
> and queries in the same index. This provides the shortest "pipeline"
> time from recieving the data to making it available for search.
>


For our use case where there is a high add/delete rate, I was thinking of
using this architecture, as I noticed that records become available right
away. But in this case we have the concern about how well it performs when
adding/deleting. I did an initial test adding many thousands of elements and
did not see any degradation, that's why I asked about its performance when
deleting records (since it only marks for deletion and we have some control
over the automatic segment mergin, I guess this is not much of a problem).

Rodrigo


>
> On Wed, Dec 2, 2009 at 2:43 PM, Jason Rutherglen
>  wrote:
> > Rodrigo,
> >
> > It sounds like you're asking about near realtime search support,
> > I'm not sure.  So here's few ideas.
> >
> > #1 How often do you need to be able to search on the latest
> > updates (as opposed to updates from lets say, 10 minutes ago)?
> >
> > To topic #2, Solr provides master slave replication. The
> > optimize would happen on the master and the new index files
> > replicated to the slave(s).
> >
> > #3 is a mixed bag at this point, and there is no official
> > solution, yet. Shell scripts, and a load balancer could kind of
> > work. Check out SOLR-1277 or SOLR-1395 for progress along these
> > lines.
> >
> > Jason
> > On Wed, Dec 2, 2009 at 11:53 AM, Rodrigo De Castro 
> wrote:
> >> We are considering Solr to store events which will be added and deleted
> from
> >> the index in a very fast rate. Solr will be used, in this case, to find
> the
> >> right event we need to process (since they may have several attributes
> and
> >> we may search the best match based on the query attributes). Our
> >> understanding is that the common use cases are those wherein the read
> rate
> >> is much higher than writes, and deletes are not as frequent, so we are
> not
> >> sure Solr handles our use case very well or if it is the right fit.
> Given
> >> that, I have a few questions:
> >>
> >> 1 - How does Solr/Lucene degrade with the fragmentation? That would
> probably
> >> determine the rate at which we would need to optimize the index. I
> presume
> >> that it depends on the rate of insertions and deletions, but would you
> have
> >> any benchmark on this degradation? Or, in general, how has been your
> >> experience with this use case?
> >>
> >> 2 - Optimizing seems to be a very expensive process. While optimizing
> the
> >> index, how much does search performance degrade? In this case, having a
> huge
> >> degradation would not allow us to optimize unless we switch to another
> copy
> >> of the index while optimize is running.
> >>
> >> 3 - In terms of high availability, what has been your experience
> detecting
> >> failure of master and having a slave taking over?
> >>
> >> Thanks,
> >> Rodrigo
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: question about schemas

2009-12-04 Thread solr-user


Lance Norskog-2 wrote:
> 
> But, in general, this is a "shopping cart" database and Solr/Lucene may
> not be the best fit for this problem.
> 

True, every tool has strengths and weaknesses. Given how powerful Solr
appears to be, I would be surprised if I was not able to handle this use
case.


Lance Norskog-2 wrote:
> 
> You can make a separate facet field which contains a range of "buckets":
> 10, 20, 50, or 100 means that the field has a value 0-10, 11-20, 21-50, or
> 51-100. You could use a separate filter query with values for these
> buckets. Filter queries are very fast in Solr 1.4 and this would limit
> your range query execution to documents which match the buckets.
> 

Thank you for this suggestion.  I will look into this.

-- 
View this message in context: 
http://old.nabble.com/question-about-schemas-tp26600956p26636155.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: edismax using bigrams instead of phrases?

2009-12-04 Thread Bill Dueber
I see that edismax already defines pf (bigrams) and pf3 (trigrams) -- how
would folks think about just calling them pf / pf1 (aliases for each
other?), pf2, and pf3? The pf would then behave exactly as it does in
dismax.

And it sounds like the solution to my single-token fields is to just move
them into the query itself.

Thanks!

On Fri, Dec 4, 2009 at 11:58 AM, Yonik Seeley wrote:

> On Fri, Dec 4, 2009 at 11:26 AM, Bill Dueber  wrote:
> > I've started trying edismax, and have noticed that my relevancy ranking
> is
> > messed up with edismax because, according to the debug output, it's using
> > bigrams instead of phrases and inexplicably ignoring a couple of the pf
> > fields. While the hit count isn't changing,  this kills my ability to
> boost
> > exact title matches (or, I would guess, exact-anything-else matches,
> too).
>
> It's a feature in general - the problem with putting all the terms in
> a single phrase query is that you get no boosting at all if all of the
> terms don't appear.
>
> But since it may be useful as an option, perhaps we should add the
> single-phrase option to extended dismax as well.
>
> > edismax is also completely ignoring the title_a and title_ab fields,
> which
> > are defined as "exactmatcher" as follows.
>
> I believe this is because extended dismax only adds phrases for
> boosting... hence if a field type outputs a single token, it's
> considered redundant with the main query.  This is an optimization to
> speed up queries (esp single-word queries).
> Perhaps one way to fix this would be to check if the pf is in the qf
> list before removing single term phrases?
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: edismax using bigrams instead of phrases?

2009-12-04 Thread Yonik Seeley
On Fri, Dec 4, 2009 at 11:26 AM, Bill Dueber  wrote:
> I've started trying edismax, and have noticed that my relevancy ranking is
> messed up with edismax because, according to the debug output, it's using
> bigrams instead of phrases and inexplicably ignoring a couple of the pf
> fields. While the hit count isn't changing,  this kills my ability to boost
> exact title matches (or, I would guess, exact-anything-else matches, too).

It's a feature in general - the problem with putting all the terms in
a single phrase query is that you get no boosting at all if all of the
terms don't appear.

But since it may be useful as an option, perhaps we should add the
single-phrase option to extended dismax as well.

> edismax is also completely ignoring the title_a and title_ab fields, which
> are defined as "exactmatcher" as follows.

I believe this is because extended dismax only adds phrases for
boosting... hence if a field type outputs a single token, it's
considered redundant with the main query.  This is an optimization to
speed up queries (esp single-word queries).
Perhaps one way to fix this would be to check if the pf is in the qf
list before removing single term phrases?

-Yonik
http://www.lucidimagination.com


RE: search on tomcat server

2009-12-04 Thread Jill Han
I went through all the links on http://wiki.apache.org/solr/#Search_and_Indexing
And still have no clue as how to proceed.
1. do I have to do some implementation in order to get solr to search doc. on 
tomcat server?
2. if I have files, such as .doc, docx, .pdf, .jsp, .html, etc under window xp, 
c:/tomcat/webapps/test1, /webapps/test2, 
   What should I do to make solr search those directories
3. since I am using tomcat, instead of jetty, is there any demo that shows the 
solr searching features, and real searching result?

Thanks,
Jill 


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Monday, November 30, 2009 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: search on tomcat server

On Mon, Nov 30, 2009 at 9:55 PM, Jill Han  wrote:

> I got solr running on the tomcat server,
> http://localhost:8080/solr/admin/
>
> After I enter a search word, such as, solr, then hit Search button, it
> will go to
>
> http://localhost:8080/solr/select/?q=solr&version=2.2&start=0&rows=10&in
> dent=on
>
>  and display
>
>   
>
> -
>  ndent=on>
>  <
>
> -
>  ndent=on>
>  <  
>
>  <0
>
>  <0
>
> -
>  ndent=on>
>  <
>
>  <  10
>
>  <  0
>
>  <  on
>
>  <  solr
>
>  <  2.2
>
> 
>
>   
>
>  <  
>
>  
>
>  My question is what is the next step to search files on tomcat server?
>
>
>
Looks like you have not added any documents to Solr. See the "Indexing
Documents" section at http://wiki.apache.org/solr/#Search_and_Indexing

-- 
Regards,
Shalin Shekhar Mangar.


Re: how to get list of unique terms for a field

2009-12-04 Thread Bill Dueber
Here's a pretty simple perl script. Call it as "scriptname facetindex" (or
"scriptname facetindex maxnum")

#

#!/usr/local/bin/perl
use strict;
use JSON::XS;
use LWP::Simple;

### CHANGE THIS TO YOUR URL!! ###

my $select = 'http://solr-vufind:8026/solr/biblio/select';


# Get facet and (optional) maxnum from the command line
my ($facet, $num) = @ARGV;
$num ||= -1; # all values

my $url =
"$select?q=*:*&rows=0&facet=true&facet.limit=$num&facet.field=$facet&wt=json&
json.nl=arrarr";
my $json = decode_json(get($url));

foreach my $a (@{$json->{facet_counts}{facet_fields}{$facet}}) {
print $a->[0], "\n";
}


edismax using bigrams instead of phrases?

2009-12-04 Thread Bill Dueber
I've started trying edismax, and have noticed that my relevancy ranking is
messed up with edismax because, according to the debug output, it's using
bigrams instead of phrases and inexplicably ignoring a couple of the pf
fields. While the hit count isn't changing,  this kills my ability to boost
exact title matches (or, I would guess, exact-anything-else matches, too).

debugQuery output can be seen at:

http://paste.lisp.org/display/91582

That's the exact same query except for the defType.

Note that instead of looking in the 'pf' fields for the search string "gone
with the wind", it's looking individually for "gone with", "with the", and
"the wind".

edismax is also completely ignoring the title_a and title_ab fields, which
are defined as "exactmatcher" as follows.


 
   
 
 
 
 
 
   
 


Any help on this would be much appreciated.


-- 
Bill Dueber
Library Systems Programmer
University of Michigan Library


Re: latency in solr response is observed after index is updated

2009-12-04 Thread Bharath Venkatesh

Hi Kay Kay ,
 We have commented out  auto commit frequency in solrconfig.xml

below is the cache configuration:-



   


  


will further requests after index is updated  wait for auto warming to complete 
?

Thanks,
Bharath


Kay Kay wrote:
> What would be the average doc size.  What is the autoCommit frequency set in 
> solrconfig.xml .
>
> Another place to look at is the field cache size and the nature of warmup 
> queries run after a new searcher is created ( happens due to a commit ).
>
>
>
> Bharath Venkatesh wrote:
>> Hi Kalidoss,
>>  I am not aware of using solr-config for committing the document . but I 
>> have mentioned below how we update and  commit documents:
>>  
>> curl http://solr_url/update --data-binary @feeds.xml -H 
>> 'Content-type:text/xml; charset=utf-8'
>> curl http://solr_url/update --data-binary '' -H 
>> 'Content-type:text/xml; charset=utf-8'
>>
>> where feeds.xml contains the document in xml format
>>
>> we have master and slave replication for solr server.
>>
>> updates happens in master , snappuller and snapinstaller is run on slaves 
>> periodically
>> queries don't happen at master , only happens at slaves
>>
>> is there any thing which can be said from above information ?
>>
>> Thanks,
>> Bharath
>>
>>
>>
>> -Original Message-
>> From: kalidoss [mailto:kalidoss.muthuramalin...@sifycorp.com]
>> Sent: Tue 12/1/2009 2:38 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: latency in solr response  is observed  after index is updated
>>  
>> r u using solr-config for committing the document?
>>
>> bharath venkatesh wrote:
>>  
>>> Hi,
>>>
>>> We are observing latency (some times huge latency upto 10-20 secs) in 
>>> solr response  after index is updated . whats the reason of this latency 
>>> and how can it be minimized ?
>>> Note: our index size is pretty large.
>>>
>>> any help would be appreciated as we largely affected by it
>>>
>>> Thanks in advance.
>>> Bharath
>>> 

This message is intended only for the use of the addressee and may contain 
information that is privileged, confidential 
and exempt from disclosure under applicable law. If the reader of this message 
is not the intended recipient, or the 
employee or agent responsible for delivering the message to the intended 
recipient, you are hereby notified that any 
dissemination, distribution or copying of this communication is strictly 
prohibited. If you have received this e-mail 
in error, please notify us immediately by return e-mail and delete this e-mail 
and all attachments from your system.


Re: dismax query syntax to replace standard query

2009-12-04 Thread javaxmlsoapdev

Thanks. When I do it that way it gives me following query.

params={indent=on&start=0&q=risk+test&qt=dismax&fq=statusName:(Male+OR+Female)+name:"Joe"&hl=on&rows=10&version=2.2}
hits=63 status=0 QTime=54 

I typed in 'Risk test' (no quote in the text) in the text field in UI. I
want search to do AND between "statusName" and "name" attributes (all
attributes in fq param).  

Following is my dismax configuration in solrconfig.xml


 dismax
 explicit
 0.01
 
title^2 description
 
 
title description
 
 
2<-1 5<-2 6<90%
 
 100
 *:*
 title description
 10
 title
 regex

  

And schema.xml has
title
 -- when I change this to AND it does
AND all params in fq and also does ANDing between words in the text field
e.g. "risk+test" and doesn't return me results. 

basically I want to do ORing between words in "q" list and ANDing between
params in "fq" list.

Any pointers would be appreciated.

Thanks,


isugar wrote:
> 
> I believe you need to use the fq parameter with dismax (not to be confused
> with qf) to add a "filter query" in addition to the q parameter.
> 
> So your text search value goes in q parameter (which searches on the
> fields
> you configure) and the rest of the query goes in the fq.
> 
> Would that work?
> 
> On Thu, Dec 3, 2009 at 7:28 PM, javaxmlsoapdev  wrote:
> 
>>
>> I have configured dismax handler to search against both "title" &
>> "description" fields now I have some other attributes on the page e.g.
>> "status", "name" etc. On the search page I have three fields for user to
>> input search values
>>
>> 1)Free text search field (which searchs against both "title" &
>> "description")
>> 2)Status (multi select dropdown)
>> 3)name(single select dropdown)
>>
>> I want to form query like textField1:value AND status:(Male OR Female)
>> AND
>> name:"abc". I know first (textField1:value searchs against both "title" &
>> "description" as that's how I have configured dixmax in the
>> configuration)
>> but not sure how I can AND other attributes (in my case "status" &
>> "name")
>>
>> note; standadquery looks like following (w/o using dixmax handler)
>> title:"test"description:"test"name:"Joe"statusName:(Male OR Female)
>> --
>> View this message in context:
>> http://old.nabble.com/dismax-query-syntax-to-replace-standard-query-tp26631725p26631725.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/dismax-query-syntax-to-replace-standard-query-tp26631725p26635928.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread Yonik Seeley
Are you explicitly setting the heap sizes?  If not, the JVM is
deciding for itself based on what the box looks like (ram, cpus, OS,
etc).  Are they both the same architecture (32 bit or 64 bit?)

-Yonik
http://www.lucidimagination.com

p.s. in general cross-posting to both solr-user and solr-dev is discouraged.


On Fri, Dec 4, 2009 at 5:27 AM, rajan chandi  wrote:
> Hi All,
>
> We've deployed 4 instances of Solr on a debian server.
>
> It is taking only 1.5 GB of RAM on local ubuntu machine but it is taking 2.0
> GB plus on Debian Lenny server.
>
> Any ideas/pointers will help.
>
> Regards
> Rajan
>


Re: how to get list of unique terms for a field

2009-12-04 Thread Erik Hatcher


On Dec 4, 2009, at 8:59 AM, Joel Nylund wrote:
lets say I have a field called countryName, is there a way to get a  
list of all the countries for this field? Trying to figure out a  
nice way to keep my categories and the solr results in sync, would  
be nice to get these from solr instead of the database.


A couple of ways, depending on what you want:

  1) faceting, as part of the search results, filtered within  
constraints (q/fq's):  &facet=on&facet.field=countryName


  2) TermsComponent: http://wiki.apache.org/solr/TermsComponent

Erik



how to get list of unique terms for a field

2009-12-04 Thread Joel Nylund

Hi,

lets say I have a field called countryName, is there a way to get a  
list of all the countries for this field? Trying to figure out a nice  
way to keep my categories and the solr results in sync, would be nice  
to get these from solr instead of the database.


thanks
Joel



Re: Issues with alphanumeric search terms

2009-12-04 Thread Erick Erickson
as Ahmet says, you need to re-index.

Nothing about WordDelmiterFilterFactory alters case as far as I can tell
from
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Are you applying this in addition to the LowerCaseTokenizerFactory? In
which
case it's too late. The numbers have already been stripped...

Please get a copy of Luke and examine your index to see what actually
gets indexed, it'll give you a *much* better idea of what the various
analyzers actually put in your index.

Best
Erick

On Fri, Dec 4, 2009 at 6:57 AM, AHMET ARSLAN  wrote:

> > I have added
> >  > class="solr.WordDelimiterFilterFactory" catenateAll="1"
> > />
> > to both index and query but still getting same behaviour.
> >
> > Is there any other that i am missing?
> >
>
> Did you re-start tomcat and re-index? Why not use StandardTokenizerFactory?
>
>
>
>


Re: creating Lucene document from an external XML file.

2009-12-04 Thread Phanindra Reva
Hello..,
  You have mentioned I can make use of UpdateProcessor API.
May I know when the flow of execution enters that
UpdateRequestProcessor class.? To be brief , it would be perfect for
my case if its after analysis but exactly before its being added to
the index.
Thanks alot.

On Wed, Dec 2, 2009 at 8:56 PM, Chris Hostetter
 wrote:
>
> : // "  one possibility to think about is that instead of modifying the 
> documents
> : before sending them to Solr, you could write an UpdateProcessor tha runs
> : direclty in Solr and gets access to those Documents after Solr has already
> : parsed that XML (or even if the documents come from someplace else, like
> : DIH, or a CSV file) and then make your changes. " //
> :        I have not decided to modify documents, instead I go for
> : modifying them at run time. (modifying Java object's variables that
> : contains information extracted from the document-file).
> : my question is : Is there any part of the api which take document file
> : path as input , returns java object and gives us a way to modify
> : inbetween before sending the same object for indexing (to the
> : IndexWriter - lucene api).
>
> Yes ... as i mentioned the UpdateProcessor API is where you have access to
> the Documents as Lucene objects inside of Solr before they are indexed.
>
>
>
> -Hoss
>
>


Re: Issues with alphanumeric search terms

2009-12-04 Thread AHMET ARSLAN
> I have added 
>      class="solr.WordDelimiterFilterFactory" catenateAll="1"
> />
> to both index and query but still getting same behaviour.
> 
> Is there any other that i am missing?
> 

Did you re-start tomcat and re-index? Why not use StandardTokenizerFactory?





Debian Lenny + Apache Tomcat 5.5 + Solr 1.4

2009-12-04 Thread rajan chandi
Hi All,

We've deployed 4 instances of Solr on a debian server.

It is taking only 1.5 GB of RAM on local ubuntu machine but it is taking 2.0
GB plus on Debian Lenny server.

Any ideas/pointers will help.

Regards
Rajan


Re: deleteById without solrj?

2009-12-04 Thread Erik Hatcher
Also note that the XML that an be POSTed to /solr/update can also be  
sent as a content stream on the URL for a plain GET request:


/solr/update?stream.body=...&commit=true

Erik


On Dec 3, 2009, at 3:05 PM, Tom Hill wrote:


http://wiki.apache.org/solr/UpdateXmlMessages#A.22delete.22_by_ID_and_by_Query

On Thu, Dec 3, 2009 at 11:57 AM, Joel Nylund   
wrote:



Is there a url based approach to delete a document?

thanks
Joel