Re: Sentence level searching

2006-11-12 Thread Michael Imbeault
So basically its just as I thought it was, thanks for the help :) I had 
checked the wiki before asking, but it lacks details and is often vague, 
or presuppose that you have knowledge about some specific terms without 
explaining them. Its all clear now, thanks to you ;)


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Chris Hostetter wrote:

: Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not
: exactly sure of how to add multiple values to a single field (aside from
: fieldcopy). The code I'm thinking of using :

If you look at the exampledocs, "features" and "cat" are both multivalued
fields... you just list multiple s with the same name in your

: Field in schema.xml : 
:
: Where am I supposed to configure the value of the gap?
: positionIncrementGap in the fieldtype definition is my guess, but I'm

correct.

: not sure. Also, am I supposed to put multivalued in the fieldtype
: definition? Alternatively, could I put positionIncrementGap in the
:  that I posted just above?

I *think* positionIncrementGap has to be set by on the fieldtype ... but
i'm not 100% certain of that.

multiValued and the other field attributes (indexed, stored,
compressed, omitNorms) can be set on the field or inherited from the
fieldtype.

More info can be found in the comments of the example schema.xml, as well
as these wiki pages...

http://wiki.apache.org/solr/SchemaXml
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


-Hoss

  


Re: Index & search questions; special cases

2006-11-12 Thread Michael Imbeault

Chris Hostetter wrote:

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)
  

For clarification, I'm using the filter both at index and query time.

Have you by any chance played with the "Analysis" page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
"debugQuery=on" option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.
  
Will try that, played with it in the past, but not for this particular 
problem, good idea :)

: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
: ("1 hepatitis" AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
"HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
'HIV "1 hepatits"' if that's what they ment.)
  
Sadly I can't rely on users smartness for this :) I have concerns that 
for stuff like Hepatitis A, it will match just about every document 
containing hepatitis and the very common 'a' word, anywhere in the 
document. I can't stopword single letters, cause then there would be no 
way to find documents about 'hepatitis c' and not about 'hepatitis b' 
for example. I will test my solution and report; if you have any other 
ideas, just tell me.


And thanks for the help! :)



Re: Index & search questions; special cases

2006-11-12 Thread Chris Hostetter

: - Let's say I index "HIV-1" with . Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also find
: documents which have HIV and the number "1" somewhere in the document,
: but not directly after HIV? If so, how should I fix this? I could boost
: score by proximity, but I'm doing a sort on date anyway, so I guess it
: would be pointless to do so.

A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your questions really depend on
wether you use WordDelim at both index time and query time (or if you do
use it in both cases, but configure it differnetly)

Have you by any chance played with the "Analysis" page on your Solr index?
  
http://localhost:8983/solr/admin/analysis.jsp?name=&verbose=on&highlight=on&qverbose=on&;

...it makes it really easy to see exactly how your various fields will get
parsed at index time and query time.  I would also suggest you use the
"debugQuery=on" option when doing some searches -- even if there aren't
nay documents in your index, that will help you see how your query is
getting parsed and what Query structure QueryParser is building based on
the tokens it gets from each of hte Anaalyzers.

: - Somewhat related : Let's say I index "Polymyxin B". If I stopword
: single letters, would a phrase search ("Polymyxin B") still find the
: right documents (I don't think so, but still)? If not, I'll have to

depends on what the "right documents" are .. if you strip stopwords out
both at index time and at query time then it will ultimately match exctly
the same thing as a query on "Polymyxin" which i guess must be the "right
documents" since no documents will container the letter "B" so what else
could be right? :)

: index single letters; how do I prevent the same problem as in the first
: question (i.e., a search on Polymyxin B yielding documents with
: Polymyxin and B, but not close to one another).
:
: My thought is to parse the user query and rephrase it to do phrase
: searches on nearby terms containing single letters / numbers. If an user
: search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
: ("1 hepatitis" AND hiv). Is it a sensible solution?

that's kind of a strange behavior for a search application to have ... you
might just wnat to trust that your users will be smart and if they find
that 'HIV 1 hepatitis' is matching docs where "1" doesn't appear near
"HIV" or "hepatitis" then they will start entering '"HIV 1" hepatitis" (or
'HIV "1 hepatits"' if that's what they ment.)




-Hoss



Re: Sentence level searching

2006-11-12 Thread Chris Hostetter

: Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not
: exactly sure of how to add multiple values to a single field (aside from
: fieldcopy). The code I'm thinking of using :

If you look at the exampledocs, "features" and "cat" are both multivalued
fields... you just list multiple s with the same name in your

: Field in schema.xml : 
:
: Where am I supposed to configure the value of the gap?
: positionIncrementGap in the fieldtype definition is my guess, but I'm

correct.

: not sure. Also, am I supposed to put multivalued in the fieldtype
: definition? Alternatively, could I put positionIncrementGap in the
:  that I posted just above?

I *think* positionIncrementGap has to be set by on the fieldtype ... but
i'm not 100% certain of that.

multiValued and the other field attributes (indexed, stored,
compressed, omitNorms) can be set on the field or inherited from the
fieldtype.

More info can be found in the comments of the example schema.xml, as well
as these wiki pages...

http://wiki.apache.org/solr/SchemaXml
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


-Hoss



Re: Sentence level searching

2006-11-12 Thread Michael Imbeault

Hello everyone,

Solr puts a configurable gap between values of the same field, so you
could index every sentence as a separate value of a multi-valued
field.
Thanks for the answer Yonik; I forgot about Multivalued fields! I'm not 
exactly sure of how to add multiple values to a single field (aside from 
fieldcopy). The code I'm thinking of using :


   PHP code to build the XML

   (loop for each sentence)
   $abstract_element = $dom->createElement('field');
   $abstract_element->setAttribute('name', 'abstract');
   $abstract_text = 
$dom->createTextNode($array['abstract']);

   $abstract_element->appendChild($abstract_text);
   (end loop)
   $doc->appendChild($abstract_element);

Field in schema.xml : stored="false" multivalued="true" />


Where am I supposed to configure the value of the gap? 
positionIncrementGap in the fieldtype definition is my guess, but I'm 
not sure. Also, am I supposed to put multivalued in the fieldtype 
definition? Alternatively, could I put positionIncrementGap in the 
 that I posted just above?


Thanks for the help,
Michael





Re: Sentence level searching

2006-11-12 Thread Yonik Seeley

On 11/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

I'm trying to do some sentence-level searching with Solr; basically, I
want to find if two words are in the same sentence. As I read on the
Lucene mailing list, there's many ways to do this, including but not
limited to :

-inserting special boundary terms to denote the start and end of a
sentence. It is unclear to me what kind of query should be used to fetch
results from within one sentence (something like: start_sentence_token
word1 word2 end_sentence_token)?


Span queries... but there isn't really query parser support for them.


-increase token position at a sentence boundary by a large factor
(1000?) so that "x y"~500 (or more) won't match across sentence boundaries.


That's probably the easiest and simplest.


Is there an existing filter class that I could use to do this, or should
I first parse my text fields with PHP and some NLP tool, and index the
result (for the first case)? For the second case (increment token
position), how should I do this within Solr?


Solr puts a configurable gap between values of the same field, so you
could index every sentence as a separate value of a multi-valued
field.

A better solution would be to have a tokenizer that could detect the
end of sentences and either insert a gap or a special token that
another filter could act on.

-Yonik


Index & search questions; special cases

2006-11-12 Thread Michael Imbeault

Hello again,

- Let's say I index "HIV-1" with class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
catenateAll="1"/>. Would a search on HIV AND 1 (or even HIV-1, which 
after parsing by the above filter would yield HIV1 or HIV 1) also find 
documents which have HIV and the number "1" somewhere in the document, 
but not directly after HIV? If so, how should I fix this? I could boost 
score by proximity, but I'm doing a sort on date anyway, so I guess it 
would be pointless to do so.


- Somewhat related : Let's say I index "Polymyxin B". If I stopword 
single letters, would a phrase search ("Polymyxin B") still find the 
right documents (I don't think so, but still)? If not, I'll have to 
index single letters; how do I prevent the same problem as in the first 
question (i.e., a search on Polymyxin B yielding documents with 
Polymyxin and B, but not close to one another).


My thought is to parse the user query and rephrase it to do phrase 
searches on nearby terms containing single letters / numbers. If an user 
search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR 
("1 hepatitis" AND hiv). Is it a sensible solution?


Thanks,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Sentence level searching

2006-11-12 Thread Michael Imbeault

Hello everyone,

I'm trying to do some sentence-level searching with Solr; basically, I 
want to find if two words are in the same sentence. As I read on the 
Lucene mailing list, there's many ways to do this, including but not 
limited to :


-inserting special boundary terms to denote the start and end of a 
sentence. It is unclear to me what kind of query should be used to fetch 
results from within one sentence (something like: start_sentence_token 
word1 word2 end_sentence_token)?
-increase token position at a sentence boundary by a large factor 
(1000?) so that "x y"~500 (or more) won't match across sentence boundaries.


Is there an existing filter class that I could use to do this, or should 
I first parse my text fields with PHP and some NLP tool, and index the 
result (for the first case)? For the second case (increment token 
position), how should I do this within Solr?


Is there any plans to implement such functionality as standard?

Thanks for the help,

--
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Re: Tomcat SecurityManager config for SOLR

2006-11-12 Thread Wade Leftwich
Chris Hostetter wrote:
> : log. I have not been able to find much in the way of a howto for
> : SecurityManager, and am pretty much groping in the dark at this point.
> 
> I found this...
>   http://tomcat.apache.org/tomcat-5.5-doc/security-manager-howto.html
> ...but I'm afraid I don't have much concrete advice to offer you, i've
> never run any servlet container with a security manager, mainly because
> i've never run a WAR i didn't trust completely...
> 
> to start with, it looks like you're going to want to enable JndiPermission
> (since that's how you are setting the solr home) and you should enable
> read FilePermission for the entire solr home dir, and write FilePermission
> for the data dir.  if you wnat to use snapshooting/snappulling you'll also
> need to RuntimePermission.  i think that's probably it ... but according
> to that doc, the best way to figure out what permissions you need to turn
> on if something isn't working, is with the -Djava.security.debug=all
> option.
> 
> if you have any luck with this, then by all means please update the
> SOlrTomcat wiki with what you find.
> 
> 
> -Hoss
> 
> 

Hoss, thanks for the Saturday night reply. Open Source does indeed rule.

I've gotten as far as making Solr completely trusted, which I think will
be sufficient for my own purposes, since only people with root access
can touch the Solr directories. Here's the command I added to
catalina.policy:

grant codeBase "file:${catalina.home}/webapps/solr/-" {
permission java.security.AllPermission;
};

-- Wade Leftwich
Ithaca, NY



Re: Jetty for production?

2006-11-12 Thread Panayiotis Papadopoulos

Personally i have deployed SOLR within a PHP, AJAX framework
I just have just deployed Jetty for SOLR and i created a PHP wrapper so 
that i can send XML docs to SOLR and returns JSON. Besides that i have 
filtered the traffic to Jetty and only the PHP wrapper can access it. So 
it is super easy, super fast and quite secure...