Re: issue with highlighting in solr 4.10.2

2015-06-29 Thread Dmitry Kan
Hi Erick,

The Contents field contains one sentence only and no "watch" exists in it.
Plus we use quite large snippet size to surely cover the field.

Dmitry

On Sat, Jun 27, 2015 at 6:16 PM, Erick Erickson 
wrote:

> Does watch exist in the Contents field somewhere outside the snippet
> size you've specified?
>
> Shot in the dark,
> Erick
>
> On Fri, Jun 26, 2015 at 3:22 AM, Dmitry Kan  wrote:
> > Hi,
> >
> > When highlighting hits for the following query:
> >
> > (+Contents:apple +Contents:watch) Contents:iphone
> >
> > I expect the standard solr highlighter to highlight either iphone or
> iphone
> > AND apple, only if watch is present.
> >
> > However, solr highlights iphone along with only apple. Is this a bug or a
> > known feature? Is there any way to debug the highlighter using solr
> admin?
> >
> > --
> > Dmitry Kan
> > Luke Toolbox: http://github.com/DmitryKey/luke
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> > SemanticAnalyzer: www.semanticanalyzer.info
>



-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: issue with highlighting in solr 4.10.2

2015-06-27 Thread Erick Erickson
Does watch exist in the Contents field somewhere outside the snippet
size you've specified?

Shot in the dark,
Erick

On Fri, Jun 26, 2015 at 3:22 AM, Dmitry Kan  wrote:
> Hi,
>
> When highlighting hits for the following query:
>
> (+Contents:apple +Contents:watch) Contents:iphone
>
> I expect the standard solr highlighter to highlight either iphone or iphone
> AND apple, only if watch is present.
>
> However, solr highlights iphone along with only apple. Is this a bug or a
> known feature? Is there any way to debug the highlighter using solr admin?
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: www.semanticanalyzer.info


issue with highlighting in solr 4.10.2

2015-06-26 Thread Dmitry Kan
Hi,

When highlighting hits for the following query:

(+Contents:apple +Contents:watch) Contents:iphone

I expect the standard solr highlighter to highlight either iphone or iphone
AND apple, only if watch is present.

However, solr highlights iphone along with only apple. Is this a bug or a
known feature? Is there any way to debug the highlighter using solr admin?

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: www.semanticanalyzer.info


Re: Highlighting in Solr

2015-04-26 Thread Zheng Lin Edwin Yeo
I supposed currently the only way to show the highlighting snippets in xml
and json output is via a separate section at the bottom, and it is
currently not possible to show the highlighted snippets together with the
rest of the response?

Regards,
Edwin


On 22 April 2015 at 21:57, Zheng Lin Edwin Yeo  wrote:

> Hi,
>
> I'm currently implementing highlighting on my Solr-5.0.0. When I issue the
> following command:
> http://localhost:8983/solr/collection1/select?q=conducted
> 
> &hl=true&hl.fl=Content,Summary&wt=json&indent=true&rows=10,
> the highlighting result is listed at the bottom of the output, instead of
> together with the rest of the response above. The result is shown below:
>
>   "response":{"numFound":10,"start":0,"docs":[
>   {
> "id":"1-1",
> "Summary":"i} Trial conducted",
> "Content":"Completed",
> "_version_":1498407036159787020},
>
>
>   "highlighting":{
> "1-1":{
>   "Summary":["i) Trial conducted"]}
>
>
> Is there any way to get the highlighted output to be displayed together with 
> the rest of the response, instead of having it display separately at the 
> bottom? Which is something like this
>
>
>   "response":{"numFound":10,"start":0,"docs":[
>   {
> "id":"1-1",
> "Summary":"i} Trial conducted",
> "Content":"Completed",
> "_version_":1498407036159787020},
>
>
> Regards,
> Edwin
>
>


Highlighting in Solr

2015-04-22 Thread Zheng Lin Edwin Yeo
Hi,

I'm currently implementing highlighting on my Solr-5.0.0. When I issue the
following command:
http://localhost:8983/solr/collection1/select?q=conducted

&hl=true&hl.fl=Content,Summary&wt=json&indent=true&rows=10,
the highlighting result is listed at the bottom of the output, instead of
together with the rest of the response above. The result is shown below:

  "response":{"numFound":10,"start":0,"docs":[
  {
"id":"1-1",
"Summary":"i} Trial conducted",
"Content":"Completed",
"_version_":1498407036159787020},


  "highlighting":{
"1-1":{
  "Summary":["i) Trial conducted"]}


Is there any way to get the highlighted output to be displayed
together with the rest of the response, instead of having it display
separately at the bottom? Which is something like this


  "response":{"numFound":10,"start":0,"docs":[
  {
"id":"1-1",
"Summary":"i} Trial conducted",
"Content":"Completed",
"_version_":1498407036159787020},


Regards,
Edwin


Re: Inconsistent highlighting in Solr

2013-11-28 Thread harsh kapoor
Hi Ahmet,

Now things are making sense.Thank you for your reply.


On Thu, Nov 28, 2013 at 3:26 PM, Ahmet Arslan  wrote:

> Hi Hars,
>
> Highlighted text samples are matching because of
>  WordDelimiterFilterFactory splits them. You can see/test the behaviour of
> your fieldType name="text" at analysis page.
>
>
>
> On Thursday, November 28, 2013 11:51 AM, harsh kapoor <
> harshlnm...@gmail.com> wrote:
>
> Hi Ahmet,
>
> Thanks for your reply but i am still not clear on this.Why highlighting
> occurs in text (fernando_*alonso, *Fernando*Alonso*(CamelCase) ) these are
> also words and Solr is highlighting inside words.
>
> But no highlighting takes place in lowercase 'fernandoalonso'. why is this?
>
>
>
>
>
> On Thu, Nov 28, 2013 at 2:58 PM, Ahmet Arslan  wrote:
>
> > Hi Harsh,
> >
> > Your query 'alonso' is not matching the text in your non-highlighted
> > instance examples. Thats why they are not highlighted. It seems that you
> > want to be able to search inside words too. You can use wildcard operator
> > for this. Please see for similar discussion:
> > http://search-lucene.com/m/HiKY02e1KgI1
> >
> >
> >
> > On Thursday, November 28, 2013 10:57 AM, harsh kapoor <
> > harshlnm...@gmail.com> wrote:
> >
> > I have indexed data using Solr.I want to highlight matched keyword in
> > search results. highlighting is inconsistent.
> > eg. if search keyword is 'alonso'.
> >
> > highlighted instances are:
> *Alonso*,fernando_*alonso*,**#Alonso**MeetVettel
> >
> > non-highlightes instances are : @fernandoalonso, www.alonsodriver.com
> >
> > Can anyone tell me why is that?
> >
> > I am using this configuration-
> >
> >positionIncrementGap="100">
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true"/>
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > language="English" protected="protwords.txt"/>
> >   
> >   
> > 
> >  > words="stopwords.txt" enablePositionIncrements="true"/>
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> > 
> >  > language="English" protected="protwords.txt"/>
> >   
> > 
> >
> > --
> > Harsh Kapoor
> > Developer
> > Serendio Softwares Pvt ltd.
> > Contact: 7401551935,9571702158
>
> >
>
>
>
> --
> Harsh Kapoor
> Developer
> Serendio Softwares Pvt ltd.
> Contact: 7401551935,9571702158
>



-- 
Harsh Kapoor
Developer
Serendio Softwares Pvt ltd.
Contact: 7401551935,9571702158


Re: Inconsistent highlighting in Solr

2013-11-28 Thread Ahmet Arslan
Hi Hars,

Highlighted text samples are matching because of  WordDelimiterFilterFactory 
splits them. You can see/test the behaviour of your fieldType name="text" at 
analysis page. 



On Thursday, November 28, 2013 11:51 AM, harsh kapoor  
wrote:
 
Hi Ahmet,

Thanks for your reply but i am still not clear on this.Why highlighting
occurs in text (fernando_*alonso, *Fernando*Alonso*(CamelCase) ) these are
also words and Solr is highlighting inside words.

But no highlighting takes place in lowercase 'fernandoalonso'. why is this?





On Thu, Nov 28, 2013 at 2:58 PM, Ahmet Arslan  wrote:

> Hi Harsh,
>
> Your query 'alonso' is not matching the text in your non-highlighted
> instance examples. Thats why they are not highlighted. It seems that you
> want to be able to search inside words too. You can use wildcard operator
> for this. Please see for similar discussion:
> http://search-lucene.com/m/HiKY02e1KgI1
>
>
>
> On Thursday, November 28, 2013 10:57 AM, harsh kapoor <
> harshlnm...@gmail.com> wrote:
>
> I have indexed data using Solr.I want to highlight matched keyword in
> search results. highlighting is inconsistent.
> eg. if search keyword is 'alonso'.
>
> highlighted instances are: *Alonso*,fernando_*alonso*,**#Alonso**MeetVettel
>
> non-highlightes instances are : @fernandoalonso, www.alonsodriver.com
>
> Can anyone tell me why is that?
>
> I am using this configuration-
>
>   
>   
>     
>      words="stopwords.txt" enablePositionIncrements="true"/>
>      generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>     
>      language="English" protected="protwords.txt"/>
>   
>   
>     
>      words="stopwords.txt" enablePositionIncrements="true"/>
>      generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>     
>      language="English" protected="protwords.txt"/>
>   
> 
>
> --
> Harsh Kapoor
> Developer
> Serendio Softwares Pvt ltd.
> Contact: 7401551935,9571702158

>



-- 
Harsh Kapoor
Developer
Serendio Softwares Pvt ltd.
Contact: 7401551935,9571702158

Re: Inconsistent highlighting in Solr

2013-11-28 Thread harsh kapoor
Hi Ahmet,

Thanks for your reply but i am still not clear on this.Why highlighting
occurs in text (fernando_*alonso, *Fernando*Alonso*(CamelCase) ) these are
also words and Solr is highlighting inside words.

But no highlighting takes place in lowercase 'fernandoalonso'. why is this?





On Thu, Nov 28, 2013 at 2:58 PM, Ahmet Arslan  wrote:

> Hi Harsh,
>
> Your query 'alonso' is not matching the text in your non-highlighted
> instance examples. Thats why they are not highlighted. It seems that you
> want to be able to search inside words too. You can use wildcard operator
> for this. Please see for similar discussion:
> http://search-lucene.com/m/HiKY02e1KgI1
>
>
>
> On Thursday, November 28, 2013 10:57 AM, harsh kapoor <
> harshlnm...@gmail.com> wrote:
>
> I have indexed data using Solr.I want to highlight matched keyword in
> search results. highlighting is inconsistent.
> eg. if search keyword is 'alonso'.
>
> highlighted instances are: *Alonso*,fernando_*alonso*,**#Alonso**MeetVettel
>
> non-highlightes instances are : @fernandoalonso, www.alonsodriver.com
>
> Can anyone tell me why is that?
>
> I am using this configuration-
>
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
>  language="English" protected="protwords.txt"/>
>   
>   
> 
>  words="stopwords.txt" enablePositionIncrements="true"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> 
>  language="English" protected="protwords.txt"/>
>   
> 
>
> --
> Harsh Kapoor
> Developer
> Serendio Softwares Pvt ltd.
> Contact: 7401551935,9571702158
>



-- 
Harsh Kapoor
Developer
Serendio Softwares Pvt ltd.
Contact: 7401551935,9571702158


Re: Inconsistent highlighting in Solr

2013-11-28 Thread Ahmet Arslan
Hi Harsh,

Your query 'alonso' is not matching the text in your non-highlighted instance 
examples. Thats why they are not highlighted. It seems that you want to be able 
to search inside words too. You can use wildcard operator for this. Please see 
for similar discussion: http://search-lucene.com/m/HiKY02e1KgI1



On Thursday, November 28, 2013 10:57 AM, harsh kapoor  
wrote:
 
I have indexed data using Solr.I want to highlight matched keyword in
search results. highlighting is inconsistent.
eg. if search keyword is 'alonso'.

highlighted instances are: *Alonso*,fernando_*alonso*,**#Alonso**MeetVettel

non-highlightes instances are : @fernandoalonso, www.alonsodriver.com

Can anyone tell me why is that?

I am using this configuration-

  
  
    
    
    
    
    
  
  
    
    
    
    
    
  


-- 
Harsh Kapoor
Developer
Serendio Softwares Pvt ltd.
Contact: 7401551935,9571702158

Inconsistent highlighting in Solr

2013-11-28 Thread harsh kapoor
I have indexed data using Solr.I want to highlight matched keyword in
search results. highlighting is inconsistent.
eg. if search keyword is 'alonso'.

highlighted instances are: *Alonso*,fernando_*alonso*,**#Alonso**MeetVettel

non-highlightes instances are : @fernandoalonso, www.alonsodriver.com

Can anyone tell me why is that?

I am using this configuration-

  
  





  
  





  


-- 
Harsh Kapoor
Developer
Serendio Softwares Pvt ltd.
Contact: 7401551935,9571702158


Re: how to do sorting on no. of highlighting in solr

2011-09-08 Thread lboutros
Hi,

it is possible to create a new similarity class which returns the term
occurrences.
You have to disable Idf (just return1), normalization and co.

then you have to declare it in your schema:

http://wiki.apache.org/solr/SchemaXml#Similarity
http://wiki.apache.org/solr/SolrPlugins#Similarity


We did something like that for one particular project.


Ludovic.

-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-do-sorting-on-no-of-highlighting-in-solr-tp3319983p3320688.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlighting in solr

2011-05-24 Thread Darren Govoni
You should be able to retrieve the snippets in your search engine and
combine or format
them however you like before returning the results to your client Right?

So in your middle tier, you invoke solr with a search, get the results,
retrieve the snippets,
iterate over them and format to your needs, then return it. That's
basically what I do for mine.


On Tue, 2011-05-24 at 16:38 +0530, Vignesh Raj wrote:

> Hi,
> 
> I use Solrnet to develop a search engine. In my application, I have a field
> called file_contents which I use for highlighting. Am able to get the
> highlights without a problem. Now I need to format it. For example, if the
> keyword occurs multiple times in the field, I have to display it lime what
> google does. May be like this.
> 
> GSMArena.com: Toshiba GSM cellphones. ... Toshiba phones. Toshiba. Filters.
> Available . Coming soon . Smartphone . Touchscreen . Camera . Bluetooth .
> Wi-Fi ...
> 
> Multiple snippets here are separated by "...". 
> 
> I need to achieve something like this. Am able to get multiple snippets. But
> how do I deal with the separating?
> 
> Regards
> 
> Vignesh
> 
>  
> 




Highlighting in solr

2011-05-24 Thread Vignesh Raj
Hi,

I use Solrnet to develop a search engine. In my application, I have a field
called file_contents which I use for highlighting. Am able to get the
highlights without a problem. Now I need to format it. For example, if the
keyword occurs multiple times in the field, I have to display it lime what
google does. May be like this.

GSMArena.com: Toshiba GSM cellphones. ... Toshiba phones. Toshiba. Filters.
Available . Coming soon . Smartphone . Touchscreen . Camera . Bluetooth .
Wi-Fi ...

Multiple snippets here are separated by "...". 

I need to achieve something like this. Am able to get multiple snippets. But
how do I deal with the separating?

Regards

Vignesh

 



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton
Named entity references are valid in XML.  They just need to be  
declared

before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid  
when

parsing with an XML parser.


Correct, it was an offhand comment and I skipped over all the  
details. In general named entities other than the built-ins aren't  
declared at the top of the file and many parsers don't bother to read  
in external DTDs so any entities declared there aren't read and are  
therefore considered invalid.



XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].


Except that no browser that I know of actually reads in the XHTML DTD  
when in standards compliant mode, so none of those entities are  
actually viable to be used unless you include the declarations for  
them at the top of every XHTML document (which is ludicrous).


The bottom line is that it's far, far better to use numeric entities  
in XML and simply ignore all but the built-in named entities if you  
want to have any confidence that the document will be parsed  
correctly - hence my offhand comment.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Mike Klaas

On 5-Oct-07, at 11:59 AM, Ravish Bhagdev wrote:


But a different use-case might be for the highlighting to encompass

the markup rather than >just the text, e.g.
  Parisspan>

which would have to be accomplished some other way.


Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?


Last time I looked at the nutch highlighter I don't remember seeing  
anything about handling this correctly (which would involved a  
considerable amount of html finangling to get perfect).


Also, I don't see the use case for web docs: you absolutely never  
want to serve up the raw html form an unknown page.


I'm not against improving Solr's handling of HTML data, but it is the  
type of thing that is unlikely to happen unless someone who cares  
about it steps up.


Patches welcome :)

-Mike


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks all for very valuable contributions, I understand these aspects
of Solr much better now

but...

>But a different use-case might be for the highlighting to encompass
the markup rather than >just the text, e.g.
>   Paris
>which would have to be accomplished some other way.

Yes, exactly.  And I think nutch handles this somehow as I remember
using it for indexing HTML and then returning snippets with accurate
highlighting placed within html snippets.

Is there a potential for code reuse from nutch?  Maybe this is topic
for solr developer list?  Or has it been already considered?

Bests,
Ravish


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread J.J. Larrea
At 9:32 PM +1000 10/5/07, Adrian Sutton wrote:
>From what people are suggesting though you'd be better off converting to plain 
>text before indexing it with Solr. Something like JTidy (http://jtidy.sf.net) 
>can parse most HTML that's around and you can iterate over the DOM to extract 
>the text from there.

It depends entirely on the use-case.  You can fire HTML or XML at a Solr field 
(possibly wrapping it in a CDATA block as just suggested by Pieter Berkel) and 
have it stored verbatim, then what happens at index-time is entirely dependent 
on the Analyzer chain: Treat tags and attributes as if they were text, remove 
them entirely, etc.  You can strip the markup before sending the data and so 
store and/or index just the text content.  You can use XSLT or other means to 
extract data to be indexed in specific fields.  And, as Benoit Pauwels just 
wrote, a combination of these techniques might be the most appropriate for a 
particular application, e.g. field-specific search yielding marked-up documents.

The HTMLStripXXX tokenizers appear to do a fine job of entity conversion and 
tag stripping, and so if highlighting is not a consideration then it makes the 
markup stripping very convenient, allowing storage of the document with markup 
and indexing of just the text content.

The primary issue with HTMLStripXXX is for the use-case when one wants to 
return the stored HTML/XML content with highlighting markup inserted around the 
text content, but preserving the original markup.  For example, have
Paris
highlighted as
Paris

For that the original marked-up version (rather than stripped) must be stored, 
a markup-stripped version should probably (but not necessarily) be indexed, and 
the offsets of the indexed tokens must properly point to the locations of those 
tokens in the stored version.  The HTMLStripXXX tokenizers ignore the offset of 
the stripped content (both tags and attributes, but also when entities are 
converted to characters) and so the token /paris/ in the example above is given 
the offset of the opening <, and the highlighting falls within (and thus 
destroys) the  tag.  The PatternTokenizer workaround posted to SOLR-42 
will fulfill this use-case.

But a different use-case might be for the highlighting to encompass the markup 
rather than just the text, e.g.
Paris
which would have to be accomplished some other way.

- J.J.


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Steven Rowe
Adrian Sutton wrote:
> We didn't do anything at all to the HTML, the editor returns valid XHTML
> (using numeric entities, never named entities which aren't valid in XML
> and don't tend to work in XHTML) [...]

Named entity references are valid in XML.  They just need to be declared
before they are used[1], unless they are one of the builtin named
entities < > ' " or & -- these are always valid when
parsing with an XML parser.

XHTML is XML, so if parsed by an XML parser, XML's builtin named
entities are available, and if the parser doesn't ignore external
entities, then the same set of (roughly) 250 named entities defined in
HTML are available as well[2].

Steve

[1] XML well-formedness constraint - entities must be declared:


[2] Named entities defined in XHTML 1.0



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Walter Underwood
That is one seriously manly regex, but I'd recommend using the Tag Soup
parser instead:

  http://ccil.org/~cowan/XML/tagsoup/

wunder

On 10/4/07 10:11 PM, "J.J. Larrea" <[EMAIL PROTECTED]> wrote:

> It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or
> XML-like tags:
> 
>   (?:\s*\s]+))?)\s*|\s*)/?>\s*)|\s



Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?


We didn't do anything at all to the HTML, the editor returns valid  
XHTML (using numeric entities, never named entities which aren't  
valid in XML and don't tend to work in XHTML) and we do string  
concatenation to build up the /update request body like:


requestBody += "" + xhtmlContent + "";

Solr seems to handle it. From what people are suggesting though you'd  
be better off converting to plain text before indexing it with Solr.  
Something like JTidy (http://jtidy.sf.net) can parse most HTML that's  
around and you can iterate over the DOM to extract the text from there.


Regards,

Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Ravish Bhagdev
Thanks Adrian,  I'm very new to Solr myself so struggling a bit in
initial stages...

One last one, when you send HTML to solr, do you too replace special
chars and tags with named entities?  I did this and HTMLStripper
doesn't seem to recognise them the tags :-S  While if I try and input
HTML as is indexer throws exceptions (as having tags within XML tags
is obviously not valid.  How to do this part?

Ravish

On 10/5/07, Adrian Sutton <[EMAIL PROTECTED]> wrote:
> On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:
> > (Query esp. Adrian):
> >
> > If you are indexing XHTML, do you replace tags with entities before
> > giving it to solr, if so, when you get back snippets do you get tags
> > or entities or do you convert again to tags for presentation?  What's
> > the best way out?  It would help me a lot if you briefly explain your
> > configuration.
>
> We happen to develop a HTML editor so we know 100% for certain that
> the XHTML is valid XML. Given that we just throw the raw XHTML at
> Solr which uses the HTMLStripWhitespaceTokenizer. However, at this
> stage we haven't configured highlighting at all, so our index is used
> for search and retrieving a document ID. At some point I'd like to
> add highlighting and it sounds like the best way to do so would be to
> index the document text instead of the HTML.
>
> Beyond that, we also use Solr as an optimization for extracting
> information such as what content was most recently changed, which
> pages link to others etc. On the page linking, we actually identify
> what pages are linked to prior to indexing and store them as a
> separate field - Solr itself has no understanding of the linking.
>
> Oh and I should note, I'm very new to Solr so I'm probably not doing
> things the best way, but I'm getting great results anyway.
>
> Regards,
>
> Adrian Sutton
> http://www.symphonious.net
>
>


Re: unable to figure out nutch type highlighting in solr....

2007-10-05 Thread Adrian Sutton

On 05/10/2007, at 4:07 PM, Ravish Bhagdev wrote:

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.


We happen to develop a HTML editor so we know 100% for certain that  
the XHTML is valid XML. Given that we just throw the raw XHTML at  
Solr which uses the HTMLStripWhitespaceTokenizer. However, at this  
stage we haven't configured highlighting at all, so our index is used  
for search and retrieving a document ID. At some point I'd like to  
add highlighting and it sounds like the best way to do so would be to  
index the document text instead of the HTML.


Beyond that, we also use Solr as an optimization for extracting  
information such as what content was most recently changed, which  
pages link to others etc. On the page linking, we actually identify  
what pages are linked to prior to indexing and store them as a  
separate field - Solr itself has no understanding of the linking.


Oh and I should note, I'm very new to Solr so I'm probably not doing  
things the best way, but I'm getting great results anyway.


Regards,

Adrian Sutton
http://www.symphonious.net



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Ravish Bhagdev
Thanks all for help.

Just to make sure I understand correctly, am I right in summarizing
this way than?:

No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.

HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?)
: Main purpose is to allow users to index html code, it will strip the
html tags and index the contents, but if used for getting snippets in
results the  tags may be in wrong locations

To avoid using HTMLAnalyser, strip out the tags yourself and only send
text to Solr for indexing using one of the "normal" analysers.
Highlighting should be accurate in this case.

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

Do let me know if my assumptions are wrong!

Cheers,
Ravish

On 10/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : In general, I don't recommend indexing HTML content straight to Solr.  None 
> of
> : the Solr contributors do this so the use case hasn't received a lot of love.
>
> I second that comment ... the HTML Striping code was never intended to be
> an "HTML Parser" it was designed to be a workarround for dealing with
> "dirty data" where people had unwanted HTML tags in what should be plain
> text.  indexing as is with some analyzers would result in words like
> "script", "strong", and "class" matching lots of docs where the words
> never relaly appear in the text.
>
> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.
>
>
>
> -Hoss
>
>


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread J.J. Larrea
At 3:45 PM -0700 10/4/07, Mike Klaas wrote:
>I'm actually somewhat surprised that several people are interested in this but 
>none have have been sufficiently interested to implement a solution to 
>contribute:
>
>http://issues.apache.org/jira/browse/SOLR-42

I just devised a workaround earlier in the week and was planning on posting it; 
thanks to your nudge I just did (to SOLR-42).  Hopefully it may be of use to 
someone else.

It uses a PatternTokenizerFactory with a RegEx that swallows runs of HTML- or 
XML-like tags:

  (?:\s*\s]+))?)\s*|\s*)/?>\s*)|\s

and it will treat runs of "things that look like HTML/XML open or close tags 
with optional attributes, optionally preceded or followed by spaces" 
identically to "runs of one or more spaces" as token delimiters, and swallow 
them up, so the previous and following tokens have the correct offsets.

Of course this is just a hack: It doesn't have any real understanding of HTML 
or XML syntax, so something invalid like  will get matched. 
On the other hand, < and > in text will be left alone.

Also note it doesn't decode XML or HTML numeric or symbolic entity references, 
as HTMLStripReader does (my indexer is pre-decoding the entity references 
before sending the text to Solr for indexing).

So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the 
right thing with offsets would still be a worthy task.  I wonder whether 
recasting HTMLStripReader using the 
org.apache.lucene.analysis.standard.CharStream interface would make sense for 
this?

(I just added the above to the Jira comment, please pardon the redundancy)

- J.J.


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Walter Underwood
Wow, well-formed HTML. That's a rare beast. --wunder

On 10/4/07 7:08 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Chris Hostetter

: In general, I don't recommend indexing HTML content straight to Solr.  None of
: the Solr contributors do this so the use case hasn't received a lot of love.

I second that comment ... the HTML Striping code was never intended to be 
an "HTML Parser" it was designed to be a workarround for dealing with 
"dirty data" where people had unwanted HTML tags in what should be plain 
text.  indexing as is with some analyzers would result in words like 
"script", "strong", and "class" matching lots of docs where the words 
never relaly appear in the text.

if you have wellformed HTML documents, use an HTML parser to extract the 
real content.



-Hoss



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton

On 05/10/2007, at 8:45 AM, Mike Klaas wrote:
In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.


We're indexing XHTML straight to Solr and it's working great so far.

I'm actually somewhat surprised that several people are interested  
in this but none have have been sufficiently interested to  
implement a solution to contribute:


http://issues.apache.org/jira/browse/SOLR-42


Didn't know there was a problem to solve. We're a fair way off  
actually playing with highlighting but I'll keep an eye on this for  
when we get to it.



-Mike


Thanks,

Adrian Sutton
http://www.symphonious.net



Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas

On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:

I see that you're using the HTML analyzer.  Unfortunately that  
does not play very well with highlighting at the moment. You may  
get garbled output.


Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for  
highlighting (but search on the original field). Would this work,  
and if so, which analyzer would be suitable for the second field?


the HTML analyzer strips html but doesn't update the offsets nicely  
(the highlighter uses these to determine where to insert the  tags).


If you use a "normal" analyzer (like WordDelimiterFilter) directly on  
the HTML, the offsets will be correct but you will get HTML tags  
returned in your output, which you will have to be careful to strip.
(which means you couldn't use the default '' as highlighting  
markers).


In general, I don't recommend indexing HTML content straight to  
Solr.  None of the Solr contributors do this so the use case hasn't  
received a lot of love.


I'm actually somewhat surprised that several people are interested in  
this but none have have been sufficiently interested to implement a  
solution to contribute:


http://issues.apache.org/jira/browse/SOLR-42

-Mike


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Adrian Sutton
I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.


Is it the HTML analyzer or the fact that it's HTML content? If it's  
just the analyzer you could always just copy the HTML content to  
another field with a different analyzer and use that for highlighting  
(but search on the original field). Would this work, and if so, which  
analyzer would be suitable for the second field?


Adrian Sutton
http://www.symphonious.net


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas


On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:





I see that you're using the HTML analyzer.  Unfortunately that does  
not play very well with highlighting at the moment. You may get  
garbled output.


-Mike


Re: unable to figure out nutch type highlighting in solr....

2007-10-04 Thread Mike Klaas

In 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:


I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is "stored" in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using "full search interface" option?  How, it just
returns xml with full document between field tag at the moment.


Note that the highlighting data is _not_ returned in the   
section of the response.  Getting the whole document back is probably  
due to asking for all fields (coupled with having stored the main  
text field).


You can play with the highlighting in the admin ui.  Besides having a  
few parameters directly present, the others can be added directly to  
the url for testing.


The minimum required for highlighting is:
 1. hl=true
 2. hl.fl=myfield

_If_ that field matches one of the query terms, you should see  
snippets in the generated response.  EVen if not, you should see a  
 section of the response (it will be empty).


regards,
-Mike


unable to figure out nutch type highlighting in solr....

2007-10-02 Thread Ravish Bhagdev
I have tried very hard to follow documentation and forums that try to
answer questions about how to return snippets with highlights for
relevant searched term using Solr (as nutch does with such ease).

I will be really grateful if someone can guide me with basics, i have
made sure that the field to be highlighted is "stored" in index etc.
Still I can't figure out why it doesn't return the snippet and instead
returns the whole document.

I have tried all different highlight parameters with variations, but
no idea what's happening.  Can I test highlighting using given
application using "full search interface" option?  How, it just
returns xml with full document between field tag at the moment.

Please find attached my conf files as well




  
  ${solr.abortOnConfigurationError:true}

  
  

  
   
false
5
100
2147483647
1
1000
1
  

  

false
5
1000
2147483647
1


false
  

  
  




 
  1000
  1000






  


  

1024





   


  



true




   

   
10

















false


4

  

  
  


  
  
  
  
  

 
   explicit
   
 
  

  
  

 explicit
 0.01
 
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 
 
text^0.2 features^1.1 name^1.5 manu^1.4 manu_exact^1.9
 
 
ord(poplarity)^0.5 recip(rord(price),1,1000,1000)^0.3
 
 
id,name,price,score
 
 
2<-1 5<-2 6<90%
 
 100
 *:*

  

  
  

 explicit
 text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0
 2<-1 5<-2 6<90%
 
 incubationdate_dt:[* TO NOW/DAY-1MONTH]^2.2



  inStock:true



  cat
  manu_exact
  price:[* TO 500]
  price:[500 TO *]

  
  
  

 
inStock:true
 
 
text^0.5 features^1.0 name^1.2 sku^1.5 id^10.0 manu^1.1 cat^1.4
 
 
2<-1 5<-2 6<90%
 
  


  
  

 
   1
   0.5
 
 
 
 
 
 
 
 spell
 
 
 
 
 word
 
   

  
  
  

  
  


  
  
  
  
  
  
  
  
  

 explicit 
 true

  
  
  

  
  
5
   

   
  
solr
solrconfig.xml schema.xml admin-extra.html


 qt=standard&q=solrpingquery


  








  

  


































  

  




  







  
  







  



  







  
  







  




  







  




  








  


 
 

 


 
   


   
   
   
   
   
   
   
   

   
   
   
   
   

   
   

   
   
   

   
   

   
   




   
   
   
   
   
   
   
   
   
   
   
   
   
   


   
   
 

 
 id

 
 document