How to do MoreLikeThis with documents in seperate indexes?

2010-10-03 Thread Savannah Beckett
Is it possible to do MoreLikeThis with documents that are in seperate indexes?  
If so, how?  Thanks.


  

How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
Hi,
  I am using xpath to index different parts of the html pages into different 
fields.  Now, I have some pure text documents that has no html.  So I can't use 
xpath.  How do I index these pure text into different fields of the index?  How 
do I make nutch/solr understand these different parts belong to different 
fields?  Maybe I can use existing content in the fields in my index?
Thanks.


  

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
No, I am using xpath for html, this is not the question.  I am indexing pure 
text in addition to html that I was indexing.  Pure text like TXT file or 
Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into 
different fields in my index like the way I use xpath to index html into 
differernt fields in my index?

My question is referring to pure TXT like .txt file and microsoft word, not 
html.  I am completely fine with html.
Thanks.





From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 2:59:26 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index just the text of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett 
savannah_becket...@yahoo.com wrote:

 Hi,
  I am using xpath to index different parts of the html pages into
 different
 fields.  Now, I have some pure text documents that has no html.  So I can't
 use
 xpath.  How do I index these pure text into different fields of the index?
 How
 do I make nutch/solr understand these different parts belong to different
 fields?  Maybe I can use existing content in the fields in my index?
 Thanks.






  

Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
Hi,
  I have city name as a text field, and I want to do spellcheck on it.  I use 
setting in http://wiki.apache.org/solr/SpellCheckComponent

If I setup city name as text field and do spell check on San Jos for San 
Jose, 
I get suggestion for Jos as ojos.  I checked the extendedresult and I found 
that Jose is in the middle of all 10 suggestions in term of score and 
frequency.  I then set city name as string field, and spell check again, I got 
Van for San and Ross for Jos, which is weird because San is correct.  


How do you setup spellchecker to spellcheck city names?  City name can have 
multiple words.
Thanks.


  

Re: Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
No, it doesn't work, I got weird result. I set my city name field to be parsed 
as a token as following:

    fieldType name=autocomplete1 class=solr.TextField 
positionIncrementGap=100
  analyzer type=index
    tokenizer class=solr.KeywordTokenizerFactory/
    filter class=solr.LowerCaseFilterFactory/
  /analyzer
  analyzer type=query
    tokenizer class=solr.KeywordTokenizerFactory/
    filter class=solr.LowerCaseFilterFactory/
  /analyzer
    /fieldType

I got following result for spellcheck:

lstname=spellcheck 
-     lstname=suggestions
-         lstname=san
              intname=numFound1/int 
              intname=startOffset0/int 
              intname=endOffset3/int 
-             arrname=suggestion
              strswan/str 
          /arr
      /lst
- lstname=clar
              intname=numFound1/int 
              intname=startOffset4/int 
   intname=endOffset8/int 
                arrname=suggestion
     strclark/str 
 /arr
      /lst
  /lst

 




From: Tom Hill solr-l...@worldware.com
To: solr-user@lucene.apache.org
Sent: Mon, September 27, 2010 3:52:48 PM
Subject: Re: Need help with spellcheck city name

Maybe process the city name as a single token?

On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett
savannah_becket...@yahoo.com wrote:
 Hi,
   I have city name as a text field, and I want to do spellcheck on it.  I use
 setting in http://wiki.apache.org/solr/SpellCheckComponent

 If I setup city name as text field and do spell check on San Jos for San 
Jose,
 I get suggestion for Jos as ojos.  I checked the extendedresult and I found
 that Jose is in the middle of all 10 suggestions in term of score and
 frequency.  I then set city name as string field, and spell check again, I got
 Van for San and Ross for Jos, which is weird because San is correct.


 How do you setup spellchecker to spellcheck city names?  City name can have
 multiple words.
 Thanks.






  

Re: Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
No, I checked, there is a city called Swan in Iowa.  So, it is getting from the 
city index, so is Clerk.  But why does it favor Swan than San?  Spellcheck get 
weird after I treat city name as one token.  If I do it in the old way, it let 
San go, and correct Jos as Ojos instead of Jose because Ojos is ranked as #1 
and 
Jose at the middle.  Any more suggestions?  Rank it by frequency first then 
score doesn't work neither.  


 


From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Mon, September 27, 2010 5:24:25 PM
Subject: Re: Need help with spellcheck city name

Hmmm, did you rebuild your spelling index after the config changes?

And it really looks like somehow you're getting results from a field other
than city. Are you also sure that your cityname field is of type
autocomplete1?

Shooting in the dark here, but these results are so weird that I suspect
it's
something fundamental

Best
Erick

On Mon, Sep 27, 2010 at 8:05 PM, Savannah Beckett 
savannah_becket...@yahoo.com wrote:

 No, it doesn't work, I got weird result. I set my city name field to be
 parsed
 as a token as following:

        fieldType name=autocomplete1 class=solr.TextField
 positionIncrementGap=100
          analyzer type=index
            tokenizer class=solr.KeywordTokenizerFactory/
            filter class=solr.LowerCaseFilterFactory/
          /analyzer
          analyzer type=query
            tokenizer class=solr.KeywordTokenizerFactory/
            filter class=solr.LowerCaseFilterFactory/
          /analyzer
        /fieldType

 I got following result for spellcheck:

 lstname=spellcheck
 -    lstname=suggestions
 -        lstname=san
              intname=numFound1/int
              intname=startOffset0/int
              intname=endOffset3/int
 -            arrname=suggestion
                  strswan/str
          /arr
      /lst
 -        lstname=clar
              intname=numFound1/int
              intname=startOffset4/int
        intname=endOffset8/int
                arrname=suggestion
          strclark/str
      /arr
      /lst
  /lst





 
 From: Tom Hill solr-l...@worldware.com
 To: solr-user@lucene.apache.org
 Sent: Mon, September 27, 2010 3:52:48 PM
 Subject: Re: Need help with spellcheck city name

 Maybe process the city name as a single token?

 On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett
 savannah_becket...@yahoo.com wrote:
  Hi,
   I have city name as a text field, and I want to do spellcheck on it.  I
 use
  setting in http://wiki.apache.org/solr/SpellCheckComponent
 
  If I setup city name as text field and do spell check on San Jos for
 San
 Jose,
  I get suggestion for Jos as ojos.  I checked the extendedresult and I
 found
  that Jose is in the middle of all 10 suggestions in term of score and
  frequency.  I then set city name as string field, and spell check again,
 I got
  Van for San and Ross for Jos, which is weird because San is correct.
 
 
  How do you setup spellchecker to spellcheck city names?  City name can
 have
  multiple words.
  Thanks.
 
 
 








  

Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in 
the solr index and use the data from these similar documents to modify a field 
in each document that I am indexing.  I found that MoreLikeThis in Solr only 
works when the document is in the index, is it true?  If so, I may have to wait 
til the indexing is finished, then run my own command to do MoreLikeThis to 
each 
document in the index, and then reindex each document?  It sounds like it's not 
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle liam.obo...@intelligencebank.com
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
savannah_becket...@yahoo.com wrote:

 I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
 update the value of one of the fields of a document in the solr index after 
the
 document was already indexed, and I have only the document id.  How do I do
 that?

 Thanks.






  

Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
I want to do MoreLikeThis to find documents that are similar to the document 
that I am indexing.  Then I want to calculate the average of one of the fields 
of all those documents and input this average into a field of the document that 
I am indexing.  From my research, it seems that MoreLikeThis can only be used 
to 
find similarity of document that is already in the index.  So, I think I need 
to 
index it first, and then use MoreLikeThis to find similar documents in the 
index 
and then reindex that document.  Any better way?  I try not to reindex a 
document because it's not efficient.  I don't have to use MoreLikeThis.
Thanks.




From: Jonathan Rochkind rochk...@jhu.edu
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Fri, September 10, 2010 9:58:20 AM
Subject: RE: How to Update Value of One Field of a Document in Index?

More like this is intended to be run at query time. For what reasons are you 
thinking you want to (re-)index each document based on the results of 
MoreLikeThis?  You're right that that's not what the component is intended for. 


Jonathan

From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle liam.obo...@intelligencebank.com
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
savannah_becket...@yahoo.com wrote:

 I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
 update the value of one of the fields of a document in the solr index after
the
 document was already indexed, and I have only the document id.  How do I do
 that?

 Thanks.





  

How to Update Value of One Field of a Document in Index?

2010-09-09 Thread Savannah Beckett
I use nutch to crawl and index to Solr.  My code is working.  Now, I want to 
update the value of one of the fields of a document in the solr index after the 
document was already indexed, and I have only the document id.  How do I do 
that?  

Thanks.


  

How to set custom fields for SolrSearchBean Query in Nutch?

2010-08-25 Thread Savannah Beckett
I am using SolrSearchBean inside my custom parse filter in Nutch 1.1.  My 
solr/nutch setup is working.  I have Nutch to crawl and index into Solr and I 
am 
able to search solr index with my solr admin page.  My solr schema is 
completely 
different than the one in Nutch.  When I tried to query my solr index using 
SolrSearchBean, it somehow always treat my query with fields like content, 
site, 
url, etc, my solr index has none of those fields.  Of course, there is an 
exception complaining cannot executing query.  


How do I make SolrSearchBean use my solr setup's fields instead of nutch ones?  
Thanks.


  

How to do Spatial Search with Solr?

2010-08-23 Thread Savannah Beckett
Hi,
  I am using nutch to do the crawling and solr to do the searching.  The index 
has City and State.  I want to able to get all nearby cities by entering city 
name.  e.g. when I type New York, I want to get the following as facet:

New York, NY (1905) 
Brooklyn, NY (89) 
Jersey City, NJ (55) 
New York City, NY (34) 
Montclair, NJ (25) 

How do I do that?  More importantly, where do I get all the latitute and 
longitude data for all cities?  

Thanks.



  

How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

2010-07-26 Thread Savannah Beckett
I am using Drupal ApacheSolr module to integrate solr with drupal.  I already 
integrated solr with nutch.  I already moved nutch's solrconfig.xml and 
schema.xml to solr's example directory, and it work.  I tried to append 
Drupal's 
ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, 
but I got the following error when I java -jar start.jar:
 
Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log
SEVERE: Exception during parsing file: 
solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document 
following the root element must be well-formed.
    at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
    at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)

    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
    at org.apache.solr.core.Config.init(Config.java:110)
    at org.apache.solr.core.SolrConfig.init(SolrConfig.java:130)
    at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134)

    at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)

Why?  does solrconfig.xml allow to have 2 config sections?  does schema.xml 
allow to have 2 schema sections?  

Thanks.


  

Which is a good XPath generator?

2010-07-24 Thread Savannah Beckett
Hi,
  I am looking for a XPath generator that can generate xpath by picking a 
specific tag inside a html.  Do you know a good xpath generator?  If possible, 
free xpath generator would be great.
Thanks.


  

faceted search with job title

2010-07-21 Thread Savannah Beckett
Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 
index to make it work with solr faceted search, am I right?
Thanks.


  

Re: faceted search with job title

2010-07-21 Thread Savannah Beckett
mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 
tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle dave.sea...@magicalia.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 
each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 
id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 

index to make it work with solr faceted search, am I right?
Thanks.


  

Re: faceted search with job title

2010-07-21 Thread Savannah Beckett
I don't see how it can be done without writing sax or dom code for each job 
board, it is non-maintainable if there are a lot of new job boards being 
crawled.  Maybe I should use regex match?  Then I just need to substitute the 
regex pattern for each job board without writing any new sax or dom code.  But 
is regex pattern flexible enough for all job boards?
Thanks.





From: Nagelberg, Kallin knagelb...@globeandmail.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle dave.sea...@magicalia.com
To: solr-user@lucene.apache.org solr-user@lucene.apache.org
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 

id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 


index to make it work with solr faceted search, am I right?
Thanks.