How to do MoreLikeThis with documents in seperate indexes?

2010-10-03 Thread Savannah Beckett
Is it possible to do MoreLikeThis with documents that are in seperate indexes?  
If so, how?  Thanks.


  

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
No, I am using xpath for html, this is not the question.  I am indexing pure 
text in addition to html that I was indexing.  Pure text like TXT file or 
Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into 
different fields in my index like the way I use xpath to index html into 
differernt fields in my index?

My question is referring to pure TXT like .txt file and microsoft word, not 
html.  I am completely fine with html.
Thanks.





From: Erick Erickson 
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 2:59:26 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index "just the text" of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett <
savannah_becket...@yahoo.com> wrote:

> Hi,
>  I am using xpath to index different parts of the html pages into
> different
> fields.  Now, I have some pure text documents that has no html.  So I can't
> use
> xpath.  How do I index these pure text into different fields of the index?
> How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>



  

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
No, these new documents are not html, these are pure text, like the ones you 
see 
in notepad or Microsoft Word.  I have no problem indexing Html, but I got stuck 
with these pure text.





From: Scott Gonyea 
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 1:20:20 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Break your HTML pages into the desired fields, format it as follows:

http://wiki.apache.org/solr/UpdateXmlMessages

And away you go.  You may want to search / review the Wiki.  Also, if
you're indexing websites and want to place it in Solr, you should look
at Nutch.  It can do all that work for you, and more.

Scott

On Wed, Sep 29, 2010 at 12:56 PM, Savannah Beckett
 wrote:
> Hi,
>   I am using xpath to index different parts of the html pages into different
> fields.  Now, I have some pure text documents that has no html.  So I can't 
use
> xpath.  How do I index these pure text into different fields of the index?  
How
> do I make nutch/solr understand these different parts belong to different
> fields?  Maybe I can use existing content in the fields in my index?
> Thanks.
>
>
>



  

How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
Hi,
  I am using xpath to index different parts of the html pages into different 
fields.  Now, I have some pure text documents that has no html.  So I can't use 
xpath.  How do I index these pure text into different fields of the index?  How 
do I make nutch/solr understand these different parts belong to different 
fields?  Maybe I can use existing content in the fields in my index?
Thanks.


  

Re: Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
No, I checked, there is a city called Swan in Iowa.  So, it is getting from the 
city index, so is Clerk.  But why does it favor Swan than San?  Spellcheck get 
weird after I treat city name as one token.  If I do it in the old way, it let 
San go, and correct Jos as Ojos instead of Jose because Ojos is ranked as #1 
and 
Jose at the middle.  Any more suggestions?  Rank it by frequency first then 
score doesn't work neither.  


 


From: Erick Erickson 
To: solr-user@lucene.apache.org
Sent: Mon, September 27, 2010 5:24:25 PM
Subject: Re: Need help with spellcheck city name

Hmmm, did you rebuild your spelling index after the config changes?

And it really looks like somehow you're getting results from a field other
than city. Are you also sure that your cityname field is of type
autocomplete1?

Shooting in the dark here, but these results are so weird that I suspect
it's
something fundamental

Best
Erick

On Mon, Sep 27, 2010 at 8:05 PM, Savannah Beckett <
savannah_becket...@yahoo.com> wrote:

> No, it doesn't work, I got weird result. I set my city name field to be
> parsed
> as a token as following:
>
>         positionIncrementGap="100">
>          
>            
>            
>          
>          
>            
>            
>          
>        
>
> I got following result for spellcheck:
>
> 
> -    
> -        
>              1
>              0
>              3
> -            
>                  swan
>          
>      
> -        
>              1
>              4
>        8
>                
>          clark
>      
>      
>  
>
>
>
>
>
> 
> From: Tom Hill 
> To: solr-user@lucene.apache.org
> Sent: Mon, September 27, 2010 3:52:48 PM
> Subject: Re: Need help with spellcheck city name
>
> Maybe process the city name as a single token?
>
> On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett
>  wrote:
> > Hi,
> >  I have city name as a text field, and I want to do spellcheck on it.  I
> use
> > setting in http://wiki.apache.org/solr/SpellCheckComponent
> >
> > If I setup city name as text field and do spell check on "San Jos" for
> San
> >Jose,
> > I get suggestion for Jos as "ojos".  I checked the extendedresult and I
> found
> > that Jose is in the middle of all 10 suggestions in term of score and
> > frequency.  I then set city name as string field, and spell check again,
> I got
> > Van for San and Ross for Jos, which is weird because San is correct.
> >
> >
> > How do you setup spellchecker to spellcheck city names?  City name can
> have
> > multiple words.
> > Thanks.
> >
> >
> >
>
>
>
>
>



  

Re: Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
No, it doesn't work, I got weird result. I set my city name field to be parsed 
as a token as following:

    
  
    
    
  
  
    
    
  
    

I got following result for spellcheck:

 
-     
-         
              1 
              0 
              3 
-             
              swan 
          
      
- 
              1 
              4 
   8 
                
     clark 
 
      
  

 




From: Tom Hill 
To: solr-user@lucene.apache.org
Sent: Mon, September 27, 2010 3:52:48 PM
Subject: Re: Need help with spellcheck city name

Maybe process the city name as a single token?

On Mon, Sep 27, 2010 at 3:25 PM, Savannah Beckett
 wrote:
> Hi,
>   I have city name as a text field, and I want to do spellcheck on it.  I use
> setting in http://wiki.apache.org/solr/SpellCheckComponent
>
> If I setup city name as text field and do spell check on "San Jos" for San 
>Jose,
> I get suggestion for Jos as "ojos".  I checked the extendedresult and I found
> that Jose is in the middle of all 10 suggestions in term of score and
> frequency.  I then set city name as string field, and spell check again, I got
> Van for San and Ross for Jos, which is weird because San is correct.
>
>
> How do you setup spellchecker to spellcheck city names?  City name can have
> multiple words.
> Thanks.
>
>
>



  

Need help with spellcheck city name

2010-09-27 Thread Savannah Beckett
Hi,
  I have city name as a text field, and I want to do spellcheck on it.  I use 
setting in http://wiki.apache.org/solr/SpellCheckComponent

If I setup city name as text field and do spell check on "San Jos" for San 
Jose, 
I get suggestion for Jos as "ojos".  I checked the extendedresult and I found 
that Jose is in the middle of all 10 suggestions in term of score and 
frequency.  I then set city name as string field, and spell check again, I got 
Van for San and Ross for Jos, which is weird because San is correct.  


How do you setup spellchecker to spellcheck city names?  City name can have 
multiple words.
Thanks.


  

spellcheck on multiple fields?

2010-09-27 Thread Savannah Beckett
Is it possible to do spellcheck on multiple fields in my solr index?  If so, 
how?  The following setup works for only one field:
    

  default
  solr.IndexBasedSpellChecker
  myfield
  ./spellchecker1
  0.5
  true
    


Thanks.


  

Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
I want to do MoreLikeThis to find documents that are similar to the document 
that I am indexing.  Then I want to calculate the average of one of the fields 
of all those documents and input this average into a field of the document that 
I am indexing.  From my research, it seems that MoreLikeThis can only be used 
to 
find similarity of document that is already in the index.  So, I think I need 
to 
index it first, and then use MoreLikeThis to find similar documents in the 
index 
and then reindex that document.  Any better way?  I try not to reindex a 
document because it's not efficient.  I don't have to use MoreLikeThis.
Thanks.




From: Jonathan Rochkind 
To: "solr-user@lucene.apache.org" 
Sent: Fri, September 10, 2010 9:58:20 AM
Subject: RE: How to Update Value of One Field of a Document in Index?

"More like this" is intended to be run at query time. For what reasons are you 
thinking you want to (re-)index each document based on the results of 
MoreLikeThis?  You're right that that's not what the component is intended for. 


Jonathan
________
From: Savannah Beckett [savannah_becket...@yahoo.com]
Sent: Friday, September 10, 2010 11:18 AM
To: solr-user@lucene.apache.org
Subject: Re: How to Update Value of One Field of a Document in Index?

Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in
the solr index and use the data from these similar documents to modify a field
in each document that I am indexing.  I found that MoreLikeThis in Solr only
works when the document is in the index, is it true?  If so, I may have to wait
til the indexing is finished, then run my own command to do MoreLikeThis to each
document in the index, and then reindex each document?  It sounds like it's not
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>


  

Re: How to Update Value of One Field of a Document in Index?

2010-09-10 Thread Savannah Beckett
Thanks.  I am trying to use MoreLikeThis in Solr to find similar documents in 
the solr index and use the data from these similar documents to modify a field 
in each document that I am indexing.  I found that MoreLikeThis in Solr only 
works when the document is in the index, is it true?  If so, I may have to wait 
til the indexing is finished, then run my own command to do MoreLikeThis to 
each 
document in the index, and then reindex each document?  It sounds like it's not 
efficient.  Is there a better way?
Thanks.





From: Liam O'Boyle 
To: solr-user@lucene.apache.org
Cc: u...@nutch.apache.org
Sent: Thu, September 9, 2010 11:06:36 PM
Subject: Re: How to Update Value of One Field of a Document in Index?

Hi Savannah,

You can only reindex the entire document; if you only have the ID,
then do a search to retrieve the rest of the data, then reindex.  This
assumes that all of the fields you need to index are stored (so that
you can retrieve them) and not just indexed.

Liam

On Fri, Sep 10, 2010 at 3:29 PM, Savannah Beckett
 wrote:
>
> I use nutch to crawl and index to Solr.  My code is working.  Now, I want to
> update the value of one of the fields of a document in the solr index after 
the
> document was already indexed, and I have only the document id.  How do I do
> that?
>
> Thanks.
>
>
>



  

How to Update Value of One Field of a Document in Index?

2010-09-09 Thread Savannah Beckett
I use nutch to crawl and index to Solr.  My code is working.  Now, I want to 
update the value of one of the fields of a document in the solr index after the 
document was already indexed, and I have only the document id.  How do I do 
that?  

Thanks.


  

How to set custom fields for SolrSearchBean Query in Nutch?

2010-08-25 Thread Savannah Beckett
I am using SolrSearchBean inside my custom parse filter in Nutch 1.1.  My 
solr/nutch setup is working.  I have Nutch to crawl and index into Solr and I 
am 
able to search solr index with my solr admin page.  My solr schema is 
completely 
different than the one in Nutch.  When I tried to query my solr index using 
SolrSearchBean, it somehow always treat my query with fields like content, 
site, 
url, etc, my solr index has none of those fields.  Of course, there is an 
exception complaining cannot executing query.  


How do I make SolrSearchBean use my solr setup's fields instead of nutch ones?  
Thanks.


  

How to do Spatial Search with Solr?

2010-08-22 Thread Savannah Beckett
Hi,
  I am using nutch to do the crawling and solr to do the searching.  The index 
has City and State.  I want to able to get all nearby cities by entering city 
name.  e.g. when I type New York, I want to get the following as facet:

New York, NY (1905) 
Brooklyn, NY (89) 
Jersey City, NJ (55) 
New York City, NY (34) 
Montclair, NJ (25) 

How do I do that?  More importantly, where do I get all the latitute and 
longitude data for all cities?  

Thanks.



  

How to Combine Drupal solrconfig.xml with Nutch solrconfig.xml?

2010-07-26 Thread Savannah Beckett
I am using Drupal ApacheSolr module to integrate solr with drupal.  I already 
integrated solr with nutch.  I already moved nutch's solrconfig.xml and 
schema.xml to solr's example directory, and it work.  I tried to append 
Drupal's 
ApacheSolr module's own solrconfig.xml and schema.xml into the same xml files, 
but I got the following error when I "java -jar start.jar":
 
Jul 26, 2010 1:18:31 PM org.apache.solr.common.SolrException log
SEVERE: Exception during parsing file: 
solrconfig.xml:org.xml.sax.SAXParseException: The markup in the document 
following the root element must be well-formed.
    at 
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:249)
    at 
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:284)

    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:124)
    at org.apache.solr.core.Config.(Config.java:110)
    at org.apache.solr.core.SolrConfig.(SolrConfig.java:130)
    at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134)

    at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)

Why?  does solrconfig.xml allow to have 2  sections?  does schema.xml 
allow to have 2  sections?  

Thanks.


  

Which is a good XPath generator?

2010-07-24 Thread Savannah Beckett
Hi,
  I am looking for a XPath generator that can generate xpath by picking a 
specific tag inside a html.  Do you know a good xpath generator?  If possible, 
free xpath generator would be great.
Thanks.


  

Re: faceted search with job title

2010-07-21 Thread Savannah Beckett
And I will have to recompile the dom or sax code each time I add a job board 
for 
crawling.  Regex patten is only a string which can be stored in a text file or 
db, and retrieved based on the job board.  What do you think?





From: "Nagelberg, Kallin" 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-----Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 

id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 


index to make it work with solr faceted search, am I right?
Thanks.


  

Re: faceted search with job title

2010-07-21 Thread Savannah Beckett
I don't see how it can be done without writing sax or dom code for each job 
board, it is non-maintainable if there are a lot of new job boards being 
crawled.  Maybe I should use regex match?  Then I just need to substitute the 
regex pattern for each job board without writing any new sax or dom code.  But 
is regex pattern flexible enough for all job boards?
Thanks.





From: "Nagelberg, Kallin" 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each site.. should be 
easy to extract title using groovy's xml parsing along with tagsoup for sloppy 
html. If you can't find the pattern for each site leading to the job title how 
can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 

tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 

each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 

id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 


index to make it work with solr faceted search, am I right?
Thanks.


  

Re: faceted search with job title

2010-07-21 Thread Savannah Beckett
mmm...there must be better way...each job board has different format.  If there 
are constantly new job boards being crawled, I don't think I can manually look 
for specific sequence of tags that leads to job title.  Most of them don't even 
have class or id.  There is no guarantee that the job title will be in the 
title 
tag, or header tag.  Something else can be in the title.  Should I do this in a 
class that extends IndexFilter in Nutch?
Thanks. 





From: Dave Searle 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set up rules 
for 
each website to grab that specific bit of data. You could load the html into an 
xml parser, then use xpath to grab content from a particular tag with a class 
or 
id, based on the particular website



-Original Message-----
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com] 
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 

index to make it work with solr faceted search, am I right?
Thanks.


  

faceted search with job title

2010-07-21 Thread Savannah Beckett
Hi,
  I am currently using nutch to crawl some job pages from job boards.  They are 
in my solr index now.  I want to do faceted search with the job titles.  How?  
The job titles can be in any locations of the page, e.g. title, header, 
content...   If I use indexfilter in Nutch to search the content for job title, 
there are hundred of thousands of job titles, I can't hard code them all.  Do 
you have a better idea?  I think I need the job title in a separate field in 
the 
index to make it work with solr faceted search, am I right?
Thanks.