date:20071127

Re: Memory use with sorting problem

2007-11-27 Thread Chris Laux

Hi again,

in the meantime I discovered the use of jmap (I'm not a Java programmer)
and found that all the memory was being used up by String and char[]
objects.

The Lucene docs have the following to say on sorting memory use:

For String fields, the cache is larger: in addition to the above
array, the value of every term in the field is kept in memory. If there
are many unique terms in the field, this could be quite large.

(http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Sort.html)

I am sorting on the slong schema type, which is of course stored as a
string. The above quote seems to indicate that it is possible for a
field not to be a string for the purposes of the sort, while I took it
from LiA that everything is a string to Lucene.

What can I do to make sure the additional memory is not used by every
unique term? i.e. how to have the slong not be a String field?

Cheers,
Chris

Chris Laux wrote:
Hi all,

I've been struggling with this problem for over a month now, and
although memory issues have been discussed often, I don't seem to be
able to find a fitting solution.

The index is merely 1.5 GB large, but memory use quickly fills out the
heap max of 1 GB on a 2 GB machine. This then works fine until
auto-warming starts. Switching the latter off altogether is unattractive
as it leads to response times of up to 30 s. When auto-warming starts, I
get this error:

SEVERE: Error during auto-warming of
key:org.apache.solr.search.QueryResultKey
@e0b93139:java.lang.OutOfMemoryError: Java heap space

Now when I reduce the size of caches (to a fraction of the default
settings) and number of warming Searchers (to 2), memory use is not
reduced and the problem stays. Only deactivating auto-warming will help.
When I set the heap size limit higher (and go into swap space), all the
extra memory seems to be used up right away, independently from
auto-warming.

This all seems to be closely connected to sorting by a numerical field,
as switching this off does make memory use a lot more friendly.

Is it normal to need that much Memory for such a small index?

I suspect the problem is in Lucene, would it be better to post on their
list?

Does anyone know a better way of getting the sorting done?

Thanks in advance for your help,

Chris

This is the field setup in schema.xml:

field name=id type=long stored=true required=true
multiValued=false /
field name=user-id type=long stored=true required=true
multiValued=false /
field name=text type=text indexed=true multiValued=false /
field name=created type=slong indexed=true multiValued=false /

And this is a sample query:

select/?q=solrstart=0rows=20sort=created+desc

Re: Inconsistent results in Solr Search with Lucene Index

2007-11-27 Thread Grant Ingersoll

Have you setup your Analyzers, etc. so they correspond to the exact  
ones that you were using in Lucene?  Under the Solr Admin you can try  
the analysis tool to see how your index and queries are treated.  What  
happens if you do a *:* query from the Admin query screen?


If your index is reasonably sized, I would just reindex, but you  
shouldn't have to do this.


-Grant

On Nov 27, 2007, at 8:18 AM, trysteps wrote:


Hi All,
I am trying to use Solr Search with Lucene Index so just set all  
schema.xml configs like tokenize and field necessaries.

But I can not get results like Lucene.
For example ,
search for 'dog' returns lots of results with lucene but in Solr, I  
can't get any result. But search with 'dog*' returns same result  
with Lucene.
What is the best way to integrate Lucene index to Solr, are there  
any well-documented sources?

Thanks for your Attention,
Trysteps



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: CJK Analyzers for Solr

2007-11-27 Thread Eswar K

Is there any specific reason why the CJK analyzers in Solr were chosen to be
n-gram based instead of it being a morphological analyzer which is kind of
implemented in Google as it considered to be more effective than the n-gram
ones?

Regards,
Eswar



On Nov 27, 2007 7:57 AM, Eswar K [EMAIL PROTECTED] wrote:

 thanks james...

 How much time does it take to index 18m docs?

 - Eswar


 On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED]  wrote:

  i not use HYLANDA analyzer.
 
  i use je-analyzer and indexing at least 18m docs.
 
  i m sorry i only use chinese analyzer.
 
 
  On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:
 
   What is the performance of these CJK analyzers (one in lucene and
  hylanda
   )?
   We would potentially be indexing millions of documents.
  
   James,
  
   We would have a look at hylanda too. What abt japanese and korean
   analyzers,
   any recommendations?
  
   - Eswar
  
   On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED] wrote:
  
I don't think NGram is good method for Chinese.
   
CJKAnalyzer of Lucene is 2-Gram.
   
Eswar K:
 if it is chinese analyzer,,i recommend hylanda（www.hylanda.com）,,,it
  is
the best chinese analyzer and it not free.
 if u wanna free chinese analyzer, maybe u can try je-analyzer. it
  have
some problem when using it.
   
   
   
On Nov 27, 2007 5:56 AM, Otis Gospodnetic 
  [EMAIL PROTECTED]
wrote:
   
 Eswar,

 We've uses the NGram stuff that exists in Lucene's
  contrib/analyzers
 instead of CJK.  Doesn't that allow you to do everything that the
Chinese
 and CJK analyzers do?  It's been a few months since I've looked at
Chinese
 and CJK Analzyers, so I could be off.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 8:30:52 AM
 Subject: CJK Analyzers for Solr

 Hi,

 Does Solr come with Language analyzers for CJK? If not, can you
  please
 direct me to some good CJK analyzers?

 Regards,
 Eswar




   
   
--
regards
jl
   
  
 
 
 
  --
  regards
  jl

Re: CJK Analyzers for Solr

2007-11-27 Thread John Stewart

Eswar,

What type of morphological analysis do you suspect (or know) that
Google does on east asian text?  I don't think you can treat the three
languages in the same way here.  Japanese has multi-morphemic words,
but Chinese doesn't really.

jds

On Nov 27, 2007 11:54 AM, Eswar K [EMAIL PROTECTED] wrote:
 Is there any specific reason why the CJK analyzers in Solr were chosen to be
 n-gram based instead of it being a morphological analyzer which is kind of
 implemented in Google as it considered to be more effective than the n-gram
 ones?

 Regards,
 Eswar




 On Nov 27, 2007 7:57 AM, Eswar K [EMAIL PROTECTED] wrote:

  thanks james...
 
  How much time does it take to index 18m docs?
 
  - Eswar
 
 
  On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED]  wrote:
 
   i not use HYLANDA analyzer.
  
   i use je-analyzer and indexing at least 18m docs.
  
   i m sorry i only use chinese analyzer.
  
  
   On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:
  
What is the performance of these CJK analyzers (one in lucene and
   hylanda
)?
We would potentially be indexing millions of documents.
   
James,
   
We would have a look at hylanda too. What abt japanese and korean
analyzers,
any recommendations?
   
- Eswar
   
On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED] wrote:
   
 I don't think NGram is good method for Chinese.

 CJKAnalyzer of Lucene is 2-Gram.

 Eswar K:
  if it is chinese analyzer,,i recommend hylanda（www.hylanda.com）,,,it
   is
 the best chinese analyzer and it not free.
  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
   have
 some problem when using it.



 On Nov 27, 2007 5:56 AM, Otis Gospodnetic 
   [EMAIL PROTECTED]
 wrote:

  Eswar,
 
  We've uses the NGram stuff that exists in Lucene's
   contrib/analyzers
  instead of CJK.  Doesn't that allow you to do everything that the
 Chinese
  and CJK analyzers do?  It's been a few months since I've looked at
 Chinese
  and CJK Analzyers, so I could be off.
 
  Otis
 
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
  - Original Message 
  From: Eswar K [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Monday, November 26, 2007 8:30:52 AM
  Subject: CJK Analyzers for Solr
 
  Hi,
 
  Does Solr come with Language analyzers for CJK? If not, can you
   please
  direct me to some good CJK analyzers?
 
  Regards,
  Eswar
 
 
 
 


 --
 regards
 jl

   
  
  
  
   --
   regards
   jl

Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Siegfried Goeschl


Hi folks,

working on a closed source project for an IP concerned company is not 
always fun ... we combined SOLR with JAMon 
(http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
this might be of general interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued by a 
customer and the requested domain objects, e.g. artist, album, track
+) this allows us to keep track of the execution times and their 
distribution to find quickly long running queries without having access 
to the access.log from a web browser
+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf

+) if it is of general I can rewrite the code as contribution

Cheers,

Siegfried Goeschl

Re: Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Matthew Runo

I'd be interested in seeing more logging in the admin section! I saw  
that there is QPS in 1.3, which is great, but it'd be wonderful to see  
more.


--Matthew Runo

On Nov 27, 2007, at 9:18 AM, Siegfried Goeschl wrote:


Hi folks,

working on a closed source project for an IP concerned company is  
not always fun ... we combined SOLR with JAMon (http://jamonapi.sourceforge.net/ 
) to keep an eye of the query times and this might be of general  
interest


+) JAMon comes with a ready-to-use ServletFilter
+) we extended this implementation to keep track for queries issued  
by a customer and the requested domain objects, e.g. artist,  
album, track
+) this allows us to keep track of the execution times and their  
distribution to find quickly long running queries without having  
access to the access.log from a web browser

+) a small presentation can be found at 
http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf
+) if it is of general I can rewrite the code as contribution

Cheers,

Siegfried Goeschl

Re: CJK Analyzers for Solr

2007-11-27 Thread Mike Klaas


On 27-Nov-07, at 8:54 AM, Eswar K wrote:

Is there any specific reason why the CJK analyzers in Solr were  
chosen to be
n-gram based instead of it being a morphological analyzer which is  
kind of
implemented in Google as it considered to be more effective than  
the n-gram

ones?


The CJK analyzers are just wrappers of the already-available  
analyzers in lucene.  I suspect (but am not sure) that the core devs  
aren't fluent in the issues surrounding the analysis of asian text (I  
certainly am not).  Any improvements in this regard would be greatly  
appreciated.


-Mike

two solr instances?

2007-11-27 Thread Jörg Kiegeland

Is it possible to deploy solr.war once to Tomcat (which is on top of an 
Apache HTTP Server in my configuration) which then can manage two Solr 
indexes?


I have to make accessible two different Solr indexes (both have 
different schema.xml files) over the web. If the above architecture is 
not possible: is there any other solution?

Re: CJK Analyzers for Solr

2007-11-27 Thread Walter Underwood

Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:

   http://citeseer.ist.psu.edu/kwok97comparing.html

I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is even harder. It might
work to break Korean into the phonetic subparts and use n-gram on
those.

You should not do term highlighting with any of the n-gram methods.
The relevance can be very good, but the highlighting just looks dumb.

wunder

On 11/27/07 8:54 AM, Eswar K [EMAIL PROTECTED] wrote:

 Is there any specific reason why the CJK analyzers in Solr were chosen to be
 n-gram based instead of it being a morphological analyzer which is kind of
 implemented in Google as it considered to be more effective than the n-gram
 ones?
 
 Regards,
 Eswar
 
 
 
 On Nov 27, 2007 7:57 AM, Eswar K [EMAIL PROTECTED] wrote:
 
 thanks james...
 
 How much time does it take to index 18m docs?
 
 - Eswar
 
 
 On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED]  wrote:
 
 i not use HYLANDA analyzer.
 
 i use je-analyzer and indexing at least 18m docs.
 
 i m sorry i only use chinese analyzer.
 
 
 On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:
 
 What is the performance of these CJK analyzers (one in lucene and
 hylanda
 )?
 We would potentially be indexing millions of documents.
 
 James,
 
 We would have a look at hylanda too. What abt japanese and korean
 analyzers,
 any recommendations?
 
 - Eswar
 
 On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED] wrote:
 
 I don't think NGram is good method for Chinese.
 
 CJKAnalyzer of Lucene is 2-Gram.
 
 Eswar K:
  if it is chinese analyzer,,i recommend hylanda（www.hylanda.com）,,,it
 is
 the best chinese analyzer and it not free.
  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
 have
 some problem when using it.
 
 
 
 On Nov 27, 2007 5:56 AM, Otis Gospodnetic 
 [EMAIL PROTECTED]
 wrote:
 
 Eswar,
 
 We've uses the NGram stuff that exists in Lucene's
 contrib/analyzers
 instead of CJK.  Doesn't that allow you to do everything that the
 Chinese
 and CJK analyzers do?  It's been a few months since I've looked at
 Chinese
 and CJK Analzyers, so I could be off.
 
 Otis
 
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 8:30:52 AM
 Subject: CJK Analyzers for Solr
 
 Hi,
 
 Does Solr come with Language analyzers for CJK? If not, can you
 please
 direct me to some good CJK analyzers?
 
 Regards,
 Eswar
 
 
 
 
 
 
 --
 regards
 jl
 
 
 
 
 
 --
 regards
 jl

Re: two solr instances?

2007-11-27 Thread Chris Laux

Have you looked at this page on the wiki:
http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac

That should get you started.

-Chris


Jörg Kiegeland wrote:
 Is it possible to deploy solr.war once to Tomcat (which is on top of an
 Apache HTTP Server in my configuration) which then can manage two Solr
 indexes?
 
 I have to make accessible two different Solr indexes (both have
 different schema.xml files) over the web. If the above architecture is
 not possible: is there any other solution?

RE: LSA Implementation

2007-11-27 Thread Norskog, Lance

WordNet itself is English-only. There are various ontology projects for
it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page. Thanks
for starting me on the search!

Lance 

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED] 
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:

 The WordNet project at Princeton (USA) is a large database of
synonyms.
 If you're only working in English this might be useful instead of 
 running your own analyses.

 http://en.wikipedia.org/wiki/WordNet
 http://wordnet.princeton.edu/

 Lance

 -Original Message-
 From: Eswar K [mailto:[EMAIL PROTECTED]
 Sent: Monday, November 26, 2007 6:34 PM
 To: solr-user@lucene.apache.org
 Subject: Re: LSA Implementation

 In addition to recording which keywords a document contains, the 
 method examines the document collection as a whole, to see which other

 documents contain some of those same words. this algo should consider 
 documents that have many words in common to be semantically close, and

 ones with few words in common to be semantically distant. This simple 
 method correlates surprisingly well with how a human being, looking at

 content, might classify a document collection. Although the algorithm 
 doesn't understand anything about what the words *mean*, the patterns 
 it notices can make it seem astonishingly intelligent.

 When you search an such  an index, the search engine looks at 
 similarity values it has calculated for every content word, and 
 returns the documents that it thinks best fit the query. Because two 
 documents may be semantically very close even if they do not share a 
 particular keyword,

 Where a plain keyword search will fail if there is no exact match, 
 this algo will often return relevant documents that don't contain the 
 keyword at all.

 - Eswar

 On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED]
wrote:

 
  On Nov 26, 2007, at 6:06 PM, Eswar K wrote:
 
   We essentially are looking at having an implementation for doing 
   search which can return documents having conceptually similar 
   words without necessarily having the original word searched for.
 
  Very challenging.  Say someone searches for LSA and hits an 
  archived

  version of the mail you sent to this list.  LSA is a reasonably 
  discriminating term.  But so is Eswar.
 
  If you knew that the original term was LSA, then you might look 
  for documents near it in term vector space.  But if you don't know 
  the original term, only the content of the document, how do you know

  whether you should look for docs near lsa or eswar?
 
  Marvin Humphrey
  Rectangular Research
  http://www.rectangular.com/

Related Search

2007-11-27 Thread William Silva

Hi,
What is the best way to implement a related search like CNET with SOLR ?
Ex.: Searching for tv the related searches are: lcd tv, lcd, hdtv,
vizio, plasma tv, panasonic, gps, plasma
Thanks,
William.

Re: Related Search

2007-11-27 Thread Cool Coder

Take a look at this thread 
http://www.gossamer-threads.com/lists/lucene/java-user/54996
   
  There was a need to get all related topics for any selected topic. I have 
taken help of lucene-sand-box wordnet project to get all synoms of user 
selected topics. I am not sure whether wordnet project would help you as you 
look for products synonyms. In your case, you might need to maintain a vector 
of product synonyms E.g. If User searches TV, internally you would search for 
lcd tv, lcd, hdtv etc...
   
   
  Take a look at www.ajaxtrend.com, how all related topics are displayed and I 
am also keep on refining the related query search as this site is evolving. 
This is just prototype. 
   
  - BR
William Silva [EMAIL PROTECTED] wrote:
  Hi,
What is the best way to implement a related search like CNET with SOLR ?
Ex.: Searching for tv the related searches are: lcd tv, lcd, hdtv,
vizio, plasma tv, panasonic, gps, plasma
Thanks,
William.


   
-
Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try it now.

Solr and nutch, for reading a nutch index

2007-11-27 Thread bbrown

I couldn't tell if this was asked before.  But I want to perform a nutch crawl
without any solr plugin which will simply write to some index directory.  And
then ideally I would like to use solr for searching?  I am assuming this is
possible?

--
Berlin Brown
[berlin dot brown at gmail dot com]
http://botspiritcompany.com/botlist/?

Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman



On Nov 27, 2007, at 6:08 PM, bbrown wrote:

I couldn't tell if this was asked before.  But I want to perform a  
nutch crawl
without any solr plugin which will simply write to some index  
directory.  And
then ideally I would like to use solr for searching?  I am assuming  
this is

possible?



yes, this is quite possible. You need to have a solr schema that  
mimics the nutch schema, see sami's solrindexer for an example. Once  
you've got that schema, simply set the data dir in your solrconfig to  
the nutch index location and you'll be set.

Re: LSA Implementation

2007-11-27 Thread Grant Ingersoll

Using Wordnet may require having some type of disambiguation approach,  
otherwise you can end up w/ a lot of synonyms.  I also would look  
into how much coverage there is for non-English languages.


If you have the resources, you may be better off developing/finding  
your own synonym/concept list based on your genres.  You may also look  
into other approaches for assigning concepts off line and adding them  
to the document.


-Grant

On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote:

WordNet itself is English-only. There are various ontology projects  
for

it.

http://www.globalwordnet.org/ is a separate world language database
project. I found it at the bottom of the WordNet wikipedia page.  
Thanks

for starting me on the search!

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:50 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

The languages also include CJK :) among others.

- Eswar

On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote:


The WordNet project at Princeton (USA) is a large database of

synonyms.

If you're only working in English this might be useful instead of
running your own analyses.

http://en.wikipedia.org/wiki/WordNet
http://wordnet.princeton.edu/

Lance

-Original Message-
From: Eswar K [mailto:[EMAIL PROTECTED]
Sent: Monday, November 26, 2007 6:34 PM
To: solr-user@lucene.apache.org
Subject: Re: LSA Implementation

In addition to recording which keywords a document contains, the
method examines the document collection as a whole, to see which  
other



documents contain some of those same words. this algo should consider
documents that have many words in common to be semantically close,  
and



ones with few words in common to be semantically distant. This simple
method correlates surprisingly well with how a human being, looking  
at



content, might classify a document collection. Although the algorithm
doesn't understand anything about what the words *mean*, the patterns
it notices can make it seem astonishingly intelligent.

When you search an such  an index, the search engine looks at
similarity values it has calculated for every content word, and
returns the documents that it thinks best fit the query. Because two
documents may be semantically very close even if they do not share a
particular keyword,

Where a plain keyword search will fail if there is no exact match,
this algo will often return relevant documents that don't contain the
keyword at all.

- Eswar

On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED]

wrote:




On Nov 26, 2007, at 6:06 PM, Eswar K wrote:


We essentially are looking at having an implementation for doing
search which can return documents having conceptually similar
words without necessarily having the original word searched for.


Very challenging.  Say someone searches for LSA and hits an
archived



version of the mail you sent to this list.  LSA is a reasonably
discriminating term.  But so is Eswar.

If you knew that the original term was LSA, then you might look
for documents near it in term vector space.  But if you don't know
the original term, only the content of the document, how do you know



whether you should look for docs near lsa or eswar?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/







--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Combining SOLR and JAMon to monitor query execution times from a browser

2007-11-27 Thread Norberto Meijome

On Tue, 27 Nov 2007 18:18:16 +0100
Siegfried Goeschl [EMAIL PROTECTED] wrote:

 Hi folks,
 
 working on a closed source project for an IP concerned company is not 
 always fun ... we combined SOLR with JAMon 
 (http://jamonapi.sourceforge.net/) to keep an eye of the query times and 
 this might be of general interest
 
 +) JAMon comes with a ready-to-use ServletFilter
 +) we extended this implementation to keep track for queries issued by a 
 customer and the requested domain objects, e.g. artist, album, track
 +) this allows us to keep track of the execution times and their 
 distribution to find quickly long running queries without having access 
 to the access.log from a web browser
 +) a small presentation can be found at 
 http://people.apache.org/~sgoeschl/presentations/jamon-20070717.pdf
 +) if it is of general I can rewrite the code as contribution

Thanks Siegfried,

I am further interested in  plugging this information into something like 
Nagios , Cacti , Zenoss , bigsister , Openview or your monitoring system of 
choice, but I haven't had much time to look into this yet. How does JAMon 
compare to JMX ( 
http://java.sun.com/javase/technologies/core/mntr-mgmt/javamanagement/) ? 

cheers,
B

_
{Beto|Norberto|Numard} Meijome

There are no stupid questions, but there are a LOT of inquisitive idiots.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Norberto Meijome

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman [EMAIL PROTECTED] wrote:

 
 On Nov 27, 2007, at 6:08 PM, bbrown wrote:
 
  I couldn't tell if this was asked before.  But I want to perform a  
  nutch crawl
  without any solr plugin which will simply write to some index  
  directory.  And
  then ideally I would like to use solr for searching?  I am assuming  
  this is
  possible?
 
 
 yes, this is quite possible. You need to have a solr schema that  
 mimics the nutch schema, see sami's solrindexer for an example. Once  
 you've got that schema, simply set the data dir in your solrconfig to  
 the nutch index location and you'll be set.

I think you should keep an eye on the versions of Lucene library used by both 
Nutch + Solr - differences at this layer *could* make them incompatible - but I 
am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

Against logic there is no armor like ignorance.
  Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Otis Gospodnetic

I only glanced at Sami's post recently and what I think I saw there is 
something different.  In other words, what Sami described is not a Solr 
instance pointing to a Nutch-built Lucene index, but rather an app that reads 
the appropriate Nutch/Hadoop files with fetched content and posts the read 
content to a Solr instance using a Solr java client like solrj.
No?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:33:18 PM
Subject: Re: Solr and nutch, for reading a nutch index

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman [EMAIL PROTECTED] wrote:

 
 On Nov 27, 2007, at 6:08 PM, bbrown wrote:
 
  I couldn't tell if this was asked before.  But I want to perform a
  
  nutch crawl
  without any solr plugin which will simply write to some index  
  directory.  And
  then ideally I would like to use solr for searching?  I am assuming
  
  this is
  possible?
 
 
 yes, this is quite possible. You need to have a solr schema that  
 mimics the nutch schema, see sami's solrindexer for an example. Once
  
 you've got that schema, simply set the data dir in your solrconfig to
  
 the nutch index location and you'll be set.

I think you should keep an eye on the versions of Lucene library used
 by both Nutch + Solr - differences at this layer *could* make them
 incompatible - but I am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

Against logic there is no armor like ignorance.
  Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery when
 wet. Reading disclaimers makes you go blind. Writing them is worse.
 You have been Warned.

Re: Solr and nutch, for reading a nutch index

2007-11-27 Thread Brian Whitman



On Nov 28, 2007, at 1:24 AM, Otis Gospodnetic wrote:

I only glanced at Sami's post recently and what I think I saw there  
is something different.  In other words, what Sami described is not  
a Solr instance pointing to a Nutch-built Lucene index, but rather  
an app that reads the appropriate Nutch/Hadoop files with fetched  
content and posts the read content to a Solr instance using a Solr  
java client like solrj.

No?



Yes, to be clear, all you need from Sami's thing is the schema file.  
Ignore everything else. Then point solr at the nutch index directory  
(it's just a lucene index.)


Sami's entire thing is for indexing with solr instead of nutch,  
separate issue...





Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Norberto Meijome [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Sent: Tuesday, November 27, 2007 8:33:18 PM
Subject: Re: Solr and nutch, for reading a nutch index

On Tue, 27 Nov 2007 18:12:13 -0500
Brian Whitman [EMAIL PROTECTED] wrote:



On Nov 27, 2007, at 6:08 PM, bbrown wrote:


I couldn't tell if this was asked before.  But I want to perform a



nutch crawl
without any solr plugin which will simply write to some index
directory.  And
then ideally I would like to use solr for searching?  I am assuming



this is
possible?



yes, this is quite possible. You need to have a solr schema that
mimics the nutch schema, see sami's solrindexer for an example. Once



you've got that schema, simply set the data dir in your solrconfig to



the nutch index location and you'll be set.


I think you should keep an eye on the versions of Lucene library used
by both Nutch + Solr - differences at this layer *could* make them
incompatible - but I am not an expert...
B

_
{Beto|Norberto|Numard} Meijome

Against logic there is no armor like ignorance.
 Laurence J. Peter

I speak for myself, not my employer. Contents may be hot. Slippery  
when

wet. Reading disclaimers makes you go blind. Writing them is worse.
You have been Warned.





--
http://variogr.am/

Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic

For what it's worth I worked on indexing and searching a *massive* pile of 
data, a good portion of which was in CJ and some K.  The n-gram approach was 
used for all 3 languages and the quality of search results, including 
highlighting was evaluated and okay-ed by native speakers of these languages.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Walter Underwood [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Tuesday, November 27, 2007 2:41:38 PM
Subject: Re: CJK Analyzers for Solr

Dictionaries are surprisingly expensive to build and maintain and
bi-gram is surprisingly effective for Chinese. See this paper:

   http://citeseer.ist.psu.edu/kwok97comparing.html

I expect that n-gram indexing would be less effective for Japanese
because it is an inflected language. Korean is even harder. It might
work to break Korean into the phonetic subparts and use n-gram on
those.

You should not do term highlighting with any of the n-gram methods.
The relevance can be very good, but the highlighting just looks dumb.

wunder

On 11/27/07 8:54 AM, Eswar K [EMAIL PROTECTED] wrote:

 Is there any specific reason why the CJK analyzers in Solr were
 chosen to be
 n-gram based instead of it being a morphological analyzer which is
 kind of
 implemented in Google as it considered to be more effective than the
 n-gram
 ones?
 
 Regards,
 Eswar
 
 
 
 On Nov 27, 2007 7:57 AM, Eswar K [EMAIL PROTECTED] wrote:
 
 thanks james...
 
 How much time does it take to index 18m docs?
 
 - Eswar
 
 
 On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED]  wrote:
 
 i not use HYLANDA analyzer.
 
 i use je-analyzer and indexing at least 18m docs.
 
 i m sorry i only use chinese analyzer.
 
 
 On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:
 
 What is the performance of these CJK analyzers (one in lucene and
 hylanda
 )?
 We would potentially be indexing millions of documents.
 
 James,
 
 We would have a look at hylanda too. What abt japanese and korean
 analyzers,
 any recommendations?
 
 - Eswar
 
 On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED]
 wrote:
 
 I don't think NGram is good method for Chinese.
 
 CJKAnalyzer of Lucene is 2-Gram.
 
 Eswar K:
  if it is chinese analyzer,,i recommend
 hylanda（www.hylanda.com）,,,it
 is
 the best chinese analyzer and it not free.
  if u wanna free chinese analyzer, maybe u can try je-analyzer.
 it
 have
 some problem when using it.
 
 
 
 On Nov 27, 2007 5:56 AM, Otis Gospodnetic 
 [EMAIL PROTECTED]
 wrote:
 
 Eswar,
 
 We've uses the NGram stuff that exists in Lucene's
 contrib/analyzers
 instead of CJK.  Doesn't that allow you to do everything that
 the
 Chinese
 and CJK analyzers do?  It's been a few months since I've looked
 at
 Chinese
 and CJK Analzyers, so I could be off.
 
 Otis
 
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 8:30:52 AM
 Subject: CJK Analyzers for Solr
 
 Hi,
 
 Does Solr come with Language analyzers for CJK? If not, can you
 please
 direct me to some good CJK analyzers?
 
 Regards,
 Eswar
 
 
 
 
 
 
 --
 regards
 jl
 
 
 
 
 
 --
 regards
 jl

Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic

James - can you elaborate on why you think the n-gram approach is not good for 
Chinese?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: James liu [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 26, 2007 8:51:23 PM
Subject: Re: CJK Analyzers for Solr

I don't think NGram is good method for Chinese.

CJKAnalyzer of Lucene is 2-Gram.

Eswar K:
  if it is chinese analyzer,,i recommend
 hylanda（www.hylanda.com）,,,it is
the best chinese analyzer and it not free.
  if u wanna free chinese analyzer, maybe u can try je-analyzer. it
 have
some problem when using it.



On Nov 27, 2007 5:56 AM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Eswar,

 We've uses the NGram stuff that exists in Lucene's contrib/analyzers
 instead of CJK.  Doesn't that allow you to do everything that the
 Chinese
 and CJK analyzers do?  It's been a few months since I've looked at
 Chinese
 and CJK Analzyers, so I could be off.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 8:30:52 AM
 Subject: CJK Analyzers for Solr

 Hi,

 Does Solr come with Language analyzers for CJK? If not, can you
 please
 direct me to some good CJK analyzers?

 Regards,
 Eswar






-- 
regards
jl

Re: CJK Analyzers for Solr

2007-11-27 Thread Eswar K

Otis,

Thanks for the information, we will check this out.

Regards,
Eswar

On Nov 28, 2007 12:20 PM, Otis Gospodnetic [EMAIL PROTECTED]
wrote:

 Eswar,

 I wouldn't worry about the performance of those CJK analyzers too much -
 they are fairly trivial.  The StandardAnalyzer is slower, for example.  I
 recently indexed cca 20MM large docs on a 8-core, 8 GB RAM box in 10 hours -
 550 docs/second.  No CJK, just English.

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 9:27:15 PM
 Subject: Re: CJK Analyzers for Solr

 thanks james...

 How much time does it take to index 18m docs?

 - Eswar

 On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED] wrote:

  i not use HYLANDA analyzer.
 
  i use je-analyzer and indexing at least 18m docs.
 
  i m sorry i only use chinese analyzer.
 
 
  On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:
 
   What is the performance of these CJK analyzers (one in lucene and
  hylanda
   )?
   We would potentially be indexing millions of documents.
  
   James,
  
   We would have a look at hylanda too. What abt japanese and korean
   analyzers,
   any recommendations?
  
   - Eswar
  
   On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED] wrote:
  
I don't think NGram is good method for Chinese.
   
CJKAnalyzer of Lucene is 2-Gram.
   
Eswar K:
 if it is chinese analyzer,,i recommend
  hylanda（www.hylanda.com）,,,it
  is
the best chinese analyzer and it not free.
 if u wanna free chinese analyzer, maybe u can try je-analyzer.
  it
  have
some problem when using it.
   
   
   
On Nov 27, 2007 5:56 AM, Otis Gospodnetic
  [EMAIL PROTECTED]
wrote:
   
 Eswar,

 We've uses the NGram stuff that exists in Lucene's
  contrib/analyzers
 instead of CJK.  Doesn't that allow you to do everything that
  the
Chinese
 and CJK analyzers do?  It's been a few months since I've looked
  at
Chinese
 and CJK Analzyers, so I could be off.

 Otis

 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

 - Original Message 
 From: Eswar K [EMAIL PROTECTED]
 To: solr-user@lucene.apache.org
 Sent: Monday, November 26, 2007 8:30:52 AM
 Subject: CJK Analyzers for Solr

 Hi,

 Does Solr come with Language analyzers for CJK? If not, can you
  please
 direct me to some good CJK analyzers?

 Regards,
 Eswar




   
   
--
regards
jl
   
  
 
 
 
  --
  regards
  jl

Re: CJK Analyzers for Solr

2007-11-27 Thread Otis Gospodnetic

Eswar - I can answer the Google question.  Actually, you are pointing to it in 
1) :)

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Eswar K [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Wednesday, November 28, 2007 2:21:40 AM
Subject: Re: CJK Analyzers for Solr

John,

There were two parts to my question,

1) n-gram vs morphological analyzer - This was based on what I read at
 a few
places which rate morphological analysis higher than n-gram. An example
being (
http://www.basistech.com/knowledge-center/products/N-Gram-vs-morphological-analysis.pdf).
My intention of  asking this was not to question the effectiveness of
 the
existing implementation but was from the process of thought process
 behind
the decision. I was and am curious to know if they are any downsides of
using a morphological analyzer over the CJK analyzer, which prompted me
 to
ask this.

2) Morphological Analyzer used by Google - I dont know which Morph
 analyzer
Google uses,  but I have read at  different places that they do .

- Eswar

On Nov 27, 2007 10:42 PM, John Stewart [EMAIL PROTECTED] wrote:

 Eswar,

 What type of morphological analysis do you suspect (or know) that
 Google does on east asian text?  I don't think you can treat the
 three
 languages in the same way here.  Japanese has multi-morphemic words,
 but Chinese doesn't really.

 jds

 On Nov 27, 2007 11:54 AM, Eswar K [EMAIL PROTECTED] wrote:
  Is there any specific reason why the CJK analyzers in Solr were
 chosen
 to be
  n-gram based instead of it being a morphological analyzer which is
 kind
 of
  implemented in Google as it considered to be more effective than
 the
 n-gram
  ones?

  Regards,
  Eswar

  On Nov 27, 2007 7:57 AM, Eswar K [EMAIL PROTECTED] wrote:

   thanks james...

   How much time does it take to index 18m docs?

   - Eswar

   On Nov 27, 2007 7:43 AM, James liu [EMAIL PROTECTED] 
 wrote:

i not use HYLANDA analyzer.

i use je-analyzer and indexing at least 18m docs.

i m sorry i only use chinese analyzer.

On Nov 27, 2007 10:01 AM, Eswar K [EMAIL PROTECTED] wrote:

 What is the performance of these CJK analyzers (one in lucene
 and
hylanda
 )?
 We would potentially be indexing millions of documents.

 James,

 We would have a look at hylanda too. What abt japanese and
 korean
 analyzers,
 any recommendations?

 - Eswar

 On Nov 27, 2007 7:21 AM, James liu [EMAIL PROTECTED]
 wrote:

  I don't think NGram is good method for Chinese.

  CJKAnalyzer of Lucene is 2-Gram.

  Eswar K:
   if it is chinese analyzer,,i recommend
 hylanda（www.hylanda.com）
 ,,,it
is
  the best chinese analyzer and it not free.
   if u wanna free chinese analyzer, maybe u can try
 je-analyzer.
 it
have
  some problem when using it.

  On Nov 27, 2007 5:56 AM, Otis Gospodnetic 
[EMAIL PROTECTED]
  wrote:

   Eswar,

   We've uses the NGram stuff that exists in Lucene's
contrib/analyzers
   instead of CJK.  Doesn't that allow you to do everything
 that
 the
  Chinese
   and CJK analyzers do?  It's been a few months since I've
 looked at
  Chinese
   and CJK Analzyers, so I could be off.

   Otis

   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

   - Original Message 
   From: Eswar K [EMAIL PROTECTED]
   To: solr-user@lucene.apache.org
   Sent: Monday, November 26, 2007 8:30:52 AM
   Subject: CJK Analyzers for Solr

   Hi,

   Does Solr come with Language analyzers for CJK? If not,
 can
 you
please
   direct me to some good CJK analyzers?

   Regards,
   Eswar

  --
  regards
  jl

--
regards
jl

Re: Memory use with sorting problem

Re: Inconsistent results in Solr Search with Lucene Index

Re: CJK Analyzers for Solr

Re: CJK Analyzers for Solr

Combining SOLR and JAMon to monitor query execution times from a browser

Re: Combining SOLR and JAMon to monitor query execution times from a browser

Re: CJK Analyzers for Solr

two solr instances?

Re: CJK Analyzers for Solr

Re: two solr instances?

RE: LSA Implementation

Related Search

Re: Related Search

Solr and nutch, for reading a nutch index

Re: Solr and nutch, for reading a nutch index

Re: LSA Implementation

Re: Combining SOLR and JAMon to monitor query execution times from a browser

Re: Solr and nutch, for reading a nutch index

Re: Solr and nutch, for reading a nutch index

Re: Solr and nutch, for reading a nutch index

Re: CJK Analyzers for Solr

Re: CJK Analyzers for Solr

Re: CJK Analyzers for Solr

Re: CJK Analyzers for Solr

24 matches

Site Navigation

Mail list logo

Footer information