Re: question about fl=score

2008-03-20 Thread 李银松
My customer want to get the 1th-10010th added docs
So I have to sort by timestamp, to get top10010 docs' timestamp ……


2008/3/20, Walter Underwood [EMAIL PROTECTED]:

 Why do you want the 10,000th most relevant result?
 That seems very, very odd. Most people need the most
 relevant result. Maybe the ten most relevant results.

 I'm searching for the movie 'Ratatouille', but please
 give me the 10,001st result instead of that movie.

 If you explain your desire, we may have a better approach.

 wunder
 ==
 Search Guy, Netflix

 On 3/19/08 10:43 PM, 李银松 [EMAIL PROTECTED] wrote:

  I am not getting 1 records
 I am getting  records from 1-10010
 So I
  need the top10010 records' *sort field* to merge and get final
 results,just
  like the distributed search
 the data to transport is about 500k(1 docs'
  scores)
 and the QTime is about 100ms
 but the total time I used is about 10+
  seconds
 I want to know it really cost so much time or something other is wrong
  .



 2008/3/20, Walter Underwood [EMAIL PROTECTED]:
 
  Getting 10,000
  records will be slow.
 
  What are you doing with 10,000 records?
 
 
  wunder
 
  On 3/19/08 10:07 PM, 李银松 [EMAIL PROTECTED] wrote:
 
   I
  want to get the top 1-10010  record from two different servers,So
 
  Ihave
   to get top10010 scores from each server and merge them to get
  the
  results.
  I
   found the cost time was mostly used in XMLResponseParser
  while parsing
   the
  inputstream.
  I wander whether the costtime was used
  for net transport or
   for Solr to
  prepare for transport? Or just
  something wrong with my server?
 
 
   在08-3-20,Yonik Seeley
  [EMAIL PROTECTED] 写道:
  
   2008/3/19 李银松
   [EMAIL PROTECTED]:
   
  1、When I set fl=score ,solr returns just as
   fl=*,score ,not just scores
 
 Is it a bug or just do it on purpose?
  
   On
   purpose... a score
  alone with no other context doesn't seem useful.
  
 2
   、I'm using
  solrj to get about 1 docs' score in LAN. It costs me
   about
  
   
  10+ seconds first time(QTime is less than 100ms) , but 1-2 seconds
  
  
  second
 time with the same querystring. It seems a bit too long for
  the
   first
 time(total size of the doc to transport is about 500k).
  Is there
  
   anything i
 can do with it?
  
   What are you trying
  to do with that many
   scores?
   Search engines are optimized more for
  retrieving the top n matches
  
   (where n is ~10 - 100)
  
   -Yonik
 
  
 
 
 





Re: question about fl=score

2008-03-20 Thread j . L
2008/3/20 李银松 [EMAIL PROTECTED]:

 1、When I set fl=score ,solr returns just as fl=*,score ,not just scores
 Is it a bug or just do it on purpose?


u can set fl=id,score, solr not support the style like fl=score


 My customer want to get the 1th-10010th added docs
 So I have to sort by timestamp, to get top10010 docs' timestamp ……


 limit 1, 10010 order by timestamp?


-- 
regards
j.L


Re: Help Requested

2008-03-20 Thread Norberto Meijome
On Wed, 19 Mar 2008 21:22:42 -0700 (PDT)
Raghav Kapoor [EMAIL PROTECTED] wrote:

 I am new to Solr and I am facing a question if solr can be helpful in a 
 project that I'm working on.

welcome :)

 The project is a client/server app that requires a client app to index the 
 documents and send the results in rdf to server. 
 The client needs to be smart enough to know when a new document has been 
 added to a specified folder, index it and send the results in rdf/xml to the 
 server. The server will be a web service which will parse the xml and store 
 the metadata in the a database. The search will be conducted on the server 
 and will return the results from the database which will be links to the 
 documents on the client.

Any particular reason why need the server in this situation? pretty much
everything you are doing can be done locally. Except, probably, cross linking
between client's documents. I have no idea in what kind of environment this app
is supposed to run (home? office LAN? the interweb :P ? ). 

 The client , which is also running a webserver will take the request when the 
 user clicks on the link to the document residing on the client. 

you don't need a webserver for this, just generate a page in from your
webserver with file:// links and all you need is to render it locally. 

 I believe lucene will be useful in this scenario and solr can be used as a 
 web app. 
 I would like to get any input on this architecture and would request any 
 pointers if there is any app already doing something similar and how 
 lucene/solr can be useful in this case.

there are plenty of desktop document indexers using lucene on some form or
another, and other indexing technologies. I dont know if any uses Solr  - yet.
And i know of a few apps out there that do something similar to what you
describe though with different design as the goals are somewhat different.

B
_
{Beto|Norberto|Numard} Meijome

Some cause happiness wherever they go; others, whenever they go.
  Oscar Wilde

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.


Re: RAM Based Index for Solr

2008-03-20 Thread Norberto Meijome
On Wed, 19 Mar 2008 17:04:34 -0700 (PDT)
swarag [EMAIL PROTECTED] wrote:

 In Lucene there is a Ram Based Index
 org.apache.lucene.store.RAMDirectory.
 Is there a way to setup my index in solr to use a RAMDirectory?

create a mountpoint on a ramdrive (tmpfs in linux, i think), and put your index 
in there... ? or does lucene do anything other than that?

B

_
{Beto|Norberto|Numard} Meijome

Unix is very simple, but it takes a genius to understand the simplicity.
   Dennis Ritchie

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.


RAM size

2008-03-20 Thread Geert Van Huychem

Hi all,

is there a way (or formula) to determine the required amount of RAM 
memory, e.g. by number of documents, document size?


I need to index about 15.000.000 documents, each document is 1 to 3Kb 
big, only the id of the document will be stored.


I've just implemented a testcase on one of our older servers (only 512Mb 
RAM) with 4.000.000 documents, searching the index is quite fast, but 
when I trie to sort the results, I get the well-known OutOfMemory error. 
I'm aware of the fact that 512Mb is way too little, but I'm trying to 
determine what size will be needed for 15.000.000 documents.


Thanks in advance.

--
Kind regards,

Geert Van Huychem
Project Leader
Mediargus NV

tel +32 2 741 60 22
fax +32 2 740 09 71



RE: Language support

2008-03-20 Thread nicolas . dessaigne
You may be interested in a recent discussion that took place on a similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html

Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED] 
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding  
it. Basically, we want to be able to search for content across several  
languages, given that we know what language a datum and a query are  
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that occurs in  
multiple languages, including Asian languages. I'm in the process of  
moving from PyLucene to Solr. In PyLucene, I would have a list of  
analysers:

 analyzers = dict(en = pyluc.SnowballAnalyzer(English),
  cs = pyluc.CzechAnalyzer(),
  pt = pyluc.SnowballAnalyzer(Portuguese),
  ...

Then when I want to index something, I do

writer = pyluc.IndexWriter(store, analyzer, create)
writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser  
to use when writing out the field. Then when I want to search against  
it, I do

 analyzer = LanguageAnalyzer.getanal(lang)
 q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language  
before sending it off to PyLucene. (off-topic: getanal() is perhaps my  
favourite function-name ever). So the language of a given datum is  
attached to the datum itself. In Solr, however, this appears to be  
attached to the field, not to the individual data in it:

 fieldType name=text_greek class=solr.TextField
   analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
 /fieldType

Does this mean there there's no way to have a single contents field  
that has content in multiple languages, and still have the queries be  
parsed and stemmed correctly? How are other people handling this? Does  
it makes sense to write a tokeniser factory and a query factory that  
look at, say, the 'lang' field and return the correct tokenisers? Does  
this already exist?

The other alternative is to have a text_zh field, a text_en field,  
etc, and to modify the query to search on that field depending on the  
language of the query, but that seems kind of hacky to me, especially  
if a query may be against more than one language. Is this the accepted  
way to go about it? Is there a benefit to this method over writing a  
detecting tokeniser factory?


Re: RAM Based Index for Solr

2008-03-20 Thread Jeryl Cook
there currently is no way to use RAMDirectory instead of FSDirectory
yet in SOLR, however there is a feature request to implement this.
I personally think this will be great because we could use Terracotta
to handle the clustering.

Jeryl Cook


On Thu, Mar 20, 2008 at 1:07 AM, Norberto Meijome [EMAIL PROTECTED] wrote:
 On Wed, 19 Mar 2008 17:04:34 -0700 (PDT)
  swarag [EMAIL PROTECTED] wrote:

   In Lucene there is a Ram Based Index
   org.apache.lucene.store.RAMDirectory.
   Is there a way to setup my index in solr to use a RAMDirectory?

  create a mountpoint on a ramdrive (tmpfs in linux, i think), and put your 
 index in there... ? or does lucene do anything other than that?

  B

  _
  {Beto|Norberto|Numard} Meijome

  Unix is very simple, but it takes a genius to understand the simplicity.
Dennis Ritchie

  I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
 Reading disclaimers makes you go blind. Writing them is worse. You have been 
 Warned.




-- 
Jeryl Cook
/^\ Pharaoh /^\
http://pharaohofkush.blogspot.com/
..Act your age, and not your shoe size.. -Prince(1986)


Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread Bill Au
What messages do you see in your log file?

Bill

On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote:


 Hi,



 I'm a new Solr user. I figured my way around Solr just fine (I think) ...
 I can index and search ets. And so far I have indexed over 300k documents.



 What I can't figure out is the following. I'm using:



 ??? java -Ddata=args -jar post.jar optimize/


 to post an optimize command. What I'm finding is that I have to do it
 twice in order for the files to be optimized ... i.e.: the first post
 takes 3-4 minutes but leaves the file count as is at 44 ... the second post
 takes 2-3 seconds but shrinks the file count from 44 to 8.


 So my question is the following, is this the expected behavior or am I
 doing something wrong? Do I need two optimize posts to really optimize my
 index?!


 Thanks in advance


 -JM



Re: Faceting Problem

2008-03-20 Thread Erik Hatcher
When faced with these sorts of issues, it is worthwhile to step back  
and experiment with Solr's analysis page.  http://localhost:8983/solr/ 
admin/analysis.jsp


Select your field type either by name of field or by type, put in  
some text, and see what happens to it at both indexing and querying  
time.


Erik


On Mar 19, 2008, at 10:08 AM, Tejaswi_Haramurali wrote:


Hi ,

 I am facing a problem in using solrj. I am using java (solrj) to  
index as

well as search data in the solr search engine.
This is some of the code


exer.setField(name,DOC+identity);
exer.setField(features,The Mellon Foundation);
exer.setField(language,langmap.get(008lang));
exer.setField(date,datemap.get(008date));
exer.setField(format,formatmap.get(formats));


The problem is , when I do a search on 'Mellon' or any word  
associated with
the 'features' field ,I get the results. But However when I do a  
search on

any of the other fields, I dont get the results. I have ensured that
indexed=true in schema.xml for all these fields
and have also tried displaying the values I am indexing. I dont  
know what

mistake I am committing.

I would be glad if someone could help me on this.

Tejaswi
--
View this message in context: http://www.nabble.com/Faceting- 
Problem-tp16144141p16144141.html

Sent from the Solr - User mailing list archive at Nabble.com.




Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread John

Thanks Bill!!

Here is the content of the log file?(I restarted Solr so we have a clean log):

127.0.0.1 -? -? [20/03/2008:13:38:09 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2538 
127.0.0.1 -? -? [20/03/2008:13:38:31 +] GET /solr/admin/logging.jsp 
HTTP/1.1 200 138 
127.0.0.1 -? -? [20/03/2008:13:38:33 +] GET /solr/admin/logging.xsl 
HTTP/1.1 304 0 
127.0.0.1 -? -? [20/03/2008:13:38:33 +] GET /solr/admin/meta.xsl HTTP/1.1 
304 0 
127.0.0.1 -? -? [20/03/2008:13:38:36 +] GET /solr/admin/action.jsp?log=ALL 
HTTP/1.1 200 901 
127.0.0.1 -? -? [20/03/2008:13:38:55 +] GET /solr/admin/ HTTP/1.1 200 
3818 
127.0.0.1 -? -? [20/03/2008:13:38:59 +] GET 
/solr/select/?q=solrversion=2.2start=0rows=10indent=on HTTP/1.1 200 2161 
127.0.0.1 -? -? [20/03/2008:13:39:01 +] GET 
/solr/select/?q=solrversion=2.2start=0rows=10indent=on HTTP/1.1 200 2159 
127.0.0.1 -? -? [20/03/2008:13:39:17 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 
127.0.0.1 -? -? [20/03/2008:13:39:44 +] POST /solr/update HTTP/1.1 200 
152 
127.0.0.1 -? -? [20/03/2008:13:43:32 +] POST /solr/update HTTP/1.1 200 
149 
127.0.0.1 -? -? [20/03/2008:13:44:16 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2537 
127.0.0.1 -? -? [20/03/2008:13:44:17 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 
127.0.0.1 -? -? [20/03/2008:13:44:18 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 
127.0.0.1 -? -? [20/03/2008:13:44:26 +] POST /solr/update HTTP/1.1 200 
149 
127.0.0.1 -? -? [20/03/2008:13:44:27 +] POST /solr/update HTTP/1.1 200 
149 
127.0.0.1 -? -? [20/03/2008:13:44:51 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 
127.0.0.1 -? -? [20/03/2008:13:44:51 +] GET 
/solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 

The two POST are as a result of issuing java -Ddata=args -jar post.jar 
optimize/

Just before I issued the first POST, there were 71 files in the index (total 
size ~1.4Gb) ... after the first POST, there were 20 files (total size ~2.7Gb) 
.. after the second POST, there were 8 files (total size ~1.3Gb)

The increase in the index size ... from 1.4Gb to 2.7Gb ... as well as the total 
files count (3 different counts) .. is something I have not observed before!!? 
In all of my previous experiments ... on the first POST ... the index size 
increased slightly, but the file count never (I think) went up!

Who can explain to me what's going on here?!!


I'm using Solr 1.2 ... the only change I have done is added new fields to 
support my data type.

-JM

-Original Message-
From: Bill Au [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thu, 20 Mar 2008 8:58 am
Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/




What messages do you see in your log file?

Bill

On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote:


 Hi,



 I'm a new Solr user. I figured my way around Solr just fine (I think) ...
 I can index and search ets. And so far I have indexed over 300k documents.



 What I can't figure out is the following. I'm using:



 ??? java -Ddata=args -jar post.jar optimize/


 to post an optimize command. What I'm finding is that I have to do it
 twice in order for the files to be optimized ... i.e.: the first post
 takes 3-4 minutes but leaves the file count as is at 44 ... the second post
 takes 2-3 seconds but shrinks the file count from 44 to 8.


 So my question is the following, is this the expected behavior or am I
 doing something wrong? Do I need two optimize posts to really optimize my
 index?!


 Thanks in advance


 -JM




Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread Yonik Seeley
On Wed, Mar 19, 2008 at 3:15 PM,  [EMAIL PROTECTED] wrote:
 What I'm finding is that I have to do it twice in order for the files to be 
 optimized ... i.e.: the
 first post takes 3-4 minutes but leaves the file count as is at 44 ... the 
 second post takes 2-3
 seconds but shrinks the file count from 44 to 8.

Let me guess, are you on windows?

This is actually expected behavior.  The first optimize actually does
optimize the whole index.  When the optimize finishes, it can't delete
the old files because they are still in use by the current
IndexSearcher.  If you were on UNIX, the files would be deleted sooner
(or at least like like they were).

In short, Solr + Lucene are doing the right thing... just optimize
once, and don't worry about it.

-Yonik


Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread John
Thanks Yonik!!

Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a 
restart of Solr do the trick?? i.e. the files are no longer locked by Windows 
... so they can now be deleted when Solr exits ... I tried it and didn't see 
any change.

Who is keeping those files around / locked ... Solr or Lucene?? and what is 
going on with the second call to optimize that's able to really delete those 
old files where the first optimize couldn't?

-JM


-Original Message-
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thu, 20 Mar 2008 10:13 am
Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/



On Wed, Mar 19, 2008 at 3:15 PM,  [EMAIL PROTECTED] wrote:
 What I'm finding is that I have to do it twice in order for the files to be 
optimized ... i.e.: the
 first post takes 3-4 minutes but leaves the file count as is at 44 ... the 
second post takes 2-3
 seconds but shrinks the file count from 44 to 8.

Let me guess, are you on windows?

This is actually expected behavior.  The first optimize actually does
optimize the whole index.  When the optimize finishes, it can't delete
the old files because they are still in use by the current
IndexSearcher.  If you were on UNIX, the files would be deleted sooner
(or at least like like they were).

In short, Solr + Lucene are doing the right thing... just optimize
once, and don't worry about it.

-Yonik



Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread Yonik Seeley
On Thu, Mar 20, 2008 at 10:55 AM, John [EMAIL PROTECTED] wrote:
  Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a 
 restart of Solr do the trick?? i.e. the files are no longer locked by Windows 
 ... so they can now be deleted when Solr exits ... I tried it and didn't see 
 any change.

  Who is keeping those files around / locked ... Solr or Lucene?? and what is 
 going on with the second call to optimize that's able to really delete 
 those old files where the first optimize couldn't?


The IndexWriter cleans up old unreferenced files periodically... so as
you continue to add to the index, those files will be removed (maybe
on a segment merge, definitely on another commit).
As I said... don't worry about it, they will get cleaned up sooner or
later (unless you are never going to change the index again after you
build it).

-Yonik


Re: Help Requested

2008-03-20 Thread Raghav Kapoor
Thanks Norberto !

 Any particular reason why need the server in this
 situation? pretty much
 everything you are doing can be done locally.
 Except, probably, cross linking
 between client's documents. I have no idea in what
 kind of environment this app
 is supposed to run (home? office LAN? the interweb
 :P ? ). 

So its going to be a client/server app where all the
documents will be stored on the client and only
metadata of those docs will be sent to the server.
That way server does not have to store any real
documents. Its an internet based application. Search
on the server will read the metadata for keywords and
send the request to all the clients that contain
documents with that keyword. We cannot store
everything on one client, all clients are different
machines distributed all over the world.

 you don't need a webserver for this, just generate a
 page in from your
 webserver with file:// links and all you need is to
 render it locally. 

How will the client serve the documents stored locally
through a standard mechanism (like port 80) to send
documents to the server when the server requests the
documents ? The client will not open any special ports
for the server, so we need the web server, I guess ? 


 there are plenty of desktop document indexers using
 lucene on some form or
 another, and other indexing technologies. I dont
 know if any uses Solr  - yet.
 And i know of a few apps out there that do something
 similar to what you
 describe though with different design as the goals
 are somewhat different.


The client application needs to give the users an
ability of handling metadata of the documents that
will be sent to the server so that efficient searching
can be conducted, so I assume we need a web app like
solr (or create our own using lucene).

Let me know your thoughts and thanks again for your
reply !

Regards

Raghav.


--- Norberto Meijome [EMAIL PROTECTED] wrote:

 On Wed, 19 Mar 2008 21:22:42 -0700 (PDT)
 Raghav Kapoor [EMAIL PROTECTED] wrote:
 
  I am new to Solr and I am facing a question if
 solr can be helpful in a project that I'm working
 on.
 
 welcome :)
 
  The project is a client/server app that requires a
 client app to index the documents and send the
 results in rdf to server. 
  The client needs to be smart enough to know when a
 new document has been added to a specified folder,
 index it and send the results in rdf/xml to the
 server. The server will be a web service which will
 parse the xml and store the metadata in the a
 database. The search will be conducted on the server
 and will return the results from the database which
 will be links to the documents on the client.
 
 Any particular reason why need the server in this
 situation? pretty much
 everything you are doing can be done locally.
 Except, probably, cross linking
 between client's documents. I have no idea in what
 kind of environment this app
 is supposed to run (home? office LAN? the interweb
 :P ? ). 
 
  The client , which is also running a webserver
 will take the request when the user clicks on the
 link to the document residing on the client. 
 
 you don't need a webserver for this, just generate a
 page in from your
 webserver with file:// links and all you need is to
 render it locally. 
 
  I believe lucene will be useful in this scenario
 and solr can be used as a web app. 
  I would like to get any input on this architecture
 and would request any pointers if there is any app
 already doing something similar and how lucene/solr
 can be useful in this case.
 
 there are plenty of desktop document indexers using
 lucene on some form or
 another, and other indexing technologies. I dont
 know if any uses Solr  - yet.
 And i know of a few apps out there that do something
 similar to what you
 describe though with different design as the goals
 are somewhat different.
 
 B
 _
 {Beto|Norberto|Numard} Meijome
 
 Some cause happiness wherever they go; others,
 whenever they go.
   Oscar Wilde
 
 I speak for myself, not my employer. Contents may be
 hot. Slippery when wet.
 Reading disclaimers makes you go blind. Writing them
 is worse. You have been
 Warned.
 



  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ


Re: Language support

2008-03-20 Thread David King
You may be interested in a recent discussion that took place on a  
similar

subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much  
help.


I guess what I'm asking is, if my approach seems convoluted, I'm  
probably doing it wrong, so how *a*re people solving the problem of  
searching over multiple languages? What is the canonical way to do this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across several
languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that occurs in
multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

analyzers = dict(en = pyluc.SnowballAnalyzer(English),
 cs = pyluc.CzechAnalyzer(),
 pt = pyluc.SnowballAnalyzer(Portuguese),
 ...

Then when I want to index something, I do

   writer = pyluc.IndexWriter(store, analyzer, create)
   writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search against
it, I do

analyzer = LanguageAnalyzer.getanal(lang)
q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is perhaps my
favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:

fieldType name=text_greek class=solr.TextField
  analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
/fieldType

Does this mean there there's no way to have a single contents field
that has content in multiple languages, and still have the queries be
parsed and stemmed correctly? How are other people handling this? Does
it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers? Does
this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on the
language of the query, but that seems kind of hacky to me, especially
if a query may be against more than one language. Is this the accepted
way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?




Re: Language support

2008-03-20 Thread Benson Margulies
Unless you can come up with language-neutral tokenization and stemming, you
need to:

a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.



On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote:

  You may be interested in a recent discussion that took place on a
  similar
  subject:
  http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html

 Interesting, yes. But since it doesn't actually exist, it's not much
 help.

 I guess what I'm asking is, if my approach seems convoluted, I'm
 probably doing it wrong, so how *a*re people solving the problem of
 searching over multiple languages? What is the canonical way to do this?


 
 
  Nicolas
 
  -Message d'origine-
  De : David King [mailto:[EMAIL PROTECTED]
  Envoyé : mercredi 19 mars 2008 20:07
  À : solr-user@lucene.apache.org
  Objet : Language support
 
  This has probably been asked before, but I'm having trouble finding
  it. Basically, we want to be able to search for content across several
  languages, given that we know what language a datum and a query are
  in. Is there an obvious way to do this?
 
  Here's the longer version: I am trying to index content that occurs in
  multiple languages, including Asian languages. I'm in the process of
  moving from PyLucene to Solr. In PyLucene, I would have a list of
  analysers:
 
  analyzers = dict(en = pyluc.SnowballAnalyzer(English),
   cs = pyluc.CzechAnalyzer(),
   pt = pyluc.SnowballAnalyzer(Portuguese),
   ...
 
  Then when I want to index something, I do
 
 writer = pyluc.IndexWriter(store, analyzer, create)
 writer.addDocument(d.doc)
 
  That is, I tell Lucene the language of every datum, and the analyser
  to use when writing out the field. Then when I want to search against
  it, I do
 
  analyzer = LanguageAnalyzer.getanal(lang)
  q = pyluc.QueryParser(field, analyzer).parse(value)
 
  And use that QueryParser to parse the query in the given language
  before sending it off to PyLucene. (off-topic: getanal() is perhaps my
  favourite function-name ever). So the language of a given datum is
  attached to the datum itself. In Solr, however, this appears to be
  attached to the field, not to the individual data in it:
 
  fieldType name=text_greek class=solr.TextField
analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
  /fieldType
 
  Does this mean there there's no way to have a single contents field
  that has content in multiple languages, and still have the queries be
  parsed and stemmed correctly? How are other people handling this? Does
  it makes sense to write a tokeniser factory and a query factory that
  look at, say, the 'lang' field and return the correct tokenisers? Does
  this already exist?
 
  The other alternative is to have a text_zh field, a text_en field,
  etc, and to modify the query to search on that field depending on the
  language of the query, but that seems kind of hacky to me, especially
  if a query may be against more than one language. Is this the accepted
  way to go about it? Is there a benefit to this method over writing a
  detecting tokeniser factory?




Re: Language support

2008-03-20 Thread David King
Unless you can come up with language-neutral tokenization and  
stemming, you

need to:
a) know the language of each document.
b) run a different analyzer depending on the language.
c) force the user to tell you the language of the query.
d) run the query through the same analyzer.


I can do all of those. This implies storing all of the different  
languages in different fields, right? Then changing the default search- 
field to the language of the query for every query?








On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED]  
wrote:



You may be interested in a recent discussion that took place on a
similar
subject:
http://www.mail-archive.com/solr-user@lucene.apache.org/ 
msg09332.html


Interesting, yes. But since it doesn't actually exist, it's not much
help.

I guess what I'm asking is, if my approach seems convoluted, I'm
probably doing it wrong, so how *a*re people solving the problem of
searching over multiple languages? What is the canonical way to do  
this?






Nicolas

-Message d'origine-
De : David King [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 19 mars 2008 20:07
À : solr-user@lucene.apache.org
Objet : Language support

This has probably been asked before, but I'm having trouble finding
it. Basically, we want to be able to search for content across  
several

languages, given that we know what language a datum and a query are
in. Is there an obvious way to do this?

Here's the longer version: I am trying to index content that  
occurs in

multiple languages, including Asian languages. I'm in the process of
moving from PyLucene to Solr. In PyLucene, I would have a list of
analysers:

   analyzers = dict(en = pyluc.SnowballAnalyzer(English),
cs = pyluc.CzechAnalyzer(),
pt = pyluc.SnowballAnalyzer(Portuguese),
...

Then when I want to index something, I do

  writer = pyluc.IndexWriter(store, analyzer, create)
  writer.addDocument(d.doc)

That is, I tell Lucene the language of every datum, and the analyser
to use when writing out the field. Then when I want to search  
against

it, I do

   analyzer = LanguageAnalyzer.getanal(lang)
   q = pyluc.QueryParser(field, analyzer).parse(value)

And use that QueryParser to parse the query in the given language
before sending it off to PyLucene. (off-topic: getanal() is  
perhaps my

favourite function-name ever). So the language of a given datum is
attached to the datum itself. In Solr, however, this appears to be
attached to the field, not to the individual data in it:

   fieldType name=text_greek class=solr.TextField
 analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
   /fieldType

Does this mean there there's no way to have a single contents  
field
that has content in multiple languages, and still have the queries  
be
parsed and stemmed correctly? How are other people handling this?  
Does

it makes sense to write a tokeniser factory and a query factory that
look at, say, the 'lang' field and return the correct tokenisers?  
Does

this already exist?

The other alternative is to have a text_zh field, a text_en field,
etc, and to modify the query to search on that field depending on  
the
language of the query, but that seems kind of hacky to me,  
especially
if a query may be against more than one language. Is this the  
accepted

way to go about it? Is there a benefit to this method over writing a
detecting tokeniser factory?







Re: Language support

2008-03-20 Thread Benson Margulies
You can store in one field if you manage to hide a language code with the
text. XML is overkill but effective for this. At one point, we'd
investigated how to allow a Lucene analyzer to see more than one field (the
language code as well as the text) but I don't think we came up with
anything.


On Thu, Mar 20, 2008 at 12:39 PM, David King [EMAIL PROTECTED] wrote:

  Unless you can come up with language-neutral tokenization and
  stemming, you
  need to:
  a) know the language of each document.
  b) run a different analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.

 I can do all of those. This implies storing all of the different
 languages in different fields, right? Then changing the default search-
 field to the language of the query for every query?


 
 
 
 
  On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED]
  wrote:
 
  You may be interested in a recent discussion that took place on a
  similar
  subject:
  http://www.mail-archive.com/solr-user@lucene.apache.org/
  msg09332.html
 
  Interesting, yes. But since it doesn't actually exist, it's not much
  help.
 
  I guess what I'm asking is, if my approach seems convoluted, I'm
  probably doing it wrong, so how *a*re people solving the problem of
  searching over multiple languages? What is the canonical way to do
  this?
 
 
 
 
  Nicolas
 
  -Message d'origine-
  De : David King [mailto:[EMAIL PROTECTED]
  Envoyé : mercredi 19 mars 2008 20:07
  À : solr-user@lucene.apache.org
  Objet : Language support
 
  This has probably been asked before, but I'm having trouble finding
  it. Basically, we want to be able to search for content across
  several
  languages, given that we know what language a datum and a query are
  in. Is there an obvious way to do this?
 
  Here's the longer version: I am trying to index content that
  occurs in
  multiple languages, including Asian languages. I'm in the process of
  moving from PyLucene to Solr. In PyLucene, I would have a list of
  analysers:
 
 analyzers = dict(en = pyluc.SnowballAnalyzer(English),
  cs = pyluc.CzechAnalyzer(),
  pt = pyluc.SnowballAnalyzer(Portuguese),
  ...
 
  Then when I want to index something, I do
 
writer = pyluc.IndexWriter(store, analyzer, create)
writer.addDocument(d.doc)
 
  That is, I tell Lucene the language of every datum, and the analyser
  to use when writing out the field. Then when I want to search
  against
  it, I do
 
 analyzer = LanguageAnalyzer.getanal(lang)
 q = pyluc.QueryParser(field, analyzer).parse(value)
 
  And use that QueryParser to parse the query in the given language
  before sending it off to PyLucene. (off-topic: getanal() is
  perhaps my
  favourite function-name ever). So the language of a given datum is
  attached to the datum itself. In Solr, however, this appears to be
  attached to the field, not to the individual data in it:
 
 fieldType name=text_greek class=solr.TextField
   analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/
 /fieldType
 
  Does this mean there there's no way to have a single contents
  field
  that has content in multiple languages, and still have the queries
  be
  parsed and stemmed correctly? How are other people handling this?
  Does
  it makes sense to write a tokeniser factory and a query factory that
  look at, say, the 'lang' field and return the correct tokenisers?
  Does
  this already exist?
 
  The other alternative is to have a text_zh field, a text_en field,
  etc, and to modify the query to search on that field depending on
  the
  language of the query, but that seems kind of hacky to me,
  especially
  if a query may be against more than one language. Is this the
  accepted
  way to go about it? Is there a benefit to this method over writing a
  detecting tokeniser factory?
 
 




Re: Language support

2008-03-20 Thread Benson Margulies
Token/by/token seems a bit extreme. Are you concerned with macaronic
documents?

On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED]
wrote:

 Nice list.

 You may still need to mark the language of each document. There are
 plenty of cross-language collisions: die and boot have different
 meanings in German and English. Proper nouns (Laserjet) may be the
 same in all languages, a different problem if you are trying to get
 answers in one language.

 At one point, I considered using Unicode language tagging on each
 token to keep it all straight. Effectively, index de/Boot or
 en/Laserjet.

 wunder

 On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:

  Unless you can come up with language-neutral tokenization and stemming,
  you
 need to:
 
  a) know the language of each document.
  b) run a different
  analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.





Re: Language support

2008-03-20 Thread Walter Underwood
Extreme, but guaranteed to work and it avoids bad IDF when there are
inter-language collisions. In Ultraseek, we only stored the hash, so
the size of the source token didn't matter.

Trademarks are a bad source of collisions and anomalous IDF. If you have
LaserJet support docs in 20 languages, the term LaserJet will have
a document frequency 20X higher than the terms in a single language
and will score too low.

Ultraseek handles macaronic documents when the script makes it possible,
for example, roman is sent to the English stemmer in a Japanese document,
Hangul always goes to the Korean segmenter/stemmer.

A simpler approach is to tag each document with a language, like lang:de,
then use a filter query to restrict the documents to the query language.

Per-token tagging still strikes me as the right approach. It makes
all sorts of things work, like keeping fuzzy matches within the same
language. We didn't do it in Ultraseek because it would have been an
incompatible index change and the benefit didn't justify that.

wunder
==
Walter Underwood
Former Ultraseek Architect
Current Entire Netflix Search Department

On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote:

 Token/by/token seems a bit extreme. Are you concerned with macaronic
 documents?
 
 On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED]
 wrote:
 
 Nice list.
 
 You may still need to mark the language of each document. There are
 plenty of cross-language collisions: die and boot have different
 meanings in German and English. Proper nouns (Laserjet) may be the
 same in all languages, a different problem if you are trying to get
 answers in one language.
 
 At one point, I considered using Unicode language tagging on each
 token to keep it all straight. Effectively, index de/Boot or
 en/Laserjet.
 
 wunder
 
 On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:
 
 Unless you can come up with language-neutral tokenization and stemming,
 you
 need to:
 
 a) know the language of each document.
 b) run a different
 analyzer depending on the language.
 c) force the user to tell you the language of the query.
 d) run the query through the same analyzer.
 
 
 



Re: Language support

2008-03-20 Thread Benson Margulies
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis.
All that makes sense.

On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood [EMAIL PROTECTED]
wrote:

 Extreme, but guaranteed to work and it avoids bad IDF when there are
 inter-language collisions. In Ultraseek, we only stored the hash, so
 the size of the source token didn't matter.

 Trademarks are a bad source of collisions and anomalous IDF. If you have
 LaserJet support docs in 20 languages, the term LaserJet will have
 a document frequency 20X higher than the terms in a single language
 and will score too low.

 Ultraseek handles macaronic documents when the script makes it possible,
 for example, roman is sent to the English stemmer in a Japanese document,
 Hangul always goes to the Korean segmenter/stemmer.

 A simpler approach is to tag each document with a language, like
 lang:de,
 then use a filter query to restrict the documents to the query language.

 Per-token tagging still strikes me as the right approach. It makes
 all sorts of things work, like keeping fuzzy matches within the same
 language. We didn't do it in Ultraseek because it would have been an
 incompatible index change and the benefit didn't justify that.

 wunder
 ==
 Walter Underwood
 Former Ultraseek Architect
 Current Entire Netflix Search Department

 On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote:

  Token/by/token seems a bit extreme. Are you concerned with macaronic
  documents?
 
  On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood 
 [EMAIL PROTECTED]
  wrote:
 
  Nice list.
 
  You may still need to mark the language of each document. There are
  plenty of cross-language collisions: die and boot have different
  meanings in German and English. Proper nouns (Laserjet) may be the
  same in all languages, a different problem if you are trying to get
  answers in one language.
 
  At one point, I considered using Unicode language tagging on each
  token to keep it all straight. Effectively, index de/Boot or
  en/Laserjet.
 
  wunder
 
  On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote:
 
  Unless you can come up with language-neutral tokenization and
 stemming,
  you
  need to:
 
  a) know the language of each document.
  b) run a different
  analyzer depending on the language.
  c) force the user to tell you the language of the query.
  d) run the query through the same analyzer.
 
 
 




Re: FunctionQuery in a custom request handler

2008-03-20 Thread evol__

Hi again,
digging this one up.

This is the code I've used in my handler.

ReciprocalFloatFunction tb_valuesource;
tb_valuesource = new ReciprocalFloatFunction(new
ReverseOrdFieldSource(TIMEBIAS_FIELD), m, a, b);
FunctionQuery timebias = new FunctionQuery(tb_valuesource);

// adding to main query
BooleanQuery main = new BooleanQuery();
other_queries.setBoost(BOOST_OTHER_QUERIES);
main.add(other_queries);
timebias.setBoost(BOOST_TIMEBIAS);
main.add(timebias);

It worked, but the problem is that I fail to get a decent ration between my
other_queries and timebias. I would like to keep timebias at ~15% max
(for totally fresh docs), kind of dropping to nothing at ~one week olds.
Adding to BooleanQuery sums the subquery scores, so I guess there's no way
of controlling the ratio, right?

What I tried to do is to use multiplication:

// this part stays the same
ReciprocalFloatFunction tb_valuesource;
tb_valuesource = new ReciprocalFloatFunction(new
ReverseOrdFieldSource(TIMEBIAS_FIELD), m, a, b);
FunctionQuery timebias = new FunctionQuery(tb_valuesource);

ConstValueSource tb_const = new ConstValueSource(1.0f);
ValueSource[] tb_summa_arr = {tb_const, tb_valuesource};
SumFloatFunction tb_summa = new SumFloatFunction(tb_summa_arr);

QueryValueSource query_vs = new QueryValueSource(query, DEF_VAL);

ValueSource[] vs_arr = {query_vs, tb_summa};
ProductFloatFunction pff = new ProductFloatFunction(vs_arr);

FunctionQuery THE_QUERY = new FunctionQuery(pff);
docs.docList = searcher.getDocList(THE_QUERY, filters, null, start, 
rows,
flags);


(All of the float tweakish values are ofcourse foo.)
The problem is this crashes at the last line with

Mar 20, 2008 6:59:57 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at si.david.MyRequestHandler.handleRequestBody(Unknown Source)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:815)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
at
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)

I'm using a nightly build from late November.
Any ideas? Does what I am doing make any sense? Is there any other way to
accomplish what I'm trying to do?
I'm kind of lost here, thanks for the info.

D.




hossman wrote:
 
 
 : How do I access the ValueSource for my DateField? I'd like to use a
 : ReciprocalFloatFunction from inside the code, adding it aside others in
 the
 : main BooleanQuery.
 
 The FieldType API provides a getValueSource method (so every FieldType 
 picks it's own best ValueSource implementaion).
 
 
 -Hoss
 
 
 

-- 
View this message in context: 
http://www.nabble.com/FunctionQuery-in-a-custom-request-handler-tp14838957p16186230.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Quoted searches

2008-03-20 Thread Chris Hostetter

:  When I issue a search in quotes, like tay sachs
:   lucene is returning results as if it were written: tay OR sachs
: 
: If you are using the standard request handler, the default operator is
: OR (I assume you didn't use quotes in your query).  You can switch the

BUt the Justin said When I issue a search in quotes ... so it really 
should have been a phrase query requiring both terms.

Justin: can you add debugQuery=true to your request, and then let us know 
what the parsedquery_toString and score explanation info looks like?




-Hoss



Preferential boosting

2008-03-20 Thread Lance Norskog
Suppose I have a schema with an integer field called 'duration'. I want to
find all records, but if the duration is 3 I want those records to be
boosted.
 
The index has 10 records, with duration between 2 and 4.  What is the query
that will find all of the records and place the records with duration 3
above the others?
 
These do not work (at least for me):
 
*:* OR duration:3^2.0
duration:[* TO *] duration:3^2.0
duration:3^2.0 OR -duration:3
 
Thanks,
 
Lance Norskog
 
 



Re: Does emty fields affect index size?

2008-03-20 Thread Yonik Seeley
Make sure you omit norms for those fields if possible.  If you do
that, the index should only be marginally bigger.

-Yonik


On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin
[EMAIL PROTECTED] wrote:
 Hello, lets say I have 10 fields and usually some 5 of them are present in 
 each document. And the size of my index is 100Mb.
  I want to change my schema and I'll have 100 fields, but each document will 
 still have only 5 fields present.
  After I reindex my data, will the size be affected? Could you guess how big 
 the increase will be?
  Any related information, suggestions will be helpful as well.

  Thanks in advance,
  Eugene


RE: Preferential boosting

2008-03-20 Thread Lance Norskog
I was doing something wrong. Bisecting the result set does not work. Using a
much larger boost and ORing with the entire index does work.  Thanks. 

*:* OR duration:3^20.0   works
-duration:3 OR duration:3^20gives empty result set

Now we come to another question: why doesn't X OR -X select the entire
index?

Thanks,

Lance

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Thursday, March 20, 2008 12:34 PM
To: solr-user@lucene.apache.org
Subject: Re: Preferential boosting

On Thu, Mar 20, 2008 at 3:13 PM, Lance Norskog [EMAIL PROTECTED] wrote:
 Suppose I have a schema with an integer field called 'duration'. I 
 want to  find all records, but if the duration is 3 I want those 
 records to be  boosted.

  The index has 10 records, with duration between 2 and 4.  What is the 
 query  that will find all of the records and place the records with 
 duration 3  above the others?

  These do not work (at least for me):

 *:* OR duration:3^2.0
 duration:[* TO *] duration:3^2.0

In what way don't these work?
Perhaps a bigger boost would help?

-Yonik



Re: highlighting pt2: returning tokens out of order from PhraseQuery

2008-03-20 Thread Erik Hatcher


On Mar 19, 2008, at 10:26 AM, Brian Whitman wrote:
Can we somehow force the highlighter to not return snips that do  
not exactly match the query?


Unfortunately not with the current highlighter.  But there has been a  
great deal of work towards fixing this here:  http:// 
issues.apache.org/jira/browse/LUCENE-794


Erik



Re: highlighting pt2: returning tokens out of order from PhraseQuery

2008-03-20 Thread Brian Whitman


Unfortunately not with the current highlighter.  But there has been  
a great deal of work towards fixing this here:  http://issues.apache.org/jira/browse/LUCENE-794




ah, thanks Eric, didn't think to check w/ the lucene folks.
I see they have somewhat working patches -- does this kind of stuff  
port over easy to solr?


Re: Does emty fields affect index size?

2008-03-20 Thread Evgeniy Strokin
Thanks for the info. But what about cache? Will it take more memory for 100 
fields schema with the same amount of data?


- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, March 20, 2008 3:48:28 PM
Subject: Re: Does emty fields affect index size?

Make sure you omit norms for those fields if possible.  If you do
that, the index should only be marginally bigger.

-Yonik


On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin
[EMAIL PROTECTED] wrote:
 Hello, lets say I have 10 fields and usually some 5 of them are present in 
 each document. And the size of my index is 100Mb.
  I want to change my schema and I'll have 100 fields, but each document will 
 still have only 5 fields present.
  After I reindex my data, will the size be affected? Could you guess how big 
 the increase will be?
  Any related information, suggestions will be helpful as well.

  Thanks in advance,
  Eugene

Re: Does emty fields affect index size?

2008-03-20 Thread Yonik Seeley
On Thu, Mar 20, 2008 at 4:23 PM, Evgeniy Strokin
[EMAIL PROTECTED] wrote:
 Thanks for the info. But what about cache? Will it take more memory for 100 
 fields schema with the same amount of data?

For normal searches, not really.

-Yonik


  - Original Message 
  From: Yonik Seeley [EMAIL PROTECTED]
  To: solr-user@lucene.apache.org
  Sent: Thursday, March 20, 2008 3:48:28 PM
  Subject: Re: Does emty fields affect index size?

  Make sure you omit norms for those fields if possible.  If you do
  that, the index should only be marginally bigger.

  -Yonik


  On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin
  [EMAIL PROTECTED] wrote:
   Hello, lets say I have 10 fields and usually some 5 of them are present in 
 each document. And the size of my index is 100Mb.
I want to change my schema and I'll have 100 fields, but each document 
 will still have only 5 fields present.
After I reindex my data, will the size be affected? Could you guess how 
 big the increase will be?
Any related information, suggestions will be helpful as well.
  
Thanks in advance,
Eugene


Re: Does emty fields affect index size?

2008-03-20 Thread Evgeniy Strokin
This is I found in docs:
 
Omitting norms is useful for saving memory on Fields that do not affect 
scoring, such as those used for calculating facets.
 
I don't really understand the statement, but does it mean I cannot use those 
fields as facet fields, because this is exactly why I need those 100 fields.
 


- Original Message 
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thursday, March 20, 2008 3:48:28 PM
Subject: Re: Does emty fields affect index size?

Make sure you omit norms for those fields if possible.  If you do
that, the index should only be marginally bigger.

-Yonik


On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin
[EMAIL PROTECTED] wrote:
 Hello, lets say I have 10 fields and usually some 5 of them are present in 
 each document. And the size of my index is 100Mb.
  I want to change my schema and I'll have 100 fields, but each document will 
 still have only 5 fields present.
  After I reindex my data, will the size be affected? Could you guess how big 
 the increase will be?
  Any related information, suggestions will be helpful as well.

  Thanks in advance,
  Eugene

Re: Does emty fields affect index size?

2008-03-20 Thread Yonik Seeley
On Thu, Mar 20, 2008 at 4:46 PM, Evgeniy Strokin
[EMAIL PROTECTED] wrote:
 This is I found in docs:

  Omitting norms is useful for saving memory on Fields that do not affect 
 scoring, such as those used for calculating facets.

  I don't really understand the statement, but does it mean I cannot use those 
 fields as facet fields, because this is exactly why I need those 100 fields.

It just means that the norm has been omitted (which is 1 byte per doc
in the complete index).  The norm is just used for length
normalization and index-time boosting.  You can still search and facet
a field that has norms omitted.  Norms are only recommended for better
relevance for big text fields.

-Yonik


Re: highlighting pt2: returning tokens out of order from PhraseQuery

2008-03-20 Thread Erik Hatcher

On Mar 20, 2008, at 4:13 PM, Brian Whitman wrote:


Unfortunately not with the current highlighter.  But there has  
been a great deal of work towards fixing this here:  http:// 
issues.apache.org/jira/browse/LUCENE-794




ah, thanks Eric, didn't think to check w/ the lucene folks.
I see they have somewhat working patches -- does this kind of stuff  
port over easy to solr?


If I had replied a bit earlier today the answer would have been  
different, but I see that Mike has just committed the SOLR-386 patch  
today which makes highlighters pluggable, so it shouldn't be too  
terrible to wire it in.


Erik



Re: what's up with: java -Ddata=args -jar post.jar optimize/

2008-03-20 Thread John

Thanks Yonik.? Now that I understand it ... i'm not worried about it.? :)



-JM


-Original Message-
From: Yonik Seeley [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Thu, 20 Mar 2008 11:19 am
Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/




On Thu, Mar 20, 2008 at 10:55 AM, John [EMAIL PROTECTED] wrote:
  Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a 
restart of Solr do the trick?? i.e. the files are no longer locked by Windows 
... so they can now be deleted when Solr exits ... I tried it and didn't see 
any 
change.

  Who is keeping those files around / locked ... Solr or Lucene?? and what is 
going on with the second call to optimize that's able to really delete those 
old files where the first optimize couldn't?


The IndexWriter cleans up old unreferenced files periodically... so as
you continue to add to the index, those files will be removed (maybe
on a segment merge, definitely on another commit).
As I said... don't worry about it, they will get cleaned up sooner or
later (unless you are never going to change the index again after you
build it).

-Yonik



cannot start solr after adding Analyzer, ClassCaseException error

2008-03-20 Thread xunzhang huang
Hi, everyone

After I add a Analyzer to solr, there is a exception ClassCaseException
error and solr cannot be started. the detail is:

environment: solr 1.2, jdk 1.6.03, ubuntu linux 7.10, and a chinese analyzer

I add some lines in schema.xml:

fieldtype name=text_chinese class=solr.TextField
analyzer   class=net.paoding.analysis.analyzer.PaodingAnalyzer/

  /fieldtype

I tried some different analyzer, but the same exception happened, so I think
it is solr's problem or my configuration has something wrong

Any ideas?

the error message is:

org.apache.solr.core.SolrException: Schema Parsing Failed
at
org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:556)
at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:
71)
at org.apache.solr.core.SolrCore.init(SolrCore.java:196)
at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java:
177)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:
69)
at
org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
40)
at
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:
594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:
139)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
1218)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:
500)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
40)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
147)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCo

llection.java:
161)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
40)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:
147)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
40)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:
117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:
929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.lang.ClassCastException:
net.paoding.analysis.analyzer.PaodingAnalyzer cannot be cast to
org.apache.lucene.analysis.Analyzer
at
org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:583)
at
org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331)
... 28 more