Re: question about fl=score
My customer want to get the 1th-10010th added docs So I have to sort by timestamp, to get top10010 docs' timestamp …… 2008/3/20, Walter Underwood [EMAIL PROTECTED]: Why do you want the 10,000th most relevant result? That seems very, very odd. Most people need the most relevant result. Maybe the ten most relevant results. I'm searching for the movie 'Ratatouille', but please give me the 10,001st result instead of that movie. If you explain your desire, we may have a better approach. wunder == Search Guy, Netflix On 3/19/08 10:43 PM, 李银松 [EMAIL PROTECTED] wrote: I am not getting 1 records I am getting records from 1-10010 So I need the top10010 records' *sort field* to merge and get final results,just like the distributed search the data to transport is about 500k(1 docs' scores) and the QTime is about 100ms but the total time I used is about 10+ seconds I want to know it really cost so much time or something other is wrong . 2008/3/20, Walter Underwood [EMAIL PROTECTED]: Getting 10,000 records will be slow. What are you doing with 10,000 records? wunder On 3/19/08 10:07 PM, 李银松 [EMAIL PROTECTED] wrote: I want to get the top 1-10010 record from two different servers,So Ihave to get top10010 scores from each server and merge them to get the results. I found the cost time was mostly used in XMLResponseParser while parsing the inputstream. I wander whether the costtime was used for net transport or for Solr to prepare for transport? Or just something wrong with my server? 在08-3-20,Yonik Seeley [EMAIL PROTECTED] 写道: 2008/3/19 李银松 [EMAIL PROTECTED]: 1、When I set fl=score ,solr returns just as fl=*,score ,not just scores Is it a bug or just do it on purpose? On purpose... a score alone with no other context doesn't seem useful. 2 、I'm using solrj to get about 1 docs' score in LAN. It costs me about 10+ seconds first time(QTime is less than 100ms) , but 1-2 seconds second time with the same querystring. It seems a bit too long for the first time(total size of the doc to transport is about 500k). Is there anything i can do with it? What are you trying to do with that many scores? Search engines are optimized more for retrieving the top n matches (where n is ~10 - 100) -Yonik
Re: question about fl=score
2008/3/20 李银松 [EMAIL PROTECTED]: 1、When I set fl=score ,solr returns just as fl=*,score ,not just scores Is it a bug or just do it on purpose? u can set fl=id,score, solr not support the style like fl=score My customer want to get the 1th-10010th added docs So I have to sort by timestamp, to get top10010 docs' timestamp …… limit 1, 10010 order by timestamp? -- regards j.L
Re: Help Requested
On Wed, 19 Mar 2008 21:22:42 -0700 (PDT) Raghav Kapoor [EMAIL PROTECTED] wrote: I am new to Solr and I am facing a question if solr can be helpful in a project that I'm working on. welcome :) The project is a client/server app that requires a client app to index the documents and send the results in rdf to server. The client needs to be smart enough to know when a new document has been added to a specified folder, index it and send the results in rdf/xml to the server. The server will be a web service which will parse the xml and store the metadata in the a database. The search will be conducted on the server and will return the results from the database which will be links to the documents on the client. Any particular reason why need the server in this situation? pretty much everything you are doing can be done locally. Except, probably, cross linking between client's documents. I have no idea in what kind of environment this app is supposed to run (home? office LAN? the interweb :P ? ). The client , which is also running a webserver will take the request when the user clicks on the link to the document residing on the client. you don't need a webserver for this, just generate a page in from your webserver with file:// links and all you need is to render it locally. I believe lucene will be useful in this scenario and solr can be used as a web app. I would like to get any input on this architecture and would request any pointers if there is any app already doing something similar and how lucene/solr can be useful in this case. there are plenty of desktop document indexers using lucene on some form or another, and other indexing technologies. I dont know if any uses Solr - yet. And i know of a few apps out there that do something similar to what you describe though with different design as the goals are somewhat different. B _ {Beto|Norberto|Numard} Meijome Some cause happiness wherever they go; others, whenever they go. Oscar Wilde I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: RAM Based Index for Solr
On Wed, 19 Mar 2008 17:04:34 -0700 (PDT) swarag [EMAIL PROTECTED] wrote: In Lucene there is a Ram Based Index org.apache.lucene.store.RAMDirectory. Is there a way to setup my index in solr to use a RAMDirectory? create a mountpoint on a ramdrive (tmpfs in linux, i think), and put your index in there... ? or does lucene do anything other than that? B _ {Beto|Norberto|Numard} Meijome Unix is very simple, but it takes a genius to understand the simplicity. Dennis Ritchie I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
RAM size
Hi all, is there a way (or formula) to determine the required amount of RAM memory, e.g. by number of documents, document size? I need to index about 15.000.000 documents, each document is 1 to 3Kb big, only the id of the document will be stored. I've just implemented a testcase on one of our older servers (only 512Mb RAM) with 4.000.000 documents, searching the index is quite fast, but when I trie to sort the results, I get the well-known OutOfMemory error. I'm aware of the fact that 512Mb is way too little, but I'm trying to determine what size will be needed for 15.000.000 documents. Thanks in advance. -- Kind regards, Geert Van Huychem Project Leader Mediargus NV tel +32 2 741 60 22 fax +32 2 740 09 71
RE: Language support
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: RAM Based Index for Solr
there currently is no way to use RAMDirectory instead of FSDirectory yet in SOLR, however there is a feature request to implement this. I personally think this will be great because we could use Terracotta to handle the clustering. Jeryl Cook On Thu, Mar 20, 2008 at 1:07 AM, Norberto Meijome [EMAIL PROTECTED] wrote: On Wed, 19 Mar 2008 17:04:34 -0700 (PDT) swarag [EMAIL PROTECTED] wrote: In Lucene there is a Ram Based Index org.apache.lucene.store.RAMDirectory. Is there a way to setup my index in solr to use a RAMDirectory? create a mountpoint on a ramdrive (tmpfs in linux, i think), and put your index in there... ? or does lucene do anything other than that? B _ {Beto|Norberto|Numard} Meijome Unix is very simple, but it takes a genius to understand the simplicity. Dennis Ritchie I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ ..Act your age, and not your shoe size.. -Prince(1986)
Re: what's up with: java -Ddata=args -jar post.jar optimize/
What messages do you see in your log file? Bill On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote: Hi, I'm a new Solr user. I figured my way around Solr just fine (I think) ... I can index and search ets. And so far I have indexed over 300k documents. What I can't figure out is the following. I'm using: ??? java -Ddata=args -jar post.jar optimize/ to post an optimize command. What I'm finding is that I have to do it twice in order for the files to be optimized ... i.e.: the first post takes 3-4 minutes but leaves the file count as is at 44 ... the second post takes 2-3 seconds but shrinks the file count from 44 to 8. So my question is the following, is this the expected behavior or am I doing something wrong? Do I need two optimize posts to really optimize my index?! Thanks in advance -JM
Re: Faceting Problem
When faced with these sorts of issues, it is worthwhile to step back and experiment with Solr's analysis page. http://localhost:8983/solr/ admin/analysis.jsp Select your field type either by name of field or by type, put in some text, and see what happens to it at both indexing and querying time. Erik On Mar 19, 2008, at 10:08 AM, Tejaswi_Haramurali wrote: Hi , I am facing a problem in using solrj. I am using java (solrj) to index as well as search data in the solr search engine. This is some of the code exer.setField(name,DOC+identity); exer.setField(features,The Mellon Foundation); exer.setField(language,langmap.get(008lang)); exer.setField(date,datemap.get(008date)); exer.setField(format,formatmap.get(formats)); The problem is , when I do a search on 'Mellon' or any word associated with the 'features' field ,I get the results. But However when I do a search on any of the other fields, I dont get the results. I have ensured that indexed=true in schema.xml for all these fields and have also tried displaying the values I am indexing. I dont know what mistake I am committing. I would be glad if someone could help me on this. Tejaswi -- View this message in context: http://www.nabble.com/Faceting- Problem-tp16144141p16144141.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: what's up with: java -Ddata=args -jar post.jar optimize/
Thanks Bill!! Here is the content of the log file?(I restarted Solr so we have a clean log): 127.0.0.1 -? -? [20/03/2008:13:38:09 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2538 127.0.0.1 -? -? [20/03/2008:13:38:31 +] GET /solr/admin/logging.jsp HTTP/1.1 200 138 127.0.0.1 -? -? [20/03/2008:13:38:33 +] GET /solr/admin/logging.xsl HTTP/1.1 304 0 127.0.0.1 -? -? [20/03/2008:13:38:33 +] GET /solr/admin/meta.xsl HTTP/1.1 304 0 127.0.0.1 -? -? [20/03/2008:13:38:36 +] GET /solr/admin/action.jsp?log=ALL HTTP/1.1 200 901 127.0.0.1 -? -? [20/03/2008:13:38:55 +] GET /solr/admin/ HTTP/1.1 200 3818 127.0.0.1 -? -? [20/03/2008:13:38:59 +] GET /solr/select/?q=solrversion=2.2start=0rows=10indent=on HTTP/1.1 200 2161 127.0.0.1 -? -? [20/03/2008:13:39:01 +] GET /solr/select/?q=solrversion=2.2start=0rows=10indent=on HTTP/1.1 200 2159 127.0.0.1 -? -? [20/03/2008:13:39:17 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 127.0.0.1 -? -? [20/03/2008:13:39:44 +] POST /solr/update HTTP/1.1 200 152 127.0.0.1 -? -? [20/03/2008:13:43:32 +] POST /solr/update HTTP/1.1 200 149 127.0.0.1 -? -? [20/03/2008:13:44:16 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2537 127.0.0.1 -? -? [20/03/2008:13:44:17 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 127.0.0.1 -? -? [20/03/2008:13:44:18 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 127.0.0.1 -? -? [20/03/2008:13:44:26 +] POST /solr/update HTTP/1.1 200 149 127.0.0.1 -? -? [20/03/2008:13:44:27 +] POST /solr/update HTTP/1.1 200 149 127.0.0.1 -? -? [20/03/2008:13:44:51 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 127.0.0.1 -? -? [20/03/2008:13:44:51 +] GET /solr/select/?q=*%3A*version=2.2start=0rows=10indent=on HTTP/1.1 200 2536 The two POST are as a result of issuing java -Ddata=args -jar post.jar optimize/ Just before I issued the first POST, there were 71 files in the index (total size ~1.4Gb) ... after the first POST, there were 20 files (total size ~2.7Gb) .. after the second POST, there were 8 files (total size ~1.3Gb) The increase in the index size ... from 1.4Gb to 2.7Gb ... as well as the total files count (3 different counts) .. is something I have not observed before!!? In all of my previous experiments ... on the first POST ... the index size increased slightly, but the file count never (I think) went up! Who can explain to me what's going on here?!! I'm using Solr 1.2 ... the only change I have done is added new fields to support my data type. -JM -Original Message- From: Bill Au [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thu, 20 Mar 2008 8:58 am Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/ What messages do you see in your log file? Bill On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote: Hi, I'm a new Solr user. I figured my way around Solr just fine (I think) ... I can index and search ets. And so far I have indexed over 300k documents. What I can't figure out is the following. I'm using: ??? java -Ddata=args -jar post.jar optimize/ to post an optimize command. What I'm finding is that I have to do it twice in order for the files to be optimized ... i.e.: the first post takes 3-4 minutes but leaves the file count as is at 44 ... the second post takes 2-3 seconds but shrinks the file count from 44 to 8. So my question is the following, is this the expected behavior or am I doing something wrong? Do I need two optimize posts to really optimize my index?! Thanks in advance -JM
Re: what's up with: java -Ddata=args -jar post.jar optimize/
On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote: What I'm finding is that I have to do it twice in order for the files to be optimized ... i.e.: the first post takes 3-4 minutes but leaves the file count as is at 44 ... the second post takes 2-3 seconds but shrinks the file count from 44 to 8. Let me guess, are you on windows? This is actually expected behavior. The first optimize actually does optimize the whole index. When the optimize finishes, it can't delete the old files because they are still in use by the current IndexSearcher. If you were on UNIX, the files would be deleted sooner (or at least like like they were). In short, Solr + Lucene are doing the right thing... just optimize once, and don't worry about it. -Yonik
Re: what's up with: java -Ddata=args -jar post.jar optimize/
Thanks Yonik!! Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a restart of Solr do the trick?? i.e. the files are no longer locked by Windows ... so they can now be deleted when Solr exits ... I tried it and didn't see any change. Who is keeping those files around / locked ... Solr or Lucene?? and what is going on with the second call to optimize that's able to really delete those old files where the first optimize couldn't? -JM -Original Message- From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thu, 20 Mar 2008 10:13 am Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/ On Wed, Mar 19, 2008 at 3:15 PM, [EMAIL PROTECTED] wrote: What I'm finding is that I have to do it twice in order for the files to be optimized ... i.e.: the first post takes 3-4 minutes but leaves the file count as is at 44 ... the second post takes 2-3 seconds but shrinks the file count from 44 to 8. Let me guess, are you on windows? This is actually expected behavior. The first optimize actually does optimize the whole index. When the optimize finishes, it can't delete the old files because they are still in use by the current IndexSearcher. If you were on UNIX, the files would be deleted sooner (or at least like like they were). In short, Solr + Lucene are doing the right thing... just optimize once, and don't worry about it. -Yonik
Re: what's up with: java -Ddata=args -jar post.jar optimize/
On Thu, Mar 20, 2008 at 10:55 AM, John [EMAIL PROTECTED] wrote: Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a restart of Solr do the trick?? i.e. the files are no longer locked by Windows ... so they can now be deleted when Solr exits ... I tried it and didn't see any change. Who is keeping those files around / locked ... Solr or Lucene?? and what is going on with the second call to optimize that's able to really delete those old files where the first optimize couldn't? The IndexWriter cleans up old unreferenced files periodically... so as you continue to add to the index, those files will be removed (maybe on a segment merge, definitely on another commit). As I said... don't worry about it, they will get cleaned up sooner or later (unless you are never going to change the index again after you build it). -Yonik
Re: Help Requested
Thanks Norberto ! Any particular reason why need the server in this situation? pretty much everything you are doing can be done locally. Except, probably, cross linking between client's documents. I have no idea in what kind of environment this app is supposed to run (home? office LAN? the interweb :P ? ). So its going to be a client/server app where all the documents will be stored on the client and only metadata of those docs will be sent to the server. That way server does not have to store any real documents. Its an internet based application. Search on the server will read the metadata for keywords and send the request to all the clients that contain documents with that keyword. We cannot store everything on one client, all clients are different machines distributed all over the world. you don't need a webserver for this, just generate a page in from your webserver with file:// links and all you need is to render it locally. How will the client serve the documents stored locally through a standard mechanism (like port 80) to send documents to the server when the server requests the documents ? The client will not open any special ports for the server, so we need the web server, I guess ? there are plenty of desktop document indexers using lucene on some form or another, and other indexing technologies. I dont know if any uses Solr - yet. And i know of a few apps out there that do something similar to what you describe though with different design as the goals are somewhat different. The client application needs to give the users an ability of handling metadata of the documents that will be sent to the server so that efficient searching can be conducted, so I assume we need a web app like solr (or create our own using lucene). Let me know your thoughts and thanks again for your reply ! Regards Raghav. --- Norberto Meijome [EMAIL PROTECTED] wrote: On Wed, 19 Mar 2008 21:22:42 -0700 (PDT) Raghav Kapoor [EMAIL PROTECTED] wrote: I am new to Solr and I am facing a question if solr can be helpful in a project that I'm working on. welcome :) The project is a client/server app that requires a client app to index the documents and send the results in rdf to server. The client needs to be smart enough to know when a new document has been added to a specified folder, index it and send the results in rdf/xml to the server. The server will be a web service which will parse the xml and store the metadata in the a database. The search will be conducted on the server and will return the results from the database which will be links to the documents on the client. Any particular reason why need the server in this situation? pretty much everything you are doing can be done locally. Except, probably, cross linking between client's documents. I have no idea in what kind of environment this app is supposed to run (home? office LAN? the interweb :P ? ). The client , which is also running a webserver will take the request when the user clicks on the link to the document residing on the client. you don't need a webserver for this, just generate a page in from your webserver with file:// links and all you need is to render it locally. I believe lucene will be useful in this scenario and solr can be used as a web app. I would like to get any input on this architecture and would request any pointers if there is any app already doing something similar and how lucene/solr can be useful in this case. there are plenty of desktop document indexers using lucene on some form or another, and other indexing technologies. I dont know if any uses Solr - yet. And i know of a few apps out there that do something similar to what you describe though with different design as the goals are somewhat different. B _ {Beto|Norberto|Numard} Meijome Some cause happiness wherever they go; others, whenever they go. Oscar Wilde I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
Re: Language support
You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. I can do all of those. This implies storing all of the different languages in different fields, right? Then changing the default search- field to the language of the query for every query? On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/ msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
You can store in one field if you manage to hide a language code with the text. XML is overkill but effective for this. At one point, we'd investigated how to allow a Lucene analyzer to see more than one field (the language code as well as the text) but I don't think we came up with anything. On Thu, Mar 20, 2008 at 12:39 PM, David King [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer. I can do all of those. This implies storing all of the different languages in different fields, right? Then changing the default search- field to the language of the query for every query? On Thu, Mar 20, 2008 at 12:17 PM, David King [EMAIL PROTECTED] wrote: You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/ msg09332.html Interesting, yes. But since it doesn't actually exist, it's not much help. I guess what I'm asking is, if my approach seems convoluted, I'm probably doing it wrong, so how *a*re people solving the problem of searching over multiple languages? What is the canonical way to do this? Nicolas -Message d'origine- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer(English), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer(Portuguese), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: fieldType name=text_greek class=solr.TextField analyzer class=org.apache.lucene.analysis.el.GreekAnalyzer/ /fieldType Does this mean there there's no way to have a single contents field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?
Re: Language support
Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Re: Language support
Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter. Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term LaserJet will have a document frequency 20X higher than the terms in a single language and will score too low. Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer. A simpler approach is to tag each document with a language, like lang:de, then use a filter query to restrict the documents to the query language. Per-token tagging still strikes me as the right approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote: Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Re: Language support
Oh, Walter! Hello! I thought that name was familiar. Greetings from Basis. All that makes sense. On Thu, Mar 20, 2008 at 1:00 PM, Walter Underwood [EMAIL PROTECTED] wrote: Extreme, but guaranteed to work and it avoids bad IDF when there are inter-language collisions. In Ultraseek, we only stored the hash, so the size of the source token didn't matter. Trademarks are a bad source of collisions and anomalous IDF. If you have LaserJet support docs in 20 languages, the term LaserJet will have a document frequency 20X higher than the terms in a single language and will score too low. Ultraseek handles macaronic documents when the script makes it possible, for example, roman is sent to the English stemmer in a Japanese document, Hangul always goes to the Korean segmenter/stemmer. A simpler approach is to tag each document with a language, like lang:de, then use a filter query to restrict the documents to the query language. Per-token tagging still strikes me as the right approach. It makes all sorts of things work, like keeping fuzzy matches within the same language. We didn't do it in Ultraseek because it would have been an incompatible index change and the benefit didn't justify that. wunder == Walter Underwood Former Ultraseek Architect Current Entire Netflix Search Department On 3/20/08 9:45 AM, Benson Margulies [EMAIL PROTECTED] wrote: Token/by/token seems a bit extreme. Are you concerned with macaronic documents? On Thu, Mar 20, 2008 at 12:42 PM, Walter Underwood [EMAIL PROTECTED] wrote: Nice list. You may still need to mark the language of each document. There are plenty of cross-language collisions: die and boot have different meanings in German and English. Proper nouns (Laserjet) may be the same in all languages, a different problem if you are trying to get answers in one language. At one point, I considered using Unicode language tagging on each token to keep it all straight. Effectively, index de/Boot or en/Laserjet. wunder On 3/20/08 9:20 AM, Benson Margulies [EMAIL PROTECTED] wrote: Unless you can come up with language-neutral tokenization and stemming, you need to: a) know the language of each document. b) run a different analyzer depending on the language. c) force the user to tell you the language of the query. d) run the query through the same analyzer.
Re: FunctionQuery in a custom request handler
Hi again, digging this one up. This is the code I've used in my handler. ReciprocalFloatFunction tb_valuesource; tb_valuesource = new ReciprocalFloatFunction(new ReverseOrdFieldSource(TIMEBIAS_FIELD), m, a, b); FunctionQuery timebias = new FunctionQuery(tb_valuesource); // adding to main query BooleanQuery main = new BooleanQuery(); other_queries.setBoost(BOOST_OTHER_QUERIES); main.add(other_queries); timebias.setBoost(BOOST_TIMEBIAS); main.add(timebias); It worked, but the problem is that I fail to get a decent ration between my other_queries and timebias. I would like to keep timebias at ~15% max (for totally fresh docs), kind of dropping to nothing at ~one week olds. Adding to BooleanQuery sums the subquery scores, so I guess there's no way of controlling the ratio, right? What I tried to do is to use multiplication: // this part stays the same ReciprocalFloatFunction tb_valuesource; tb_valuesource = new ReciprocalFloatFunction(new ReverseOrdFieldSource(TIMEBIAS_FIELD), m, a, b); FunctionQuery timebias = new FunctionQuery(tb_valuesource); ConstValueSource tb_const = new ConstValueSource(1.0f); ValueSource[] tb_summa_arr = {tb_const, tb_valuesource}; SumFloatFunction tb_summa = new SumFloatFunction(tb_summa_arr); QueryValueSource query_vs = new QueryValueSource(query, DEF_VAL); ValueSource[] vs_arr = {query_vs, tb_summa}; ProductFloatFunction pff = new ProductFloatFunction(vs_arr); FunctionQuery THE_QUERY = new FunctionQuery(pff); docs.docList = searcher.getDocList(THE_QUERY, filters, null, start, rows, flags); (All of the float tweakish values are ofcourse foo.) The problem is this crashes at the last line with Mar 20, 2008 6:59:57 PM org.apache.solr.common.SolrException log SEVERE: java.lang.NullPointerException at si.david.MyRequestHandler.handleRequestBody(Unknown Source) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:117) at org.apache.solr.core.SolrCore.execute(SolrCore.java:815) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) I'm using a nightly build from late November. Any ideas? Does what I am doing make any sense? Is there any other way to accomplish what I'm trying to do? I'm kind of lost here, thanks for the info. D. hossman wrote: : How do I access the ValueSource for my DateField? I'd like to use a : ReciprocalFloatFunction from inside the code, adding it aside others in the : main BooleanQuery. The FieldType API provides a getValueSource method (so every FieldType picks it's own best ValueSource implementaion). -Hoss -- View this message in context: http://www.nabble.com/FunctionQuery-in-a-custom-request-handler-tp14838957p16186230.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Quoted searches
: When I issue a search in quotes, like tay sachs : lucene is returning results as if it were written: tay OR sachs : : If you are using the standard request handler, the default operator is : OR (I assume you didn't use quotes in your query). You can switch the BUt the Justin said When I issue a search in quotes ... so it really should have been a phrase query requiring both terms. Justin: can you add debugQuery=true to your request, and then let us know what the parsedquery_toString and score explanation info looks like? -Hoss
Preferential boosting
Suppose I have a schema with an integer field called 'duration'. I want to find all records, but if the duration is 3 I want those records to be boosted. The index has 10 records, with duration between 2 and 4. What is the query that will find all of the records and place the records with duration 3 above the others? These do not work (at least for me): *:* OR duration:3^2.0 duration:[* TO *] duration:3^2.0 duration:3^2.0 OR -duration:3 Thanks, Lance Norskog
Re: Does emty fields affect index size?
Make sure you omit norms for those fields if possible. If you do that, the index should only be marginally bigger. -Yonik On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello, lets say I have 10 fields and usually some 5 of them are present in each document. And the size of my index is 100Mb. I want to change my schema and I'll have 100 fields, but each document will still have only 5 fields present. After I reindex my data, will the size be affected? Could you guess how big the increase will be? Any related information, suggestions will be helpful as well. Thanks in advance, Eugene
RE: Preferential boosting
I was doing something wrong. Bisecting the result set does not work. Using a much larger boost and ORing with the entire index does work. Thanks. *:* OR duration:3^20.0 works -duration:3 OR duration:3^20gives empty result set Now we come to another question: why doesn't X OR -X select the entire index? Thanks, Lance -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, March 20, 2008 12:34 PM To: solr-user@lucene.apache.org Subject: Re: Preferential boosting On Thu, Mar 20, 2008 at 3:13 PM, Lance Norskog [EMAIL PROTECTED] wrote: Suppose I have a schema with an integer field called 'duration'. I want to find all records, but if the duration is 3 I want those records to be boosted. The index has 10 records, with duration between 2 and 4. What is the query that will find all of the records and place the records with duration 3 above the others? These do not work (at least for me): *:* OR duration:3^2.0 duration:[* TO *] duration:3^2.0 In what way don't these work? Perhaps a bigger boost would help? -Yonik
Re: highlighting pt2: returning tokens out of order from PhraseQuery
On Mar 19, 2008, at 10:26 AM, Brian Whitman wrote: Can we somehow force the highlighter to not return snips that do not exactly match the query? Unfortunately not with the current highlighter. But there has been a great deal of work towards fixing this here: http:// issues.apache.org/jira/browse/LUCENE-794 Erik
Re: highlighting pt2: returning tokens out of order from PhraseQuery
Unfortunately not with the current highlighter. But there has been a great deal of work towards fixing this here: http://issues.apache.org/jira/browse/LUCENE-794 ah, thanks Eric, didn't think to check w/ the lucene folks. I see they have somewhat working patches -- does this kind of stuff port over easy to solr?
Re: Does emty fields affect index size?
Thanks for the info. But what about cache? Will it take more memory for 100 fields schema with the same amount of data? - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, March 20, 2008 3:48:28 PM Subject: Re: Does emty fields affect index size? Make sure you omit norms for those fields if possible. If you do that, the index should only be marginally bigger. -Yonik On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello, lets say I have 10 fields and usually some 5 of them are present in each document. And the size of my index is 100Mb. I want to change my schema and I'll have 100 fields, but each document will still have only 5 fields present. After I reindex my data, will the size be affected? Could you guess how big the increase will be? Any related information, suggestions will be helpful as well. Thanks in advance, Eugene
Re: Does emty fields affect index size?
On Thu, Mar 20, 2008 at 4:23 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Thanks for the info. But what about cache? Will it take more memory for 100 fields schema with the same amount of data? For normal searches, not really. -Yonik - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, March 20, 2008 3:48:28 PM Subject: Re: Does emty fields affect index size? Make sure you omit norms for those fields if possible. If you do that, the index should only be marginally bigger. -Yonik On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello, lets say I have 10 fields and usually some 5 of them are present in each document. And the size of my index is 100Mb. I want to change my schema and I'll have 100 fields, but each document will still have only 5 fields present. After I reindex my data, will the size be affected? Could you guess how big the increase will be? Any related information, suggestions will be helpful as well. Thanks in advance, Eugene
Re: Does emty fields affect index size?
This is I found in docs: Omitting norms is useful for saving memory on Fields that do not affect scoring, such as those used for calculating facets. I don't really understand the statement, but does it mean I cannot use those fields as facet fields, because this is exactly why I need those 100 fields. - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thursday, March 20, 2008 3:48:28 PM Subject: Re: Does emty fields affect index size? Make sure you omit norms for those fields if possible. If you do that, the index should only be marginally bigger. -Yonik On Thu, Mar 20, 2008 at 3:20 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: Hello, lets say I have 10 fields and usually some 5 of them are present in each document. And the size of my index is 100Mb. I want to change my schema and I'll have 100 fields, but each document will still have only 5 fields present. After I reindex my data, will the size be affected? Could you guess how big the increase will be? Any related information, suggestions will be helpful as well. Thanks in advance, Eugene
Re: Does emty fields affect index size?
On Thu, Mar 20, 2008 at 4:46 PM, Evgeniy Strokin [EMAIL PROTECTED] wrote: This is I found in docs: Omitting norms is useful for saving memory on Fields that do not affect scoring, such as those used for calculating facets. I don't really understand the statement, but does it mean I cannot use those fields as facet fields, because this is exactly why I need those 100 fields. It just means that the norm has been omitted (which is 1 byte per doc in the complete index). The norm is just used for length normalization and index-time boosting. You can still search and facet a field that has norms omitted. Norms are only recommended for better relevance for big text fields. -Yonik
Re: highlighting pt2: returning tokens out of order from PhraseQuery
On Mar 20, 2008, at 4:13 PM, Brian Whitman wrote: Unfortunately not with the current highlighter. But there has been a great deal of work towards fixing this here: http:// issues.apache.org/jira/browse/LUCENE-794 ah, thanks Eric, didn't think to check w/ the lucene folks. I see they have somewhat working patches -- does this kind of stuff port over easy to solr? If I had replied a bit earlier today the answer would have been different, but I see that Mike has just committed the SOLR-386 patch today which makes highlighters pluggable, so it shouldn't be too terrible to wire it in. Erik
Re: what's up with: java -Ddata=args -jar post.jar optimize/
Thanks Yonik.? Now that I understand it ... i'm not worried about it.? :) -JM -Original Message- From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Thu, 20 Mar 2008 11:19 am Subject: Re: what's up with: java -Ddata=args -jar post.jar optimize/ On Thu, Mar 20, 2008 at 10:55 AM, John [EMAIL PROTECTED] wrote: Yep, I'm on Windows ... so if it can't delete the old files, shouldn't a restart of Solr do the trick?? i.e. the files are no longer locked by Windows ... so they can now be deleted when Solr exits ... I tried it and didn't see any change. Who is keeping those files around / locked ... Solr or Lucene?? and what is going on with the second call to optimize that's able to really delete those old files where the first optimize couldn't? The IndexWriter cleans up old unreferenced files periodically... so as you continue to add to the index, those files will be removed (maybe on a segment merge, definitely on another commit). As I said... don't worry about it, they will get cleaned up sooner or later (unless you are never going to change the index again after you build it). -Yonik
cannot start solr after adding Analyzer, ClassCaseException error
Hi, everyone After I add a Analyzer to solr, there is a exception ClassCaseException error and solr cannot be started. the detail is: environment: solr 1.2, jdk 1.6.03, ubuntu linux 7.10, and a chinese analyzer I add some lines in schema.xml: fieldtype name=text_chinese class=solr.TextField analyzer class=net.paoding.analysis.analyzer.PaodingAnalyzer/ /fieldtype I tried some different analyzer, but the same exception happened, so I think it is solr's problem or my configuration has something wrong Any ideas? the error message is: org.apache.solr.core.SolrException: Schema Parsing Failed at org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:556) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java: 71) at org.apache.solr.core.SolrCore.init(SolrCore.java:196) at org.apache.solr.core.SolrCore.getSolrCore(SolrCore.java: 177) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 69) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java: 40) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java: 594) at org.mortbay.jetty.servlet.Context.startContext(Context.java: 139) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java: 1218) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java: 500) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java: 40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java: 147) at org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCo llection.java: 161) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java: 40) at org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java: 147) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java: 40) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java: 117) at org.mortbay.jetty.Server.doStart(Server.java:210) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java: 40) at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java: 929) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java: 39) at sun.reflect.DelegatingMethodAccessorImpl.invoke( DelegatingMethodAccessorImpl.java: 25) at java.lang.reflect.Method.invoke(Method.java:597) at org.mortbay.start.Main.invokeMain(Main.java:183) at org.mortbay.start.Main.start(Main.java:497) at org.mortbay.start.Main.main(Main.java:115) Caused by: java.lang.ClassCastException: net.paoding.analysis.analyzer.PaodingAnalyzer cannot be cast to org.apache.lucene.analysis.Analyzer at org.apache.solr.schema.IndexSchema.readAnalyzer(IndexSchema.java:583) at org.apache.solr.schema.IndexSchema.readConfig(IndexSchema.java:331) ... 28 more