RE: XML data in solr field
Thankyou Tommy. But the real problem here is that the xml is dynamic and the element names will be different in different docs which means that there will be a lot of field names to be added in schema if I were to index those xml nodes separately. Is it possible to have nested indexing (xml within xml) in solr without the overhead of adding all those inner xml nodes as actual fields in solr schema? Manas From: Tommy Chheng [mailto:tommy.chh...@gmail.com] Sent: Tue 3/16/2010 5:05 PM To: solr-user@lucene.apache.org Subject: Re: XML data in solr field Do you have the option of just importing each xml node as a field/value when you add the document? That'll let you do the search easily. If you need to store the raw XML, you can use an extra field. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com http://tommy.chheng.com/ On 3/16/10 12:59 PM, Nair, Manas wrote: Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair
RE: Issue in search
You could write yourr query like q=filedname1:searchValue AND fieldName2:value OR fieldName3: Value Regards, Manas From: Suram [mailto:reactive...@yahoo.com] Sent: Wed 3/17/2010 12:44 AM To: solr-user@lucene.apache.org Subject: Issue in search In solr how can perform AND, OR, NOT search while querying the data -- View this message in context: http://old.nabble.com/Issue-in-search-tp27927828p27927828.html Sent from the Solr - User mailing list archive at Nabble.com.
Weired behaviour for certain search terms
Solr is behaving a bit weirdly for some of the search terms. EG: co-ownership, co ownership. It works fine with terms like quasi-delict, non-interference etc. The issue is, its not return any excerpts in highlighting key of the result dictionary. My search query is something like this: http://192.168.1.50:8080/solr/core_SFS/select?q=content:(co-ownership)+AND+permauid:(AAAE1292-rw)hl=truehl.fl=contenthl.requireFieldMatch=truehl.fragsize=600hl.usePhraseHighlighter=truefacet=truefacet.field=permauidfacet.field=info_ownerfacet.sort=truefacet.mincount=1facet.limit=-1wt=pythonsort=promulgation_date_igprs_date+ascstart=0rows=200fl=uid,permauid but when i search for terms like quasi-delict, non-interference, it gives me proper excerpts. I am using solr1.4 with python. Any help will the highly appreciated. Thanks -- View this message in context: http://old.nabble.com/Weired-behaviour-for-certain-search-terms-tp27927995p27927995.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr query parser doesn't invoke analyzer for simple term query?
Hello, You can see what happen (which analyzer are used for this field and which is the output of the analyzers) with this search using the analysis page of the solr default web page. I assume you are using the same analyzers and tokenizers in indexing and searching for this field in your schema. Regards, Marco Martínez Bautista 2010/3/17 Teruhiko Kurosaka k...@basistech.com It seems that Solr's query parser doesn't pass a single term query to the Analyzer for the field. For example, if I give it 2001年 (year 2001 in Japanese), the searcher returns 0 hits but if I quote them with double-quotes, it returns hits. In this experiment, I configured schema.xml so that the field in question will use the morphological Analyzer my company makes that is capable of splitting 2001年 into two tokens 2001 and 年. I am guessing that this Analyzer is called ONLY IF the term is a phrase. Is my observation correct? If so, is there any configuration parameter that I can tweak to force any query for the text fields be processed by the Analyzer? One might ask why users won't put space between 2001 and 年. Well if they are clearly two separate words, people do that. But 年 works more like a suffix in this case, and in many Japanese speaker's mind, 2001年 seems like one token, so many people won't. (Remember Japanese don't use spaces in normal writing.) Forcing to use Analyzer would also be useful for compound word handling often desirable for languages like German. Teruhiko Kuro Kurosaka RLP + Lucene Solr = powerful search for global contents
Re: APR setup
I think I know many sites that ignore this warning... using mod_proxy is quite an easier method in comparison to this. Maybe you are aiming at millions of queries per second, then you should consider that. I wonder if it makes sense before. paul Le 17-mars-10 à 04:36, blargy a écrit : [java] INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: .:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/ java What the heck is this and why is it recommended for production settings? Anyone?
Will Solr fit our needs?
Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Solr 1.4 - Stemmer expansion
I'm using the SnowballPorterFilterFactory for stemming French words. Some words are not reconginized by this stemmer; I wonder wether, like synonyms processing, the stemmers have the option of expansion. Thanks.
Re: Will Solr fit our needs?
Hi, Solr is running on top of Lucene and as far as I know Lucene knows only one approach how to update the document field content: that is delete first and then (re)index with new values. However, saying this it does not mean you can not implement what you need. Take a look at ParallelReader API http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/ParallelReader.html I am not sure if this functionality is directly exposed via Solr API. Try digging mail lists (search-lucene.com or lucidimagination.com can be of good help while you can narrow search to Solr only: http://search-lucene.com/?q=ParallelReaderfc_project=Solr or http://www.lucidimagination.com/search/?q=ParallelReader#/p:solr). For example the following mail thread seems to be relevant: http://search-lucene.com/m/iT2hMvtDt5 (though it is bit dated) Do you really use only one physical index for all auctions? If yes then you might consider using ParallelReader but if the index is large then I am not sure about the performance. If you are planning to partition your index then it can get more complex but faster. Regards, Lukas On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.dewrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Re: Will Solr fit our needs?
Having been thinking about your questions again and I think that if you are expecting that the price value will be changing a lot, especially when talking about auctions then you should consider not storing the actual price into the full text index but into some fast datastore. Some kind of scalable in-memory hash map with journal based backup would do this job better I think. Just my 2 cents. Regards, Lukas On Wed, Mar 17, 2010 at 10:36 AM, Lukáš Vlček lukas.vl...@gmail.com wrote: Hi, Solr is running on top of Lucene and as far as I know Lucene knows only one approach how to update the document field content: that is delete first and then (re)index with new values. However, saying this it does not mean you can not implement what you need. Take a look at ParallelReader API http://lucene.apache.org/java/3_0_1/api/all/org/apache/lucene/index/ParallelReader.html I am not sure if this functionality is directly exposed via Solr API. Try digging mail lists (search-lucene.com or lucidimagination.com can be of good help while you can narrow search to Solr only: http://search-lucene.com/?q=ParallelReaderfc_project=Solr or http://www.lucidimagination.com/search/?q=ParallelReader#/p:solr). For example the following mail thread seems to be relevant: http://search-lucene.com/m/iT2hMvtDt5 (though it is bit dated) Do you really use only one physical index for all auctions? If yes then you might consider using ParallelReader but if the index is large then I am not sure about the performance. If you are planning to partition your index then it can get more complex but faster. Regards, Lukas On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.dewrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Re: Stopwords
I was reading Scaling Lucen and Solr (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) and I came across the section StopWords. In there it mentioned that its not recommended to remove stop words at index time. Why is this the case? Don't all the extraneous stopwords bloat the index and lead to less relevant results? Can someone please explain this to me. Thanks There were a discussion about stopwords (remove them, not to remove them, or index them with CommonGramsFilterFactory) and good references in this thread. http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference
Re: Weired behaviour for certain search terms
Solr is behaving a bit weirdly for some of the search terms. EG: co-ownership, co ownership. It works fine with terms like quasi-delict, non-interference etc. The issue is, its not return any excerpts in highlighting key of the result dictionary. My search query is something like this: http://192.168.1.50:8080/solr/core_SFS/select?q=content:(co-ownership)+AND+permauid:(AAAE1292-rw)hl=truehl.fl=contenthl.requireFieldMatch=truehl.fragsize=600hl.usePhraseHighlighter=truefacet=truefacet.field=permauidfacet.field=info_ownerfacet.sort=truefacet.mincount=1facet.limit=-1wt=pythonsort=promulgation_date_igprs_date+ascstart=0rows=200fl=uid,permauid but when i search for terms like quasi-delict, non-interference, it gives me proper excerpts. If the problem is only empty snippets (numFound 0) then adding hl.maxAnalyzedChars=-1 can help.
Re: SQL and $deleteDocById
On 16.03.2010, at 15:42, Lukas Kahwe Smith wrote: Hi, I am trying to use $deleteDocById to delete rows based on an SQL query in my db-data-config.xml. The following tag is a top level tag in the document tag. entity name=company_del query=SELECT e.id AS `$deleteDocById` ROM deletedentity AS e/ thats obviously a typo from trying to simplify the example .. should be FROM However it seems like its only fetching the rows, its not actually issuing any index deletes. I can see that the special case handler is triggered when looking at the console, but no actual deletes are happening as I can verify via luke or just trying a query INFO: [core1] webapp=/solr path=/dataimport params={command=full-importclean=false} status=0 QTime=7 Mar 17, 2010 11:29:15 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Mar 17, 2010 11:29:15 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Creating a connection for entity company_del with URL: jdbc:mysql://localhost/xxx Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.JdbcDataSource$1 call INFO: Time taken for getConnection(): 809 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 1 Mar 17, 2010 11:29:16 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=1 commit{dir=/Users/lsmith/htdocs/liip/xxx/trunk/jetty/solr/core1/data_test/index,segFN=segments_9,version=1268742459863,generation=9,filenames=[_8.nrm, segments_9, _8.tis, _8.prx, _8.fnm, _8.tii, _8.frq, _8.fdx, _8.fdt] Mar 17, 2010 11:29:16 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: newest commit = 1268742459863 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 2 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 3 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 4 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 5 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 6 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 7 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 8 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 9 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 10 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 11 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 12 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 13 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 14 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 15 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 16 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 17 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 18 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 19 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 20 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 21 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 22 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 23 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 24 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 25 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 26 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 27 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 28 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 29 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc INFO: Deleting document: 30 Mar 17, 2010 11:29:16 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
Re: Will Solr fit our needs?
Hi Mortiz, You can take a look on the project ZOIE - http://code.google.com/p/zoie/. I think it's that what are you looking for. br Krzysztof On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.de wrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Re: Will Solr fit our needs?
If you dont' plan on filtering/ sorting and/or faceting on fast-changing fields it would be better to store them outside of solr/lucene in my opinion. If you must: for indexing-performance reasons you will probably end up with maintaining seperate indices (1 for slow-changing/static fields and 1 for fast-changing-fields) . You frequently commit the fast-changing -index to incorporate the changes in current_price. Afterwards you have 2 options I believe: 1. use parallelreader to query the seperate indices directly. Afaik, this is not (completely) integrated in Solr... I wouldn't recommend it. 2. after you commit the fast-changing-index, merge with the static-index. You're left with 1 fresh index, which you can push to your slave-servers. (all this in regular interverals) Disadvatages: - In any way, you must be very careful with maintaining multiple parallel indexes with the purpose of treating them as one. For instance document inserts must be done exactly in the same order, otherwise the indices go 'out-of-sync' and are unusable. - higher maintenance - there is always a time-window in which the current_price values are stale. If that's within reqs that's ok. The other path, which I recommend, would be to store the current_price outside of solr (like you're currently doing) but instead of using a relational db, try looking into persistent key-value stores. Many of them exist and a lot of progress has been made in the last couple of years. For simple key-lookups (what you need as far as I can tell) they really blow every relational db out of the water (considering the same hardware of course) We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant and seeing almost a 5x increased in lookup performance compared to our previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was already several times faster than our mysql-setup (although not optimally tuned) . to sum things up: use the best tools for what they were meant to do. - index/search -- solr/ lucene without a doubt. - kv-lookup -- consensus is still forming, and a lot of players (with a lot of different types of functionality) but if all you need is simple key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the moment. Please note that TC and competitors aren't just some code/ hobby projects but are usually born out of a real need at huge websites / social networks such as TC which is born from mixi (big social network in Japan) . So at least you're in good company.. for kv-stores I would suggest to begin your research at: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ (beginning 2009) http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores (half 2009) and get a feel of the kv-playing field. Hope this (pretty long) post helps, Geert-Jan 2010/3/17 Krzysztof Grodzicki krzysztof.grodzi...@iterate.pl Hi Mortiz, You can take a look on the project ZOIE - http://code.google.com/p/zoie/. I think it's that what are you looking for. br Krzysztof On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.de wrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz
Re: SQL and $deleteDocById
On 17.03.2010, at 11:36, Lukas Kahwe Smith wrote: On 16.03.2010, at 15:42, Lukas Kahwe Smith wrote: Hi, I am trying to use $deleteDocById to delete rows based on an SQL query in my db-data-config.xml. The following tag is a top level tag in the document tag. entity name=company_del query=SELECT e.id AS `$deleteDocById` ROM deletedentity AS e/ I have managed to get things working with a different approach now: entity name=entity pk=id query=SELECT e.id, e.name FROM entity AS e deletedPkQuery=SELECT e.id FROM deletedentity AS e/ regards, Lukas Kahwe Smith m...@pooteeweet.org
London open-source search social - 6th April
Hi all, We're meeting up at the Elgin just by Ladbroke Grove on the 6th for a bit of relaxed chat about search, and related technology. Come along, we're nice. http://www.meetup.com/london-search-social/calendar/12781861/ It's a regular event, so if you want prior warning about future meetups you can sign up here: http://www.meetup.com/london-search-social/ Cheers, Rich
Re: XML data in solr field
Have you considered an XML database? Because this is exactly what they are designed to do. eXist is open source, or you can use Mark Logic (my employer), which is much faster and more scalable. We do give out free academic and community licenses for Mark Logic. wunder On Mar 16, 2010, at 11:04 PM, Nair, Manas wrote: Thankyou Tommy. But the real problem here is that the xml is dynamic and the element names will be different in different docs which means that there will be a lot of field names to be added in schema if I were to index those xml nodes separately. Is it possible to have nested indexing (xml within xml) in solr without the overhead of adding all those inner xml nodes as actual fields in solr schema? Manas From: Tommy Chheng [mailto:tommy.chh...@gmail.com] Sent: Tue 3/16/2010 5:05 PM To: solr-user@lucene.apache.org Subject: Re: XML data in solr field Do you have the option of just importing each xml node as a field/value when you add the document? That'll let you do the search easily. If you need to store the raw XML, you can use an extra field. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com http://tommy.chheng.com/ On 3/16/10 12:59 PM, Nair, Manas wrote: Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair
Re: Solr 1.4 - Stemmer expansion
The configuration is correct and it works perfectly for French. So far, all the French words I tried got stemmed correctly; except the word studios. This is why I thought about expansion, perhaps I might need it for other words. Thanks, -Saïd 2010/3/17 Erick Erickson erickerick...@gmail.com Did you specify language=French? Did you re-index after specifying this? Can you give some examples of unrecognized words? Did you look in your index to see what was actually indexed via the admin pages and/or Luke? Did you use debugQuery=on to see how your search was parsed? Could you post your schema definitions for the field in question so folks can look at it? We need some details in order to actually be helpful G... Best Erick On Wed, Mar 17, 2010 at 5:05 AM, Saïd Radhouani r.steve@gmail.com wrote: I'm using the SnowballPorterFilterFactory for stemming French words. Some words are not reconginized by this stemmer; I wonder wether, like synonyms processing, the stemmers have the option of expansion. Thanks.
RE: PDFBox/Tika Performance Issues
Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an error, but the data doesn't get extracted. Using the same PDF with my previous /Lib contents works fine. Any other ideas? These are the jar files I have in my /Lib apache-solr-cell-1.4-dev.jar asm-3.1.jar bcmail-jdk15-1.45.jar bcprov-jdk15-1.45.jar commons-codec-1.3.jar commons-compress-1.0.jar commons-io-1.4.jar commons-lang-2.1.jar commons-logging-1.1.1.jar dom4j-1.6.1.jar fontbox-1.0.0.jar geronimo-stax-api_1.0_spec-1.0.1.jar hamcrest-core-1.1.jar icu4j-3.8.jar jempbox-1.0.0.jar junit-3.8.1.jar log4j-1.2.14.jar lucene-core-2.9.1-dev.jar lucene-misc-2.9.1-dev.jar metadata-extractor-2.4.0-beta-1.jar mockito-core-1.7.jar nekohtml-1.9.9.jar objenesis-1.0.jar ooxml-schemas-1.0.jar pdfbox-1.0.0.jar poi-3.6.jar poi-ooxml-3.6.jar poi-ooxml-schemas-3.6.jar poi-scratchpad-3.6.jar tagsoup-1.2.jar tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar xercesImpl-2.8.1.jar xml-apis-1.0.b2.jar xmlbeans-2.3.0.jar -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 11:50 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hi Giovanni, Comments below: I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance. This is what I've tried so far (which was really just me guessing): 1. Got the latest version of the trunk code from http://svn.apache.org/repos/asf/lucene/tika/trunk 2. Built this using Maven (mvn install) On track so far. 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib folder for my Solr Core, and renamed it to the name of the existing Tika Jar (tika-0.3.jar). I don't think you need to do this (w.r.t to the renaming). I think what you need to do is to drop: tika-core-0.7-SNAPSHOT.jar tika-parsers-0.7-SNAPSHOT.jar Into your Solr core /lib folder. Also you should make sure to take the updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies in the tika-parsers project, see here: http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo jo.html), along with the rest of the jar deps for tika-parsers and drop them in there as well. Then, make sure to remove the existing tika-0.3.jar, as well as any of the existing parser lib jar files and replace them with the new deps. A bunch of manual labor yes, but you're on the bleeding edge, so c'est la vie, right? :) The alternative is to wait for Tika 0.7 to be released and then for Solr to upgrade to it. 4. Then I bounced my servlet server and tried indexing a document. The document was successfully indexed, and there were no errors logged as a result, but the PDF data does not appear to have been extracted (the field I used for map.content had an empty-string as a value). I think probably has to do with the lib deps. Try what I mentioned above and let's go from there. Cheers, Chris -Original Message- From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com] Sent: Tuesday, March 16, 2010 5:41 PM To: solr-user@lucene.apache.org Subject: RE: PDFBox/Tika Performance Issues Thanks Chris! I'll try the patch. -Original Message- From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov] Sent: Tuesday, March 16, 2010 5:37 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Guys, I think this is an issue with PDFBOX and the version that Tika 0.6 depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may include a fix for the problem you're seeing. See this discussion [2] on how to patch Tika to use the new PDFBox if you can't wait for the 0.7 release which should happen soon (hopefully next few weeks). Cheers, Chris [1] http://issues.apache.org/jira/browse/TIKA-380 [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html On 3/16/10 2:31 PM, Giovanni Fernandez-Kincade gfernandez-kinc...@capitaliq.com wrote: Originally 16 (the number of CPUs on the machine), but even with 5 threads it's not looking so hot. -Original Message- From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Ingersoll Sent: Tuesday, March 16, 2010 5:15 PM To: solr-user@lucene.apache.org Subject: Re: PDFBox/Tika Performance Issues Hmm, that is an ugly thing in PDFBox. We should probably take this over to the PDFBox project. How many threads are you indexing with? FWIW, for that many documents, I might consider using Tika on the client side to save on a lot of network traffic. -Grant On Mar 16, 2010, at 4:37 PM, Giovanni Fernandez-Kincade wrote: I've been trying to bulk index about 11 million PDFs, and while profiling our Solr instance, I noticed that all of the threads that are processing indexing requests are constantly blocking
Re: Stopwords
That discussion cites a paper via a URL: http://doc.rero.ch/lm.php?url#16;00,43,4,20091218142456-GY/Dolamic_Ljiljana__When_Stopword_Lists_Make_the_Difference_20091218.pdf Unfortunately when I go to this URL I get: L'accès à ce document est limité. But I tracked down the paper. Here is its reference (which may require a subscription: sorry): US: http://dx.doi.org/10.1002/asi.21186 AU: Ljiljana Dolamic AU: Jacques Savoy TI: When stopword lists make the difference SO: Journal of the American Society for Information Science and Technology VL: 61 NO: 1 PG: 200-203 YR: 2010 CP: © 2009 ASIST ON: 1532-2890 PN: 1532-2882 AD: Computer Science Department, University of Neuchâtel, 2009 Neuchâtel, Switzerland DOI: 10.1002/asi.21186 -Glen On 17 March 2010 06:02, Ahmet Arslan iori...@yahoo.com wrote: I was reading Scaling Lucen and Solr (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) and I came across the section StopWords. In there it mentioned that its not recommended to remove stop words at index time. Why is this the case? Don't all the extraneous stopwords bloat the index and lead to less relevant results? Can someone please explain this to me. Thanks There were a discussion about stopwords (remove them, not to remove them, or index them with CommonGramsFilterFactory) and good references in this thread. http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference -- -
RE: Indexing CLOB Column in Oracle
To convert an XMLTYPE to CLOB use the getClobVal() method like this: SELECT d.XML.getClobVal() FROM DOC d WHERE d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, March 16, 2010 7:37 PM To: solr-user@lucene.apache.org Subject: Re: Indexing CLOB Column in Oracle Disclaimer: My Oracle experience is miniscule at best. I am also a beginner at Solr, so grab yourself the proverbial grain of salt. I googled a bit on CLOB. One page I found mentioned setting up a view to return the data type you want. Can you use the functions described on these pages in either the Solr query or a view? http://www.oradev.com/dbms_lob.jsp http://www.dba-oracle.com/t_dbms_lob.htm http://www.praetoriate.com/dbms_packages/ddp_dbms_lob.htm I also was trying to find a way to convert from xmltype directly to a string in a query, but that quickly got way over my level of understanding. I saw hints that it is possible, though. Shawn On 3/16/2010 4:59 PM, Neil Chaudhuri wrote: Since my original thread was straying to a new topic, I thought it made sense to create a new thread of discussion. I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type, which is an instance of oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob.
Re: spanish solr tutorial
Very nice. I'd suggest adding a link to the wiki near the tutorial link. -Grant On Mar 16, 2010, at 11:44 PM, Juan Pedro Danculovic wrote: Hi all, we translated the Solr tutorial to Spanish due to a client's request. For all you Spanish speakers/readers out there, you can have a look at it: http://www.linebee.com/?p=155 We hope this can expand the usage of the project and lower the language barrier to non-english speakers. Thanks Juan Danculovic CTO - www.linebee.com
Re: Stopwords
They apparently moved it .. it's here now: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf -- From: Glen Newton glen.new...@gmail.com Sent: Wednesday, March 17, 2010 11:13 AM To: solr-user@lucene.apache.org Subject: Re: Stopwords That discussion cites a paper via a URL: http://doc.rero.ch/lm.php?url#16;00,43,4,20091218142456-GY/Dolamic_Ljiljana__When_Stopword_Lists_Make_the_Difference_20091218.pdf Unfortunately when I go to this URL I get: L'accès à ce document est limité. But I tracked down the paper. Here is its reference (which may require a subscription: sorry): US: http://dx.doi.org/10.1002/asi.21186 AU: Ljiljana Dolamic AU: Jacques Savoy TI: When stopword lists make the difference SO: Journal of the American Society for Information Science and Technology VL: 61 NO: 1 PG: 200-203 YR: 2010 CP: © 2009 ASIST ON: 1532-2890 PN: 1532-2882 AD: Computer Science Department, University of Neuchâtel, 2009 Neuchâtel, Switzerland DOI: 10.1002/asi.21186 -Glen On 17 March 2010 06:02, Ahmet Arslan iori...@yahoo.com wrote: I was reading Scaling Lucen and Solr (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) and I came across the section StopWords. In there it mentioned that its not recommended to remove stop words at index time. Why is this the case? Don't all the extraneous stopwords bloat the index and lead to less relevant results? Can someone please explain this to me. Thanks There were a discussion about stopwords (remove them, not to remove them, or index them with CommonGramsFilterFactory) and good references in this thread. http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference -- -
Re: Stopwords
On Mar 16, 2010, at 9:51 PM, blargy wrote: I was reading Scaling Lucen and Solr (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/) and I came across the section StopWords. In there it mentioned that its not recommended to remove stop words at index time. Why is this the case? Don't all the extraneous stopwords bloat the index and lead to less relevant results? Can someone please explain this to me. Thanks Yes and no. Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious. Times, as they say, have changed. Disk is cheap, so that is no longer a concern. Think about stop words a little bit from a language perspective, while it is true that they are of little value in search, they are not of no value (if they are of no value in a language, one could argue that the word shouldn't even exist, right?). This is especially true when the user enters a query that is entirely stop words (for instance, there is a band called The THE). Thus, the trick becomes knowing when to use stop words and when not to. If you remove them at indexing time, you have no choice, as the information is lost, so that is why more and more people keep them during indexing and then deal with them at query time. Turns out, stop words are often also useful as part of phrases. Consider the following two documents: 1. The President of the United States went to China last week. 2. Joe is the President. The United States is investigating him for corruption. If the user enters the query The President of the United States and stop words are removed at indexing and search time, then both documents will match, whereas with stop words, the first is the only (and correct) match at least based on my intent. To deal with them at query time, you need an intelligent query parser that: 1. Recognizes when the query is all stop words 2. Keeps stop words as part of phrases Unfortunately, none of the existing Solr Query Parsers address these two things. HTH, Grant -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: Stopwords
On Wed, Mar 17, 2010 at 11:48 AM, Grant Ingersoll gsing...@apache.org wrote: Yes and no. Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious. Times, as they say, have changed. Disk is cheap, so that is no longer a concern. Yes, and the take-away from the Dolamic and Savoy paper is that, performance-aside, removing stopwords is still a necessary evil for good relevance, at least for some languages. Ideally we wouldn't have to remove information to have good relevance, and a good step forward would be to support relevance-ranking algorithms such as the BM25* mentioned in the paper, that provide good relevance without the need to remove stopwords. For now, at least the CommonGrams solution is available in Solr that provides an alternative which can address both concerns (performance and relevance) to some degree. -- Robert Muir rcm...@gmail.com
Re: Stopwords
On 03/17/2010 12:03 PM, Robert Muir wrote: On Wed, Mar 17, 2010 at 11:48 AM, Grant Ingersollgsing...@apache.org wrote: Yes and no. Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious. Times, as they say, have changed. Disk is cheap, so that is no longer a concern. Yes, and the take-away from the Dolamic and Savoy paper is that, performance-aside, removing stopwords is still a necessary evil for good relevance, at least for some languages. Ideally we wouldn't have to remove information to have good relevance, and a good step forward would be to support relevance-ranking algorithms such as the BM25* mentioned in the paper, that provide good relevance without the need to remove stopwords. For now, at least the CommonGrams solution is available in Solr that provides an alternative which can address both concerns (performance and relevance) to some degree. In general I prefer to have the option of removing stopwords at query time (common grams solution aside). Too many times have I removed stopwords and had user complaints about phrase and proximity queries, and no server downtime to reindex and fix the issue. It was never fun supporting Librarians. -- - Mark http://www.lucidimagination.com
Re: Exception encountered during replication on slave....Any clues?
Hi William, We are facing the same issue as yourself.. just thought of checking if you had already resolve this issue? Thanks, Barani William Pierce-3 wrote: Folks: I am seeing this exception in my logs that is causing my replication to fail.I start with a clean slate (empty data directory). I index the data on the postingsmaster using the dataimport handler and it succeeds. When the replication slave attempts to replicate it encounters this error. Dec 7, 2009 9:20:00 PM org.apache.solr.handler.SnapPuller fetchLatestIndex SEVERE: Master at: http://localhost/postingsmaster/replication is not available. Index fetch failed. Exception: Invalid version or the data in not in 'javabin' format Any clues as to what I should look for to debug this further? Replication is enabled as follows: The postingsmaster solrconfig.xml looks as follows: requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'optimize' it can also be 'commit' -- str name=replicateAftercommit/str !--If configuration files need to be replicated give the names here . comma separated -- str name=confFiles/str /lst /requestHandler The postings slave solrconfig.xml looks as follows: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave !--fully qualified url for the replication handler of master -- str name=masterUrlhttp://localhost/postingsmaster/replication/str !--Interval in which the slave should poll master .Format is HH:mm:ss . If this is absent slave does not poll automatically. But a snappull can be triggered from the admin or the http API -- str name=pollInterval00:05:00/str /lst /requestHandler Thanks, - Bill -- View this message in context: http://old.nabble.com/Exception-encountered-during-replication-on-slaveAny-clues--tp26684769p27933575.html Sent from the Solr - User mailing list archive at Nabble.com.
Replication failed due to HTTP PROXY?
Hi, One of my collegue back in India is not able to replicate the index present in the Servers (USA). I am now thinking if this is due to any proxy related issue? He is getting the below metioned error message Is there a way to configure PROXY in SOLR config files? Server logs INFO: [] Registered new searcher searc...@edf730 main Mar 17, 2010 8:38:06 PM org.apache.solr.handler.ReplicationHandler getReplicatio nDetails WARNING: Exception while invoking 'details' method for replication on master org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept t he connection within timeout of 5000 ms at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.create Socket(ReflectionSocketFactory.java:155) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.c reateSocket(DefaultProtocolSocketFactory.java:125) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java :707) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http ConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Htt pMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMe thodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav a:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav a:323) at org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.ja va:193) at org.apache.solr.handler.SnapPuller.getCommandResponse(SnapPuller.java :188) at org.apache.solr.handler.ReplicationHandler.getReplicationDetails(Repl icationHandler.java:581) at org.apache.solr.handler.ReplicationHandler.handleRequestBody(Replicat ionHandler.java:180) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.jsp.admin.replication.index_jsp.executeCommand(org.apache. jsp.admin.replication.index_jsp:50) at org.apache.jsp.admin.replication.index_jsp._jspService(org.apache.jsp .admin.replication.index_jsp:231) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper .java:373) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:4 64) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:358) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487 ) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3 67) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1 81) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7 12) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:268) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:126) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:264) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet Handler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3 65) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1 81) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7 12) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand lerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection. java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1 39) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:50 2) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpCo nnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector. java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool .java:442) Caused by:
related search
How can i make related search in solr.if i search ipod i need to get answer like ipodsuffle,ipodnano,ipone with out using morelikethis option -- View this message in context: http://old.nabble.com/related-search-tp27933778p27933778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr RAM Requirements
Hi Chak Rather than comparing the overall size of your index to the RAM available for the OS disk cache, you might want to look at particular files. For example if you allow phrase queries, than the size of the *prx files is relevant, if you don't, you can look at the size of your *frq files. You also might want to take a look at the free memory when you start up Solr and then watch as it fills up as you get more queries (or send cache-warming queries). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search KaktuChakarabati wrote: My question was mainly about the fact there seems to be two different aspects to the solr RAM usage: in-process and out-process. By that I mean, yes i know the many different parameters/caches to do with solr in-process memory usage and related culprits, however I also understand that as for actual index access (posting list, positional index etc), solr mostly delegates the access/caching of this to the OS/disk cache. So I guess my question is more about that: namely, what would be a good way to calculate an overall ram requirement profile for a server running solr? -- View this message in context: http://old.nabble.com/Solr-RAM-Requirements-tp27924551p27933779.html Sent from the Solr - User mailing list archive at Nabble.com.
Querying multiple fields with the MoreLikeThis handler and mlt.fl
I'm wondering if there's been any progress on an issue described a year or so ago in More details on my MoreLikeThis mlt.qf boosting problem http://markmail.org/thread/nmabm5ly3wk2nqyy, where it was pointed out that the MoreLikeThis handler only queries one field for each of the interesting terms that it finds in the input text. I was hoping that using /mlt?mlt.fl=title+textmlt.qf=title^2+text^0.5mlt.interestingTerms=detailsstream.body=tony+blair would produce title:tony^2 title:blair^2 text:tony^0.5 text:blair^0.5 but it actually produces just text:tony^0.5 text:blair^0.5 i.e. despite including the title field in both mlt.qf and mlt.fl, it only searches the text field. If I set mlt.fl=title, it produces title:tony^2 title:blair^2 so it is having an effect, just not the one I'm hoping for... As it stands, in Solr 1.4, the MoreLikeThis result set from the query above doesn't produce the document with title Tony Blair as the first result, which would seem appropriate given the input text tony blair and a boost on the title field. alf
XPath Processing Applied to Clob
I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks.
Trouble getting results from Dismax query
I'm trying to use the Dismax request handler, and thanks to the list, I fixed one problem, which was the existing configs in solrconfig.xml. I'm now just not getting any result from the query though. I changed the dismax section in solrconfig.xml to this: requestHandler name=dismax class=solr.SearchHandler lst name=defaults str name=defTypedismax/str str name=echoParamsexplicit/str float name=tie0.01/float int name=ps100/int str name=q.alt*:*/str str name=hl.flartist title/str str name=f.name.hl.fragsize0/str str name=f.name.hl.alternateFieldtitle/str str name=f.text.hl.fragmenterregex/str /lst /requestHandler I'm using solr-php-client for my query, and the code looks like this: $params['qt'] = 'dismax'; $params['qf'] = 'title^100 artist^150 description^5 tags^'.$tagboost.' artist_title^500 featuring_artist^20 collaborators^50'; $params['pf'] = 'artist_title'; $query = title:$search artist:$search description:$search tags:$search +type:$type artist_title:$search featuring_artist:$search collaborators:$search; $response = $solr-search( $query, 0, 30 ,$params); The raw query ends up as this: /solr/select?qt=dismaxqf=title%5E100+artist%5E150+description%5E5+tags%5E10+artist_title%5E500+featuring_artist%5E20+collaborators%5E50pf=artist_titleq=title%3Aakon+artist%3Aakon+description%3Aakon+tags%3Aakon+%2Btype%3Avideo+artist_title%3Aakon+featuring_artist%3Aakon+collaborators%3Aakonversion=1.2wt=jsonjson.nl=mapstart=0rows=30 Responseheader is this: {responseHeader:{status:0,QTime:9,params:{pf:artist_title,start:0,q:title:akon artist:akon description:akon tags:akon +type:video artist_title:akon featuring_artist:akon collaborators:akon,qf:title^100 artist^150 description^5 tags^10 artist_title^500 featuring_artist^20 collaborators^50,json.nl:map,qt:dismax,wt:json,version:1.2,rows:30}},response:{numFound:0,start:0,docs:[]}} If I remove the qt=dismax, I get results like I should. Can anyone shed some light? Thanks, Alex
Re: Trouble getting results from Dismax query
On Mar 17, 2010, at 3:38 PM, Alex Thurlow wrote: I'm trying to use the Dismax request handler, and thanks to the list, I fixed one problem, which was the existing configs in solrconfig.xml. I'm now just not getting any result from the query though. I changed the dismax section in solrconfig.xml to this: $query = title:$search artist:$search description:$search tags: $search +type:$type artist_title:$search featuring_artist:$search collaborators:$search; $response = $solr-search( $query, 0, 30 ,$params); The raw query ends up as this: /solr/select?qt=dismaxqf=title%5E100+artist%5E150+description %5E5+tags%5E10+artist_title%5E500+featuring_artist%5E20+collaborators %5E50pf=artist_titleq=title%3Aakon+artist%3Aakon+description%3Aakon +tags%3Aakon+%2Btype%3Avideo+artist_title%3Aakon+featuring_artist %3Aakon+collaborators %3Aakonversion=1.2wt=jsonjson.nl=mapstart=0rows=30 The dismax parser does not support fielded queries, so title$search... etc is not parsing as you expect. qf/pf control the fields searched. If you need fielded searches like you're attempting, you'll need to overhaul how you're doing the parsing. You'll likely also want to tune the mm parameter. If I remove the qt=dismax, I get results like I should. Can anyone shed some light? Right, because the default is to use the SolrQueryParser, which supports fielded syntax. Erik
RE: XPath Processing Applied to Clob
Incidentally, I tried adding this: datasource name=f type=FieldReaderDataSource / document entity dataSource=f processor=XPathEntityProcessor dataField=d.text forEach=/MESSAGE field column=body xpath=//BODY/ /entity /document But this didn't seem to change anything. Any insight is appreciated. Thanks. From: Neil Chaudhuri Sent: Wednesday, March 17, 2010 3:24 PM To: solr-user@lucene.apache.org Subject: XPath Processing Applied to Clob I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks.
Re: XPath Processing Applied to Clob
The XPath parser in the DIH is a limited implementation. The unit test program is the only enumeration (that I can find) of what it handles: http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/dataimporthandler/src/test/java/org/apache/solr/handler/dataimport/TestXPathRecordReader.java //BODY in fact is not allowed, and should throw an Exception. Or at least some kind of error message. Perhaps there is one in the logs? On Wed, Mar 17, 2010 at 2:45 PM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: Incidentally, I tried adding this: datasource name=f type=FieldReaderDataSource / document entity dataSource=f processor=XPathEntityProcessor dataField=d.text forEach=/MESSAGE field column=body xpath=//BODY/ /entity /document But this didn't seem to change anything. Any insight is appreciated. Thanks. From: Neil Chaudhuri Sent: Wednesday, March 17, 2010 3:24 PM To: solr-user@lucene.apache.org Subject: XPath Processing Applied to Clob I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks. -- Lance Norskog goks...@gmail.com
Re: Will Solr fit our needs?
Another option is the ExternalFileField: http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4?q=ExternalFileField This lets you store the current prices for all items in a separate file. You can only use it in a function query, that is. But it does allow you to maintain one Solr index, which is very very worthwhile. On Wed, Mar 17, 2010 at 4:19 AM, Geert-Jan Brits gbr...@gmail.com wrote: If you dont' plan on filtering/ sorting and/or faceting on fast-changing fields it would be better to store them outside of solr/lucene in my opinion. If you must: for indexing-performance reasons you will probably end up with maintaining seperate indices (1 for slow-changing/static fields and 1 for fast-changing-fields) . You frequently commit the fast-changing -index to incorporate the changes in current_price. Afterwards you have 2 options I believe: 1. use parallelreader to query the seperate indices directly. Afaik, this is not (completely) integrated in Solr... I wouldn't recommend it. 2. after you commit the fast-changing-index, merge with the static-index. You're left with 1 fresh index, which you can push to your slave-servers. (all this in regular interverals) Disadvatages: - In any way, you must be very careful with maintaining multiple parallel indexes with the purpose of treating them as one. For instance document inserts must be done exactly in the same order, otherwise the indices go 'out-of-sync' and are unusable. - higher maintenance - there is always a time-window in which the current_price values are stale. If that's within reqs that's ok. The other path, which I recommend, would be to store the current_price outside of solr (like you're currently doing) but instead of using a relational db, try looking into persistent key-value stores. Many of them exist and a lot of progress has been made in the last couple of years. For simple key-lookups (what you need as far as I can tell) they really blow every relational db out of the water (considering the same hardware of course) We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant and seeing almost a 5x increased in lookup performance compared to our previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was already several times faster than our mysql-setup (although not optimally tuned) . to sum things up: use the best tools for what they were meant to do. - index/search -- solr/ lucene without a doubt. - kv-lookup -- consensus is still forming, and a lot of players (with a lot of different types of functionality) but if all you need is simple key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the moment. Please note that TC and competitors aren't just some code/ hobby projects but are usually born out of a real need at huge websites / social networks such as TC which is born from mixi (big social network in Japan) . So at least you're in good company.. for kv-stores I would suggest to begin your research at: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ (beginning 2009) http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores (half 2009) and get a feel of the kv-playing field. Hope this (pretty long) post helps, Geert-Jan 2010/3/17 Krzysztof Grodzicki krzysztof.grodzi...@iterate.pl Hi Mortiz, You can take a look on the project ZOIE - http://code.google.com/p/zoie/. I think it's that what are you looking for. br Krzysztof On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler m...@moritz-maedler.de wrote: Hi List, we are running a marketplace which has about a comparable functionality like ebay (auctions, fixed-price items etc). The items are placed on the market by users who want to sell their goods. Currently we are using Sphinx as an indexing engine, but, as Sphinx returns only document ids we have to make a database-query to fetch the data to display. This massively decreases performance as we have to do two requests to display data. I heard that Solr is able to return a complete dataset and we hope a switch to Solr can boost perfomance. A critical question is left and i was not able to find a solution for it in the docs: Is it possible to update attributes directly in the index? An example for better illustration: We have an index which holds all the auctions (containing auctionid, auction title) with its current prices(field: current_price). When a user places a new bid, is it possible to update the attribute 'current_price' directly in the index so that we can fetch the current_price from Solr and not from the database? I hope you understood my problem. It would be kind if someone can point me to the right direction. Thanks alot! Moritz -- Lance Norskog goks...@gmail.com
Re: XML data in solr field
You can use dynamic fields (wildcard field names) to add any and all element names. You would have to add a suffix to every element name in your preparation, but you will not have to add all of the element names to your schema. On Wed, Mar 17, 2010 at 7:04 AM, Walter Underwood wun...@wunderwood.org wrote: Have you considered an XML database? Because this is exactly what they are designed to do. eXist is open source, or you can use Mark Logic (my employer), which is much faster and more scalable. We do give out free academic and community licenses for Mark Logic. wunder On Mar 16, 2010, at 11:04 PM, Nair, Manas wrote: Thankyou Tommy. But the real problem here is that the xml is dynamic and the element names will be different in different docs which means that there will be a lot of field names to be added in schema if I were to index those xml nodes separately. Is it possible to have nested indexing (xml within xml) in solr without the overhead of adding all those inner xml nodes as actual fields in solr schema? Manas From: Tommy Chheng [mailto:tommy.chh...@gmail.com] Sent: Tue 3/16/2010 5:05 PM To: solr-user@lucene.apache.org Subject: Re: XML data in solr field Do you have the option of just importing each xml node as a field/value when you add the document? That'll let you do the search easily. If you need to store the raw XML, you can use an extra field. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com http://tommy.chheng.com/ On 3/16/10 12:59 PM, Nair, Manas wrote: Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair -- Lance Norskog goks...@gmail.com
Re: Exception encountered during replication on slave....Any clues?
The localhost URLs have no port numbers. Is there a more complete error in the logs? On Wed, Mar 17, 2010 at 9:15 AM, JavaGuy84 bbar...@gmail.com wrote: Hi William, We are facing the same issue as yourself.. just thought of checking if you had already resolve this issue? Thanks, Barani William Pierce-3 wrote: Folks: I am seeing this exception in my logs that is causing my replication to fail. I start with a clean slate (empty data directory). I index the data on the postingsmaster using the dataimport handler and it succeeds. When the replication slave attempts to replicate it encounters this error. Dec 7, 2009 9:20:00 PM org.apache.solr.handler.SnapPuller fetchLatestIndex SEVERE: Master at: http://localhost/postingsmaster/replication is not available. Index fetch failed. Exception: Invalid version or the data in not in 'javabin' format Any clues as to what I should look for to debug this further? Replication is enabled as follows: The postingsmaster solrconfig.xml looks as follows: requestHandler name=/replication class=solr.ReplicationHandler lst name=master !--Replicate on 'optimize' it can also be 'commit' -- str name=replicateAftercommit/str !--If configuration files need to be replicated give the names here . comma separated -- str name=confFiles/str /lst /requestHandler The postings slave solrconfig.xml looks as follows: requestHandler name=/replication class=solr.ReplicationHandler lst name=slave !--fully qualified url for the replication handler of master -- str name=masterUrlhttp://localhost/postingsmaster/replication/str !--Interval in which the slave should poll master .Format is HH:mm:ss . If this is absent slave does not poll automatically. But a snappull can be triggered from the admin or the http API -- str name=pollInterval00:05:00/str /lst /requestHandler Thanks, - Bill -- View this message in context: http://old.nabble.com/Exception-encountered-during-replication-on-slaveAny-clues--tp26684769p27933575.html Sent from the Solr - User mailing list archive at Nabble.com. -- Lance Norskog goks...@gmail.com
Re: Replication failed due to HTTP PROXY?
A 5-second connection is not going to work trans-globally. The replication engine is generally tested in local sites. If it is possible to set defaults for the Apache Commons http classes via system properties, that might let this work. This doc does not seem promising: http://www.jdocs.com/httpclient/3.0.1/api-index.html?m=packagep=org.apache.commons.httpclientrender=classic On Wed, Mar 17, 2010 at 9:22 AM, JavaGuy84 bbar...@gmail.com wrote: Hi, One of my collegue back in India is not able to replicate the index present in the Servers (USA). I am now thinking if this is due to any proxy related issue? He is getting the below metioned error message Is there a way to configure PROXY in SOLR config files? Server logs INFO: [] Registered new searcher searc...@edf730 main Mar 17, 2010 8:38:06 PM org.apache.solr.handler.ReplicationHandler getReplicatio nDetails WARNING: Exception while invoking 'details' method for replication on master org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept t he connection within timeout of 5000 ms at org.apache.commons.httpclient.protocol.ReflectionSocketFactory.create Socket(ReflectionSocketFactory.java:155) at org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.c reateSocket(DefaultProtocolSocketFactory.java:125) at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java :707) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Http ConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Htt pMethodDirector.java:387) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMe thodDirector.java:171) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav a:397) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.jav a:323) at org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.ja va:193) at org.apache.solr.handler.SnapPuller.getCommandResponse(SnapPuller.java :188) at org.apache.solr.handler.ReplicationHandler.getReplicationDetails(Repl icationHandler.java:581) at org.apache.solr.handler.ReplicationHandler.handleRequestBody(Replicat ionHandler.java:180) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl erBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.jsp.admin.replication.index_jsp.executeCommand(org.apache. jsp.admin.replication.index_jsp:50) at org.apache.jsp.admin.replication.index_jsp._jspService(org.apache.jsp .admin.replication.index_jsp:231) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper .java:373) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:4 64) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:358) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487 ) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3 67) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1 81) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7 12) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:268) at org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:126) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte r.java:264) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet Handler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3 65) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav a:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1 81) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7 12) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHand lerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection. java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:1 39) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:50 2)
Re: Solr Performance Issues
Try cutting back Solr's memory - the OS knows how to manage disk caches better than Solr does. Another approach is to raise and lower the queryResultCache and see if the hitratio changes. On Wed, Mar 17, 2010 at 9:44 AM, Siddhant Goel siddhantg...@gmail.com wrote: Hi, Apparently the bottleneck seem to be the time periods when CPU is waiting to do some I/O. Out of all the numbers I can see, the CPU wait times for I/O seem to be the highest. I've alloted 4GB to Solr out of the total 8GB available. There's only 47MB free on the machine, so I assume the rest of the memory is being used for OS disk caches. In addition, the hit ratios for queryResultCache isn't going beyond 20%. So the problem I think is not at Solr's end. Are there any pointers available on how can I resolve such issues related to disk I/O? Does this mean I need more overall memory? Or reducing the amount of memory allocated to Solr so that the disk cache has more memory, would help? Thanks, On Fri, Mar 12, 2010 at 11:21 PM, Erick Erickson erickerick...@gmail.comwrote: Sounds like you're pretty well on your way then. This is pretty typical of multi-threaded situations... Threads 1-n wait around on I/O and increasing the number of threads increases throughput without changing (much) the individual response time. Threads n+1 - p don't change throughput much, but increase the response time for each request. On aggregate, though, the throughput doesn't change (much). Adding threads after p+1 *decreases* throughput while *increasing* individual response time as your processors start spending w to much time context and/or memory swapping. The trick is finding out what n and p are G. Best Erick On Fri, Mar 12, 2010 at 12:06 PM, Siddhant Goel siddhantg...@gmail.com wrote: Hi, Thanks for your responses. It actually feels good to be able to locate where the bottlenecks are. I've created two sets of data - in the first one I'm measuring the time took purely on Solr's end, and in the other one I'm including network latency (just for reference). The data that I'm posting below contains the time took purely by Solr. I'm running 10 threads simultaneously and the average response time (for each query in each thread) remains close to 40 to 50 ms. But as soon as I increase the number of threads to something like 100, the response time goes up to ~600ms, and further up when the number of threads is close to 500. Yes the average time definitely depends on the number of concurrent requests. Going from memory, debugQuery=on will let you know how much time was spent in various operations in SOLR. It's important to know whether it was the searching, assembling the response, or transmitting the data back to the client. I just tried this. The information that it gives me for a query that took 7165ms is - http://pastebin.ca/1835644 So out of the total time 7165ms, QueryComponent took most of the time. Plus I can see the load average going up when the number of threads is really high. So it actually makes sense. (I didn't add any other component while searching; it was a plain /select?q=query call). Like I mentioned earlier in this mail, I'm maintaining separate sets for data with/without network latency, and I don't think its the bottleneck. How many threads does it take to peg the CPU? And what response times are you getting when your number of threads is around 10? If the number of threads is greater than 100, that really takes its toll on the CPU. So probably thats the number. When the number of threads is around 10, the response times average to something like 60ms (and 95% of the queries fall within 100ms of that value). Thanks, Erick On Fri, Mar 12, 2010 at 3:39 AM, Siddhant Goel siddhantg...@gmail.com wrote: I've allocated 4GB to Solr, so the rest of the 4GB is free for the OS disk caching. I think that at any point of time, there can be a maximum of number of threads concurrent requests, which happens to make sense btw (does it?). As I increase the number of threads, the load average shown by top goes up to as high as 80%. But if I keep the number of threads low (~10), the load average never goes beyond ~8). So probably thats the number of requests I can expect Solr to serve concurrently on this index size with this hardware. Can anyone give a general opinion as to how much hardware should be sufficient for a Solr deployment with an index size of ~43GB, containing around 2.5 million documents? I'm expecting it to serve at least 20 requests per second. Any experiences? Thanks On Fri, Mar 12, 2010 at 12:47 AM, Tom Burton-West tburtonw...@gmail.com wrote: How much of your memory are you allocating to the JVM and how much are you leaving free?
Re: Indexing CLOB Column in Oracle
This could be the problem: the text field in the example schema is indexed, but not stored. If you query the index with text:monkeys it will find records with monkeys, but the text field will not appear in the returned XML because it was not stored. On Wed, Mar 17, 2010 at 11:17 AM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: For those who might encounter a similar issue, merging what I had into a single entity and using getClobVal() did the trick. In other words: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true /entity /document Thanks. -Original Message- From: Craig Christman [mailto:cchrist...@caci.com] Sent: Wednesday, March 17, 2010 11:23 AM To: solr-user@lucene.apache.org Subject: RE: Indexing CLOB Column in Oracle To convert an XMLTYPE to CLOB use the getClobVal() method like this: SELECT d.XML.getClobVal() FROM DOC d WHERE d.ARCHIVE_ID = '${doc.ARCHIVE_ID}' -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: Tuesday, March 16, 2010 7:37 PM To: solr-user@lucene.apache.org Subject: Re: Indexing CLOB Column in Oracle Disclaimer: My Oracle experience is miniscule at best. I am also a beginner at Solr, so grab yourself the proverbial grain of salt. I googled a bit on CLOB. One page I found mentioned setting up a view to return the data type you want. Can you use the functions described on these pages in either the Solr query or a view? http://www.oradev.com/dbms_lob.jsp http://www.dba-oracle.com/t_dbms_lob.htm http://www.praetoriate.com/dbms_packages/ddp_dbms_lob.htm I also was trying to find a way to convert from xmltype directly to a string in a query, but that quickly got way over my level of understanding. I saw hints that it is possible, though. Shawn On 3/16/2010 4:59 PM, Neil Chaudhuri wrote: Since my original thread was straying to a new topic, I thought it made sense to create a new thread of discussion. I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type, which is an instance of oracle.sql.OPAQUE. Still, it is nothing more than a fancy clob. -- Lance Norskog goks...@gmail.com
Re: Dummy boost question
: I want to *search* on title and content, and then, within these results *boost* by keyword. ... : str name=bqkeyword:(*.*)^1.0/str : : But I'm fairly sure that this is boosting on all keywords (not just ones matching my search term) correct. : Does anyone know how to achieve what I want (I'm using the DisMax query request handler btw.) H... you could use the pf param to specify your keywords field (if you aren't already) so that queries where the entire search string match the keyword field are boosted (it's not clear to me if that's what you want or not) Alternately, you could specify the bq at query time and copy your search terms into it actually ... i've never tried it, but something like this might work... str name=bq{!lucene df=keyword v=$q}/str ...that should the Local Params derefrencing feature to use the q param as the value of the of the bq param (typically it's used to bake localparams in the config while pulling some other param from the request -- but i can't think of any reason why you can't use it to access q directly. -Hoss
Re: indexing key/value field type
: tagskey,value , where key is String and value is Int. : key is a given tag and value is a count of how many users used this tag for : a given document. : : How can I index and store a key/value type of field? such that one can : search on the values as well as keys of this field. It depends on what types of searches you want to do. Some people only care about searching on the tag string and just want the numeric value to boost the score -- in which case Payloads work really well (and there's already a Tokenizer that makes it easy to index the pairs, but i think you still need a custom QParser to query them) If you actaully want to be able to apply arbitrary numeric constraints (ie: find all docs where more then 13 and less then 34 people applied teh tag 'food' then things get a lot more complicated ... you can do it with parallel fields (ie: the tags in one multiValued string field, and the numbers in another multiValued int field) but then you really have to write a lot of custom query code to pay attention to the position info when evaluating matches. : I have looked at FAQs, where one mailing-list suggests using the dynamic : field type such as: : : dynamicField name=tags_* type=string indexed=true stored=true : omitNorms=true / : : but how would we search on the dynamic field names? tags_food:[13 TO 34] ...if you want to know if a document has a tag at all, you could use something like tags_food:[* TO *] or lump all the tag strings into a tags field as well (tags:food) -Hoss
Re: XPath Processing Applied to Clob
keep in mind that the xpath is case-sensitive. paste a sample xml what is dataField=d.text it does not seem to refer to anything. where is the enclosing entity? did you mean dataField=doc.text. xpath=//BODY is a supported syntax as long as you are using Solr1.4 or higher On Thu, Mar 18, 2010 at 3:15 AM, Neil Chaudhuri nchaudh...@potomacfusion.com wrote: Incidentally, I tried adding this: datasource name=f type=FieldReaderDataSource / document entity dataSource=f processor=XPathEntityProcessor dataField=d.text forEach=/MESSAGE field column=body xpath=//BODY/ /entity /document But this didn't seem to change anything. Any insight is appreciated. Thanks. From: Neil Chaudhuri Sent: Wednesday, March 17, 2010 3:24 PM To: solr-user@lucene.apache.org Subject: XPath Processing Applied to Clob I am using the DataImportHandler to index 3 fields in a table: an id, a date, and the text of a document. This is an Oracle database, and the document is an XML document stored as Oracle's xmltype data type. Since this is nothing more than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. However, I don't want to index/store all the XML but instead just the XML within a set of tags. The XPath itself is trivial, but it seems like the XPathEntityProcessor only works for XML file content rather than the output of a Transformer. Here is what I currently have that fails: document entity name=doc query=SELECT d.EFFECTIVE_DT, d.ARCHIVE_ID, d.XML.getClobVal() AS TEXT FROM DOC d transformer=ClobTransformer field column=EFFECTIVE_DT name=effectiveDate / field column=ARCHIVE_ID name=id / field column=TEXT name=text clob=true entity name=text processor=XPathEntityProcessor forEach=/MESSAGE url=${doc.text} field column=body xpath=//BODY/ /entity /entity /document Is there an easy way to do this without writing my own custom transformer? Thanks. -- - Noble Paul | Systems Architect| AOL | http://aol.com
What is the use of Solr configuration in Katta master and nodes after integrating katta into Solr
Hi All, Can some body please explain, What is the use of Solr configuration in Katta master and nodes after integrating katta into Solr (1395 Patch). Thanks, vsreddy
Re: What is the use of Solr configuration in Katta master and nodes after integrating katta into Solr
The katta master is set up to act as a solr master server. The config there is to be setup to distribute requests to the individual shards. The solr config in the nodes is the default config to use, to start the solr instance in the node. On 3/17/10 9:05 PM, V SudershanReddy vsre...@huawei.com wrote: Hi All, Can some body please explain, What is the use of Solr configuration in Katta master and nodes after integrating katta into Solr (1395 Patch). Thanks, vsreddy