DIH delta import - last modified date
I am struggling with the concept of delta import in DIH. According the to documentation, the delta import will automatically record the last index time stamp and make it available to use for the delta query. However in many case when the last_modified date time stamp in the database lag behind the current time, the last index time stamp is the not good for delta query. Can I pick a different mechanism to generate last_index_time by using time stamp computed from the database (such as from a column of the database)? -- View this message in context: http://old.nabble.com/DIH-delta-import---last-modified-date-tp27231449p27231449.html Sent from the Solr - User mailing list archive at Nabble.com.
DIH - Export to XML
For Data Import Handler, there is a way to dump data to a SOLR feed format XML file? -- View this message in context: http://old.nabble.com/DIH---Export-to-XML-tp26138213p26138213.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Google Side-By-Side UI
Yes. I think would be very helpful tool for tunning search relevancy - you can do a controlled experiment with your target audiences to understand their responses to the parameter changes. We plan to use this feature to benchmark Lucene/SOLR against our in-house commercial search engine - it will be an interesting test. Lance Norskog-2 wrote: http://googleenterprise.blogspot.com/2009/08/compare-enterprise-search-relevance.html This is really cool, and a version for Solr would help in doing relevance experiments. We don't need the select A or B feature, just seeing search result sets side-by-side would be great. -- Lance Norskog goks...@gmail.com -- View this message in context: http://www.nabble.com/Google-Side-By-Side-UI-tp25719087p25719806.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Item Facet
Are your product_name* fields numeric fields (integer or float)? Dals wrote: Hi... Is there any way to group values like shopping.yahoo.com or shopper.cnet.com do? For instance, I have documents like: doc1 - product_name1 - value1 doc2 - product_name1 - value2 doc3 - product_name1 - value3 doc4 - product_name2 - value4 doc5 - product_name2 - value5 doc6 - product_name2 - value6 I'd like to have a result grouping by product name with the value range per product. Something like: product_name1 - (value1 to value3) product_name2 - (value4 to value6) It is not like the current facet because the information is grouped by item, not the entire result. Any idea? Thanks! David Lojudice Sobrinho -- View this message in context: http://www.nabble.com/Item-Facet-tp24853669p24865535.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Limiting facets for huge data - setting indexed=false in schema.xml
Having a large number of fields is not the same as having a large number of facets. To facets are something you would display to users as aid for query refinement or navigation. There is no way for a user to use 3700 facets at the same time. So it more of question on how to determine what facets to fetch on search time based on the user's actions or based on certain predefined configurations. I have written an application with 30 some facetable fields on millions of records, I also ran into the issue of calculate all facets as the server resources as limited to number of caches available and CPU cycles available for facet calculations. I then realize why display all these facet regardless user want to see them or not? I have then change to approach to only fetch minimum set of facets by default and make the rest of facets fields open on-demand (using AJAX). I was able to dramatically increase the response time by spreading the facet loading overtime. There are still issues of total facet caches when you have a large number available facets, but you need realistically evaluate what does it means to a user to have large number of facet. I don't think on typical user interface having more than 10 filters showing at the same time will be any more effective than having a small number of filters to begin with and progressive showing more on-demand (hierarchical facets?) Rahul R wrote: Hello, We are trying to get Solr to work for a really huge parts database. Details of the database - 55 million parts - Totally 3700 properties (facets). But each record will not have value for all properties. - Most of these facets are defined as dynamic fields within the Solr Index We were getting really unacceptable timing while doing faceting/searches on an index created with this database. With only one user using the system, query times are in excess of 1 minute. With more users concurrently using the system, the response times are further high. We thought that by limiting the number of properties that are available for faceting, the performance can be improved. To test this, we enabled only 6 properties for faceting by setting indexed=true (in schema.xml) for only these properties. All other properties which are defined as dynamic properties had indexed=false. The observations after this change : - Index size reduced by a meagre 5 % only - Performance did not improve. Infact during PSR run we observed that it degraded. My questions: - Will reducing the number of facets improve faceting and search performance ? - Is there a better way to reduce the number of facets ? - Will having a large number of properties defined as dynamic fields, reduce performance ? Thank you. Regards Rahul -- View this message in context: http://www.nabble.com/Limiting-facets-for-huge-data---setting-indexed%3Dfalse-in-schema.xml-tp24751763p24761778.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr's MLT query call doesn't work
A couple of things, your mlt.fl value, must be part of fl. In this case, content_mlt is not included in fl. I think the fl parameter value need to be comma separated. try fl=title,author,content_mlt,score -Yao SergeyG wrote: Hi, Recently, while implementing the MoreLikeThis search, I've run into the situation when Solr's mlt query calls don't work. More specifically, the following query: http://localhost:8080/solr/select?q=id:10mlt=truemlt.fl=content_mltmlt.maxqt= 5mlt.interestingTerms=detailsfl=title+author+score brings back just the doc with id=10 and nothing else. While using the GetMethod approach (putting /mlt explicitely into the url), I got back some results. I've been trying to solve this problem for more than a week with no luck. If anybody has any hint, please help. Below, I put logs outputs from 3 runs: a) Solr; b) GetMethod (/mlt); c) GetMethod (/select). Thanks a lot. Regards, Sergey Goldberg Here're the logs: a) Solr (http://localhost:8080/solr/select) 08.07.2009 15:50:33 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={fl=title+author+scoremlt.fl=content_mltq=id:10mlt= truemlt.interestingTerms=detailsmlt.maxqt=5wt=javabinversion=2.2} hits=1 status=0 QTime=172 INFO MLTSearchRequestProcessor:49 - SolrServer url: http://localhost:8080/solr INFO MLTSearchRequestProcessor:67 - solrQuery q=id%3A10mlt=truemlt.fl=content_mltmlt.maxqt= 5mlt.interestingTerms=detailsfl=title+author+score INFO MLTSearchRequestProcessor:73 - Number of docs found = 1 INFO MLTSearchRequestProcessor:77 - title = SG_Book; score = 2.098612 b) GetMethod (http://localhost:8080/solr/mlt) 08.07.2009 16:55:44 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/mlt params={fl=title+author+scoremlt.fl=content_mltq=id:10mlt.max qt=5mlt.interestingTerms=details} status=0 QTime=15 INFO MLT2SearchRequestProcessor:76 - ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime0/int/lstresult name=match numFound=1 start=0 maxScore=2.098612docfloat name=score2.098612/floatarr name= authorstrS.G./str/arrstr name=titleSG_Book/str/doc/resultresult name=response n umFound=4 start=0 maxScore=0.28923997docfloat name=score0.28923997/floatarr name=authorstrO. Henry/strstrS.G./str/arrstr name=titleFour Million, The/str/docdocfloat name=score0.08667877/floatarr name=authorstrKatherine Mosby/str/arrstr name=titleThe Season of Lillian Dawes/str/docdocfloat name=score0.07947738/floatarr name=authorstrJerome K. Jerome/str/arrstr name=titleThree Men in a Boat/str/docdocfloat name=score0.047219563/floatarr name=authorstrCharles Oliver/strstrS.G./str/arrstr name=titleABC's of Science/str/doc/resultlst name=interestingTermsfloat name=content_mlt:ye1.0/floatfloat name=content_mlt:tobin1.0/floatfloat name=content_mlt:a1.0/floatfloat name=content_mlt:i1.0/floatfloat name=content_mlt:his1.0/float/lst /response c) GetMethod (http://localhost:8080/solr/select) 08.07.2009 17:06:45 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={fl=title+author+scoremlt.fl=content_mltq=id:10mlt. maxqt=5mlt.interestingTerms=details} hits=1 status=0 QTime=16 INFO MLT2SearchRequestProcessor:80 - ?xml version=1.0 encoding=UTF-8? response lst name=responseHeaderint name=status0/intint name=QTime16/intlst name=paramsstr name=fltitle author score/strstr name=mlt.flcontent_mlt/strstr name=qid:10/strstr name=mlt.maxqt5/strstr name=mlt.interestingTermsdetails/str/lst/lstresult name=response numFound=1 start=0 maxScore=2.098612docfloat name=score2.098612/floatarr name=authorstrS.G./str/arrstr name=titleSG_Book/str/doc/resultlst name=debugstr name=rawquerystringid:10/strstr name=querystringid:10/strstr name=parsedq ueryid:10/strstr name=parsedquery_toStringid:10/strlst name=explainstr name=10 2.098612 = (MATCH) weight(id:10 in 3), product of: 0.9994 = queryWeight(id:10), product of: 2.0986123 = idf(docFreq=1, numDocs=5) 0.47650534 = queryNorm 2.0986123 = (MATCH) fieldWeight(id:10 in 3), product of: 1.0 = tf(termFreq(id:10)=1) 2.0986123 = idf(docFreq=1, numDocs=5) 1.0 = fieldNorm(field=id, doc=3) /str/lststr name=QParserOldLuceneQParser/strlst name=timingdouble name=time16.0/doublelst name=preparedouble name=time0.0/doublelst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.FacetComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component .MoreLikeThisComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.HighlightComponentdouble name=time0.0/double/lstlst name=org.apache.solr.handler.component.DebugComponentdouble name=time0.0/double/lst/lstlst name=processdouble name=time16.0/doublelst name=org.apache.solr.handler.component.QueryComponentdouble name=time0.0/double/lstlst
Re: Filtering MoreLikeThis results
I am not sure about the parameters for MLT the requestHandler plugin. Can one of you share the solrconfig.xml entry for MLT? Thanks in advance. -Yao Bill Au wrote: I have been using the StandardRequestHandler (ie /solr/select). fq does work with the MoreLikeThisHandler. I will switch to use that. Thanks. Bill On Tue, Jul 7, 2009 at 11:02 AM, Marc Sturlese marc.sturl...@gmail.comwrote: At least in trunk, if you request for: http://localhost:8084/solr/core_A/mlt?q=id:7468365fq=price[100http://localhost:8084/solr/core_A/mlt?q=id:7468365fq=price%5B100TO 200] It will filter the MoreLikeThis results Bill Au wrote: I think fq only works on the main response, not the mlt matches. I found a couple of releated jira: http://issues.apache.org/jira/browse/SOLR-295 http://issues.apache.org/jira/browse/SOLR-281 If I am reading them correctly, I should be able to use DIsMax and MoreLikeThis together. I will give that a try and report back. Bill On Tue, Jul 7, 2009 at 4:45 AM, Marc Sturlese marc.sturl...@gmail.comwrote: Using MoreLikeThisHandler you can use fq to filter your results. As far as I know bq are not allowed. Bill Au wrote: I have been trying to restrict MoreLikeThis results without any luck also. In additional to restricting the results, I am also looking to influence the scores similar to the way boost query (bq) works in the DisMaxRequestHandler. I think Solr's MoreLikeThis depends on Lucene's contrib queries MoreLikeThis, or at least it used to. Has anyone looked into enhancing Solrs' MoreLikeThis to support bq and restricting mlt results? Bill On Mon, Jul 6, 2009 at 2:16 PM, Yao Ge yao...@gmail.com wrote: I could not find any support from http://wiki.apache.org/solr/MoreLikeThison how to restrict MLT results to certain subsets. I passed along a fq parameter and it is ignored. Since we can not incorporate the filters in the query itself which is used to retrieve the target for similarity comparison, it appears there is no way to filter MLT results. BTW. I am using Solr 1.3. Please let me know if there is way (other than hacking the source code) to do this. Thanks! -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24369257.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24374996.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24377360.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filtering MoreLikeThis results
The answer to my owner question: ... requestHandler name=mlt class=solr.MoreLikeThisHandler lst name=defaults/ /requestHandler ... would work. -Yao Yao Ge wrote: I am not sure about the parameters for MLT the requestHandler plugin. Can one of you share the solrconfig.xml entry for MLT? Thanks in advance. -Yao Bill Au wrote: I have been using the StandardRequestHandler (ie /solr/select). fq does work with the MoreLikeThisHandler. I will switch to use that. Thanks. Bill On Tue, Jul 7, 2009 at 11:02 AM, Marc Sturlese marc.sturl...@gmail.comwrote: At least in trunk, if you request for: http://localhost:8084/solr/core_A/mlt?q=id:7468365fq=price[100http://localhost:8084/solr/core_A/mlt?q=id:7468365fq=price%5B100TO 200] It will filter the MoreLikeThis results Bill Au wrote: I think fq only works on the main response, not the mlt matches. I found a couple of releated jira: http://issues.apache.org/jira/browse/SOLR-295 http://issues.apache.org/jira/browse/SOLR-281 If I am reading them correctly, I should be able to use DIsMax and MoreLikeThis together. I will give that a try and report back. Bill On Tue, Jul 7, 2009 at 4:45 AM, Marc Sturlese marc.sturl...@gmail.comwrote: Using MoreLikeThisHandler you can use fq to filter your results. As far as I know bq are not allowed. Bill Au wrote: I have been trying to restrict MoreLikeThis results without any luck also. In additional to restricting the results, I am also looking to influence the scores similar to the way boost query (bq) works in the DisMaxRequestHandler. I think Solr's MoreLikeThis depends on Lucene's contrib queries MoreLikeThis, or at least it used to. Has anyone looked into enhancing Solrs' MoreLikeThis to support bq and restricting mlt results? Bill On Mon, Jul 6, 2009 at 2:16 PM, Yao Ge yao...@gmail.com wrote: I could not find any support from http://wiki.apache.org/solr/MoreLikeThison how to restrict MLT results to certain subsets. I passed along a fq parameter and it is ignored. Since we can not incorporate the filters in the query itself which is used to retrieve the target for similarity comparison, it appears there is no way to filter MLT results. BTW. I am using Solr 1.3. Please let me know if there is way (other than hacking the source code) to do this. Thanks! -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24369257.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24374996.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24380408.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting with MoreLikeThis
Faceting on MLT request the use of MoreLikeThisHandler. The standard request handler, while provide support to MLT via a search component, does not return facets on MLT results. To enable MLT handler, add an entry like below to your solrconfig.xml requestHandler name=mlt class=solr.MoreLikeThisHandler lst name=defaults/ /requestHandler The query parameters syntax for faceting remains the same as standard request handler. -Yao Yao Ge wrote: Does Solr support faceting on MoreLikeThis search results? -- View this message in context: http://www.nabble.com/Faceting-with-MoreLikeThis-tp24356166p24380459.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: A big question about Solr and SolrJ range query ?
use Solr's Filter Query parameter fq: fq=x:[10 TO 100]fq=y:[20 TO 300]fl=title -Yao huenzhao wrote: Hi all: Suppose that my index have 3 fields: title, x and y. I know one range(10 x 100) can query liks this: http://localhost:8983/solr/select?q=x:[10 TO 100]fl=title If I want to two range(10 x 100 AND 20 y 300) query like SQL(select title where x10 and x 100 and y 20 and y 300) by using Solr range query or SolrJ, but not know how to implement. Anybody know ? Thanks Email: enzhao...@gmail.com -- View this message in context: http://www.nabble.com/A-big-question-about-Solr-and-SolrJ-range-query---tp24384416p24384540.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: about defaultSearchField
Try with fl=* or fl=*,score added to your request string. -Yao Yang Lin-2 wrote: Hi, I have some problems. For my solr progame, I want to type only the Query String and get all field result that includ the Query String. But now I can't get any result without specified field. For example, query with tina get nothing, but Sentence:tina could. I hava adjusted the *schema.xml* like this: fields field name=CategoryNamePolarity type=text indexed=true stored=true multiValued=true/ field name=CategoryNameStrenth type=text indexed=true stored=true multiValued=true/ field name=CategoryNameSubjectivity type=text indexed=true stored=true multiValued=true/ field name=Sentence type=text indexed=true stored=true multiValued=true/ field name=allText type=text indexed=true stored=true multiValued=true/ /fields uniqueKey required=falseSentence/uniqueKey !-- field for the QueryParser to use when an explicit fieldname is absent -- defaultSearchFieldallText/defaultSearchField !-- SolrQueryParser configuration: defaultOperator=AND|OR -- solrQueryParser defaultOperator=OR/ copyfield source=CategoryNamePolarity dest=allText/ copyfield source=CategoryNameStrenth dest=allText/ copyfield source=CategoryNameSubjectivity dest=allText/ copyfield source=Sentence dest=allText/ I think the problem is in defaultSearchField, but I don't know how to fix it. Could anyone help me? Thanks Yang -- View this message in context: http://www.nabble.com/about-defaultSearchField-tp24382105p24384615.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting with MoreLikeThis
Does Solr support faceting on MoreLikeThis search results? -- View this message in context: http://www.nabble.com/Faceting-with-MoreLikeThis-tp24356166p24356166.html Sent from the Solr - User mailing list archive at Nabble.com.
Filtering MoreLikeThis results
I could not find any support from http://wiki.apache.org/solr/MoreLikeThis on how to restrict MLT results to certain subsets. I passed along a fq parameter and it is ignored. Since we can not incorporate the filters in the query itself which is used to retrieve the target for similarity comparison, it appears there is no way to filter MLT results. BTW. I am using Solr 1.3. Please let me know if there is way (other than hacking the source code) to do this. Thanks! -- View this message in context: http://www.nabble.com/Filtering-MoreLikeThis-results-tp24360355p24360355.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query Filter fq with OR operator
I will like to submit a JIRA issue for this. Can anyone help me on where to go? -Yao Otis Gospodnetic wrote: Brian, Opening a JIRA issue if it doesn't already exist is the best way. If you can provide a patch, even better! Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: brian519 bpear...@desire2learn.com To: solr-user@lucene.apache.org Sent: Tuesday, June 16, 2009 1:32:41 PM Subject: Re: Query Filter fq with OR operator This feature is very important to me .. should I post something on the dev forum? Not sure what the proper protocol is for adding a feature to the roadmap Thanks, Brian. -- View this message in context: http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p24059181.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p24222170.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
FYI. I did a direct integration with Carrot2 with Solrj with a separate Ajax call from UI for top 100 hits to clusters terms in the two text fields. It gots comparable performance to other facets in terms of response time. In terms of algorithms, their listed two Lingo and STC which I don't reconize. But I think at least one of them might have used SVD (http://en.wikipedia.org/wiki/Singular_value_decomposition). -Yao Otis Gospodnetic wrote: I'd call it related (their application in search encourages exploration), but also distinct enough to never mix them up. I think your assessment below is correct, although I'm not familiar with the details of Carrot2 any more (was once), so I can't tell you exactly which algo is used under the hood. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Ludwig m...@as-guides.com To: solr-user@lucene.apache.org Sent: Wednesday, June 10, 2009 9:41:54 AM Subject: Re: Faceting on text fields Otis Gospodnetic schrieb: Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent Would it be fair to say that clustering as detailed on the page you're referring to is a kind of dynamic faceting? The faceting not being done based on distinct values of certain fields, but on the presence (and frequency) of terms in one field? The main difference seems to be that with faceting, grouping criteria (facets) are known beforehand, while with clustering, grouping criteria (the significant terms which create clusters - the cluster keys) have yet to be determined. Is that a correct assessment? Michael Ludwig -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980124.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
BTW, Carrot2 has a very impressive Clustering Workbench (based on eclipse) that has built-in integration with Solr. If you have a Solr service running, it is a just a matter of point the workbench to it. The clustering results and visualization are amazing. (http://project.carrot2.org/download.html). Yao Ge wrote: FYI. I did a direct integration with Carrot2 with Solrj with a separate Ajax call from UI for top 100 hits to clusters terms in the two text fields. It gots comparable performance to other facets in terms of response time. In terms of algorithms, their listed two Lingo and STC which I don't reconize. But I think at least one of them might have used SVD (http://en.wikipedia.org/wiki/Singular_value_decomposition). -Yao Otis Gospodnetic wrote: I'd call it related (their application in search encourages exploration), but also distinct enough to never mix them up. I think your assessment below is correct, although I'm not familiar with the details of Carrot2 any more (was once), so I can't tell you exactly which algo is used under the hood. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Michael Ludwig m...@as-guides.com To: solr-user@lucene.apache.org Sent: Wednesday, June 10, 2009 9:41:54 AM Subject: Re: Faceting on text fields Otis Gospodnetic schrieb: Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent Would it be fair to say that clustering as detailed on the page you're referring to is a kind of dynamic faceting? The faceting not being done based on distinct values of certain fields, but on the presence (and frequency) of terms in one field? The main difference seems to be that with faceting, grouping criteria (facets) are known beforehand, while with clustering, grouping criteria (the significant terms which create clusters - the cluster keys) have yet to be determined. Is that a correct assessment? Michael Ludwig -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23980959.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Thanks for insight Otis. I have no awareness of ClusteringComponent until now. It is time to move to Solr 1.4 -Yao Otis Gospodnetic wrote: Yao, Solr can already cluster top N hits using Carrot2: http://wiki.apache.org/solr/ClusteringComponent I've also done ugly manual counting of terms in top N hits. For example, look at the right side of this: http://www.simpy.com/user/otis/tag/%22machine+learning%22 Something like http://www.sematext.com/product-key-phrase-extractor.html could also be used. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yao Ge yao...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, June 9, 2009 3:46:13 PM Subject: Re: Faceting on text fields Michael, Thanks for the update! I definitely need to get a 1.4 build see if it makes a difference. BTW, maybe instead of using faceting for text mining/clustering/visualization purpose, we can build a separate feature in SOLR for this. Many of commercial search engines I have experiences with (Google Search Appliance, Vivisimo etc) provide dynamic term clustering based on top N ranked documents (N is a parameter can be configured). When facet field is highly fragmented (say a text field), the existing set intersection based approach might no longer be optimum. Aggregating term vectors over top N docs might be more attractive. Another features I can really appreciate is to provide search time n-gram term clustering. Maybe this might be better suited for spell checker as it just a different way to display the alternative search terms. -Yao Michael Ludwig-4 wrote: Yao Ge schrieb: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? Very interesting questions! I think an answer would both require and further an understanding of how filters work, which might even lead to a more general guideline on when and how to use filters and facets. Even though faceting appears to have changed in 1.4 vs 1.3, it would still be interesting to understand the 1.3 side of things. Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. Also a very interesting data mining question! I'm sorry I don't have any answers for you. Maybe someone else does. Best, Michael Ludwig -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23965401.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Michael, Thanks for the update! I definitely need to get a 1.4 build see if it makes a difference. BTW, maybe instead of using faceting for text mining/clustering/visualization purpose, we can build a separate feature in SOLR for this. Many of commercial search engines I have experiences with (Google Search Appliance, Vivisimo etc) provide dynamic term clustering based on top N ranked documents (N is a parameter can be configured). When facet field is highly fragmented (say a text field), the existing set intersection based approach might no longer be optimum. Aggregating term vectors over top N docs might be more attractive. Another features I can really appreciate is to provide search time n-gram term clustering. Maybe this might be better suited for spell checker as it just a different way to display the alternative search terms. -Yao Michael Ludwig-4 wrote: Yao Ge schrieb: The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? Very interesting questions! I think an answer would both require and further an understanding of how filters work, which might even lead to a more general guideline on when and how to use filters and facets. Even though faceting appears to have changed in 1.4 vs 1.3, it would still be interesting to understand the 1.3 side of things. Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. Also a very interesting data mining question! I'm sorry I don't have any answers for you. Maybe someone else does. Best, Michael Ludwig -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23950084.html Sent from the Solr - User mailing list archive at Nabble.com.
Query Filter fq with OR operator
If I want use OR operator with mutile query filters, I can do: fq=popularity:[10 TO *] OR section:0 Is there a more effecient alternative to this? -- View this message in context: http://www.nabble.com/Query-Filter-fq-with-OR-operator-tp23895837p23895837.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting on text fields
I am index a database with over 1 millions rows. Two of fields contain unstructured text but size of each fields is limited (256 characters). I come up with an idea to use visualize the text fields using text cloud by turning the two text fields in facets. The weight of font and size is of each facet value (words) derived from the facet counts. I used simpler field type so that the there is no stemming to these facet values: fieldType name=word class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? The following is my filterCahce setting: filterCache class=solr.LRUCache size=5120 initialSize=512 autowarmCount=128/ Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting on text fields
Yes. I am using 1.3. When is 1.4 due for release? Yonik Seeley-2 wrote: Are you using Solr 1.3? You might want to try the latest 1.4 test build - faceting has changed a lot. -Yonik http://www.lucidimagination.com On Thu, Jun 4, 2009 at 12:01 PM, Yao Ge yao...@gmail.com wrote: I am index a database with over 1 millions rows. Two of fields contain unstructured text but size of each fields is limited (256 characters). I come up with an idea to use visualize the text fields using text cloud by turning the two text fields in facets. The weight of font and size is of each facet value (words) derived from the facet counts. I used simpler field type so that the there is no stemming to these facet values: fieldType name=word class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=false/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=0 generateNumberParts=0 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType The facet query is considerably slower comparing to other facets from structured database fields (with highly repeated values). What I found interesting is that even after I constrained search results to just a few hunderd hits using other facets, these text facets are still very slow. I understand that text fields are not good candidate for faceting as it can contain very large number of unique values. However why it is still slow after my matching documents is reduced to hundreds? Is it because the whole filter is cached (regardless the matching docs) and I don't have enough filter cache size to fit the whole list? The following is my filterCahce setting: filterCache class=solr.LRUCache size=5120 initialSize=512 autowarmCount=128/ Lastly, what I really want to is to give user a chance to visualize and filter on top relevant words in the free-text fields. Are there alternative to facet field approach? term vectors? I can do client side process based on top N (say 100) hits for this but it is my last option. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23872891.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/Faceting-on-text-fields-tp23872891p23876051.html Sent from the Solr - User mailing list archive at Nabble.com.
spell checking
Can someone help providing a tutorial like introduction on how to get spell-checking work in Solr. It appears many steps are requires before the spell-checkering functions can be used. It also appears that a dictionary (a list of correctly spelled words) is required to setup the spell checker. Can anyone validate my impression? Thanks. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23835427.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Yes. I did. I was not able to grasp the concept of making spell checking work. For example, the wiki page says an spell check index need to be built. But did not say how to do it. Does Solr buid the index out of thin air? Or the index is buit from the main index? or index is built form a dictionary or word list? Please help. Grant Ingersoll-6 wrote: Have you gone through: http://wiki.apache.org/solr/SpellCheckComponent On Jun 2, 2009, at 8:50 AM, Yao Ge wrote: Can someone help providing a tutorial like introduction on how to get spell-checking work in Solr. It appears many steps are requires before the spell-checkering functions can be used. It also appears that a dictionary (a list of correctly spelled words) is required to setup the spell checker. Can anyone validate my impression? Thanks. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23835427.html Sent from the Solr - User mailing list archive at Nabble.com. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23840843.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Sorry for not be able to get my point across. I know the syntax that leads to a index build for spell checking. I actually run the command saw some additional file created in data\spellchecker1 directory. What I don't understand is what is in there as I can not trick Solr to make spell suggestions based on the documented query structure in wiki. Can anyone tell me what happened after when the default spell check is built? In my case, I used copyField to copy a couple of text fields into a field called spell. These fields are the original text, they are the ones with typos that I need to run spell check on. But how can these original data be used as a base for spell checking? How does Solr know what are correctly spelled words? field name=tech_comment type=text indexed=true stored=true multiValued=true/ field name=cust_comment type=text indexed=true stored=true multiValued=true/ ... field name=spell type=textSpell indexed=true stored=true multiValued=true/ ... copyField source=tech_comment dest=spell/ copyField source=cust_comment dest=spell/ Yao Ge wrote: Can someone help providing a tutorial like introduction on how to get spell-checking work in Solr. It appears many steps are requires before the spell-checkering functions can be used. It also appears that a dictionary (a list of correctly spelled words) is required to setup the spell checker. Can anyone validate my impression? Thanks. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23841373.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: spell checking
Excellent. Now everything make sense to me. :-) The spell checking suggestion is the closest variance of user input that actually existed in the main index. So called correction is relative the text existed indexed. So there is no need for a brute force list of all correctly spelled words. Maybe we should call this alternative search terms or suggested search terms instead of spell checking. It is misleading as there is no right or wrong in spelling, there is only popular (term frequency?) alternatives. Thanks for the insight. Otis Gospodnetic wrote: Hello, In short, the assumption behind this type of SC is that the text in the main index is (mostly) correctly spelled. When the SC finds query terms that are close in spelling to words indexed in SC, it offers spelling suggestions/correction using those presumably correctly spelled terms (there are other parameters that control the exact behaviour, but this is the idea) Solr (Lucene's spellchecker, which Solr uses under the hood, actually) turn the input text (values from those fields you copy to the spell field) into so called n-grams. You can see that if you open up the SC index with something like Luke. Please see http://wiki.apache.org/jakarta-lucene/SpellChecker . Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Yao Ge yao...@gmail.com To: solr-user@lucene.apache.org Sent: Tuesday, June 2, 2009 5:34:07 PM Subject: Re: spell checking Sorry for not be able to get my point across. I know the syntax that leads to a index build for spell checking. I actually run the command saw some additional file created in data\spellchecker1 directory. What I don't understand is what is in there as I can not trick Solr to make spell suggestions based on the documented query structure in wiki. Can anyone tell me what happened after when the default spell check is built? In my case, I used copyField to copy a couple of text fields into a field called spell. These fields are the original text, they are the ones with typos that I need to run spell check on. But how can these original data be used as a base for spell checking? How does Solr know what are correctly spelled words? multiValued=true/ multiValued=true/ ... multiValued=true/ ... Yao Ge wrote: Can someone help providing a tutorial like introduction on how to get spell-checking work in Solr. It appears many steps are requires before the spell-checkering functions can be used. It also appears that a dictionary (a list of correctly spelled words) is required to setup the spell checker. Can anyone validate my impression? Thanks. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23841373.html Sent from the Solr - User mailing list archive at Nabble.com. -- View this message in context: http://www.nabble.com/spell-checking-tp23835427p23844050.html Sent from the Solr - User mailing list archive at Nabble.com.
Query Boost Functions
I have a field named last-modified that I like to use in bf (Boot Functions) parameter: recip(rord(last-modified),1,1000,1000) in DisMaxRequestHander. However the Solr query parser complain about the syntax of the formula. I think it is related with hyphen in the field name. I have tried to add single and double quote around the field name but didn't help. Can field name contain hyphen in boot functions? How to do it? If not, where do I find the field name special character restrictions? -Yao -- View this message in context: http://www.nabble.com/Query-Boost-Functions-tp23595860p23595860.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Shard - Strange results
Maybe you want to try with docNumber field type as string and see it would make a difference. CB-PO wrote: I'm not quite sure what logs you are talking about, but in the tomcat/logs/catalina.out logs, i found the following [note, i can't copy/paste, so i am typing up a summary]: I execute command: localhost:8080/bravo/select?q=fredrows=102start=0shards=localhost:8080/alpha,localhost:8080/bravo In this example, alpha has 27 instances of fred, while bravo has 0. Then in the catalina.out: -There is the request for the command i sent, shards parameters and all. it has the proper queryString. -Then I see the two requests sent to the shards, apha and bravo. These two requests weave between each other until they are finished: INFO: REQUEST URI =/alpha/select INFO: REQUEST URI =/bravo/select The parameters have changed to: wt=javabinfsv=trueversion=2.2f1=docNumber,scoreq=fredrows=102isShard=truestart=0 -Then 2 INFO's scroll across: INFO: [] webapp=/bravo path=/select params={wt=javabinfsv=trueversion=2.2f1=docNumber,scoreq=fredrows=102isShard=truestart=0} hits=0 status=0 QTime=1 INFO: [] webapp=/alpha path=/select params={wt=javabinfsv=trueversion=2.2f1=docNumber,scoreq=fredrows=102isShard=truestart=0} hits=27 status=0 QTime=1 **Note, hits=27 -Then i see some octet-streams being transferred, with status 200, so those are OK. -The i see something peculiar: It calls alpha with the following parameters: wt=javabinversion=2.2ids=ABC-1353,ABC-408,ABC-1355,ABC-1824,ABC-1354,FRED-ID-27,55q=fredrows=102parameter=isShard=truestart=0 Performing this query on my own (without the wt=javabin) gives me numFound=2, the result-set I get back from the overarching query. Changing it to rows=10, it gives me numFound=2, and 2 doc's. This is not the strange functionality I was seeing with the overarching query and the mis-matched numfound and doc's. This does beg the question.. why did it add: ids=ABC-1353,ABC-408,ABC-1355,ABC-1824,ABC-1354,FRED-ID-27,55 to the query? They are the format that would be under docNumber, if that helps.. Any thoughts? I will do some research on those particular ID numbered docs, in the mean time. Here's the configuration information. I only posted the difference from the default files in the solr/example/solr/conf [solrconfig.xml] config dataDir${solr.data.dir:/data/indices/bravo/solr/data/dataDir requestHandler name=/dataimport class=org.apache.solr.handler.dataimport.DataImportHandler lst name=defaults str name=config/data/indices/bravo/solr/conf/data-config.xml/str /lst /requestHandler config [schema.xml] schema fields field name=docNumber type=text indexed=true stored=true / field name=column1 type=text indexed=true stored=true / field name=column2 type=text indexed=true stored=true / field name=column3 type=text indexed=true stored=true / field name=column4 type=text indexed=true stored=true / field name=column5 type=text indexed=true stored=true / field name=column6 type=text indexed=true stored=true / field name=column7 type=text indexed=true stored=true / field name=column8 type=text indexed=true stored=true / field name=column9 type=text indexed=true stored=true / /fields uniqueKeydocNumber/uniqueKey defaultSearchFieldcolumn2/defaultSearchField /schema [data-config.xml] dataConfig dataSource type=JdbcDataSource driver=com.metamatrix.jdbc.MMDriver url=jdbc:metamatrix:b...@mms://hostname:port user=username password=password/ document naame=DOC_NAME entity name=ENT_NAME query=select * from ASDF.TABLE field column=TABLE_COL_NO name=docNumber / field column=TABLE_COL_1 name=column1 / field column=TABLE_COL_2 name=column2 / field column=TABLE_COL_3 name=column3 / field column=TABLE_COL_4 name=column4 / field column=TABLE_COL_5 name=column5 / field column=TABLE_COL_6 name=column6 / field column=TABLE_COL_7 name=column7 / field column=TABLE_COL_8 name=column8 / field column=TABLE_COL_9 name=column9 / /entity /document /dataConfig Yonik Seeley-2 wrote: On Fri, May 15, 2009 at 4:11 PM, CB-PO charles.bush...@gmail.com wrote: Yeah, the first thing I thought of was that perhaps there was something wrong with the uniqueKey and they were clashing between the indexes, however upon visual inspection of the data the field we are using as the unique key in each of the indexes is grossly different between the two databases, so
DataImportHandler Template Transformer
It took me a while to understand that to use the Template Transfomer (http://lucene.apache.org/solr/api/org/apache/solr/handler/dataimport/TemplateTransformer.html), all building variable names (e.g. ${e.firstName} ${e.lastName} etc). can not contain null values. I hope the parser can do a better job explaining it. Also it will be nice to simple pad the null value will blank string. Should this be considered as an enhancement? -- View this message in context: http://www.nabble.com/DataImportHandler-Template-Transformer-tp23609267p23609267.html Sent from the Solr - User mailing list archive at Nabble.com.