How to rank an exact match higher?
I'm using solr 3.5 for a type ahead search system. I want to rank exact matches(lowercased) higher than non-exact matches. For example, if i have two docs: Doc One: title=New York Doc Two: title=New York City I would expect a query of new york to rank New York over New York City It looks like I need to take into account the # of matches vs the total # of tokens in a field. I'm not sure how to do this. My debug query shows the two docs with the exact scores: lst name=debug str name=rawquerystringnew york/str str name=querystringnew york/str str name=parsedquery+DisjunctionMaxQuery((title:new york^50.0 | textng:new york^40.0))/str str name=parsedquery_toString+(title:new york^50.0 | textng:new york^40.0)/str lst name=explain str name=4f553cbc03643929d093d4671.1890696 = (MATCH) max of: 1.1890696 = (MATCH) weight(title:new york^50.0 in 0), product of: 0.9994 = queryWeight(title:new york^50.0), product of: 50.0 = boost 1.1890697 = idf(title: new=2 york=2) 0.01681987 = queryNorm1.1890697 = fieldWeight(title:new york in 0), product of: 1.0 = tf(phraseFreq=1.0) 1.1890697 = idf(title: new=2 york=2) 1.0 = fieldNorm(field=title, doc=0)/str str name=4f553cbc03643929d093d4681.1890696 = (MATCH) max of: 1.1890696 = (MATCH) weight(title:new york^50.0 in 1), product of: 0.9994 = queryWeight(title:new york^50.0), product of: 50.0 = boost 1.1890697 = idf(title: new=2 york=2) 0.01681987 = queryNorm1.1890697 = fieldWeight(title:new york in 1), product of: 1.0 = tf(phraseFreq=1.0) 1.1890697 = idf(title: new=2 york=2) 1.0 = fieldNorm(field=title, doc=1)/str /lst I posted my solrconfig/schema here: https://gist.github.com/1984052 -- Tommy Chheng
Re: Solr with Scala
I have created a solr plugin using scala. It works without problems. I wouldn't go as far as using scala improve solr performance but you can definitely use scala to add a missing functionality or custom query parsing. Just build a jar using maven/sbt and put it in solr's lib directory. On Sun, Feb 5, 2012 at 4:06 PM, deniz denizdurmu...@gmail.com wrote: Hi all, I have a question about scala and solr... I am curious if we can use solr with scala (plugins etc) to improve performance. anybody used scala on solr? could you tell me opinions about them? - Zeki ama calismiyor... Calissa yapar... -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-with-Scala-tp3718539p3718539.html Sent from the Solr - User mailing list archive at Nabble.com. -- Tommy Chheng
Re: phrase auto-complete with suggester component
Thanks for link, that's the approach I'm going to try. On Wed, Jan 25, 2012 at 2:39 PM, O. Klein kl...@octoweb.nl wrote: O. Klein wrote I agree. Suggester could use some attention. Looking at Wiki there were some features planned, but not much has happened lately. Or check out this post http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/ looking very promising as an alternative. -- View this message in context: http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3689240.html Sent from the Solr - User mailing list archive at Nabble.com. -- Tommy Chheng
phrase auto-complete with suggester component
I'm testing out the various auto-complete functionalities on the wikipedia dataset. I first tried the facet.prefix and found it slow at times. I'm now looking at the Suggester component. Given a query like new york, I would like to get results like New York or New York City. When I tried using the suggest component, it suggest entries for each word rather then phrase(even if i add quotes). How can I change my config to get title matches and not have the query broken into each word? lst name=spellcheck lst name=suggestions lst name=new int name=numFound5/int int name=startOffset0/int int name=endOffset3/int arr name=suggestion strnewt/str strnewwy patitta/str strnewyddion/str strnewyorker/str strnewyork–presbyterian hospital/str /arr /lst lst name=york int name=numFound5/int int name=startOffset4/int int name=endOffset8/int arr name=suggestion stryork/str stryork–dauphin (septa station)/str stryork—humber/str stryork—scarborough/str stryork—simcoe/str /arr /lst str name=collationnewt york/str /lst /lst /solr/suggest?q=new%20yorkomitHeader=truespellcheck.count=5spellcheck.collate=true solrconfig.xml: searchComponent name=suggest class=solr.SpellCheckComponent lst name=spellchecker str name=namesuggest/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str str name=fieldtitle_autocomplete/str str name=buildOnCommittrue/str /lst /searchComponent requestHandler name=/suggest class=org.apache.solr.handler.component.SearchHandler lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.count10/str /lst arr name=components strsuggest/str /arr /requestHandler schema.xml: fieldType name=text_auto class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=title_autocomplete type=text_auto indexed=true stored=false multiValued=false / -- Tommy Chheng
Re: phrase auto-complete with suggester component
Thanks, I'll try out the custom class file. Any possibilities this class can be merged into solr? It seems like an expected behavior. On Tue, Jan 24, 2012 at 11:29 AM, O. Klein kl...@octoweb.nl wrote: You might wanna read http://lucene.472066.n3.nabble.com/suggester-issues-td3262718.html#a3264740 which contains the solution to your problem. -- View this message in context: http://lucene.472066.n3.nabble.com/phrase-auto-complete-with-suggester-component-tp3685572p3685730.html Sent from the Solr - User mailing list archive at Nabble.com. -- Tommy Chheng
Re: snapshot-4.0 and maven
You use maven-assembly-plugin's jar-with-dependencies to build a single jar with all its dependencies http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven @tommychheng On 10/19/10 6:53 AM, Matt Mitchell wrote: Hey thanks Tommy. To be more specific, I'm trying to use SolrJ in a clojure project. When I try to use SolrJ using what you showed me, I get errors saying lucene classes can't be found etc.. Is there a way to build everything SolrJ (snapshot-4.0) needs into one jar? Matt On Mon, Oct 18, 2010 at 11:01 PM, Tommy Chhengtommy.chh...@gmail.com wrote: Once you built the solr 4.0 jar, you can use mvn's install command like this: mvn install:install-file -DgroupId=org.apache -DartifactId=solr -Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar -DgeneratePom=true @tommychheng On 10/18/10 7:28 PM, Matt Mitchell wrote: I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is this possible to do? If so, could someone give me a tip or two on getting started? Thanks, Matt
Re: snapshot-4.0 and maven
Once you built the solr 4.0 jar, you can use mvn's install command like this: mvn install:install-file -DgroupId=org.apache -DartifactId=solr -Dpackaging=jar -Dversion=4.0-SNAPSHOT -Dfile=solr-4.0-SNAPSHOT.jar -DgeneratePom=true @tommychheng On 10/18/10 7:28 PM, Matt Mitchell wrote: I'd like to get solr snapshot-4.0 pushed into my local maven repo. Is this possible to do? If so, could someone give me a tip or two on getting started? Thanks, Matt
Re: DIH - deleting documents, high performance (delta) imports, and passing parameters
Thanks for the section on Passing parameters to DIH config: I'm going to try the parameter passing to allow the DIH to index different DBs based on the system environment(local dev machine or production machine) @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 8/30/10 5:07 AM, Ephraim Ofir wrote: After wasting a few days navigating the somewhat uncharted and murky waters of DIH, thought I'd share my insights with the community to save other newbies time, so here goes... First off, this is not to say DIH is bad, I think it's great and it works really well for my uses, but it has a few undocumented quirks which cost me a lot of time. Deleting documents - several options: 1. the deletedPkQuery in delta import - you'll need to make a DB query which generates the IDs to be deleted (something like: SELECT id FROM yourTable WHERE deletedFlag = 1). Make sure that you have a pk in your entity and that it's the same one returned by your query (in this case - pk=id). 2. Add the $deleteDocById or $deleteDocByQuery special command to your full/delta import. This one is a bit tricky, see comment below**. 3. Use preImportDeleteQuery/postImportDeleteQuery in your full/delta query (contrary to what the wiki says, this works for delta-import as well as full-import). Any one of these can be used separately from your import, you can put them in a separate entity and do a full/delta import just on that entity if that's what you want. High performance imports with sub entities: DIH's sub entity architecture is very easy to understand and makes a lot of sense, but it performs sub queries for each row in the root entity, which is not practical for high volumes. I opted for a solution I found in the Solr book by Packt (excellent book BTW) which involves pushing multi-valued data into a single field with a separator which is then split by DIH with a RegexTransformer. This way Solr issues only one query to the DB and the DB does all the heavy-lifting. I actually implemented my query as a stored procedure so it can be optimized by the DBA and by the DB and be kept separate from the Solr config. The following (MySql) query concatenates 3 lang_code fields from the main table into one field and multiple emails from a secondary table into another field: SELECT u.id, u.name, IF((u.lang_code1 IS NULL AND u.lang_code2 IS NULL AND u.lang_code3 IS NULL), NULL, CONVERT(CONCAT_WS('|', u.lang_code1, u.lang_code2, u.lang_code3) USING ascii)) AS multi_lang_codes, GROUP_CONCAT(e.email SEPARATOR '|') AS multiple_emails FROM users_tb u LEFT JOIN emails_tb e ON u.id = e.id GROUP BY u.id The entity in data-config.xml looks something like: entity name=my_entity query=call get_solr_full(); transformer=RegexTransformer field name=email column=multiple_emails splitBy=\| / field name=lang_code column=multiple_lang_codes splitBy=\| / /entity High performance delta imports: DIH's delta import architecture suffers from the same problem as above, it performs one query to create a list of IDs which need to be updated and then performs one query to update each ID, which is not practical for high volumes of data. I was fervently looking for a way to do this in a single simple query which would be basically like the full import query only adding a WHERE last_updated ${dataimporter.last_index_time} clause. The closest thing I found was how to do a delta-import using full-import (DIH FAQ). I fiddled around with it a bit until I finally realized that you can actually do exactly what I wanted very simply - you just need to put a dummy query in the deltaQuery (you have to have a query there which returns one row for each time you want the deltaImportQuery to run - once in my case) and put whatever query you want in the deltaImportQuery. You could even use the deltaQuery to get some parameter from the DB to use with the deltaImportQuery instead of using the dataimporter's timestamp (I saw a lot of questions concerning time differences between the Solr host and the DB or other methods of determining the delta which could be solved this way). I have no need for this so my entity in data-config.xml looks something like: entity name=my_entity pk=id deltaQuery=SELECT 1 AS dummy; deltaImportQuery=call get_solr_delta('${dataimporter.last_index_time}'); ... field ... / /entity Passing parameters to DIH config: I have multiple Solr shards in my setup, and wanted to reuse as many config files as possible, the problem is that data-config.xml doesn't seem to support system property substitution like solrconfig.xml does (at least not in 1.4.1, I think I saw something about that in JIRA somewhere). I found a workaround for this by using the property substitution in solrconfig.xml and passing it as a parameter to DIH. Here's an excerpt from my
Re: specifying the doc id in clustering component
Yes, that's the approach I'm taking right now. I do a lookup the doc ids in the resultset to find the matching document. I can live with the manual lookup, I wanted to see if it would be possible to pick a custom field to represent the document in the docs array. Thanks for contributing the plugin to solr! @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 8/19/10 10:51 PM, Stanislaw Osinski wrote: The solr schema has the fields, id, name and desc. I would like to get docs:[name Field here ] instead of the doc Id field as in docs:[200066, 195650, The idea behind using the document ids was that based on them you could access the individual documents' content, including the other fields, right from the response field. Using ids limits duplication in the response text as a whole. Is it possible to use this approach in your application? Staszek
changable DIH datasource based on environment variables
I defined my DIH datasource in solrconfig.xml. Is there a way to define two sets of data sources and use one based on the current system's environment variable?(ex. APP_ENV=production or APP_ENV=development) I run the DIH on my local machine and remote server. They use different mysql datasources for importing. -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
specifying the doc id in clustering component
I'm using the clustering component with solr 1.4. The response is given by the id field in the doc array like: labels:[Devices], docs:[200066, 195650, 204850, Is there a way to change the doc label to be another field? i couldn't this option in http://wiki.apache.org/solr/ClusteringComponent -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
Re: DIH and multivariable fields problems
For multiple value fields using the DIH, i use group_concat with the regextransformer's splitby: ex: entity dataSource=grad_schools query= SELECTgroup_concat(professors.name separator '|') as university_professors FROM professors WHERE professors.university_guid = '${universities.guid}' transformer=RegexTransformer field column=university_professors splitBy=\| / /entity hope that's helpful. @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 8/6/10 4:39 PM, harrysmith wrote: I'm having a difficult time understanding how multivariable fields work with the DataImportHandler when the source is a RDBMS. I've read the following from the wiki: -- What is a row? A row in DataImportHandler is a Map (MapString, Object). In the map , the key is the name of the field and the value can be anything which is a valid Solr type. The value can also be a Collection of the valid Solr types (this may get mapped to a multi-valued field). If the DataSource is RDBMS a query cannot emit a multivalued field. But it is possible to create a multivalued field by joining an entity with another.i.e if the sub-entity returns multiple rows for one row from parent entity it can go into a multivalued field. If the datasource is xml, it is possible to return a multivalued field. -- How does one 'join an entity with another'? Below are the relevant sections of my schema.xml and data-config.xml. schema.xml dynamicField name=*_s type=string indexed=true stored=true multiValued=true / = data-config.xml entity name=item query=select * from project_items where projectid_fk=1 field column=ID_PK name=id / entity name=terms query=select distinct DESC_TERM from tem_metadata where item_id=${item.ID_PK} entity name=metadata query=select * from term_metadata where item_id=${item.ID_PK} AND desc_term='${terms.DESC_TERM}' field name=${terms.DESC_TERM}_s column=TEXT_VALUE / /entity /entity /entity I have multiple terms (rows) in the term_metadata table that are returned from the query, but only the first one gets added. Am I missing something obvious?
Re: Design questions/Schema Help
Alternatively, have you considered storing(or i should say indexing) the search logs with Solr? This lets you text search across your search queries. You can perform time range queries with solr as well. @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/26/10 4:43 PM, Mark wrote: We are thinking about using Cassandra to store our search logs. Can someone point me in the right direction/lend some guidance on design? I am new to Cassandra and I am having trouble wrapping my head around some of these new concepts. My brain keeps wanting to go back to a RDBMS design. We will be storing the user query, # of hits returned and their session id. We would like to be able to answer the following questions. - What is the n most popular queries and their counts within the last x (mins/hours/days/etc). Basically the most popular searches within a given time range. - What is the most popular query within the last x where hits = 0. Same as above but with an extra where clause - For session id x give me all their other queries - What are all the session ids that searched for 'foos' We accomplish the above functionality w/ MySQL using 2 tables. One for the raw search log information and the other to keep the aggregate/running counts of queries. Would this sort of ad-hoc querying be better implemented using Hadoop + Hive? If so, should I be storing all this information in Cassandra then using Hadoop to retrieve it? Thanks for your suggestions
DIH stalling, how to debug?
Hi, When I run my DIH script, it says it's busy but the Total Requests made to DataSource and Total Rows Fetched remain unchanged at 4 and 6. It hasn't reported a failure. How can I debug what is blocking the DIH? -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
Re: DIH stalling, how to debug?
Ok, it was a runaway SQL query which isn't using an index. @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/22/10 4:26 PM, Tommy Chheng wrote: Hi, When I run my DIH script, it says it's busy but the Total Requests made to DataSource and Total Rows Fetched remain unchanged at 4 and 6. It hasn't reported a failure. How can I debug what is blocking the DIH?
Re: csv response writer
I fixed the path of the queryResponseWriter class in the example solrconfig.xml. This was successfully applied against solr 4.0 trunk. A few quirks: * When I didn't specify a default Delimiter, it printed out null as delimiter. I couldn't figure out why because init(NamedList args) specifies it'll use a default of , organizationnull2null * If i don't specify the column names, the output doesn't put in empty correctly. eg: output has a mismatched number of commas. organization,1,Test,Name,2, ,200,8, organization,4,Solar,4,0, added the patch to https://issues.apache.org/jira/browse/SOLR-1925 @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/13/10 1:41 PM, Erik Hatcher wrote: Tommy, It's not committed to trunk or any other branch at the moment, so no future released version until then. Have you tested it out? Any feedback we should incorporate? When I can carve out some time over the next week or so I'll review and commit if there are no issues brought up. Erik On Jul 13, 2010, at 3:42 PM, Tommy Chheng wrote: Hi, Which next version of solr is the csv response writer set to be included in? https://issues.apache.org/jira/browse/SOLR-1925 -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
csv response writer
Hi, Which next version of solr is the csv response writer set to be included in? https://issues.apache.org/jira/browse/SOLR-1925 -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
Re: csv response writer
I'll try it out and let you know! @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/13/10 1:41 PM, Erik Hatcher wrote: Tommy, It's not committed to trunk or any other branch at the moment, so no future released version until then. Have you tested it out? Any feedback we should incorporate? When I can carve out some time over the next week or so I'll review and commit if there are no issues brought up. Erik On Jul 13, 2010, at 3:42 PM, Tommy Chheng wrote: Hi, Which next version of solr is the csv response writer set to be included in? https://issues.apache.org/jira/browse/SOLR-1925 -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
Re: Query modification
Hi, I actually did something similar on http://researchwatch.net/ if you search for stanford university solar, it will process the query by tagging the stanford university to the organization field. I created a querycomponent class and altered the query string like this(in scala but translatable to java easily): override def prepare(rb: ResponseBuilder){ val params: SolrParams = rb.req.getParams if(params.getBool(COMPONENT_NAME, false)){ val queryString = params.get(q).trim //rb.getQueryString() val entityTransform = new ClearboxEntityDetection val (transformedQuery, explainMap) = entityTransform.transformQuery(queryString) rb.setQueryString(transformedQuery) rb.rsp.add(clearboxExplain, explainMap) } } @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/2/10 3:12 PM, osocurious2 wrote: If I wanted to intercept a query and turn q=romantic italian restaurant in seattle into q=romantic tag:restaurant city:seattle cuisine:italian would I subclass QueryComponent, modify the query, and pass it to super? Or is there a standard way already to do this? What about changing it to q=romantic city:seattle cuisine:italianfq=type:restaurant would that be the same process, or is there a nuance to modifying a query into a query+filterQuery? Ken
Re: Query modification
i tried openNLP but found it's not very good for search queries because it uses grammar features like capitalization. i coded up a bayesian model with mutual information to model dependence between terms. ex. grouping stanford university together in the query stanford university solar @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com On 7/2/10 3:26 PM, caman wrote: And what did you use for entity detection? GATE,openNLP? Do you mind sharing that please? From: Tommy Chheng-2 [via Lucene] [mailto:ml-node+939600-682384129-124...@n3.nabble.com] Sent: Friday, July 02, 2010 3:20 PM To: caman Subject: Re: Query modification Hi, I actually did something similar on http://researchwatch.net/ if you search for stanford university solar, it will process the query by tagging the stanford university to the organization field. I created a querycomponent class and altered the query string like this(in scala but translatable to java easily): override def prepare(rb: ResponseBuilder){ val params: SolrParams = rb.req.getParams if(params.getBool(COMPONENT_NAME, false)){ val queryString = params.get(q).trim //rb.getQueryString() val entityTransform = new ClearboxEntityDetection val (transformedQuery, explainMap) = entityTransform.transformQuery(queryString) rb.setQueryString(transformedQuery) rb.rsp.add(clearboxExplain, explainMap) } } @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.comhttp://gradschoolnow.com?by-user=t On 7/2/10 3:12 PM, osocurious2 wrote: If I wanted to intercept a query and turn q=romantic italian restaurant in seattle into q=romantic tag:restaurant city:seattle cuisine:italian would I subclass QueryComponent, modify the query, and pass it to super? Or is there a standard way already to do this? What about changing it to q=romantic city:seattle cuisine:italianfq=type:restaurant would that be the same process, or is there a nuance to modifying a query into a query+filterQuery? Ken _ View message @ http://lucene.472066.n3.nabble.com/Query-modification-tp939584p939600.html To start a new topic under Solr - User, email ml-node+472068-464289649-124...@n3.nabble.com To unsubscribe from Solr - User, click (link removed) GZvcnRoZW90aGVyc3R1ZmZAZ21haWwuY29tfDQ3MjA2OHwtOTM0OTI1NzEx here.
dismax and AND as the default operator
I'm using the dismax request handler and want to set the default operator to AND. Using the standard handler, i could just use the q.op or defaultOperator in the schema, but this doesn't work using the dismax request handler. For example, if I call solr/select/?q=fuel+cell, I want solr to handle it as a solr/select/?q=fuel+AND+cell -- @tommychheng Programmer and UC Irvine Graduate Student Find a great grad school based on research interests: http://gradschoolnow.com
readonly access for all host except for localhost
Is there a way to configure solr to only allow readonly access for all external hosts except when the request is coming from localhost? ex. solr-server.com:8983/solr/select is read-only accessible from remote server and the remote server is not allow to do any update/delete POST actions. -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: use a solr-built index with lucene?
I was thinking of the reverse case: from solr to lucene. lucene doesn't use a schema.xml Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 4/9/10 12:15 AM, Paul Libbrecht wrote: This looks like an interesting avenue for a smooth transition from lucene to solr. thanks for more hints you find around. (e.g. maybe it is not too hard to pre-generate a schema.xml from an actual index for the field-types?) paul Le 09-avr.-10 à 02:32, Erik Hatcher a écrit : Yes... gotta jive with schema.xml though. Erik On Apr 8, 2010, at 7:18 PM, Tommy Chheng wrote: If i build an index with solr, is it possible to use the index folder with lucene? -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: Drill down a solr result set by facets
Try adding quotes to your query: DepartmentName:Chemistry+fSponsor:\US Cancer/Diabetic Research Institute\ The parser will split on whitespace Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/29/10 8:49 AM, Dhanushka Samarakoon wrote: Hi, I'm trying to perform a search based on keywords and then reduce the result set based on facets that user selects. First query for a search would look like this. http://localhost:8983/solr/select/?q=cancer+stemversion=2.2wt=phpstart=rows=10indent=onqt=dismaxfacet=onfacet.mincount=1facet.field=fDepartmentNamefacet.field=fInvestigatorNamefacet.field=fSponsorfacet.date=DateAwardedfacet.date.start=2009-01-01T00:00:00Zfacet.date.end=2010-01-01T00:00:00Zfacet.date.gap=%2B1MONTH In the above query (as per dismax on the solr config file) it searches multiple fields such as GrantTitle, DepartmentName, InvestigatorName, etc... Then if user select 'Chemistry' from the facet field 'fDepartmentName' and 'US Cancer/Diabetic Research Institute' from 'fSponsor' I need to reduce the result set above to only records from where fDepartmentName is 'Chemistry' and 'fSponsor' is 'US Cancer/Diabetic Research Institute' The following query is not working. select/?q=cancer+stem+fDepartmentName:Chemistry+fSponsor:US Cancer/Diabetic Research Instituteversion=2.2 Fields starting with 'f' are defined in the schema.xml as copy fields. field name=DepartmentName type=text indexed=true stored=true multiValued=true / field name=fDepartmentName type=string indexed=true stored=false multiValued=true / copyField source=DepartmentName dest=fDepartmentName/ Any ideas on the correct syntax? Thanks, Dhanushka.
Re: document categorization using solr?
Hi Joel, Do you need a supervised or unsupervised classification? supervised: u have examples of your classes unsupervised: u don't know your classes in advance In the contribs, there is a solr clustering component which will handle unsupervised classification: http://wiki.apache.org/solr/ClusteringComponent *i think the component meant to support small quantities of documents for supervised solutions(or larger scale unsupervised solutions), mahout could be a good start as it can use the solr index. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/25/10 6:40 PM, Joel Nylund wrote: Hi, Does solr have something built in, or recommended add-on that does document categorization? ( I found a thread about a year ago, but not exact same topic) For example, here is a commercial categorization product that will take a website and categorize it http://grapeshot.co.uk/online-demo-3.php?url=http://www.solutionstreet.com I am looking for something similar that works with Solr/Lucene and is open source based. Seems like Weka (http://weka.wikispaces.com/Frequently+Asked+Questions) might be close, but not sure. Also not sure how to come up with a category list thanks Joel
Re: keyword query tokenizer
Multi-field searches is one reason of doing the tokenizing in the parser. Imagine if your query was name:bob content:climate The parser can tokenize the query into name:bob, content:climate and pass each into their own analyzer. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/25/10 7:37 PM, Jason Chaffee wrote: I am curious as to why the query parser does any tokenizing? I would think you would want control/configure this with your analyzers? Does anyone know the answer to this. Is there a performance gain or something? Thanks, Jason On Mar 25, 2010, at 4:04 PM, Ahmet Arslan iori...@yahoo.com wrote: I have the following configured for a particular field: analyzer type=query tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer I am using dismax and querying multiple fields and I expect the query to be parsed different for each field. For some reason, it is not kept as single token for this field's query. For example, the query Apple Store is being broken into two tokens, apple and store. I would expect it to be apple store. Does anyone have ideas of what might be going on here? Before analysis phase, QueryParser splits on whitespace. You can alter this behavior by escaping whitespace with back slash. apple\ store
phrase segmentation plugin in component, analyzer, filter or parser?
I'm writing an experimental phrase segmentation plugin for solr. My current plan is to write as a SearchComponent by overriding the queryString with the new grouped query. ex. (university of california irvine 2009) will be re-written to university of calfornia irvine 2009 Is the SearchComponent the right class to extend for this type of logic? I picked the component because it was one place where i could get access to overwrite the whole query string. Or is it better design to write it as an analyzer, tokenizer, filter or parser plugin? -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
trimfilterfactory on string fieldtype?
Can the trim filter factory work on string fieldtypes? When I define a trim filter factory on a string fieldtype, i get an exception: org.apache.solr.common.SolrException: Unknown fieldtype 'string' specified on field id at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:477) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:95) at org.apache.solr.core.SolrCore.init(SolrCore.java:520) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137) This is how i define the field in the schema: fieldType name=string class=solr.StrField sortMissingLast=true omitNorms=true analyzer type=index filter class=solr.TrimFilterFactory / /analyzer /fieldType -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: XML data in solr field
Do you have the option of just importing each xml node as a field/value when you add the document? That'll let you do the search easily. If you need to store the raw XML, you can use an extra field. Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/16/10 12:59 PM, Nair, Manas wrote: Hello Experts, I need help on this issue of mine. I am unsure if this scenario is possible. I have a field in my solr document namedinputxml, the value of which is a xml string as below. This xml structure is within the inputxml field value. I needed help on searching this xml structure i.e. if I search for Venue, I should get Radio City Music Hall as the result and not the complete tag likeVenue value=Radio City Music Hall /. Is this supported in solr?? If it is, how can this be implemented?? root Venue value=Radio City Music Hall / Link value=http://bit.ly/Rndab; / LinkText value=En savoir + / Address value=New-York, USA / /root Any help is appreciated. I donot need the tag name in the result, instead I need the tag value. Thanks in advance, Manas Nair
Re: DIH field options
Haven't tried this myself but try adding a default value and don't specify it during the import. http://wiki.apache.org/solr/SchemaXml On 3/12/10 7:56 AM, blargy wrote: Forgive me but I'm slightly retarded... I grew up underneath some power lines ;) I've read through that wiki but I still can't find what I'm looking for. I just want to give one of the DIH entities/fields a static value (ie it doesnt come from a database column). How can I configure this? FYI this is data-config.xml not schema.xml. document entity name=item query=select * from items field name=my_field column=static_value_not_from_db/ /entity /document Tommy Chheng-4 wrote: The wiki page has most of the info you need *http://wiki*.apache.org/*solr*/DataImportHandler To use multi-value fields, your schema.xml must define it with multiValued=true On 3/11/10 10:58 PM, blargy wrote: How can you simply add a static value like?field name=id value=123/ How does one add a static multi-value field?field name=category_ids values=123, 456/ Is there any documentation on all the options for the field tag in data-config.xml? Thanks for the help -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: How to get Term Positions?
I contributed a little reward to whoever can complete this task too http://nextsprocket.com/tasks/solr-1337-spans-and-payloads-query-support-asf-jira Feel free to contribute to the reward if you need this done too! Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 3/12/10 2:14 PM, Grant Ingersoll wrote: OK, you need https://issues.apache.org/jira/browse/SOLR-1337 and it's related item: https://issues.apache.org/jira/browse/SOLR-1485 Unfortunately, not implemented yet. On Mar 12, 2010, at 1:36 PM, MitchK wrote: Thanks for your response, Grant! Imagine you are searching for foo. foor occurs in doc1 three times. It is the 5th, the 20th, and the 50th term in the document. I want to get these positions. Of course, if I am searching for foo bar and bar occurs at the 4th and the 21th position, I also want to know that. I am not sure, but I think this is what you mean by per doc basis, right? Since I need the TermPosition at scoring time, TermVectorComponent seems to be no option in this case, or do you think it could be one, if I create such Vectors at index-time? -- View this message in context: http://old.nabble.com/How-to-get-Term-Positions--tp27880551p27881024.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH field options
The wiki page has most of the info you need *http://wiki*.apache.org/*solr*/DataImportHandler To use multi-value fields, your schema.xml must define it with multiValued=true On 3/11/10 10:58 PM, blargy wrote: How can you simply add a static value like?field name=id value=123/ How does one add a static multi-value field?field name=category_ids values=123, 456/ Is there any documentation on all the options for the field tag in data-config.xml? Thanks for the help -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: persistent cache
One solution is to add the persistent cache with memcache at the application layer. -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com On 2/12/10 5:19 AM, Tim Terlegård wrote: 2010/2/12 Shalin Shekhar Mangarshalinman...@gmail.com: 2010/2/12 Tim Terlegårdtim.terleg...@gmail.com Does Solr use some sort of a persistent cache? Solr does not have a persistent cache. That is the operating system's file cache at work. Aha, that's very interesting and seems to make sense. So is the primary goal of warmup queries to allow the operating system to cache all the files in the data/index directory? Because I think the difference (768ms vs 52ms) is pretty big. I just do one warmup query and get 52 ms response on a 40 million documents index. I think that's pretty nice performance without tinkering with the caches at all. The only tinkering that seems to be needed is this operating system file caching. What's the best way to make sure that my warmup queries have cached all the files? And does a file cache have the complete file in memory? I guess it can get tough to get my 100GB index into the 16GB memory. /Tim -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
DataImportHandlerException for custom DIH Transformer
I'm having trouble making a custom DIH transformer in solr 1.4. I compiled the General TrimTransformer into a jar. (just copy/paste sample code from http://wiki.apache.org/solr/DIHCustomTransformer) I placed the jar along with the dataimporthandler jar in solr/lib (same directory as the jetty jar) Then I added to my DIH data-config.xml file: transformer=DateFormatTransformer, RegexTransformer, com.chheng.dih.transformers.TrimTransformer Now I get this exception when I try running the import. org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NoSuchMethodException: com.chheng.dih.transformers.TrimTransformer.transformRow(java.util.Map) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.loadTransformers(EntityProcessorWrapper.java:120) I noticed the exception lists TrimTransformer.transformRow(java.util.Map) but the abstract Transformer class defines a two parameter method: transformRow(MapString, Object row, Context context)? -- Tommy Chheng Programmer and UC Irvine Graduate Student Twitter @tommychheng http://tommy.chheng.com
Re: Using solr to store data
Hey AJ, For simplicity sake, I am using Solr to serve as storage and search for http://researchwatch.net. The dataset is 110K NSF grants from 1999 to 2009. The faceting is all dynamic fields and I use a catch all to copy all fields to a default text field. All fields are also stored and used for individual grant view. The performance seems fine for my purposes. I haven't done any extensive benchmarking with it. The site was built using a light ROR/rsolr layer on a small EC2 instance. Feel free to bang against the site with jmeter if you want to stress test a sample server to failure. :) -- Tommy Chheng Developer UC Irvine Graduate Student http://tommy.chheng.com On 2/3/10 5:41 PM, AJ Asver wrote: Hi all, I work on search at Scoopler.com, a real-time search engine which uses Solr. We current use solr for indexing but then fetch data from our couchdb cluster using the IDs solr returns. We are now considering storing a larger portion of data in Solr's index itself so we don't have to hit the DB too. Assuming that we are still storing data on the db (for backend and back up purposes) are there any significant disadvantages to using solr as a data store too? We currently run a master-slave setup on EC2 using x-large slave instances to allow for the disk cache to use as much memory as possible. I imagine we would definitely have to add more slave instances to accomodate the extra data we're storing (and make sure it stays in memory). Any tips would be really helpful. -- AJ Asver Co-founder, Scoopler.com +44 (0) 7834 609830 / +1 (415) 670 9152 a...@scoopler.com Follow me on Twitter: http://www.twitter.com/_aj Add me on Linkedin: http://www.linkedin.com/in/ajasver or YouNoodle: http://younoodle.com/people/ajmal_asver My Blog: http://ajasver.com
filter querying working on dynamic int fields but not dynamic string fields?
I'm having trouble doing a filter query on a string field. Any ideas why it's working on dynamic int fields but not dynamic string fields? ex. http://localhost:8983/solr/select?indent=onversion=2.2q=climate - correct http://localhost:8983/solr/select?version=2.2q=climatefq=awardedamounttodate_i%3A88900 FQ with dynamic int field returns one result - correct http://localhost:8983/solr/select?indent=onversion=2.2q=climatefq=awardinstrument_s:Continuing+grant returns zero results - Incorrect In my schema.xml, i setup dynamic fields like this: dynamicField name=*_i type=intindexed=true stored=true/ dynamicField name=*_s type=string indexed=true stored=true/ In my index, i have a record like which should have matched the last query: str name=id9987644/str int name=awardedamounttodate_i88900/int str name=awardinstrument_sContinuing grant /str str name=abstract_tAbstract ATM-987644 Zeng, Ning University of California, Los Angeles Title: Hierarchical Modeling of Vegetation-Climate /str This is the query debug section: lst name=debug str name=rawquerystringclimate/str str name=querystringclimate/str str name=parsedquerytext:climat/str str name=parsedquery_toStringtext:climat/str lst name=explain/ str name=QParserLuceneQParser/str arr name=filter_queries strawardinstrument_s:Continuing grant/str /arr arr name=parsed_filter_queries str+awardinstrument_s:Continuing +text:grant/str/arr
Re: filter querying working on dynamic int fields but not dynamic string fields?
Thanks, quoting it fixed it. I'm also going to strip the leading/trailing whitespace at index time. Tommy On 1/20/10 1:47 PM, Erik Hatcher wrote: On Jan 20, 2010, at 4:27 PM, Tommy Chheng wrote: I'm having trouble doing a filter query on a string field. Any ideas why it's working on dynamic int fields but not dynamic string fields? ex. http://localhost:8983/solr/select?indent=onversion=2.2q=climate - correct http://localhost:8983/solr/select?version=2.2q=climatefq=awardedamounttodate_i%3A88900 FQ with dynamic int field returns one result - correct http://localhost:8983/solr/select?indent=onversion=2.2q=climatefq=awardinstrument_s:Continuing+grant returns zero results - Incorrect fq=field:value with spaces is problematic - it is being parsed as a SolrQueryParser expression. It should work if you quote it - fq=field:value with spaces However, as I mentioned earlier today on the list, I think the best option for facet narrowing on string fields is this: fq={!raw f=field}value with spaces Of course all of the above need to be URL encoded too. str+awardinstrument_s:Continuing +text:grant/str/arr This explains the problem exactly. Note how it parsed the second word to the text field, not the field you specified. Erik
Re: Facet query help
ok, so fq != facet.query. i thought it was an alias. I'm trying your suggestion fq=Memory_s:1 GB and now it's returning zero documents even though there is one document that has tommy and Memory_s:1 GB as seen in the original pastie(http://pastie.org/650932). I tried the fq query body with quotes and without quotes. http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfq=%22Memory_s:1+GB%22q=tommyindent=on Any thoughts? thanks, tommy On 10/12/09 1:00 AM, Shalin Shekhar Mangar wrote: On Mon, Oct 12, 2009 at 6:07 AM, Tommy Chhengtommy.chh...@gmail.comwrote: The dummy data set is composed of 6 docs. My query is set for 'tommy' with the facet query of Memory_s:1+GB http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfacet.query=Memory_s:1+GBq=tommyindent=on However, in the response (http://pastie.org/650932), I get two docs: one which has the correct field Memory_s:1 GB and the second document which has a Memory_s:3+GB. Why did the second document match if i set the facet.query to just 1+GB?? facet.query does not limit documents. It is used for finding the number of documents matching the query. In order to filter the result set you should use filter query e.g. fq=Memory_s:1 GB
Facet query help
The dummy data set is composed of 6 docs. My query is set for 'tommy' with the facet query of Memory_s:1+GB http://lh:8983/solr/select/?facet=truefacet.field=CPU_sfacet.field=Memory_sfacet.field=Video+Card_swt=rubyfacet.query=Memory_s:1+GBq=tommyindent=on However, in the response (http://pastie.org/650932), I get two docs: one which has the correct field Memory_s:1 GB and the second document which has a Memory_s:3+GB. Why did the second document match if i set the facet.query to just 1+GB?? I'm using Solr 1.4 trunk thanks tommy