truncating indexed docs
Is it possible to truncate large documents once they are indexed? (Can this be done without re-indexing) Regards, CI
Re: custom reranking
Would it not be a good idea to provide Ranking as solr plugin, in which users can write their custom ranking algorithms and reorder the results returned by Solr in whichever way they need. It may also help Solr users to incorporate learning (from search user feedback - such as click logs), and reorder the results returned by Solr accordingly and not depend purely on relevance as we do today. Regards, CI On Fri, Feb 27, 2009 at 5:21 PM, Grant Ingersoll wrote: > > On Feb 26, 2009, at 11:16 PM, CIF Search wrote: > > I believe the query component will generate the query in such a way that i >> get the results that i want, but not process the returned results, is that >> correct? Is there a way in which i can group the returned results, and >> rank >> each group separately, and return the results together. In other words >> which >> component do I need to write to reorder the returned results as per my >> requirements. >> > > I'd have a look at what I did for the Clustering patch, i.e. SOLR-769. It > may even be the case that you can simply plugin your own SolrClusterer or > whatever it's called. Or, if it doesn't quite fit your needs, give me > feedback/patch and we can update it. I'm definitely open to ideas on it. > > > >> >> Also, the deduplication patch seems interesting, but it doesnt appear to >> be >> expected to work across multiple shards. >> >> > Yeah, that does seem a bit tricky. Since Solr doesn't support distributed > indexing, it would be tricky to support just yet. > > > > Regards, >> CI >> >> On Thu, Feb 26, 2009 at 8:03 PM, Grant Ingersoll > >wrote: >> >> >>> On Feb 26, 2009, at 6:04 AM, CIF Search wrote: >>> >>> We have a distributed index consisting of several shards. There could be >>> >>>> some documents repeated across shards. We want to remove the duplicate >>>> records from the documents returned from the shards, and re-order the >>>> results by grouping them on the basis of a clustering algorithm and >>>> reranking the documents within a cluster on the basis of log of a >>>> particular >>>> returned field value. >>>> >>>> >>> >>> I think you would have to implement your own QueryComponent. However, >>> you >>> may be able to get away with implementing/using Solr's FunctionQuery >>> capabilities. >>> >>> FieldCollapsing is also a likely source of inspiration/help ( >>> http://www.lucidimagination.com/search/?q=Field+Collapsing#/ >>> s:email,issues) >>> >>> As a side note, have you looked at >>> http://issues.apache.org/jira/browse/SOLR-769 ? >>> >>> You might also have a look at the de-duplication patch that is working >>> it's >>> way through dev: http://wiki.apache.org/solr/Deduplication >>> >>> >>> >>> How do we go about achieving this? Should we write this logic by >>>> implementing QueryResponseWriter. Also if we remove duplicate records, >>>> the >>>> total number of records that are actually returned are less than what >>>> were >>>> asked for in the query. >>>> >>>> Regards, >>>> CI >>>> >>>> >>> -- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using >>> Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> >>> > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >
Re: response time
yes, non cached. If I repeat a query the response is fast since the results are cached. 2009/4/7 Noble Paul നോബിള് नोब्ळ् > are these the numbers for non-cached requests? > > On Tue, Apr 7, 2009 at 11:46 AM, CIF Search wrote: > > Hi, > > > > I have around 10 solr servers running indexes of around 80-85 GB each and > > and with 16,000,000 docs each. When i use distrib for querying, I am not > > getting a satisfactory response time. My response time is around 4-5 > > seconds. Any suggestions to improve the response time for queries (to > bring > > it below 1 second). Is the response slow due to the size of the index ? I > > have already gone through the pointers provided at: > > http://wiki.apache.org/solr/SolrPerformanceFactors > > > > Regards, > > CI > > > > > > -- > --Noble Paul >
response time
Hi, I have around 10 solr servers running indexes of around 80-85 GB each and and with 16,000,000 docs each. When i use distrib for querying, I am not getting a satisfactory response time. My response time is around 4-5 seconds. Any suggestions to improve the response time for queries (to bring it below 1 second). Is the response slow due to the size of the index ? I have already gone through the pointers provided at: http://wiki.apache.org/solr/SolrPerformanceFactors Regards, CI
Re: input XSLT
But these documents have to be converted to a particular format before being posted. Any XML document cannot be posted to Solr (with XSLT handled by Solr internally). DIH handles any xml format, but it operates in pull mode. On Fri, Mar 13, 2009 at 11:45 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Fri, Mar 13, 2009 at 11:36 AM, CIF Search wrote: > > > There is a fundamental problem with using 'pull' approach using DIH. > > Normally people want a delta imports which are done using a timestamp > > field. > > Now it may not always be possible for application servers to sync their > > timestamps (given protocol restrictions due to security reasons). Due to > > this Solr application is likely to miss a few records occasionally. Such > a > > problem does not arise if applications themseleves identify their records > > and post. Should we not have such a feature in Solr, which will allow > users > > to push data onto the index in whichever format they wish to? This will > > also > > facilitate plugging in solr seamlessly with all kinds of applications. > > > > You can of course push your documents to Solr using the XML/CSV update (or > using the solrj client). It's just that you can't push documents with DIH. > > http://wiki.apache.org/solr/#head-98c3ee61c5fc837b09e3dfe3fb420491c9071be3 > > -- > Regards, > Shalin Shekhar Mangar. >
Re: input XSLT
There is a fundamental problem with using 'pull' approach using DIH. Normally people want a delta imports which are done using a timestamp field. Now it may not always be possible for application servers to sync their timestamps (given protocol restrictions due to security reasons). Due to this Solr application is likely to miss a few records occasionally. Such a problem does not arise if applications themseleves identify their records and post. Should we not have such a feature in Solr, which will allow users to push data onto the index in whichever format they wish to? This will also facilitate plugging in solr seamlessly with all kinds of applications. Regards, CI On Wed, Mar 11, 2009 at 11:52 PM, Noble Paul നോബിള് नोब्ळ् < noble.p...@gmail.com> wrote: > On Tue, Mar 10, 2009 at 12:17 PM, CIF Search wrote: > > Just as you have an xslt response writer to convert Solr xml response to > > make it compatible with any application, on the input side do you have an > > xslt module that will parse xml documents to solr format before posting > them > > to solr indexer. I have gone through dataimporthandler, but it works in > data > > 'pull' mode i.e. solr pulls data from the given location. I would still > want > > to work with applications 'posting' documents to solr indexer as and when > > they want. > it is a limitation of DIH, but if you can put your xml in a file > behind an http server then you can fire a command to DIH to pull data > from the url quite easily. > > > > Regards, > > CI > > > > > > -- > --Noble Paul >
input XSLT
Just as you have an xslt response writer to convert Solr xml response to make it compatible with any application, on the input side do you have an xslt module that will parse xml documents to solr format before posting them to solr indexer. I have gone through dataimporthandler, but it works in data 'pull' mode i.e. solr pulls data from the given location. I would still want to work with applications 'posting' documents to solr indexer as and when they want. Regards, CI
Re: custom reranking
I believe the query component will generate the query in such a way that i get the results that i want, but not process the returned results, is that correct? Is there a way in which i can group the returned results, and rank each group separately, and return the results together. In other words which component do I need to write to reorder the returned results as per my requirements. Also, the deduplication patch seems interesting, but it doesnt appear to be expected to work across multiple shards. Regards, CI On Thu, Feb 26, 2009 at 8:03 PM, Grant Ingersoll wrote: > > On Feb 26, 2009, at 6:04 AM, CIF Search wrote: > > We have a distributed index consisting of several shards. There could be >> some documents repeated across shards. We want to remove the duplicate >> records from the documents returned from the shards, and re-order the >> results by grouping them on the basis of a clustering algorithm and >> reranking the documents within a cluster on the basis of log of a >> particular >> returned field value. >> > > > I think you would have to implement your own QueryComponent. However, you > may be able to get away with implementing/using Solr's FunctionQuery > capabilities. > > FieldCollapsing is also a likely source of inspiration/help ( > http://www.lucidimagination.com/search/?q=Field+Collapsing#/ > s:email,issues) > > As a side note, have you looked at > http://issues.apache.org/jira/browse/SOLR-769 ? > > You might also have a look at the de-duplication patch that is working it's > way through dev: http://wiki.apache.org/solr/Deduplication > > > >> How do we go about achieving this? Should we write this logic by >> implementing QueryResponseWriter. Also if we remove duplicate records, the >> total number of records that are actually returned are less than what were >> asked for in the query. >> >> Regards, >> CI >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >
custom reranking
We have a distributed index consisting of several shards. There could be some documents repeated across shards. We want to remove the duplicate records from the documents returned from the shards, and re-order the results by grouping them on the basis of a clustering algorithm and reranking the documents within a cluster on the basis of log of a particular returned field value. How do we go about achieving this? Should we write this logic by implementing QueryResponseWriter. Also if we remove duplicate records, the total number of records that are actually returned are less than what were asked for in the query. Regards, CI