Re: Update Plugins (was Re: Handling disparate data sources in Solr)
data and wrote it out in the current update response format .. so the current SolrUpdateServlet could be completley replaced with a simple url mapping... /update -- /select?qt=xmlupdatewt=legacyxmlupdate Using the filter method above, it could (and i think should) be mapped to: /update
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
talking about the URL structure made me realize that the Servlet should dicate the URL structure and the param parsing, but it should do it after giving the RequestParser a crack at any streams it wants (actually i think that may be a direct quote from JJ ... can't remember now) ... *BUT* the RequestParser may not want to provide a list of streams, untill the params have been parsed (if for example, one of the params is the name of a file) so what if the interface for RequestParser looked like this... interface RequestParser { public init(NamedList nl); // the usual /** will be passed the raw input stream from the * HttpServletRequest, ... may need other HttpServletRequest info as * SolrParam (ie: method, content-type/content-length, ...but we use * a SolrParam instance instead of the HttpServletRequest to * maintain an abstraction. */ public IterableContentStream preProcess(SolrParam headers, InputStream s); /** garunteed that the second arg will be the result from * a previous call to preProcess, and that that Iterable from * preProcess will not have been inspected or touched in anyway, nor * will any refrences to it be maintained after this call. * this method is responsible for calling * request.setContentStreams(IterableContentStreams) as it sees fit */ public void process(SolrRequest request, IterableContentStream i); } ...the idea being that many RequestParsers will choose to impliment one or both of those methods as a NOOP that just returns null but if they want to impliment both, they have the choice of obliterating the Iterable returned by preProcess and completely replacing it once they see the SolrParams in the request : specifically what i had in mind was something like this... : : class SolrUberServlet extends HttpServlet { : public service(HttpServletRequest req, HttpServletResponse response) { : SolrCore core = getCore(); : Solr(Query)Response solrRsp = new Solr(Query)Response(); : : // servlet specific method which does minimal inspection of : // req to determine the parser name : String p = pickRequestParser(req); : : // looks up a registered instance (from solrconfig.xml) : // matching that name : RequestParser solrParser = coreGetParserByName(p); : // let the parser preprocess the streams if it wants... IterableContentStreams s = solrParser.preprocess(req.getInputStream()) // build the request using servlet specific URL rules Solr(Query)Request solrReq = makeSolrRequest(req); // let the parser decide what to do with the existing streams, // or provide new ones solrParser.process(solrReq, s); : // does exactly what it does now: picks the RequestHandler to : // use based on the params, calls it's handleRequest method : core.execute(solrReq, solrRsp) : : // the rest of this is cut/paste from the current SolrServlet. : // use SolrParams to pick OutputWriter name, ask core for instance, : // have that writer write the results. : QueryResponseWriter responseWriter = core.getQueryResponseWriter(solrReq); : response.setContentType(responseWriter.getContentType(solrReq, solrRsp)); : PrintWriter out = response.getWriter(); : responseWriter.write(out, solrReq, solrRsp); : : } : } : : : -Hoss : -Hoss
Re: Can this be achieved? (Was: document support for file system crawling)
: 2) contrib code that runs as it's own process to crawl documents and : send them to a Solr server. (mybe it parses them, or maybe it relies on : the next item...) : : Do you know FAST? It uses a step-by-step approach (pipeline) in which : all of these tasks are done. Much of it is tuned in a easy web tool. : : The point I'm trying to make is that contrib code is nice, but a : complete package with these possibilities could broaden Solr's appeal : somewhat. in my limited experience, commercial applications tend to be all in one solutions not so much because it really adds value that they are all in one but because it helps with vendor lock in -- companies tend to want to give you a single monolithic product, because if they gave you lots of little products that tried to do just one thing very well, you might decide that one of their little products is crap, and write your own replacement for just that piece using a great open-source library you found .. and then you might realize that this *other* open-source library would make it really easy for you to replace this other little piece of their system and would be a lot more efficient ... etc. the point being that once they've got you using a monolithic application, it's a lot harder to stop using the whole thing all at once, then it would be for you to stop using 1 of N mini-applications they provide. open source projects on the other hand, tend to work well when they are composed of lots of little pieces -- because little pieces are easier to work on when you have a finite number of developers working in their spare time, because each developer can work on a few peices at a time, and those peices can be reviewed/used by other people even if the system as a whole isn't finished. I ramble about this to try and explain why Solr may not be what you would consider a complete package at the moment and why it may never reach the state you think would make it a complete package ... because there are a lot of people out there who don't need it to be -- it would be hard to be a full blown GUI configurable, web crawling, document detecting, customizable schema based application and still allow for people to use small pieces of it. To put it another way: it's a lot easier for people to put reusable components with clean APIs together in interesting ways, then it is for people to extract reusable components with clean APIs from a monolithic application. : Exactly, this sounds more like it. But if similar inputstreams can be : handled by Nutch, what's the point in using Solr at all? The http API's? : In other words, both Nutch and Solr seem to have functionality that : enterprises would want. But neither gives you the total solution. if what you care about is extracting text from arbitrary documents, that's what Nutch does well -- it doesn't worry about trying to extract complex structure from the documents, so it can parse/index lots of document formats into the same index. Solr's goal is to let *you* define the index format, but that requires you defining what data goes into which fields as well, and that makes generic reusable document crawlers parsers harder to get right in a way that can work for anyone. -Hoss
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Chris Hostetter wrote: i'm totally on board now ... the RequestParser decides where the streams come from if any (post body, file upload, local file, remote url, etc...); the RequestHandler decides what it wants to do with those streams, and has a library of DocumentProcessors it can pick from to help it parse them if it wants to, then it takes whatever actions it wants, and puts the response information in the existing Solr(Query)Response class, which the core hands off to any of the various OutputWriters to format according to the users wishes. +1 -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
Ryan McKinley wrote: In addition, consider the case where you want to index a SVN repository. Yes, this could be done in SolrRequestParser that logs in and returns the files as a stream iterator. But this seems like more 'work' then the RequestParser is supposed to do. Not to mention you would need to augment the Document with svn specific attributes. This is indeed one of the things I'd like to do - use Solr as a back-end for OpenGrok (http://www.opensolaris.org/os/project/opengrok/) -- Alan Burlison --
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
At 11:48 PM -0800 1/16/07, Chris Hostetter wrote: yeah ... once we have a RequestHandler doing that work, and populating a SolrQueryResponse with it's result info, it would probably be pretty trivial to make an extremely bare-bones LegacyUpdateOutputWRiter that only expected that simple mount of response data and wrote it out in the current update response format .. so the current SolrUpdateServlet could be completley replaced with a simple url mapping... /update -- /select?qt=xmlupdatewt=legacyxmlupdate Yah! But in my vision it would be /update - qt=update because pathInfo is update. There's no need to remap anything in the URL, the existing SolrServlet is ready for dispatch once it: - Prepares request params into SolrParams - Sets params(qt) to pathInfo - Somehow (perhaps with StreamIterator) prepares streams for RequestParser use I'm still trying to conceptually maintain a separation of concerns between handling the details of HTTP (servlet-layer) and handling different payload encodings (a different layer, one I believe can be invoked after config is read). The following is vision more than proposal or suggestion... requestHandler name=update class=lets.write.this.UpdateRequestHandler lst name=invariants str name=wtlegacyxml/str /lst lst name=defaults !-- rp matches queryRequestParser -- str name=rpxml/str /lst /request !-- only if standard responseWriter is not up to the task -- queryResponseWriter name=legacyxml class=do.we.really.need.LegacyUpdateOutputWRiter/ queryRequestParser name=xml class=solr.XMLStreamRequestParser/ queryRequestParser name=json class=solr.JSONStreamRequestParser/ So when incoming URL comes in: /update?rp=json the pipeline which is established is: SolrServlet - solr.JSONStreamRequestParser | |- request data carrier e.g. SolrQueryRequest | lets.write.this.UpdateRequestHandler | |- response data carrier e.g. SolrQueryResponse | do.we.really.need.LegacyUpdateOutputWRiter I expect this is all fairly straightforward, except for one sticky question: Is there a universal format which can efficiently (e.g. lazily, for stream input) convey all kinds of different request body encodings, such that the RequestHandler has no idea how it was dispatched? Something to think about... - J.J.
Re: Solr graduates and joins Lucene as sub-project
On Wed, 2007-01-17 at 10:07 -0500, Yonik Seeley wrote: Solr has just graduated from the Incubator, and has been accepted as a Lucene sub-project! Thanks to all the Lucene and Solr users, contributors, and developers who helped make this happen! Yeah congrats to the whole community and especially to the incubator mentors and first minute solr project members. Thanks for this awesome project. I have a feeling we're just getting started :-) +1 salu2 -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
I'm not sure i underestand preProcess( ) and what it gets us. I like the model that 1. The URL path selectes the RequestHandler 2. RequestParser = RequestHandler.getRequestParser() (typically from its default params) 3. SolrRequest = RequestParser.parse( HttpServletRequest ) 4. handler.handleRequest( req, res ); 5. write the response If anyone needs to customize this chain of events, they could easily write their own Servlet/Filter On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote: Acctually, i have to amend that ... it occured to me in my slep last night that calling HttpServletRequest.getInputStream() wasn't safe unless we *now* the Requestparser wasnts it, and will close it if it's non-null, so the API for preProcess would need to look more like this... interface PointerT { T get(); } interface RequestParser { ... /** the will be passed a Pointer to the raw input stream from the * HttpServletRequest, ... if this method accesses the IputStream * from the pointer, it is required to close it if it is non-null. */ public IterableContentStream preProcess(SolrParam headers, PointerInputStream s); ... } -Hoss
Re: Java version for solr development (was Re: Update Plugins)
I also think it is too early to move to 1.6. Only Sun has released their 1.6 JVM. Bill On 1/17/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 1/17/07, Thorsten Scherler [EMAIL PROTECTED] wrote: ...Should I use 1.6 for a patch or above mentioned libs?... IMHO moving to 1.6 is way too soon, and if it's only to save two jars it's not worth it. -Bertrand
[jira] Updated: (SOLR-104) Update Plugins
[ https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated SOLR-104: --- Attachment: DispatchFilter.patch Update Plugins -- Key: SOLR-104 URL: https://issues.apache.org/jira/browse/SOLR-104 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.2 Reporter: Ryan McKinley Fix For: 1.2 Attachments: DispatchFilter.patch, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.zip The plugin framework should work for 'update' actions in addition to 'search' actions. For more discussion on this, see: http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html#a8305828 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: graduation todo list
On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote: : OK, here's the TODO list I can think of. i added this as a new section on the TaskList (like we did for the first release) so it can evolve as people think of other things that need done (or do things on the list) Hopefully it won't take as long as the last release :-) -Yonik
Re: Update Plugins (was Re: Handling disparate data sources in Solr)
On 1/17/07, Chris Hostetter [EMAIL PROTECTED] wrote: : I'm not sure i underestand preProcess( ) and what it gets us. it gets us the abiliity for a RequestParser to be able to pull out the raw InputStream from the HTTP POST body, and make it available to the RequestHandler as a ContentStream and/or it can wait untill the servlet has parsed the URL to get the params and *then* it can generate ContentStreams based on those param values. - preProcess is neccessary to write a RequestParser that can handle the current POST raw XML model, - process is neccessary to write RequestParsers that can get file names or URLs out of escaped query params and fetch them as streams I think the confusion is that (in my view) the RequestParser is the *only* object able to touch the stream. I don't think anything should happen between preProcess() and process(); A RequestParser converts a HttpServletRequest to a SolrRequest. Nothing else will touch the servlet request. : 1. The URL path selectes the RequestHandler : 2. RequestParser = RequestHandler.getRequestParser() (typically from : its default params) : 3. SolrRequest = RequestParser.parse( HttpServletRequest ) : 4. handler.handleRequest( req, res ); : 5. write the response the problem i see with that, is that the RequestHandler shouldn't have any say in what RequestParser is used -- ... got it. Then i vote we use a syntax like: /path/registered/in/solr/config:requestparser?params If no ':' is in the URL, use 'standard' parser 1. The URL path determins the RequestHandler 2. The URL path determins the RequestParser 3. SolrRequest = RequestParser.parse( HttpServletRequest ) 4. handler.handleRequest( req, res ); 5. write the response : If anyone needs to customize this chain of events, they could easily : write their own Servlet/Filter this is why i was confused about your Filter comment earlier: if the only way a user can customize behavior is by writting a Servlet, they can't specify that servlet in a solr config file -- they'd have to unpack the war and manually eidt the web.xml ... which makes upgrading a pain. I don't *think* this would happen often, and the people would only do it if they are unhappy with the default URL structure - behavior mapping. I am not suggesting this would be the normal way to configure solr. The main case where I imagine someone would need to write their own servlet/filter is if they insist the parameters need to be in the URL. For example: /delete/id/ The URL structure I am proposing could not support this (unless you had a handler mapped to each id :) ryan
RE: Update Plugins (was Re: Handling disparate data sources in Solr)
Sorry for the flame , but I've used spring on 2 large projects and it worked out great.. you should check out some of the GUIs to help manage the XML configuration files, if that is reason your team thought it was a nightmare because of the configuration(we broke ours up to help).. Jeryl Cook -Original Message- From: Alan Burlison [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 16, 2007 10:52 AM To: solr-dev@lucene.apache.org Subject: Re: Update Plugins (was Re: Handling disparate data sources in Solr) Bertrand Delacretaz wrote: With all this talk about plugins, registries etc., /me can't help thinking that this would be a good time to introduce the Spring IoC container to manage this stuff. More info at http://www.springframework.org/docs/reference/beans.html for people who are not familiar with it. It's very easy to use for simple cases like the ones we're talking about. Please, no. I work on a big webapp that uses spring - it's a complete nightmare to figure out what's going on. -- Alan Burlison --
[jira] Updated: (SOLR-104) Update Plugins
[ https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan McKinley updated SOLR-104: --- Attachment: DispatchFilter.patch Update Plugins -- Key: SOLR-104 URL: https://issues.apache.org/jira/browse/SOLR-104 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.2 Reporter: Ryan McKinley Fix For: 1.2 Attachments: DispatchFilter.patch, DispatchFilter.patch, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.zip The plugin framework should work for 'update' actions in addition to 'search' actions. For more discussion on this, see: http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html#a8305828 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (SOLR-104) Update Plugins
[ https://issues.apache.org/jira/browse/SOLR-104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12465563 ] Ryan McKinley commented on SOLR-104: removed getRequestParser() from Handler interface. using ':' in the URL to specify the request parser. http://localhost:8983/solr/standard:requestparser?q=video NOTE: it still uses a defalt request parser. Update Plugins -- Key: SOLR-104 URL: https://issues.apache.org/jira/browse/SOLR-104 Project: Solr Issue Type: Improvement Components: update Affects Versions: 1.2 Reporter: Ryan McKinley Fix For: 1.2 Attachments: DispatchFilter.patch, DispatchFilter.patch, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring-DRAFT-SRC.zip, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.patch, HandlerRefactoring.DRAFT.zip The plugin framework should work for 'update' actions in addition to 'search' actions. For more discussion on this, see: http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html#a8305828 -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Can this be achieved? (Was: document support for file system crawling)
On 1/17/07, Eivind Hasle Amundsen [EMAIL PROTECTED] wrote: (...) the point being that once they've got you using a monolithic application, it's a lot harder to stop using the whole thing all at once, then it would be for you to stop using 1 of N mini-applications they provide. Well, FAST is composed of many small, modular products that can be replaced by other (open source) parts. It is not monolithic. The first time you install it, it might appear to be just one giant beast. However it is not. I ramble about this to try and explain why Solr may not be what you would consider a complete package at the moment and why it may never reach the state you think would make it a complete package ... because there are a lot of people out there who don't need it to be -- it would be hard to be a full blown GUI configurable, web crawling, document detecting, customizable schema based application and still allow for people to use small pieces of it. I am not arguing on this. I think my point didn't get through, then. Compare this to Linux distributions. People still use them, right? What about an enterprise search distro? That is exactly what some commercial vendors offer, only far less elegant than anything containing Lucene et al would probably be. To put it another way: it's a lot easier for people to put reusable components with clean APIs together in interesting ways, then it is for people to extract reusable components with clean APIs from a monolithic application. Yes, I agree completely, and the strength is exactly what you say - they focus on doing a small thing very well. I believe this fact would make such a search distribution even more appealing. I am not sure I follow. Enterprise search distro?. Anyway any enterprise interested in having a serious search solution (i.e. buy FAST, Autonomy or do open source lucene) will want a custom solution i.e. pick and choose the module/feature they need/want and then let an integrator/consultancy-firm/IT department to do the actual implementation. So a search distribution as pointed out is somewhat meaningless if customization is important. Now there are organization that will want to have a black-box solution i.e. Google-mini or Searchblox or the new IBM/Yahoo/Lucene search solution (sorry I can't remember the name). These are pre-packaged solution and low cost alternative, in some case free and offer no customization and I am 100% sure those organization don not even want customization. So having the possibility to pick and choose and make a custom solution from Lucene, Solr, Nutch, Hadoop is super perfect..You can do more cool things then if all of these are bundles. Just some thoughts. Cheers Zaheed
Re: [Solr Wiki] Update of TaskList by YonikSeeley
Apache Wiki wrote: * have everyone update their subversion working directories (remember to update SVN paths in IDEs too, etc) Note that 'svn switch' makes this easy. Doug
Bucketing result set (Dev list posting)...
I have a requirement wherein the documents that are retrieved based on the similarity computation are bucketed and resorted based on user score. An example - Let us say a search returns the following data set - Doc ID Lucene score User score 10001000 125 1000 900 225 1000 800 25 1000 700 525 100050 25 100040 125 Assuming two bucket are created, the expected result is - Doc ID Lucene score User score 1000 900 225 10001000 125 1000 800 25 --- 1000 700 525 100040 125 100050 25 I am assuming that the only way to do this is to change some of the Solr internals. Any pointers would be most helpful on the best way to go about it. Thanks. -- View this message in context: http://www.nabble.com/Bucketing-result-set-%28Dev-list-posting%29...-tf3031130.html#a8421969 Sent from the Solr - Dev mailing list archive at Nabble.com.
subversion move
Solr's source in subversion has moved within the ASF repository to to https://svn.apache.org/repos/asf/lucene/solr/ (Thanks Doug!) The easiest way to change your working directories is to use svn switch. For example, if you have the trunk of solr checked out, cd to that directory and execute svn switch https://svn.apache.org/repos/asf/lucene/solr/trunk Don't forget to change any SVN paths that may be configured in your IDEs too. -Yonik
Re: Can this be achieved? (Was: document support for file system crawling)
(...) any enterprise interested in having a serious search solution (i.e. buy FAST, Autonomy or do open source lucene) will want a custom solution (...) then let an integrator/consultancy-firm/IT department to do the actual implementation. So a search distribution as pointed out is somewhat meaningless if customization is important. I'm talking about creating something that works much more easily out of the box, and that can be customized as much as now - at the same time. Of course serious search solutions would be completely customized, always. And there are out of the box solutions (Google Appliance etc.). But is there no market for a middle ground here? Now there are organization that will want to have a black-box solution i.e. Google-mini or Searchblox or the new IBM/Yahoo/Lucene search solution (sorry I can't remember the name). These are pre-packaged solution and low cost alternative, in some case free and offer no customization and I am 100% sure those organization don not even want customization. Which ones are free? Are there any FLOSS alternatives to these black box solutions? (IANAL, but the Apache license is more like LGPL than GPL, right?) So having the possibility to pick and choose and make a custom solution from Lucene, Solr, Nutch, Hadoop is super perfect..You can do more cool things then if all of these are bundles. What I am really talking about, is this: There is a growing market for simple search solutions that can work out of the box, and that can still be customized. Something that: - organizations can use on their network, out of the box - on their intraweb, out of the box, just give credentials - can handle user access out of the box (LDAP/NIS/AD) - is FLOSS(!) - can be fully customized, if desired - modularized for even more customization if needed Sure, one can argue like you have done so far by saying that they could just compose their own solution completely... But then we are falling outside the market again - which I hypotesize exist. I am not looking to change Solr in that direction. But take a look at Solr. Or Nutch. They are already built on Lucene and many other projects. Why/not build something on top of this? Something more/else? Thanks for all the feedback :) Please keep it coming.
Re: [Solr Wiki] Update of TaskList by YonikSeeley
Apache Wiki wrote: * move website * checkout in new location (from the new svn location too) Note that you can update the .htaccess file in /www/incubator.apache.org/solr to redirect the old site to the new site. http://svn.apache.org/repos/asf/incubator/public/trunk/site-publish/.htaccess Doug
Re: Can this be achieved? (Was: document support for file system crawling)
On 1/17/07, Eivind Hasle Amundsen [EMAIL PROTECTED] wrote: What I am really talking about, is this: There is a growing market for simple search solutions that can work out of the box, and that can still be customized. Something that: - organizations can use on their network, out of the box - on their intraweb, out of the box, just give credentials - can handle user access out of the box (LDAP/NIS/AD) - is FLOSS(!) - can be fully customized, if desired - modularized for even more customization if needed I am not looking to change Solr in that direction. But take a look at Solr. Or Nutch. They are already built on Lucene and many other projects. Why/not build something on top of this? Something more/else? I don't think that anyone is arguing that this product shouldn't exist in the open-source world, just that it shouldn't be part of Solr's mandate. It sounds like a cool project (though the closer you get to commercial product the more important support, packaging, marketing, etc. become--some of which are very difficult to achieve in a purely open-source setting). -Mike