Re: Is Solr ready for graduation?
On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote: ...I'd say definitely ask Lucene first. (And in general ask the accepting TLP first, before asking the Incubator). +1 from me to starting the discussion, and +1 for graduating Same opinion on all points here. -Bertrand
Re: Handling disparate data sources in Solr
Chris Hostetter wrote: For your purposes, if you've got a system that works and does the Document conversion for you, then you are probably right: Solr may not be a usefull addition to your architecture. Solr doesn't really attempt to solve the problem of parsing differnet kinds of data streams into a unified Document module -- it just tries to expose all of the Lucene goodness through an easy to use, easy to configre, HTTP interface. Besides the configuration, Solr's other means of being a value add is in it's IndexReader management, it's caching, and it's plugin support for mixing and matching request handlers, output writters, and field types as easily as you can mix and match Analyzers. There has been some discussion about adding plugin support for the update side of things as well -- at a very simple level this could allow for messages to be sent via JSON, or CSV instead of just XML -- but there's no reason a more comple upate plugin couldn't read in a binary PDF file and parse it into it's appropriate fields ... but we aren't quite there yet. Feel free to bring this up on solr-dev if you'd be interested in working on it. I'm interested in discussing this further. I've moved the discussion onto solr-dev, as suggested. -- Alan Burlison --
Re: Is Solr ready for graduation?
And, of course, likewise. Solr is more than ready to get voted on for graduation. Erik On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote: On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote: ...I'd say definitely ask Lucene first. (And in general ask the accepting TLP first, before asking the Incubator). +1 from me to starting the discussion, and +1 for graduating Same opinion on all points here. -Bertrand
Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)
On Jan 4, 2007, at 4:26 AM, Bertrand Delacretaz wrote: On 1/4/07, Mike Klaas [EMAIL PROTECTED] wrote: ...Might labs.apache.org make sense for this project?... As Flare is closely related to Solr, I think it belongs in our repository. OTOH, it might be good to put such experimental stuff in a whiteboard directory instead of trunk, to make it clear that it's not (yet) part of what we're releasing. Would making this clear in the README files in both the solrb and flare directories be sufficient? I'm happy to move things wherever folks would like. It's certainly a playground of sorts for me right now, though I expect to have flare functional very soon. solrb is already functional, for what its worth, though its not particularly fancy yet. Erik
Re: Is Solr ready for graduation?
+1 on going for graduation. Bill On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote: And, of course, likewise. Solr is more than ready to get voted on for graduation. Erik On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote: On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote: ...I'd say definitely ask Lucene first. (And in general ask the accepting TLP first, before asking the Incubator). +1 from me to starting the discussion, and +1 for graduating Same opinion on all points here. -Bertrand
Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)
On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote: Would making this clear in the README files in both the solrb and flare directories be sufficient? I'm happy to move things wherever folks would like. It's certainly a playground of sorts for me right now, though I expect to have flare functional very soon. solrb is already functional, for what its worth, though its not particularly fancy yet. I don't think the package ant task doesn't currently includes anything in clients so I don't think there's an issue w.r.t. releasing. A README note would be fine. -Yonik
Re: Is Solr ready for graduation?
Hi, For the curious, here's what votes will be needed and what's binding in them. It may seem like a long road, but don't be discouraged: for a project like Solr, there's largely consensus so these votes are quick and painless. First, the Solr PPMC must approve the graduation request. In this vote, Solr PPMC members' votes are binding. Unless I'm mistaken, right now all Solr committers are also PPMC members, or close to it. Next, the adopting PMC (in this case Lucene) must vote to accept Solr. In that vote, only Lucene PMC members' votes are binding; you can see those people at http://lucene.apache.org/who.html#Lucene+PMC . Finally, after the Lucene PMC approves, we ask the Incubator PMC. In that vote, Incubator only PMC members' votes are binding. You can see that list of people at http://incubator.apache.org/whoweare.html . You will note at least several people (such as Yonik and Erik Hatcher) will have binding votes in more than one of the above votes. That's fine, it's even expected, e.g. from mentors. They can wear multiple hats without (hopefully) acquiring some clinical disease. Yoav On 1/4/07, Bill Au [EMAIL PROTECTED] wrote: +1 on going for graduation. Bill On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote: And, of course, likewise. Solr is more than ready to get voted on for graduation. Erik On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote: On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote: ...I'd say definitely ask Lucene first. (And in general ask the accepting TLP first, before asking the Incubator). +1 from me to starting the discussion, and +1 for graduating Same opinion on all points here. -Bertrand
Re: Is Solr ready for graduation?
Thanks for the summary Yoav, This thread looks like it's the first vote (unless anyone objects), so here's my +1 for graduation. -Yonik On 1/4/07, Yoav Shapira [EMAIL PROTECTED] wrote: Hi, For the curious, here's what votes will be needed and what's binding in them. It may seem like a long road, but don't be discouraged: for a project like Solr, there's largely consensus so these votes are quick and painless. First, the Solr PPMC must approve the graduation request. In this vote, Solr PPMC members' votes are binding. Unless I'm mistaken, right now all Solr committers are also PPMC members, or close to it. Next, the adopting PMC (in this case Lucene) must vote to accept Solr. In that vote, only Lucene PMC members' votes are binding; you can see those people at http://lucene.apache.org/who.html#Lucene+PMC . Finally, after the Lucene PMC approves, we ask the Incubator PMC. In that vote, Incubator only PMC members' votes are binding. You can see that list of people at http://incubator.apache.org/whoweare.html . You will note at least several people (such as Yonik and Erik Hatcher) will have binding votes in more than one of the above votes. That's fine, it's even expected, e.g. from mentors. They can wear multiple hats without (hopefully) acquiring some clinical disease. Yoav On 1/4/07, Bill Au [EMAIL PROTECTED] wrote: +1 on going for graduation. Bill On 1/4/07, Erik Hatcher [EMAIL PROTECTED] wrote: And, of course, likewise. Solr is more than ready to get voted on for graduation. Erik On Jan 4, 2007, at 4:21 AM, Bertrand Delacretaz wrote: On 1/3/07, Yoav Shapira [EMAIL PROTECTED] wrote: ...I'd say definitely ask Lucene first. (And in general ask the accepting TLP first, before asking the Incubator). +1 from me to starting the discussion, and +1 for graduating Same opinion on all points here. -Bertrand
Re: Use whiteboard for experimental stuff? (was: duplication in client/ruby/solrb/solr)
On 1/4/07, Bertrand Delacretaz [EMAIL PROTECTED] wrote: On 1/4/07, Mike Klaas [EMAIL PROTECTED] wrote: ...Might labs.apache.org make sense for this project?... As Flare is closely related to Solr, I think it belongs in our repository. OTOH, it might be good to put such experimental stuff in a whiteboard directory instead of trunk, to make it clear that it's not (yet) part of what we're releasing. I'm cool with Flare being part of the Solr repository, as long as it in a way that makes sense. But Erik has indicated that the current situation is temporary, so I think we can worry about that later. There is definitely no point in throwing up administrative hurdles to cool nascent subprojects. -Mike
Re: Is Solr ready for graduation?
On 1/4/07, Yonik Seeley [EMAIL PROTECTED] wrote: ...This thread looks like it's the first vote (unless anyone objects), so here's my +1 for graduation... I hate to be formal, but I'd much prefer voting to happen in clearly identified [VOTE] threads. As the community grows, or when people get busy, this helps in not missing these all-important threads. -Bertrand
[VOTE] graduate Solr to Lucene subproject
It's time that Solr graduate from the incubator and become an official Lucene subproject. So, please cast your votes: [ ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the Incubator to become a Lucene subproject. [ ] 0 Don't care [ ] -1 Not at this time, stay in the Incubator for now. -Yonik
Re: [VOTE] graduate Solr to Lucene subproject
On 1/4/07, Yonik Seeley [EMAIL PROTECTED] wrote: It's time that Solr graduate from the incubator and become an official Lucene subproject. So, please cast your votes: +1
Re: [VOTE] graduate Solr to Lucene subproject
[X ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the Incubator to become a Lucene subproject. -Bertrand
Re: [VOTE] graduate Solr to Lucene subproject
[x] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the Incubator to become a Lucene subproject. I'm new to solr, but I'ts been great so far. The community is great, and i will do whatever i can to make it better.
solr-42
Hi, I was wondering if the solution for the Highlighting problems with HTMLStripWhitespaceTokenizerFactory (see http://issues.apache.org/jira/browse/SOLR-42) could be resolved in the following simple way. The HTMLStripWhitespaceTokenizerFactory basically passes through the input through an HTMLStripReader which removes the HTML and then passes to the WhitespaceTokenizer. If the HTMLStripReader would simply replace the HTML with spaces (same length as the removed HTML part) then the positions for the highlighter would be correct. And most of the Tokenizers would be happy with this solution (except maybe the KeywordTokenizer). mirko
Re: Handling disparate data sources in Solr
Original problem statement: -- I'm considering using Solr to replace an existing bare-metal Lucene deployment - the current Lucene setup is embedded inside an existing monolithic webapp, and I want to factor out the search functionality into a separate webapp so it can be reused more easily. At present the content of the Lucene index comes from many different sources (web pages, documents, blog posts etc) and can be different formats (plaintext, HTML, PDF etc). All the various content types are rendered to plaintext before being inserted into the Lucene index. The net result is that the data in one field in the index (say content) may have come from one of a number of source document types. I'm having difficulty understanding how I might map this functionality onto Solr. I understand how (for example) I could use HTMLStripStandardTokenizer to insert the contents of a HTML document into a field called content, but (assuming I'd written a PDF analyser) how would I insert the content of a PDF document into the same content field? I know I could do this by preprocessing the various document types to plaintext in the various Solr clients before inserting the data into the index, but that means that each client would need to know how to do the document transformation. As well as centralising the index, I also want to centralise the handling of the different document types. -- My initial suggestion, to get the discussion started, is to extend the doc and field element with the following attributes: mime-type Mime type of the document, e.g. application/pdf, text/html and so on. encoding Encoding of the document, with base64 being the standard implementation. href The URL of any documents that can be accessed over HTTP, instead of embedding them in the indexing request. The indexer would fetch the document using the specified URL. There would then be entries in the configuration file that map each MIME type to a handler that is capable of dealing with that document type. Thoughts? -- Alan Burlison --
Re: [VOTE] graduate Solr to Lucene subproject
: [ ] +1 ask Lucene PMC and the Incubator PMC to graduate Solr from the : Incubator to become a Lucene subproject. +1 -Hoss
[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory
[ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462338 ] Hoss Man commented on SOLR-42: -- Suggestion from Mirko on solr-dev: change HTMLStripReader to replace striped HTML with equal length whitespace. (this could possibly be made a constructor option) Highlighting problems with HTMLStripWhitespaceTokenizerFactory -- Key: SOLR-42 URL: https://issues.apache.org/jira/browse/SOLR-42 Project: Solr Issue Type: Bug Components: update Reporter: Andrew May Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable). Example title field: SUP40/SUPAr/SUP39/SUPAr laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia Searching for title:fabrics with highlighting on, the highlighted version has the em tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags). Response from Yonik on the solr-user mailing-list: HTMLStripWhitespaceTokenizerFactory works in two phases... HTMLStripReader removes the HTML and passes the result to WhitespaceTokenizer... at that point, Tokens are generated, but the offsets will correspond to the text after HTML removal, not before. I did it this way so that HTMLStripReader could go before any tokenizer (like StandardTokenizer). Can you open a JIRA bug for this? The fix would be a special version of HTMLStripReader integrated with a WhitespaceTokenizer to keep offsets correct. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira