Re: Vote on merging dev of Lucene and Solr
+1 On Thu, Mar 4, 2010 at 6:32 PM, Mark Miller markrmil...@gmail.com wrote: For those committers that don't follow the general mailing list, or follow it that closely, we are currently having a vote for committers: http://search.lucidimagination.com/search/document/4722d3144c2e3a8b/vote_merge_lucene_solr_development -- - Mark http://www.lucidimagination.com -- - Noble Paul | Systems Architect| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
option 3 looks best . But do we plan to remove anything we have not already marked as deprecated? On Thu, Nov 19, 2009 at 8:10 PM, Uwe Schindler u...@thetaphi.de wrote: We also had some (maybe helpful) opinions :-) - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Thursday, November 19, 2009 3:31 PM To: java-dev@lucene.apache.org Subject: Re: Solr 1.5 or 2.0? Oops... of course I meant to post this in solr-dev. -Yonik http://www.lucidimagination.com On Wed, Nov 18, 2009 at 8:53 PM, Yonik Seeley yo...@lucidimagination.com wrote: What should the next version of Solr be? Options: - have a Solr 1.5 with a lucene 2.9.x - have a Solr 1.5 with a lucene 3.x, with weaker back compat given all of the removed lucene deprecations from 2.9-3.0 - have a Solr 2.0 with a lucene 3.x -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Noble Paul | Principal Engineer| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Solr 1.5 or 2.0?
On Fri, Nov 20, 2009 at 6:30 AM, Ryan McKinley ryan...@gmail.com wrote: On Nov 19, 2009, at 3:34 PM, Mark Miller wrote: Ryan McKinley wrote: I would love to set goals that are ~3 months out so that we don't have another 1 year release cycle. For a 2.0 release where we could have more back-compatibly flexibility, i would love to see some work that may be too ambitious... In particular, the config spaghetti needs some attention. I don't see the need to increment solr to 2.0 for the lucene 3.0 change -- of course that needs to be noted, but incrementing the major number in solr only makes sense if we are going to change *solr* significantly. Lucene major numbers don't work that way, and I don't think Solr needs to work that way be default. I think major numbers are better for indicating backwards compat issues than major features with the way these projects work. Which is why Yonik mentions 1.5 with weaker back compat - its not just the fact that we are going to Lucene 3.x - its that Solr still relies on some of the API's that won't be around in 3.x - they are not all trivial to remove or to remove while preserving back compat. I confess I don't know the details of the changes that have not yet been integrated in solr -- the only lucene changes I am familiar with is what was required for solr 1.4. The lucene 2.x - 3.0 upgrade path seems independent of that to me. I would even argue that with solr 1.4 we have already required many lucene 3.0 changes -- All my custom lucene stuff had to be reworked to work with solr 1.4 (tokenizers multi-reader filters). Many - but certainly not all. Just my luck... I'm batting 1000 :) But that means my code can upgrade to 3.0 without a issue now! In general, I wonder where the solr back-compatibility contract applies (and to what degree). For solr, I would rank the importance as: #1 - the URL API syntax. Client query parameters should change as little as possible #2 - configuration #3 - java APIs Someone else would likely rank it differently - not everyone using Solr even uses HTTP with it. Someone heavily involved in custom plugins might care more about that than config. As a dev, I just plainly rank them all as important and treat them on a case by case basis. I think it is fair to suggest that people will have the most stable/consistent/seamless upgrade path if you stick to the HTTP API (and by extension most of the solrj API) I am not suggesting that the java APIs are not important and that back-compatibly is not important. Solr has a some APIs with a clear purpose, place, and intended use -- we need to take these very seriously. We also have lots of APIs that are half baked and loosy goosy. If a developer is working on the edges, i think it is fair to expect more hickups in the upgrade path. With that in mind, i think 'solr 1.5 with lucene 3.x' makes the most sense. Unless we see making serious changes to solr that would warrent a major release bump solr 1.5 with lucene 3.x is a good option. Solr 2.0 can have non-back compat changes for Solr itself. e.g removing the single core option , changing configuration, REST Api changes etc What is a serious change that would warrant a bump in your opinion? for example: - config overhaul. detangle the XML from the components. perhaps using spring. This is already done. No components read config from xml anymore SOLR-1198 - major URL request changes. perhaps we change things to be more RESTful -- perhaps let jersey take care of the URL/request building https://jersey.dev.java.net/ - perhaps OSGi support/control/configuration Lucene has an explict back-compatibility contract: http://wiki.apache.org/lucene-java/BackwardsCompatibility I don't know if solr has one... if we make one, I would like it to focus on the URL syntax+configuration Its not nice to give people plugins and then not worry about back compat for them :) i want to be nice. I just think that a different back compatibility contract applies for solr then for lucene. It seems reasonable to consider the HTTP API, configs, and java API independently. From my perspective, saying solr 1.5 uses lucene 3.0 implies everything a plugin developer using lucene APIs needs to know about the changes. To be clear, I am not against bumping to solr 2.0 -- I just have high aspirations (yet little time) for what a 2.0 bump could mean for solr. ryan - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- - Noble Paul | Principal Engineer| AOL | http://aol.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Blob storage
On Fri, Dec 26, 2008 at 2:11 PM, Babak Farhang farh...@gmail.com wrote: Most of all, I'm trying to communicate an *idea* which itself cannot be encumbered by any license, anyway. But if you want to incorporate some of this code into an asf project, I'd be happy to also release it under the apache license. Hope the license I chose for my project doesn't get in the way of this conversation.. It would be more useful if the user could store the data using his own id (a long).This forces me to have a mapping of a key - skwish_id elsewhere . This is a serious limitation. For retrieval it means I may need to do a lookup and that can be costly if I have 10's of millions of records BTW . The license is a problem On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള് नोब्ळ् noble.p...@gmail.com wrote: The license is GPL . It cannont be used directly in any apache projects On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang farh...@gmail.com wrote: I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Exactly: instead of storing the field's value directly in Lucene, you could store it in skwish and then store its skwish id in the Lucene field instead. This works well for serving large streams (e.g. original document contents). Have you run any threaded performance tests comparing the two? No direct comps, yet. -b On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless luc...@mikemccandless.com wrote: This looks interesting! I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Have you run any threaded performance tests comparing the two? Mike Babak Farhang farh...@gmail.com wrote: Hi everyone, I've been working on a library called Skwish to complement indexes like Lucene, for blob storage and retrieval. This is nothing more than a structured implementation of storing all the files in one file and managing their offsets in another. The idea is to provide a fast, concurrent, lock-free way to serve lots of files to lots of users. Hope you find it useful or interesting. -Babak http://skwish.sourceforge.net/ - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- --Noble Paul - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- --Noble Paul
Re: Blob storage
On Fri, Dec 26, 2008 at 10:05 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Similar thoughts here. I don't have ML thread pointers nor JIRA issue pointers, but there has been discussion in this area before, and I believe the thinking was that what's needed is a general interface/abstraction/API for storing and loading field data to an external component, be that a BDB, an RDBMS, or something like Skwish. I *think* that often came up in the context of Document updates (as opposed to delete+add). This is an area of interest for me as well SOLR-828 I didn't look at Skwish, but I think this is the direction to explore, Babak, esp. if we can come up with something that let's one plug in other types of storage, as well as deal with transaction type stuff that Ian mentioned. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ian Holsman li...@holsman.net To: java-dev@lucene.apache.org Sent: Friday, December 26, 2008 5:40:36 AM Subject: Re: Blob storage Babak Farhang wrote: Most of all, I'm trying to communicate an *idea* which itself cannot be encumbered by any license, anyway. But if you want to incorporate some of this code into an asf project, I'd be happy to also release it under the apache license. Hope the license I chose for my project doesn't get in the way of this conversation.. as an idea, let me offer some thoughts. - there will be a trade-off where reading the info from a 2nd system would be slower than just a single call which has all the results. Especially if you have to fetch a couple of these things. - how is this different than BDB, and a UUID. couldn't you just store it using that? - how are you going to deal with situations where the commit fails in lucene. does the client have to recognize this and rollback skwish? - there will need to be some kind of reconciliation process that will need to deal with inconsistencies where someone forgets to delete the skiwsh object when they have deleted the lucene record. on a positive note, it would shrink the index size and allow more records to fit in memory. Regards Ian On Fri, Dec 26, 2008 at 12:46 AM, Noble Paul നോബിള് नोब्ळ् wrote: The license is GPL . It cannont be used directly in any apache projects On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang wrote: I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Exactly: instead of storing the field's value directly in Lucene, you could store it in skwish and then store its skwish id in the Lucene field instead. This works well for serving large streams (e.g. original document contents). Have you run any threaded performance tests comparing the two? No direct comps, yet. -b On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless wrote: This looks interesting! I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Have you run any threaded performance tests comparing the two? Mike Babak Farhang wrote: Hi everyone, I've been working on a library called Skwish to complement indexes like Lucene, for blob storage and retrieval. This is nothing more than a structured implementation of storing all the files in one file and managing their offsets in another. The idea is to provide a fast, concurrent, lock-free way to serve lots of files to lots of users. Hope you find it useful or interesting. -Babak http://skwish.sourceforge.net/ - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- --Noble Paul - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- --Noble Paul
Re: Blob storage
The license is GPL . It cannont be used directly in any apache projects On Fri, Dec 26, 2008 at 12:47 PM, Babak Farhang farh...@gmail.com wrote: I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Exactly: instead of storing the field's value directly in Lucene, you could store it in skwish and then store its skwish id in the Lucene field instead. This works well for serving large streams (e.g. original document contents). Have you run any threaded performance tests comparing the two? No direct comps, yet. -b On Thu, Dec 25, 2008 at 5:22 AM, Michael McCandless luc...@mikemccandless.com wrote: This looks interesting! I assume one could use Skwish instead of Lucene's normal stored fields to store retrieve document data? Have you run any threaded performance tests comparing the two? Mike Babak Farhang farh...@gmail.com wrote: Hi everyone, I've been working on a library called Skwish to complement indexes like Lucene, for blob storage and retrieval. This is nothing more than a structured implementation of storing all the files in one file and managing their offsets in another. The idea is to provide a fast, concurrent, lock-free way to serve lots of files to lots of users. Hope you find it useful or interesting. -Babak http://skwish.sourceforge.net/ - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org -- --Noble Paul - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Mark Miller as core Lucene committer
congrats Mark. You have been a great contributor to the Solr community as well. On Sat, Nov 22, 2008 at 7:09 AM, Michael Busch [EMAIL PROTECTED] wrote: Welcome Mark! Good to have you on board! -Michael Grant Ingersoll wrote: Please welcome Mark Miller as a core Lucene committer. For a while now, Mark has been a contrib committer and has recently stepped up his efforts in contributions to the core. In recognition the PMC has voted to make him a core committer. Cheers, Grant - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Realtime Search for Social Networks Collaboration
Moving back to RDBMS model will be a big step backwards where we miss mulivalued fields and arbitrary fields . On Tue, Sep 9, 2008 at 4:17 AM, Jason Rutherglen [EMAIL PROTECTED] wrote: Cool. I mention H2 because it does have some Lucene code in it yes. Also according to some benchmarks it's the fastest of the open source databases. I think it's possible to integrate realtime search for H2. I suppose there is no need to store the data in Lucene in this case? One loses the multiple values per field Lucene offers, and the schema become static. Perhaps it's a trade off? On Mon, Sep 8, 2008 at 6:17 PM, J. Delgado [EMAIL PROTECTED] wrote: Yes, both Marcelo and I would be interested. We looked into H2 and it looks like something similar to Oracle's ODCI can be implemented. Plus the primitive full-text implementación is based on Lucene. I say primitive because looking at the code I saw that one cannot define an Analyzer and for each scan corresponding to a where clause a searcher is open and closed, instead of having a pool, plus it does not have any way to queue changes to reduce the use of the IndexWriter, etc. But its open source and that is a great starting point! -- Joaquin On Mon, Sep 8, 2008 at 2:05 PM, Jason Rutherglen [EMAIL PROTECTED] wrote: Perhaps an interesting project would be to integrate Ocean with H2 www.h2database.com to take advantage of both models. I'm not sure how exactly that would work, but it seems like it would not be too difficult. Perhaps this would solve being able to perform faster hierarchical queries and perhaps other types of queries that Lucene is not capable of. Is this something Joaquin you are interested in collaborating on? I am definitely interested in it. On Sun, Sep 7, 2008 at 4:04 AM, J. Delgado [EMAIL PROTECTED] wrote: On Sat, Sep 6, 2008 at 1:36 AM, Otis Gospodnetic [EMAIL PROTECTED] wrote: Regarding real-time search and Solr, my feeling is the focus should be on first adding real-time search to Lucene, and then we'll figure out how to incorporate that into Solr later. Otis, what do you mean exactly by adding real-time search to Lucene? Note that Lucene, being a indexing/search library (and not a full blown search engine), is by definition real-time: once you add/write a document to the index it becomes immediately searchable and if a document is logically deleted and no longer returned in a search, though physical deletion happens during an index optimization. Now, the problem of adding/deleting documents in bulk, as part of a transaction and making these documents available for search immediately after the transaction is commited sounds more like a search engine problem (i.e. SOLR, Nutch, Ocean), specially if these transactions are known to be I/O expensive and thus are usually implemented bached proceeses with some kind of sync mechanism, which makes them non real-time. For example, in my previous life, I designed and help implement a quasi-realtime enterprise search engine using Lucene, having a set of multi-threaded indexers hitting a set of multiple indexes alocatted accross different search services which powered a broker based distributed search interface. The most recent documents provided to the indexers were always added to the smaller in-memory (RAM) indexes which usually could absorbe the load of a bulk add transaction and later would be merged into larger disk based indexes and then flushed to make them ready to absorbe new fresh docs. We even had further partitioning of the indexes that reflected time periods with caps on size for them to be merged into older more archive based indexes which were used less (yes the search engine default search was on data no more than 1 month old, though user could open the time window by including archives). As for SOLR and OCEAN, I would argue that these semi-structured search engines are becomming more and more like relational databases with full-text search capablities (without the benefit of full reletional algebra -- for example joins are not possible using SOLR). Notice that real-time CRUD operations and transactionality are core DB concepts adn have been studied and developed by database communities for aquite long time. There has been recent efforts on how to effeciently integrate Lucene into releational databases (see Lucene JVM ORACLE integration, see http://marceloochoa.blogspot.com/2007/09/running-lucene-inside-your-oracle-jvm.html) I think we should seriously look at joining efforts with open-source Database engine projects, written in Java (see http://java-source.net/open-source/database-engines) in order to blend IR and ORM for once and for all. -- Joaquin I've read Jason's Wiki as well. Actually, I had to read it a number of times to understand bits and pieces of it. I have to admit there is still
Re: ThreadLocal causing memory leak with J2EE applications
Why do you need to keep a strong reference? Why not a WeakReference ? --Noble On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu [EMAIL PROTECTED] wrote: The problem should be similar to what's talked about on this discussion. http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal There is a memory leak for Lucene search from Lucene-1195.(svn r659602, May23,2008) This patch brings in a ThreadLocal cache to TermInfosReader. It's usually recommended to keep the reader open, and reuse it when possible. In a common J2EE application, the http requests are usually handled by different threads. But since the cache is ThreadLocal, the cache are not really usable by other threads. What's worse, the cache can not be cleared by another thread! This leak is not so obvious usually. But my case is using RAMDirectory, having several hundred megabytes. So one un-released resource is obvious to me. Here is the reference tree: org.apache.lucene.store.RAMDirectory |- directory of org.apache.lucene.store.RAMFile |- file of org.apache.lucene.store.RAMInputStream |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput |- input of org.apache.lucene.index.SegmentTermEnum |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry After I switched back to svn revision 659601, right before this patch is checked in, the memory leak is gone. Although my case is RAMDirectory, I believe this will affect disk based index also. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ThreadLocal causing memory leak with J2EE applications
When I look at the reference tree That is the feeling I get. if you held a WeakReference it would get released . |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput |- input of org.apache.lucene.index.SegmentTermEnum |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry On Wed, Sep 10, 2008 at 8:39 PM, Chris Lu [EMAIL PROTECTED] wrote: Does this make any difference? If I intentionally close the searcher and reader failed to release the memory, I can not rely on some magic of JVM to release it. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! On Wed, Sep 10, 2008 at 4:03 AM, Noble Paul നോബിള് नोब्ळ् [EMAIL PROTECTED] wrote: Why do you need to keep a strong reference? Why not a WeakReference ? --Noble On Wed, Sep 10, 2008 at 12:27 AM, Chris Lu [EMAIL PROTECTED] wrote: The problem should be similar to what's talked about on this discussion. http://lucene.markmail.org/message/keosgz2c2yjc7qre?q=ThreadLocal There is a memory leak for Lucene search from Lucene-1195.(svn r659602, May23,2008) This patch brings in a ThreadLocal cache to TermInfosReader. It's usually recommended to keep the reader open, and reuse it when possible. In a common J2EE application, the http requests are usually handled by different threads. But since the cache is ThreadLocal, the cache are not really usable by other threads. What's worse, the cache can not be cleared by another thread! This leak is not so obvious usually. But my case is using RAMDirectory, having several hundred megabytes. So one un-released resource is obvious to me. Here is the reference tree: org.apache.lucene.store.RAMDirectory |- directory of org.apache.lucene.store.RAMFile |- file of org.apache.lucene.store.RAMInputStream |- base of org.apache.lucene.index.CompoundFileReader$CSIndexInput |- input of org.apache.lucene.index.SegmentTermEnum |- value of java.lang.ThreadLocal$ThreadLocalMap$Entry After I switched back to svn revision 659601, right before this patch is checked in, the memory leak is gone. Although my case is RAMDirectory, I believe this will affect disk based index also. -- Chris Lu - Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! -- --Noble Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --Noble Paul