EmbeddedSolrServer and BinaryRequestWriter
I'm trying to reduce memory usage when indexing, and I see that using the binary format may be a good way to do this. Unfortunately I can't see a way to do this using the EmbeddedSolrServer since only the CommonsHttpSolrServer has a setRequestWriter method. If I'm running out of memory constructing XML request documents, does that mean I just have to switch away from the EmbeddedSolrServer? I understand I can stream requests if I'm just indexing files already on disk, but I'm constructing them on the fly, and I run out of memory constructing the XML document to submit to solr, not in actual indexing, so it seems writing the document to disk would run into the same problems. thanks, Phil
Date ranges for indexes constructed outside Solr
I'm working on an application that will build indexes directly using the Lucene API, but will expose them to clients using Solr. I'm seeing plenty of documentation on how to support date range fields in Solr, but they all assume that you are inserting documents through Solr rather than merging already-generated indexes. Where can I find details about the Lucene-level field operations that can be used to generate date fields that Solr will work with? In particular date resolution settings are unclear. On a similar note: how much of schema.xml is relevant in cases where Solr is not performing insertions? Obviously defaultSearchField is as well as the solrQueryParser defaultOperator attribute, but it seems like most of the field declarations might not matter. thanks, Phil
core size
I'm are planning out a system with large indexes and wondering what kind of performance boost I'd see if I split out documents into many cores rather than using a single core and splitting by a field. I've got about 500GB worth of indexes ranging from 100MB to 50GB each. I'm assuming if we split them out to multiple cores we would see the most dramatic benefit in searches on the smaller cores, but I'm just wondering what level of speedup I should expect. Eventually the cores will be split up anyway, I'm just trying to determine how to prioritize it. thanks, Phil
Re: no .war with ubuntu release ?
On Thu, Jun 18, 2009 at 4:00 PM, Jonathan Vanasco wrote: > can anyone give me a suggestion ? i haven't touched java / jetty / tomcat / > whatever in at least a good 8 years and am lost. I spent a lot of time trying to get this working too. My conclusion was simply that the .deb packages for Solr are unmaintained and have fallen victim to bitrot. You'll have a much easier time getting it from a maven repository or just downloading a binary release. I wish that it would be removed from the Ubuntu repositories though if it isn't fixed as its presence there seems to cause more harm than good. -Phil
Re: Replication problems on 1.4
Phil Hagelberg writes: > Noble Paul നോബിള് नोब्ळ् writes: > >> if you removed the files while the slave is running , then the slave >> will not know that you removed the files (assuming it is a *nix box) >> and it will serve the search requests. But if you restart the slave , >> it should have automatically picked up the current index. >> >> if it doesn't it is a bug > > I did restart the slave server in my case. If I can confirm this with > the latest build from trunk, I will submit an issue. Hmm... can't reproduce with a fresh checkout and recreating my indices from that. Maybe it was something specifically misconfigured in my last setup. -Phil
Re: Replication problems on 1.4
Noble Paul നോബിള് नोब्ळ् writes: > if you removed the files while the slave is running , then the slave > will not know that you removed the files (assuming it is a *nix box) > and it will serve the search requests. But if you restart the slave , > it should have automatically picked up the current index. > > if it doesn't it is a bug I did restart the slave server in my case. If I can confirm this with the latest build from trunk, I will submit an issue. -Phil
Re: Replication problems on 1.4
Shalin Shekhar Mangar writes: > You are right. In Solr/Lucene, a commit exposes updates to searchers. So you > need to call commit on the master for the slave to pick up the changes. > Replicating changes from the master and then not exposing new documents to > searchers does not make sense. However, there is a lot of work going on in > Lucene to enable near real-time search (exposing documents to searchrs as > soon as possible). Once those features are mature enough, Solr's replication > will follow suit. I understand that; it's totally reasonable. What it doesn't explain is what happened in my case: the master added a bunch of docs, committed, and then the slave replicated fine. Then the slave lost all its data (due to me issuing an rm -rf of the data directory, but let's say it happened due to a disk failure or something) and tried to replicate again, but got zero docs. Once the master had another commit issued, the slave could now replicate properly. I would expect in this case the slave should be able to replicate after losing its data but before the second commit. I can see why the master would not expose uncommitted documents, but I can't see why it would refuse to allow _any_ of its index to be replicated from. I feel like I'm missing a piece of the picture here. -Phil
Re: Replication problems on 1.4
Phil Hagelberg writes: > My only guess as to what's going wrong here is that deleting the > coreN/data directory is not a good way to "reset" a core back to its > initial condition. Maybe there's a bit of state somewhere that's making > the slave think that it's already up-to-date with this master and so it > doesn't need to do any replicating? But this is a wild conjecture; I'd > appreciate any tips on where to look for what's going wrong. OK, so I inserted some more documents into the master, and now replication works. I get the feeling it may be due to this line in the master's solrconfig.xml: commit Now this is confusing since it seems that the timing of replication is not up to the master, it's up to the slave. The slave's config has settings for the interval at which to replicate, and you POST to the slave to force a replication. So why is there a setting on the master to control when replication happens? My only interpretation from the config files is the master has some sort of "you may not replicate from me unless" conditions. This seems pretty undesirable since you may have a slave that needs to get replicated from the master immediately; it shouldn't have to wait for a commit on the master. Am I misunderstanding what's going on here? It certainly isn't clear from the documents on the wiki, so I'm kind of grasping in the dark. Perhaps I'm missing something. thanks, Phil Hagelberg http://technomancy.us
Replication problems on 1.4
I'm trying out the replication features on 1.4 (trunk) with multiple indices using a setup based on the example multicore config. The first time I tried it, (replicating through the admin web interface), it worked fine. I was a little surprised that telling one core to replicate caused both to replicate since the docs seem to imply that replication is done on a per-core basis, but I was happy to see that it worked. I wanted to replay my steps, so on the slave machine I deleted core0/data/* and core1/data/* and restarted the server. I restarted the server on master just to be sure. Now replication doesn't work at all. I've tried it both through the admin interface and by curl: curl http://localhost:8983/solr/core0/replication?command=snappull The response from curl indicates that the replication was successful, but nothing happened; my slave index is still empty. My only guess as to what's going wrong here is that deleting the coreN/data directory is not a good way to "reset" a core back to its initial condition. Maybe there's a bit of state somewhere that's making the slave think that it's already up-to-date with this master and so it doesn't need to do any replicating? But this is a wild conjecture; I'd appreciate any tips on where to look for what's going wrong. As to why the replication claims to be successful, I've no idea. Am I missing some crucial log file that explains what's going wrong? It's also possible that this stuff is still in a heavy state of development such that it shouldn't be expected to work by casual users, if that is the case I can go back to the external-script-based replication features of 1.3. thanks, Phil Hagelberg http://technomancy.us
Schema vs Dynamic Fields
On the wiki, it says: > One of the powerful features of Lucene is that you don't have to > pre-define every field when you first create your index. Even though > Solr provides strong datatyping for fields, it still preserves that > flexibility using "Dynamic Fields". Is the use of a predefined schema primarily a "type safety" feature? We're considering using Solr for a data set that is very free-form; will we get much slower results if the majority of our data is in a dynamic field such as: I'm a little unclear on the trade-offs involved and would appreciate a hint. Phil Hagelberg http://technomancy.us