Hi everyone, I thought I'd document my process of getting set up with Solr 4.3.0 on a Linux server in case it's of use to anyone. I'm a moderately experienced Linux system administrator, so without passing judgment (at least for now), let me just say that I found getting Solr to work to be extremely difficult--more difficult than just about any other package I've ever dealt with, including ones I've built from source.
I downloaded the .tgz file from the Apache site without a problem and decompressed it into its own directory. I was surprised to find the unconventional (at least in my experience) directory structure, where all of the important files are contained in the "example" directory (and really a second "solr" directory under that). Following the directions I'd read on-line and in the README, I got it running pretty quickly with "java -jar start.jar" and went to the web interface on port 8983 of my server. Here's where the problems began. First, a note: the install wiki contains an error, or at least a very misleading piece of text, on the installation page (http://wiki.apache.org/solr/SolrInstall), one of many in the wiki. Port 8983 is indeed "a port other than 8080." (And since I'm talking about errors, "containers" should have an apostrophe.) The server status dashboard showed up fine, and I poked around to figure out what was what. In short order, I noticed that Solr had already thrown a warning on the Logging section about "/non/existent/dir/yields/warning", which didn't make much sense to me since I hadn't really done anything yet. I looked into that some more and wrote up a bug here: https://issues.apache.org/jira/browse/SOLR-4890. I don't think I've ever seen another piece of software that deliberately warned users that mistakes cause warnings, but I suppose there's a first time for everything. Aside from that, I tried posting documents to the example collection1, which amazingly worked, so, satisfied, I decided to delete it and make my own new collection. This was a mistake, apparently. The Solr web console can't function without at least one core at all times--but it doesn't tell you that until after you've deleted it and it's totally non-functional. To a novice, this is scary. Hence bug number two: https://issues.apache.org/jira/browse/SOLR-3633. I didn't have any idea how to get Solr working again--there are way too many XML configuration files in way too many directories for a new user to figure them all out. So I just started from scratch by decompressing the .tgz file again, and went back to my default state, which again warned me about warnings. Now I knew not to delete the collection1 core. So I left it alone, and tried to make a new one of my own. This threw an error. The new core could not be created. Why? Because the user is expected to create a directory ahead of time corresponding to that core via the shell, at least according to Stefan Matheis in bug number three's discussion: https://issues.apache.org/jira/browse/SOLR-4461. If you look at the comments for that bug you'll see what I wrote there: "So I created a new folder with the name of the core I wanted in the same place that I found the collection1 folder. That didn't work. I got the same error. Then I looked at the README.txt file for the collection1 folder and saw that you are actually supposed to duplicate the collection1 folder for your new core. (In that case, the web UI, which really doesn't want you to delete collection1 anyway, should just treat collection1 as some kind of default template that you are encouraged to duplicate to create a new core.) So with the folder duplicated, I tried adding my new core again. It kind of worked. I got a new listing on the left-hand side, but I also got this new error: SolrCore Initialization Failures new_core: org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: Could not load config for solrconfig.xml" The suggested solution, involving file locking, did not help. solrconfig.xml wasn't loading because my data import handler wasn't loading because MySQL's configuration, which I had put into a data-config.xml file (which for some reason always has a hyphen in the docs when solrconfig.xml does not), wasn't able to be used because the MySQL Java connector wasn't loading because the connector was nowhere to be found. The wiki doesn't talk about this at all. The closest it gets is "You might need to download and install the Oracle JDBC Driver in the /lib directory of your Solr installation." under the heading of "Oracle Example." I don't use Oracle (even though MySQL is an Oracle product now). Only in some of the on-line documents I read did it even mention that there was something you had to download separately from MySQL at http://www.mysql.com/downloads/connector/j/, and the particular blog I found suggesting this told me to put the whole decompressed folder in the lib directory, which I soon learned was not going to help at all. In fact, you just needed the JAR file. But whether I put the JAR file in lib, or dist, or /var/share/java (with or without symlinks), or any folder at all, Solr refused to find it. Finally, I caught a clue from some of the lines referencing libraries in solrconfig.xml and realized I might have to specifically tell Solr to look for the MySQL library by including a line for it. First I tried using a basic regular expression of "mysql*\.jar", which didn't work. Then in a fit of desperation (by this point I had consumed many, many hours on getting seemingly nowhere with Solr) I tried "mysql-connector-java-\d.*\.jar" and it finally worked. I'm not sure why, as I think both expressions are valid. But it didn't really work. Solr could connect to MySQL, but that didn't mean it could import anything. For that I had to set up a proper Solr schema, which took me a while to understand because the schema actually spans two XML configuration files, and because the "collection1" example schema seemed to handle both information from a hypothetical relational database table about products, and generic fields from Word documents. I planned to handle neither, but I did need to import multiple tables. So I deleted all of those fields from my schema and learned that doing so was a great way to make Solr crash, as in, fail to load your core at all and throw lots of Java exceptions. In particular, it didn't like that I had removed the id and _version_ fields. So, I put the fields back and then very carefully changed them until I got my queries to work (after many more hours). Solr crashed some more because I had date values in MySQL of 0000-00-00, which is a pretty common occurrence. I needed to append "?zeroDateTimeBehavior=convertToNull" to my JDBC connection string in data-config.xml for that to start working. This was not obvious to me. Solr also crashed whenever I made an SQL error, of course. Only it never said that there was an SQL error per se. (Isn't there a call to the C or Java equivalent of PHP's mysql_error() in the JDBC connector somewhere?) At one point I had an inner entity referencing an outer entity's ID that just refused to fill in the variable ${outer.keyid} with anything--because, I realized, the keyid field was missing from my query, because I had had to concatenate it with the table name as a string, e.g. CONCAT('table-',`keyid`) AS `id`, in order to make one global "id" that Solr would like. For some reason, this missing key failed silently, whereas everything else I did caused massive numbers of errors, and so MySQL spent a lot of time looking for records that had a NULL key. The web interface was confusing in a number of respects. In its default state, the core selector on the bottom left panel looks like a disabled combo box, so it took me about an hour to realize it was even there. Documentation about Solr tends to reference "the query tab," but in 4.3.0 there is no "query tab," just this disabled combo box that happens to be hiding a query UI and section for each core, completely separate from the Core Admin section at the top. Sometimes when the web interface had to display an error, such as a long warning or error in the Logging section, the left and right panes would become disjointed and content would start to pile up on top of itself. There appears to be no obvious way to secure the web interface with something so obvious as a username and password, which has me worried, and wondering how many vulnerable servers there are out there with port 8983 open for all the world to see. Nor do I recall seeing any obvious way to change the active port through the web interface to something else unlikely to be guessed (though a port scan would render that irrelevant). All of the blogs I've found about the security issue reference sections of configuration files that don't seem to exist anymore in 4.3.0 so I have no idea what to do. Before I got MySQL running, the web interface unhelpfully told me that no data import modules were set up, and rather than indicating what modules were available or some way I might be able to change that or configure them, I was left to figure it out for myself. After I got it running, it became apparent that knowing the *latest* status was for some reason an option, and if you didn't check the box, you'd only have stale and unhelpful information, unless you also looked at the command prompt. Depending on one's server, that might be easy or hard. I eventually did get my MySQL import queries to work, and then tried some example searches. I got back no results no matter what I tried. First, I realized that I had to reload each core through the Core Admin section at the top in order for it to realize that there were now documents present in the database. If I searched for *.*, Solr showed that there were documents in the database. Still, for any other query, there were no results. Then I did some digging around the internet and realized that I had to use the unintuively-named dismax query parser, which I'd never heard of. Since every field in the web interface query section is labeled with its code value, and no hint of what those letters might mean, I had no idea what I was looking at (and basically still don't). Finally, though, I was able to get some basic queries to work. This process leaves me with some questions for the Solr community: - Are XML configuration files the best way to do this, or are they merely convenient for Java programmers? - Are XML configuration files that are 90% [unhelpful] comments and *deliberate, punitive, pre-emptive warnings built-in* the best way to do user documentation? - Why bother with a web interface if it's just going to force you to use the command line anyway? So at this point let me conclude by summarizing all of this with a more judgmental, and I think substantiated, statement. Solr features the worst-designed user experience I have ever seen in an enterprise-grade program, and I've used some pretty awful software (SCO OpenServer, Microsoft Exchange Server 5.5, etc.). The search engine, in contrast, works great, which is why I'm bothering to write this at all. Nonetheless, I don't care if it's open-source or closed-source. No program should work like this--and certainly not anything called "version 4." I say this not because I enjoy starting flame wars or because I have the time to participate in them--I don't. I realize that there's a long history to Solr and I am the new kid who doesn't get it. Nonetheless, that doesn't change the way it works, and many users will be just like me. So just know that I'd just like to see Solr improve--frankly, I need it to--and if these issues were not already glaringly obvious, they should be now. Aaron Aaron Greenspan President & CEO Think Computer Corporation telephone +1 415 670 9350 fax +1 415 373 3959 e-mail aar...@thinkcomputer.com web http://www.thinkcomputer.com