Setting up Solr

Aaron Greenspan Tue, 04 Jun 2013 22:49:35 -0700

Hi everyone,

I thought I'd document my process of getting set up with Solr 4.3.0 on a Linux 
server in case it's of use to anyone. I'm a moderately experienced Linux system 
administrator, so without passing judgment (at least for now), let me just say 
that I found getting Solr to work to be extremely difficult--more difficult 
than just about any other package I've ever dealt with, including ones I've 
built from source.


I downloaded the .tgz file from the Apache site without a problem and 
decompressed it into its own directory. I was surprised to find the 
unconventional (at least in my experience) directory structure, where all of 
the important files are contained in the "example" directory (and really a 
second "solr" directory under that). Following the directions I'd read on-line 
and in the README, I got it running pretty quickly with "java -jar start.jar" 
and went to the web interface on port 8983 of my server. Here's where the 
problems began.

First, a note: the install wiki contains an error, or at least a very 
misleading piece of text, on the installation page 
(http://wiki.apache.org/solr/SolrInstall), one of many in the wiki. Port 8983 
is indeed "a port other than 8080." (And since I'm talking about errors, 
"containers" should have an apostrophe.)

The server status dashboard showed up fine, and I poked around to figure out 
what was what. In short order, I noticed that Solr had already thrown a warning 
on the Logging section about "/non/existent/dir/yields/warning", which didn't 
make much sense to me since I hadn't really done anything yet. I looked into 
that some more and wrote up a bug here: 
https://issues.apache.org/jira/browse/SOLR-4890. I don't think I've ever seen 
another piece of software that deliberately warned users that mistakes cause 
warnings, but I suppose there's a first time for everything.

Aside from that, I tried posting documents to the example collection1, which 
amazingly worked, so, satisfied, I decided to delete it and make my own new 
collection.

This was a mistake, apparently. The Solr web console can't function without at 
least one core at all times--but it doesn't tell you that until after you've 
deleted it and it's totally non-functional. To a novice, this is scary. Hence 
bug number two: https://issues.apache.org/jira/browse/SOLR-3633.

I didn't have any idea how to get Solr working again--there are way too many 
XML configuration files in way too many directories for a new user to figure 
them all out. So I just started from scratch by decompressing the .tgz file 
again, and went back to my default state, which again warned me about warnings.

Now I knew not to delete the collection1 core. So I left it alone, and tried to 
make a new one of my own. This threw an error. The new core could not be 
created. Why? Because the user is expected to create a directory ahead of time 
corresponding to that core via the shell, at least according to Stefan Matheis 
in bug number three's discussion: 
https://issues.apache.org/jira/browse/SOLR-4461.

If you look at the comments for that bug you'll see what I wrote there: "So I 
created a new folder with the name of the core I wanted in the same place that 
I found the collection1 folder. That didn't work. I got the same error. Then I 
looked at the README.txt file for the collection1 folder and saw that you are 
actually supposed to duplicate the collection1 folder for your new core. (In 
that case, the web UI, which really doesn't want you to delete collection1 
anyway, should just treat collection1 as some kind of default template that you 
are encouraged to duplicate to create a new core.)
So with the folder duplicated, I tried adding my new core again. It kind of 
worked. I got a new listing on the left-hand side, but I also got this new 
error:

SolrCore Initialization Failures

new_core: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Could not load config for solrconfig.xml"

The suggested solution, involving file locking, did not help. solrconfig.xml 
wasn't loading because my data import handler wasn't loading because MySQL's 
configuration, which I had put into a data-config.xml file (which for some 
reason always has a hyphen in the docs when solrconfig.xml does not), wasn't 
able to be used because the MySQL Java connector wasn't loading because the 
connector was nowhere to be found. The wiki doesn't talk about this at all. The 
closest it gets is "You might need to download and install the Oracle JDBC 
Driver in the /lib directory of your Solr installation." under the heading of 
"Oracle Example." I don't use Oracle (even though MySQL is an Oracle product 
now).

Only in some of the on-line documents I read did it even mention that there was 
something you had to download separately from MySQL at 
http://www.mysql.com/downloads/connector/j/, and the particular blog I found 
suggesting this told me to put the whole decompressed folder in the lib 
directory, which I soon learned was not going to help at all. In fact, you just 
needed the JAR file. But whether I put the JAR file in lib, or dist, or 
/var/share/java (with or without symlinks), or any folder at all, Solr refused 
to find it.

Finally, I caught a clue from some of the lines referencing libraries in 
solrconfig.xml and realized I might have to specifically tell Solr to look for 
the MySQL library by including a line for it. First I tried using a basic 
regular expression of "mysql*\.jar", which didn't work. Then in a fit of 
desperation (by this point I had consumed many, many hours on getting seemingly 
nowhere with Solr) I tried "mysql-connector-java-\d.*\.jar" and it finally 
worked. I'm not sure why, as I think both expressions are valid.

But it didn't really work. Solr could connect to MySQL, but that didn't mean it 
could import anything. For that I had to set up a proper Solr schema, which 
took me a while to understand because the schema actually spans two XML 
configuration files, and because the "collection1" example schema seemed to 
handle both information from a hypothetical relational database table about 
products, and generic fields from Word documents. I planned to handle neither, 
but I did need to import multiple tables. So I deleted all of those fields from 
my schema and learned that doing so was a great way to make Solr crash, as in, 
fail to load your core at all and throw lots of Java exceptions. In particular, 
it didn't like that I had removed the id and _version_ fields. So, I put the 
fields back and then very carefully changed them until I got my queries to work 
(after many more hours).

Solr crashed some more because I had date values in MySQL of 0000-00-00, which 
is a pretty common occurrence. I needed to append 
"?zeroDateTimeBehavior=convertToNull" to my JDBC connection string in 
data-config.xml for that to start working. This was not obvious to me.

Solr also crashed whenever I made an SQL error, of course. Only it never said 
that there was an SQL error per se. (Isn't there a call to the C or Java 
equivalent of PHP's mysql_error() in the JDBC connector somewhere?)

At one point I had an inner entity referencing an outer entity's ID that just 
refused to fill in the variable ${outer.keyid} with anything--because, I 
realized, the keyid field was missing from my query, because I had had to 
concatenate it with the table name as a string, e.g. CONCAT('table-',`keyid`) 
AS `id`, in order to make one global "id" that Solr would like. For some 
reason, this missing key failed silently, whereas everything else I did caused 
massive numbers of errors, and so MySQL spent a lot of time looking for records 
that had a NULL key.

The web interface was confusing in a number of respects. In its default state, 
the core selector on the bottom left panel looks like a disabled combo box, so 
it took me about an hour to realize it was even there. Documentation about Solr 
tends to reference "the query tab," but in 4.3.0 there is no "query tab," just 
this disabled combo box that happens to be hiding a query UI and section for 
each core, completely separate from the Core Admin section at the top. 
Sometimes when the web interface had to display an error, such as a long 
warning or error in the Logging section, the left and right panes would become 
disjointed and content would start to pile up on top of itself.

There appears to be no obvious way to secure the web interface with something 
so obvious as a username and password, which has me worried, and wondering how 
many vulnerable servers there are out there with port 8983 open for all the 
world to see. Nor do I recall seeing any obvious way to change the active port 
through the web interface to something else unlikely to be guessed (though a 
port scan would render that irrelevant). All of the blogs I've found about the 
security issue reference sections of configuration files that don't seem to 
exist anymore in 4.3.0 so I have no idea what to do.

Before I got MySQL running, the web interface unhelpfully told me that no data 
import modules were set up, and rather than indicating what modules were 
available or some way I might be able to change that or configure them, I was 
left to figure it out for myself. After I got it running, it became apparent 
that knowing the *latest* status was for some reason an option, and if you 
didn't check the box, you'd only have stale and unhelpful information, unless 
you also looked at the command prompt. Depending on one's server, that might be 
easy or hard.

I eventually did get my MySQL import queries to work, and then tried some 
example searches. I got back no results no matter what I tried. First, I 
realized that I had to reload each core through the Core Admin section at the 
top in order for it to realize that there were now documents present in the 
database. If I searched for *.*, Solr showed that there were documents in the 
database. Still, for any other query, there were no results. Then I did some 
digging around the internet and realized that I had to use the 
unintuively-named dismax query parser, which I'd never heard of. Since every 
field in the web interface query section is labeled with its code value, and no 
hint of what those letters might mean, I had no idea what I was looking at (and 
basically still don't). Finally, though, I was able to get some basic queries 
to work.

This process leaves me with some questions for the Solr community:
- Are XML configuration files the best way to do this, or are they merely 
convenient for Java programmers?
- Are XML configuration files that are 90% [unhelpful] comments and 
*deliberate, punitive, pre-emptive warnings built-in* the best way to do user 
documentation?
- Why bother with a web interface if it's just going to force you to use the 
command line anyway?

So at this point let me conclude by summarizing all of this with a more 
judgmental, and I think substantiated, statement. Solr features the 
worst-designed user experience I have ever seen in an enterprise-grade program, 
and I've used some pretty awful software (SCO OpenServer, Microsoft Exchange 
Server 5.5, etc.). The search engine, in contrast, works great, which is why 
I'm bothering to write this at all. Nonetheless, I don't care if it's 
open-source or closed-source. No program should work like this--and certainly 
not anything called "version 4."

I say this not because I enjoy starting flame wars or because I have the time 
to participate in them--I don't. I realize that there's a long history to Solr 
and I am the new kid who doesn't get it. Nonetheless, that doesn't change the 
way it works, and many users will be just like me. So just know that I'd just 
like to see Solr improve--frankly, I need it to--and if these issues were not 
already glaringly obvious, they should be now.

Aaron

        
Aaron Greenspan
President & CEO
Think Computer Corporation

telephone +1 415 670 9350
fax +1 415 373 3959
e-mail aar...@thinkcomputer.com
web http://www.thinkcomputer.com

Setting up Solr

Reply via email to