Here's a quick demo I wrote at one point. I haven't run it in a while,
but you should be able to get the idea.
package jdbc;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer;
import org.apache.solr.client.solrj.impl.XMLResponseParser;
import org.apache.solr.common.SolrInputDocument;
import java.io.IOException;
import java.sql.*;
import java.util.ArrayList;
import java.util.Collection;
public class Indexer {
public static void main(String[] args) {
startIndex("http://localhost:8983/solr");
}
private static void startIndex(String url) {
Connection con = DataSource.getConnection();
try {
long start = System.currentTimeMillis();
// Create a multi-threaded communications channel to the Solr
server. Full interface (3.3) at:
//
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html
StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer(url, 10, 4);
// You may want to set these timeouts higer, Solr occasionally
will have long pauses while
// segments merge.
server.setSoTimeout(1000); // socket read timeout
server.setConnectionTimeout(100);
//server.setDefaultMaxConnectionsPerHost(100);
//server.setMaxTotalConnections(100);
//server.setFollowRedirects(false); // defaults to false
// allowCompression defaults to false.
// Server side must support gzip or deflate for this to have any effect.
//server.setAllowCompression(true);
server.setMaxRetries(1); // defaults to 0. > 1 not recommended.
server.setParser(new XMLResponseParser()); // binary parser is
used by default
doDocuments(server, con);
server.commit(); // Only needs to be done at the end, autocommit
or commitWithin should
// do the rest.
long endTime = System.currentTimeMillis();
System.out.println("Total Time Taken->" + (endTime - start) + " mils");
} catch (Exception e) {
e.printStackTrace();
String msg = e.getMessage();
System.out.println(msg);
}
}
private static void doDocuments(StreamingUpdateSolrServer server,
Connection con) throws SQLException, IOException, SolrServerException
{
Statement st = con.createStatement();
ResultSet rs = st.executeQuery("select id,title,text from test");
// SolrInputDocument interface (3.3) at
//
http://lucene.apache.org/solr/api/org/apache/solr/common/SolrInputDocument.html
Collection<SolrInputDocument> docs = new ArrayList<SolrInputDocument>();
int total = 0;
int counter = 0;
while (rs.next()) {
SolrInputDocument doc = new SolrInputDocument(); // DO NOT move
this outside the while loop
// or be sure to call doc.clear()
String id = rs.getString("id");
String title = rs.getString("title");
String text = rs.getString("text");
doc.addField("id", id);
doc.addField("title", title);
doc.addField("text", text);
docs.add(doc);
++counter;
++total;
if (counter > 1000) { // Completely arbitrary, just batch up
more than one document for throughput!
server.add(docs);
docs.clear();
counter = 0;
}
}
System.out.println("Total " + total + " Docs added succesfully");
}
}
// Trivial class showing connecting to a MySql database server via jdbc...
class DataSource {
public static Connection getConnection() {
Connection conn = null;
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
System.out.println("Driver Loaded......");
conn = DriverManager.getConnection("jdbc:mysql://172.16.0.169:3306/test?"
+ "user=testuser&password=test123");
System.out.println("Connection build......");
} catch (Exception ex) {
System.out.println(ex);
}
return conn;
}
public static void closeConnection(Connection con) {
try {
if (con != null)
con.close();
} catch (SQLException e) {
e.printStackTrace();
}
}
}
On Mon, Dec 12, 2011 at 2:57 PM, Brian Lamb
<[email protected]> wrote:
> Thanks all. Erick, is there documentation on doing things with SolrJ and a
> JDBC connection?
>
> On Mon, Dec 12, 2011 at 1:34 PM, Erick Erickson
> <[email protected]>wrote:
>
>> You might want to consider just doing the whole
>> thing in SolrJ with a JDBC connection. When things
>> get complex, it's sometimes more straightforward.
>>
>> Best
>> Erick...
>>
>> P.S. Yes, it's pretty standard to have a single
>> field be the destination for several copyField
>> directives.
>>
>> On Mon, Dec 12, 2011 at 12:48 PM, Gora Mohanty <[email protected]> wrote:
>> > On Mon, Dec 12, 2011 at 2:24 AM, Brian Lamb
>> > <[email protected]> wrote:
>> >> Hi all,
>> >>
>> >> I have a few questions about how the MySQL data import works. It seems
>> it
>> >> creates a separate connection for each entity I create. Is there any
>> way to
>> >> avoid this?
>> >
>> > Not sure, but I do not think that it is possible. However, from your
>> description
>> > below, I think that you are unnecessarily multiplying entities.
>> >
>> >> By nature of my schema, I have several multivalued fields. Each one I
>> >> populate with a separate entity. Is there a better way to do it? For
>> >> example, could I pull in all the singular data in one sitting and then
>> come
>> >> back in later and populate with the multivalued items.
>> >
>> > Not quite sure as to what you mean. Would it be possible for you
>> > to post your schema.xml, and the DIH configuration file? Preferably,
>> > put these on pastebin.com, and send us links. Also, you should
>> > obfuscate details like access passwords.
>> >
>> >> An alternate approach in some cases would be to do a GROUP_CONCAT and
>> then
>> >> populate the multivalued column with some transformation. Is that
>> possible?
>> > [...]
>> >
>> > This is how we have been handling it. A complete description would
>> > be long, but here is the gist of it:
>> > * A transformer will be needed. In this case, we found it easiest
>> > to use a Java-based transformer. Thus, your entity should include
>> > something like
>> > <entity name="myname" dataSource="mysource"
>> > transformer="com.mycompany.search.solr.handler.JobsNumericTransformer...>
>> > ...
>> > </entity>
>> > Here, the class name to be used for the transformer attribute follows
>> > the usual Java rules, and the .jar needs to be made available to Solr.
>> > * The SELECT statement for the entity looks something like
>> > select group_concat( myfield SEPARATOR '@||@')...
>> > The separator should be something that does not occur in your
>> > normal data stream.
>> > * Within the entity, define
>> > <field column="myfield"/>
>> > * There are complications involved if NULL values are allowed
>> > for the field, in which case you would need to use COALESCE,
>> > maybe along with CAST
>> > * The transformer would look up "myfield", split along the separator,
>> > and populate the multi-valued field.
>> >
>> > This *is* a little complicated, so I would also like to hear about
>> > possible alternatives.
>> >
>> > Regards,
>> > Gora
>>