On 1/11/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
On 1/11/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On 1/11/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> > I'd like to be able to add/update documents from an SQL query.
>
> Me too... it's been on the todo list a long time.
> A lot of people have data in databases, and it's a shame to require
> code to index their data if it can be expressed in SQL.
>
>  If it were less common, I'd say it would be better as a standalone
> app talking to Solr over XML/HTTP, but given that it's *such* a common
> case, I'd support it going into the core.
>
> I'd envision query args instead of XML though... something that could
> be generated by a browser.
>
> overwrite=true&sql=SELECT * FROM my_stats_table&etc

To keep /update a clean, perhaps updating from SQL should get its own servlet.

It should definitely be separate... either a servlet or update plugin.
You could start by implementing it as a request handler plugin and it should be
easy to change it to an update plugin depending on if/how we handle that.

See
http://issues.apache.org/jira/browse/SOLR-66
for how I handle parameters, reusing SolrParams and enabling per-field
specifications for things like separator.

/updateSQL?overwrite=true&sql=SELECT * FROM my_stats_table&etc

I think connection settings and driver should be set by the request,
not through configuration.

If we do things the same way as request handlers, then you can have it
both ways... one can specify it in configuration *or* in the request.

> The big question in my mind is if the database schema is simple enough
> for something like this to work... esp w.r.t multi-valued fields.

I like the idea of adding a 'separator' token.  that could split a
single string into multiple fields.

That should definitely be there, but I just don't know if it's
sufficient.  Many fields that need to be multi-valued won't be stored
like that in the database.

> Multiple values in a database may be in multiple rows... can we handle
> that case somehow?

If the rows are sorted by the ID, we could keep building a document
until the ID is different from the previous one.  Multiple rows would
keep adding to the same document.

I had thought about that, but it's tricky.
When you do a join and get multiple rows, don't values repeat in each
row, making it hard to tell if there were multiple values to begin
with or not?
I guess we could
1) go by the schema and only add multiple values if it's marked multiValued
2) only add distinct values... ignore repeated values

Requiring fields to be sorted by document ID seems like an ok
restriction - the alternative is to load everything into memory until
you hit the end of the result set.

The latter would break for large collections, so I guess it should be
the former.

-Yonik

Reply via email to