Re: SQL UpdatePlugin?

2007-01-12 Thread Bertrand Delacretaz

On 1/12/07, Ryan McKinley [EMAIL PROTECTED] wrote:


...What is the best way to post unfinished code and ask for review?
Should i make a JIRA issue and post a patch with a not ready for
prime time note? ...


Yes, this would work. Our personal NotReadyForPrimeTime detectors
would flag it immediately anyway, if needed ;-)

Putting TODO comments in your code in places where you feel others
should have a look can help if you're unsure about how to best
implement certain things.

-Bertrand


Re: SQL UpdatePlugin?

2007-01-12 Thread Yonik Seeley

On 1/12/07, Ryan McKinley [EMAIL PROTECTED] wrote:

For now, it extends SolrRequestHandler (not UpdateHandler) because the
request plugin framework seems more stable.


There is no framework for pluggable update handlers yet...
Solr's current UpdateHandler classes (DirectUpdateHandler2) are a bit
misnamed in this context... they deal with the low-level lucene index
munging and concurrency issues, more like an IndexWriter.


What is the best way to post unfinished code and ask for review?
Should i make a JIRA issue and post a patch with a not ready for
prime time note?  Should I post the code on a personal website?


Yep, and adding a note that it's just a draft is important.


Does this belong in the main package o.a.s.update?


IMO, yes (or in o.a.s.update.db or sql).  SQL databases are
ubiquitous, and we don't want everyone having to do the same
integration steps themselves to index their db content.


or a 'contrib'
package/project/source folder?  If it is in the main tree: the unit
tests require mysql-connector.jar and assume you have a database at:
jdbc:mysql://localhost/solrtestsql with user=solr and password=solr.
Any suggestions on how this should be handled or structured.


I'm split on this one...
It seems like it would be nice if we didn't have any vendor specific
db stuff in the core or in the repository.  On the other hand, if the
vast majority of people will be using this with mysql, why make them
go through the trouble.

-Yonik


Re: SQL UpdatePlugin?

2007-01-12 Thread Ryan McKinley


 or a 'contrib'
 package/project/source folder?  If it is in the main tree: the unit
 tests require mysql-connector.jar and assume you have a database at:
 jdbc:mysql://localhost/solrtestsql with user=solr and password=solr.
 Any suggestions on how this should be handled or structured.

I'm split on this one...
It seems like it would be nice if we didn't have any vendor specific
db stuff in the core or in the repository.  On the other hand, if the
vast majority of people will be using this with mysql, why make them
go through the trouble.



Perhaps we could add an 'ext' directory to lib where we put libraries
used by many plugins, but are not essential to solr.  These would
probably include:
 commons-fileupload-1.1.1.jar
 mysql-connector-java-5.0.4.jar
 postgresql.jar (i don't actually know what it is called)
 poi-2.5.1.jar
 commons-csv-20061121.jar

If people are worried about a more lean distribution, there could be a
'core' and 'complete' version.

At least this would solve my current problem: where to put the
libraries for unit testing!


Re: SQL UpdatePlugin?

2007-01-11 Thread Thorsten Scherler
On Wed, 2007-01-10 at 22:32 -0800, Ryan McKinley wrote:
 I'd like to be able to add/update documents from an SQL query.  Perhaps:
 
 addFromSQL
   mode=add or replace fields
   connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx
   driver=com.mysql.jdbc.Driver
   multifieldSeperator=\n 
   SELECT * FROM my_stats_table
 /addFromSQL
 
 This would use the the column names as the field name, and the cell
 value.toString() as the field value.
 
 If the schema says the field can have multiple values AND a
 multifieldSeperator is defined, it will split the value on that
 string.
 
 To get intended results, you may need to use the 'AS' command and
 perhaps format the cells using SQL.  For example:
 
  SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' )
 
 Should this be an implemented as an Update Plugin?  or added directly
 to the DirectUpdateHandler.
 
 If it should be an UpdatePlugin, how do i get started?

Hmm, I am not an expert but IMO that should not go directly to the
DirectUpdateHandler.

Solr is following the push model for updates till now. The above is
changing this since now solr is pulling the documents to add from the
db. Not saying this is bad or good.

I think you should consider something like this:
DbToSolrXml.java - this component connects to the db and generates
proper solr xml update statement. From here you do as usual.

Have a look at https://issues.apache.org/jira/browse/SOLR-66 maybe you
can use this.

HTH

salu2


 
 thanks
 ryan



Re: SQL UpdatePlugin?

2007-01-11 Thread Yonik Seeley

On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote:

I'd like to be able to add/update documents from an SQL query.


Me too... it's been on the todo list a long time.
A lot of people have data in databases, and it's a shame to require
code to index their data if it can be expressed in SQL.

If it were less common, I'd say it would be better as a standalone
app talking to Solr over XML/HTTP, but given that it's *such* a common
case, I'd support it going into the core.

I'd envision query args instead of XML though... something that could
be generated by a browser.

overwrite=truesql=SELECT * FROM my_stats_tableetc

The big question in my mind is if the database schema is simple enough
for something like this to work... esp w.r.t multi-valued fields.
Multiple values in a database may be in multiple rows... can we handle
that case somehow?  What types of joins can we handle?

-Yonik


addFromSQL
  mode=add or replace fields
  connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx
  driver=com.mysql.jdbc.Driver
  multifieldSeperator=\n 
  SELECT * FROM my_stats_table
/addFromSQL

This would use the the column names as the field name, and the cell
value.toString() as the field value.

If the schema says the field can have multiple values AND a
multifieldSeperator is defined, it will split the value on that
string.

To get intended results, you may need to use the 'AS' command and
perhaps format the cells using SQL.  For example:

 SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' )

Should this be an implemented as an Update Plugin?  or added directly
to the DirectUpdateHandler.

If it should be an UpdatePlugin, how do i get started?

thanks
ryan


Re: SQL UpdatePlugin?

2007-01-11 Thread Ryan McKinley

On 1/11/07, Yonik Seeley [EMAIL PROTECTED] wrote:

On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote:
 I'd like to be able to add/update documents from an SQL query.

Me too... it's been on the todo list a long time.
A lot of people have data in databases, and it's a shame to require
code to index their data if it can be expressed in SQL.

 If it were less common, I'd say it would be better as a standalone
app talking to Solr over XML/HTTP, but given that it's *such* a common
case, I'd support it going into the core.

I'd envision query args instead of XML though... something that could
be generated by a browser.

overwrite=truesql=SELECT * FROM my_stats_tableetc


To keep /update a clean, perhaps updating from SQL should get its own servlet.
/updateSQL?overwrite=truesql=SELECT * FROM my_stats_tableetc

I think connection settings and driver should be set by the request,
not through configuration.

People would need to add an sql driver (mysql-connector-java-5.0.4.jar
or whatever) to the /lib directory for anything to work.



The big question in my mind is if the database schema is simple enough
for something like this to work... esp w.r.t multi-valued fields.


I like the idea of adding a 'separator' token.  that could split a
single string into multiple fields.


Multiple values in a database may be in multiple rows... can we handle
that case somehow?


If the rows are sorted by the ID, we could keep building a document
until the ID is different from the previous one.  Multiple rows would
keep adding to the same document.

Requiring fields to be sorted by document ID seems like an ok
restriction - the alternative is to load everything into memory until
you hit the end of the result set.



What types of joins can we handle?



Anything you can pass to:
 ResultSet resultSet = stmt.executeQuery( query );

If this throws an exception, it will be passed to the user.

I can't think of any (or don't know about any) specific join functions
that would be problematic

- - - - - -

In my real use cases, documents are made from some fields that are
directly from SQL and others that have complex logic behind them.  The
SQL fields (stats) must be updated frequently and the others only
once.  This is why i also need to implement an 'update' mode.


Re: SQL UpdatePlugin?

2007-01-11 Thread Yonik Seeley

On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote:

On 1/11/07, Yonik Seeley [EMAIL PROTECTED] wrote:
 On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote:
  I'd like to be able to add/update documents from an SQL query.

 Me too... it's been on the todo list a long time.
 A lot of people have data in databases, and it's a shame to require
 code to index their data if it can be expressed in SQL.

  If it were less common, I'd say it would be better as a standalone
 app talking to Solr over XML/HTTP, but given that it's *such* a common
 case, I'd support it going into the core.

 I'd envision query args instead of XML though... something that could
 be generated by a browser.

 overwrite=truesql=SELECT * FROM my_stats_tableetc

To keep /update a clean, perhaps updating from SQL should get its own servlet.


It should definitely be separate... either a servlet or update plugin.
You could start by implementing it as a request handler plugin and it should be
easy to change it to an update plugin depending on if/how we handle that.

See
http://issues.apache.org/jira/browse/SOLR-66
for how I handle parameters, reusing SolrParams and enabling per-field
specifications for things like separator.


/updateSQL?overwrite=truesql=SELECT * FROM my_stats_tableetc

I think connection settings and driver should be set by the request,
not through configuration.


If we do things the same way as request handlers, then you can have it
both ways... one can specify it in configuration *or* in the request.


 The big question in my mind is if the database schema is simple enough
 for something like this to work... esp w.r.t multi-valued fields.

I like the idea of adding a 'separator' token.  that could split a
single string into multiple fields.


That should definitely be there, but I just don't know if it's
sufficient.  Many fields that need to be multi-valued won't be stored
like that in the database.


 Multiple values in a database may be in multiple rows... can we handle
 that case somehow?

If the rows are sorted by the ID, we could keep building a document
until the ID is different from the previous one.  Multiple rows would
keep adding to the same document.


I had thought about that, but it's tricky.
When you do a join and get multiple rows, don't values repeat in each
row, making it hard to tell if there were multiple values to begin
with or not?
I guess we could
1) go by the schema and only add multiple values if it's marked multiValued
2) only add distinct values... ignore repeated values


Requiring fields to be sorted by document ID seems like an ok
restriction - the alternative is to load everything into memory until
you hit the end of the result set.


The latter would break for large collections, so I guess it should be
the former.

-Yonik


Re: SQL UpdatePlugin?

2007-01-11 Thread Chris Hostetter


this relates pretty much directly to a seperate ongoing thread about
making the update mechanism more general to support differnet types of
input streams then just XML...

http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html

...tackling an approach like that, with an out of the box
DbUpdateRequestHandler that could be configured with a JDBC refrence and a
prepared statement string might be the best avenue to persue instead of
adding DB specific code to the existing SolrUpdateServlet or SolrCore.


: Date: Wed, 10 Jan 2007 22:32:54 -0800
: From: Ryan McKinley [EMAIL PROTECTED]
: Reply-To: solr-dev@lucene.apache.org
: To: solr-dev@lucene.apache.org
: Subject: SQL UpdatePlugin?
:
: I'd like to be able to add/update documents from an SQL query.  Perhaps:
:
: addFromSQL
:   mode=add or replace fields
:   connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx
:   driver=com.mysql.jdbc.Driver
:   multifieldSeperator=\n 
:   SELECT * FROM my_stats_table
: /addFromSQL
:
: This would use the the column names as the field name, and the cell
: value.toString() as the field value.
:
: If the schema says the field can have multiple values AND a
: multifieldSeperator is defined, it will split the value on that
: string.
:
: To get intended results, you may need to use the 'AS' command and
: perhaps format the cells using SQL.  For example:
:
:  SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' )
:
: Should this be an implemented as an Update Plugin?  or added directly
: to the DirectUpdateHandler.
:
: If it should be an UpdatePlugin, how do i get started?
:
: thanks
: ryan
:



-Hoss



Re: SQL UpdatePlugin?

2007-01-11 Thread Ryan McKinley

I have written a first pass of an SQLUpdateHandler based on this
discussion.  It is working well for simple cases and has not tried
anything complex yet.

For now, it extends SolrRequestHandler (not UpdateHandler) because the
request plugin framework seems more stable.

What is the best way to post unfinished code and ask for review?
Should i make a JIRA issue and post a patch with a not ready for
prime time note?  Should I post the code on a personal website?

Does this belong in the main package o.a.s.update? or a 'contrib'
package/project/source folder?  If it is in the main tree: the unit
tests require mysql-connector.jar and assume you have a database at:
jdbc:mysql://localhost/solrtestsql with user=solr and password=solr.
Any suggestions on how this should be handled or structured.

While we are discussing package layout and external dependencies,
consider the not-too-distant case where I will need to do this with an
excel spreadsheet and depend on jakarta poi.

Perhaps it would be good to stub folders and examples for 'contrib'
RequestHandlers and UpdateHandlers.

Thanks!  I'm asking so many questions because i want to make sure i
can give back whatever useful stuff I'm doing now.


SQL UpdatePlugin?

2007-01-10 Thread Ryan McKinley

I'd like to be able to add/update documents from an SQL query.  Perhaps:

addFromSQL
 mode=add or replace fields
 connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx
 driver=com.mysql.jdbc.Driver
 multifieldSeperator=\n 
 SELECT * FROM my_stats_table
/addFromSQL

This would use the the column names as the field name, and the cell
value.toString() as the field value.

If the schema says the field can have multiple values AND a
multifieldSeperator is defined, it will split the value on that
string.

To get intended results, you may need to use the 'AS' command and
perhaps format the cells using SQL.  For example:

SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' )

Should this be an implemented as an Update Plugin?  or added directly
to the DirectUpdateHandler.

If it should be an UpdatePlugin, how do i get started?

thanks
ryan