Re: SQL UpdatePlugin?
On 1/12/07, Ryan McKinley [EMAIL PROTECTED] wrote: ...What is the best way to post unfinished code and ask for review? Should i make a JIRA issue and post a patch with a not ready for prime time note? ... Yes, this would work. Our personal NotReadyForPrimeTime detectors would flag it immediately anyway, if needed ;-) Putting TODO comments in your code in places where you feel others should have a look can help if you're unsure about how to best implement certain things. -Bertrand
Re: SQL UpdatePlugin?
On 1/12/07, Ryan McKinley [EMAIL PROTECTED] wrote: For now, it extends SolrRequestHandler (not UpdateHandler) because the request plugin framework seems more stable. There is no framework for pluggable update handlers yet... Solr's current UpdateHandler classes (DirectUpdateHandler2) are a bit misnamed in this context... they deal with the low-level lucene index munging and concurrency issues, more like an IndexWriter. What is the best way to post unfinished code and ask for review? Should i make a JIRA issue and post a patch with a not ready for prime time note? Should I post the code on a personal website? Yep, and adding a note that it's just a draft is important. Does this belong in the main package o.a.s.update? IMO, yes (or in o.a.s.update.db or sql). SQL databases are ubiquitous, and we don't want everyone having to do the same integration steps themselves to index their db content. or a 'contrib' package/project/source folder? If it is in the main tree: the unit tests require mysql-connector.jar and assume you have a database at: jdbc:mysql://localhost/solrtestsql with user=solr and password=solr. Any suggestions on how this should be handled or structured. I'm split on this one... It seems like it would be nice if we didn't have any vendor specific db stuff in the core or in the repository. On the other hand, if the vast majority of people will be using this with mysql, why make them go through the trouble. -Yonik
Re: SQL UpdatePlugin?
or a 'contrib' package/project/source folder? If it is in the main tree: the unit tests require mysql-connector.jar and assume you have a database at: jdbc:mysql://localhost/solrtestsql with user=solr and password=solr. Any suggestions on how this should be handled or structured. I'm split on this one... It seems like it would be nice if we didn't have any vendor specific db stuff in the core or in the repository. On the other hand, if the vast majority of people will be using this with mysql, why make them go through the trouble. Perhaps we could add an 'ext' directory to lib where we put libraries used by many plugins, but are not essential to solr. These would probably include: commons-fileupload-1.1.1.jar mysql-connector-java-5.0.4.jar postgresql.jar (i don't actually know what it is called) poi-2.5.1.jar commons-csv-20061121.jar If people are worried about a more lean distribution, there could be a 'core' and 'complete' version. At least this would solve my current problem: where to put the libraries for unit testing!
Re: SQL UpdatePlugin?
On Wed, 2007-01-10 at 22:32 -0800, Ryan McKinley wrote: I'd like to be able to add/update documents from an SQL query. Perhaps: addFromSQL mode=add or replace fields connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx driver=com.mysql.jdbc.Driver multifieldSeperator=\n SELECT * FROM my_stats_table /addFromSQL This would use the the column names as the field name, and the cell value.toString() as the field value. If the schema says the field can have multiple values AND a multifieldSeperator is defined, it will split the value on that string. To get intended results, you may need to use the 'AS' command and perhaps format the cells using SQL. For example: SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' ) Should this be an implemented as an Update Plugin? or added directly to the DirectUpdateHandler. If it should be an UpdatePlugin, how do i get started? Hmm, I am not an expert but IMO that should not go directly to the DirectUpdateHandler. Solr is following the push model for updates till now. The above is changing this since now solr is pulling the documents to add from the db. Not saying this is bad or good. I think you should consider something like this: DbToSolrXml.java - this component connects to the db and generates proper solr xml update statement. From here you do as usual. Have a look at https://issues.apache.org/jira/browse/SOLR-66 maybe you can use this. HTH salu2 thanks ryan
Re: SQL UpdatePlugin?
On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote: I'd like to be able to add/update documents from an SQL query. Me too... it's been on the todo list a long time. A lot of people have data in databases, and it's a shame to require code to index their data if it can be expressed in SQL. If it were less common, I'd say it would be better as a standalone app talking to Solr over XML/HTTP, but given that it's *such* a common case, I'd support it going into the core. I'd envision query args instead of XML though... something that could be generated by a browser. overwrite=truesql=SELECT * FROM my_stats_tableetc The big question in my mind is if the database schema is simple enough for something like this to work... esp w.r.t multi-valued fields. Multiple values in a database may be in multiple rows... can we handle that case somehow? What types of joins can we handle? -Yonik addFromSQL mode=add or replace fields connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx driver=com.mysql.jdbc.Driver multifieldSeperator=\n SELECT * FROM my_stats_table /addFromSQL This would use the the column names as the field name, and the cell value.toString() as the field value. If the schema says the field can have multiple values AND a multifieldSeperator is defined, it will split the value on that string. To get intended results, you may need to use the 'AS' command and perhaps format the cells using SQL. For example: SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' ) Should this be an implemented as an Update Plugin? or added directly to the DirectUpdateHandler. If it should be an UpdatePlugin, how do i get started? thanks ryan
Re: SQL UpdatePlugin?
On 1/11/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote: I'd like to be able to add/update documents from an SQL query. Me too... it's been on the todo list a long time. A lot of people have data in databases, and it's a shame to require code to index their data if it can be expressed in SQL. If it were less common, I'd say it would be better as a standalone app talking to Solr over XML/HTTP, but given that it's *such* a common case, I'd support it going into the core. I'd envision query args instead of XML though... something that could be generated by a browser. overwrite=truesql=SELECT * FROM my_stats_tableetc To keep /update a clean, perhaps updating from SQL should get its own servlet. /updateSQL?overwrite=truesql=SELECT * FROM my_stats_tableetc I think connection settings and driver should be set by the request, not through configuration. People would need to add an sql driver (mysql-connector-java-5.0.4.jar or whatever) to the /lib directory for anything to work. The big question in my mind is if the database schema is simple enough for something like this to work... esp w.r.t multi-valued fields. I like the idea of adding a 'separator' token. that could split a single string into multiple fields. Multiple values in a database may be in multiple rows... can we handle that case somehow? If the rows are sorted by the ID, we could keep building a document until the ID is different from the previous one. Multiple rows would keep adding to the same document. Requiring fields to be sorted by document ID seems like an ok restriction - the alternative is to load everything into memory until you hit the end of the result set. What types of joins can we handle? Anything you can pass to: ResultSet resultSet = stmt.executeQuery( query ); If this throws an exception, it will be passed to the user. I can't think of any (or don't know about any) specific join functions that would be problematic - - - - - - In my real use cases, documents are made from some fields that are directly from SQL and others that have complex logic behind them. The SQL fields (stats) must be updated frequently and the others only once. This is why i also need to implement an 'update' mode.
Re: SQL UpdatePlugin?
On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote: On 1/11/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 1/11/07, Ryan McKinley [EMAIL PROTECTED] wrote: I'd like to be able to add/update documents from an SQL query. Me too... it's been on the todo list a long time. A lot of people have data in databases, and it's a shame to require code to index their data if it can be expressed in SQL. If it were less common, I'd say it would be better as a standalone app talking to Solr over XML/HTTP, but given that it's *such* a common case, I'd support it going into the core. I'd envision query args instead of XML though... something that could be generated by a browser. overwrite=truesql=SELECT * FROM my_stats_tableetc To keep /update a clean, perhaps updating from SQL should get its own servlet. It should definitely be separate... either a servlet or update plugin. You could start by implementing it as a request handler plugin and it should be easy to change it to an update plugin depending on if/how we handle that. See http://issues.apache.org/jira/browse/SOLR-66 for how I handle parameters, reusing SolrParams and enabling per-field specifications for things like separator. /updateSQL?overwrite=truesql=SELECT * FROM my_stats_tableetc I think connection settings and driver should be set by the request, not through configuration. If we do things the same way as request handlers, then you can have it both ways... one can specify it in configuration *or* in the request. The big question in my mind is if the database schema is simple enough for something like this to work... esp w.r.t multi-valued fields. I like the idea of adding a 'separator' token. that could split a single string into multiple fields. That should definitely be there, but I just don't know if it's sufficient. Many fields that need to be multi-valued won't be stored like that in the database. Multiple values in a database may be in multiple rows... can we handle that case somehow? If the rows are sorted by the ID, we could keep building a document until the ID is different from the previous one. Multiple rows would keep adding to the same document. I had thought about that, but it's tricky. When you do a join and get multiple rows, don't values repeat in each row, making it hard to tell if there were multiple values to begin with or not? I guess we could 1) go by the schema and only add multiple values if it's marked multiValued 2) only add distinct values... ignore repeated values Requiring fields to be sorted by document ID seems like an ok restriction - the alternative is to load everything into memory until you hit the end of the result set. The latter would break for large collections, so I guess it should be the former. -Yonik
Re: SQL UpdatePlugin?
this relates pretty much directly to a seperate ongoing thread about making the update mechanism more general to support differnet types of input streams then just XML... http://www.nabble.com/Re%3A-Handling-disparate-data-sources-in-Solr-tf2918621.html ...tackling an approach like that, with an out of the box DbUpdateRequestHandler that could be configured with a JDBC refrence and a prepared statement string might be the best avenue to persue instead of adding DB specific code to the existing SolrUpdateServlet or SolrCore. : Date: Wed, 10 Jan 2007 22:32:54 -0800 : From: Ryan McKinley [EMAIL PROTECTED] : Reply-To: solr-dev@lucene.apache.org : To: solr-dev@lucene.apache.org : Subject: SQL UpdatePlugin? : : I'd like to be able to add/update documents from an SQL query. Perhaps: : : addFromSQL : mode=add or replace fields : connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx : driver=com.mysql.jdbc.Driver : multifieldSeperator=\n : SELECT * FROM my_stats_table : /addFromSQL : : This would use the the column names as the field name, and the cell : value.toString() as the field value. : : If the schema says the field can have multiple values AND a : multifieldSeperator is defined, it will split the value on that : string. : : To get intended results, you may need to use the 'AS' command and : perhaps format the cells using SQL. For example: : : SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' ) : : Should this be an implemented as an Update Plugin? or added directly : to the DirectUpdateHandler. : : If it should be an UpdatePlugin, how do i get started? : : thanks : ryan : -Hoss
Re: SQL UpdatePlugin?
I have written a first pass of an SQLUpdateHandler based on this discussion. It is working well for simple cases and has not tried anything complex yet. For now, it extends SolrRequestHandler (not UpdateHandler) because the request plugin framework seems more stable. What is the best way to post unfinished code and ask for review? Should i make a JIRA issue and post a patch with a not ready for prime time note? Should I post the code on a personal website? Does this belong in the main package o.a.s.update? or a 'contrib' package/project/source folder? If it is in the main tree: the unit tests require mysql-connector.jar and assume you have a database at: jdbc:mysql://localhost/solrtestsql with user=solr and password=solr. Any suggestions on how this should be handled or structured. While we are discussing package layout and external dependencies, consider the not-too-distant case where I will need to do this with an excel spreadsheet and depend on jakarta poi. Perhaps it would be good to stub folders and examples for 'contrib' RequestHandlers and UpdateHandlers. Thanks! I'm asking so many questions because i want to make sure i can give back whatever useful stuff I'm doing now.
SQL UpdatePlugin?
I'd like to be able to add/update documents from an SQL query. Perhaps: addFromSQL mode=add or replace fields connection=jdbc:mysql://localhost/nblmc?username=xxxpassword=xxx driver=com.mysql.jdbc.Driver multifieldSeperator=\n SELECT * FROM my_stats_table /addFromSQL This would use the the column names as the field name, and the cell value.toString() as the field value. If the schema says the field can have multiple values AND a multifieldSeperator is defined, it will split the value on that string. To get intended results, you may need to use the 'AS' command and perhaps format the cells using SQL. For example: SELECT itemID AS id, name, DATE_FORMAT( addedTime, '%Y-%m-%dT%H:%i:%s.000Z' ) Should this be an implemented as an Update Plugin? or added directly to the DirectUpdateHandler. If it should be an UpdatePlugin, how do i get started? thanks ryan