Re: loading many documents by ID
On 2/3/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: the schema creator should still have some say in what kinds of things are allowed/dissalloed though -- the person doing the "update" may not fully understand the underlying model. I think the two concerns should be separated: 1) updateable docs implementation 2) constraint checking IMO, it's unnecessary to link these features, and requiring (2) will just delay (1). (2) should also cover things unrelated to updateable docs, such as mandatory fields (say someone changes the schema and can't provide a default, and they want the clients to fail until they change). -Yonik
Re: loading many documents by ID
: I agree. I started down that path, and it gets pretty ugly. I : stopped. I have opted for a syntax that 'updates' all stored fields, : but lets you say explicitly what to do for each field. If there is a : stored field you want to skip, you can specify that in command rather : then in the schema. the schema creator should still have some say in what kinds of things are allowed/dissalloed though -- the person doing the "update" may not fully understand the underlying model. : > another simple approach would be to make "updatability" a property of the : > schema, that can contain a few different values... : This is an interisting idea, but (if i'm understaning your suggestion : correctly) it seems like TOO big of change from the existing schema. the schema.xml format wouldn't change much .. just a new attribute on the tag ... the existing example schema would either be labeled "loose" or "none" and we could provide another example of "strict" ... or we would label it "strict" and remove the refrences to indexed/stored and only mention them in comments describing other things you can do if you dont' require the ability to mutate documents. : think throwing an error if there are no stored fields is reasonable : and only updating stored fields is simple enough logic I don't think : we need to over complicate it. throwing an error if there are no stored fields in the schema, or no stored fields in the existing document, or no stored fields in mutate request? what if the document just doesn't have any stored fields because the first time it was added, the stored fields weren't known yet? what if the document does have stored fields, but it also has an indexed but not stored fields, and the person doing the update doesn't realize htat and doesn't send a replacement value for that field? : > another approach i don't really have fully fleshed out in my head would be : > to introduce a concept of "fieldsets" ... an update that : > sets/appends/incrments a field in a fieldset which does not provide a : : I may be working on this, but not sure if it is what you are saying. I have: no, i was thining of it as a new bit of syntax in the schema ... after defining all of your s you have some s and any time you update a doc, and mutate a field (either overwrite, append, increment, whatever) which is in some s then you have to also provide a new value for any non-stored field also listed in those fieldset. in a simpel schema, you'd only need one and it would list every field (we'd probably even want a simple syntactic alias for that) but in more complex schemas where you want SOlr to provide some sanity checking on your docs, but you frequently have different "types" of docs in your schema with differnet sets of common overlapping fields - the s are your way of telling Solr when to complain. : public enum FieldMODE { : APPEND,// add the fields to existing fields : OVERWRITE, // overwrite existing fields : INCREMENT, // increment existing field. Must be a number! : DISTINCT, // same as APPEND, but make sure there are distinct values : IGNORE // ignore the previous value -- don't copy it as i understand it, these are options specified by the client triggering the "mutate doc" command right? ... they totally make sense, but they don't really address what Sol should do if the command doesn't mention a field which is in the schema. the use case i'm thinking about is an existing solr index with lots of clients from differnet parts of a company adding/mutating documents, and then the schema needs changed. the Schema Owner should have some way of saying what happens if one of those clients attempts to mutate a document and doesn't provide a replacement value for an indexed/unstored field -- but there's no easy/fast way for the UpdateHandler to realize that a given document has indexed values for that field -- hence either some simple broad rules the schema owner can put in about hte schema as a whole, or sets of fields the schema owner can define: (if they try to mutate x, y, or z, then they better be providing a, b and c because they are all used together) : default mode. I have not tried to tackle dynamic fields yet... it : seems a bit more complicated! yeah .. that's what i'm worried about with the fieldset idea too. It's one of the reasons why it might be a good idea to just say: * if you want to be able to mutate docs, and you want to be garunteed it will allways work, then every indexed field must be stored. * if you want to be able to mutate docs, and you can't feasible store every indexed field; then add this one line to your schema.xml and Solr will trust that the clients sending mutate requests know what they are doing. * if you don't trust your clients to know what they are doing when mutating documents, add this one line to your schema and Solr will reject any attempt to mutate a document (only wholesale document replacement will be allowed) -Hoss
Re: loading many documents by ID
> > How about: Iterable Maybe... but that might not be the easiest for request handlers to use... they would then need to spin up a different thread and use a pull model (provide a new doc on demand) rather than push (call addDocument()). With Iterable, you don't need to start a thread to implement a 'streaming' parser. You can use an anonymous inner class that waits until next() is called before reading the next row/line/document, etc. In affect this lets the RequestHandler set up all the common configurations and then lets the UpdateHandler ask for a document one at a time. What I like about this is that the code that loops through each row of my SQL updater does not need to know *anything* about the UpdateHandler. I would rather not call updater.addDoc( cmd ) within the while( rs.next() ) loop. This makes it much cleaner and easier to test. If writing a 'streaming' Iterable is more trouble then someone wants to go through, they can easily return a Collection or an array with single element. When I'm coding, the design tends to morph a lot. mine too! I think we need to figure out what type of update semantics we want w.r.t. adding multiple documents, and all the other misc autocommit params. Right now, what i am working with is an 'update' command that you can pass along modes for each field. If no modes are specified (or they are all OVERWRITE) it behaves exactly as we have now (SQL REPLACE). If any field uses something other then OVERWRITE, it behaves like an SQL INSERT ... ON DUPLICATE KEY UPDATE.
Re: loading many documents by ID
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > Not sure... depends on how update handlers will use it... by update handler, you mean UpdateRequestHandler(s)? or UpdateHandler? Both. > One thing we might not want to get rid of though is streaming > (constructing and adding a document, then discarding it). People are > starting to add a lot of documents in a single XML request, and this > will be much larger for CVS/SQL. > So you are uncomfortable with the Collection because you would have to load all the documents before indexing them. If this was many, it could be a problem... If UpdateHandler is going to take care of stuff like autocommit and modifying documents, It seems best to have that apply to all the documents you are going to modify as a unit. For example, say i have a SQL updater that will modify 100,000 documents incrementing field 'count_*' and replacing 'fl_*'. If the DocumentCommand only applies to a single document, it would have to match each field as it went along rather then once when it starts. How about: Iterable Maybe... but that might not be the easiest for request handlers to use... they would then need to spin up a different thread and use a pull model (provide a new doc on demand) rather than push (call addDocument()). I'm really just thinking a little out loud... just first impressions - don't read too much into it. When I'm coding, the design tends to morph a lot. I think we need to figure out what type of update semantics we want w.r.t. adding multiple documents, and all the other misc autocommit params. -Yonik
Re: loading many documents by ID
1) regardless of the verb (updatable/modifiable) i'm not sure that it makes sense to annotate in the schema the fields that should be copied on update, and not label the feilds that must be "set" on update (ie: the fields that cannot be copied) I agree. I started down that path, and it gets pretty ugly. I stopped. I have opted for a syntax that 'updates' all stored fields, but lets you say explicitly what to do for each field. If there is a stored field you want to skip, you can specify that in command rather then in the schema. another simple approach would be to make "updatability" a property of the schema, that can contain a few different values... "strict" - indexed and stored are no longer valid field(type) attributes -- all fields are indexed and stored. all fields are copied on "update" unless the update command inlcudes instructions to replace, append or incriment the field value "loose" - indexed/stored still exist, any attempt to "update" an existing document is legal, all stored fields are copied on update unless the update command includes in structures to replace, append or increment the field value. "none" - any attempt to update will fail. This is an interisting idea, but (if i'm understaning your suggestion correctly) it seems like TOO big of change from the existing schema. The more I think about the 'error' behavior, I am convinced we just need solid, easily explainable logic for what happens and why. I think throwing an error if there are no stored fields is reasonable and only updating stored fields is simple enough logic I don't think we need to over complicate it. another approach i don't really have fully fleshed out in my head would be to introduce a concept of "fieldsets" ... an update that sets/appends/incrments a field in a fieldset which does not provide a I may be working on this, but not sure if it is what you are saying. I have: public class IndexDocumentCommand { public enum FieldMODE { APPEND,// add the fields to existing fields OVERWRITE, // overwrite existing fields INCREMENT, // increment existing field. Must be a number! DISTINCT, // same as APPEND, but make sure there are distinct values IGNORE // ignore the previous value -- don't copy it }; public Iterable docs; public Map fieldMode; // What to do for each field. public int commitMaxTime = -1; } If fieldMode is null or they are all OVERWRITE, the addDoc command behaves as it always has. Otherwise, it first extracts the exiting stored values (unless the fieldMode is IGNORE) then applies the new documents value on top of the old one. Currently I am only handling wildcard substitution for "*" - the default mode. I have not tried to tackle dynamic fields yet... it seems a bit more complicated!
Re: loading many documents by ID
: 1. Set the "updateable" fields explicitly in the schema. : : : * throw an exception at startup if an updateable field is not stored. : If somewhere down the road we figure out how to efficiently handled : unstored fields, we can remove this error. : * when 'updating', only copy the fields marked 'updateable' : * If someone sends an 'update' request and there are no fields marked : updateable, return an error i have two concerns: 1) regardless of the verb (updatable/modifiable) i'm not sure that it makes sense to annotate in the schema the fields that should be copied on update, and not label the feilds that must be "set" on update (ie: the fields that cannot be copied) 2) Solr makes it very easy to support different "classes" of documents that use differnet subsets of hte fields in the schema -- some of which may overlap. if we assume that it's okay to allow an "update" of a document because there's at least one field in the schema that is stored, we won't catch cases where that one field isn't used for that "type" of document. a simple way to go that wouldn't catch all user mistakes, but could be confident it never errored incorrectly would be to assume that any doc can be "updated" as long as it has at least one stred field -- that's the simplest possible use case afterall, that i want to modify a doc in place, replacing all of the index but unstored values with new values, and i only want the stored fields to be copied over again unchanged. another simple approach would be to make "updatability" a property of the schema, that can contain a few different values... "strict" - indexed and stored are no longer valid field(type) attributes -- all fields are indexed and stored. all fields are copied on "update" unless the update command inlcudes instructions to replace, append or incriment the field value "loose" - indexed/stored still exist, any attempt to "update" an existing document is legal, all stored fields are copied on update unless the update command includes in structures to replace, append or increment the field value. "none" - any attempt to update will fail. ...novice users who want updatability should use strict, more experienced users who want updatability but smaller index sizes and understand the issues with fields that are indexed but unstored can use loose. another approach i don't really have fully fleshed out in my head would be to introduce a concept of "fieldsets" ... an update that sets/appends/incrments a field in a fieldset which does not provide a value for any unstored fields in that fieldset could trigger an error ... thta would help with the differnet 'classes' of documents, but i'm not sure if it could relaly work with dynamicFields. -Hoss
Re: loading many documents by ID
Not sure... depends on how update handlers will use it... by update handler, you mean UpdateRequestHandler(s)? or UpdateHandler? One thing we might not want to get rid of though is streaming (constructing and adding a document, then discarding it). People are starting to add a lot of documents in a single XML request, and this will be much larger for CVS/SQL. So you are uncomfortable with the Collection because you would have to load all the documents before indexing them. If this was many, it could be a problem... If UpdateHandler is going to take care of stuff like autocommit and modifying documents, It seems best to have that apply to all the documents you are going to modify as a unit. For example, say i have a SQL updater that will modify 100,000 documents incrementing field 'count_*' and replacing 'fl_*'. If the DocumentCommand only applies to a single document, it would have to match each field as it went along rather then once when it starts. How about: Iterable this way, an UpdateRequestHandler can start the UpdateHandler running while it streams each document from XML/CSV/SQL ryan
Re: loading many documents by ID
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: I am (was?) using DISTINCT to say, only add the unique fields. As implemented, it keeps a Collection for each field name. If the 'mode' is 'DISTINCT' the collection is Set, otherwise List Ah, OK... that does seem useful. How would you feel about an interface like this: Not sure... depends on how update handlers will use it... One thing we might not want to get rid of though is streaming (constructing and adding a document, then discarding it). People are starting to add a lot of documents in a single XML request, and this will be much larger for CVS/SQL. For that reason, I'm not sure how often the "Collection" part will be utilized. I like the it OK on the conceptual level though. -Yonik public class IndexDocumentsCommand { public enum MODE { APPEND,// add the fields to existing fields OVERWRITE, // overwrite existing fields INCREMENT, // increment existing field DISTINCT // same as APPEND, but make sure there are distinct values }; // optional id in "internal" indexed form... if it is needed and not supplied, // it will be obtained from the doc. public String indexedId; public Collection docs; public boolean allowDups; public boolean overwrite; public SimpleOrderedMap modifyFieldMode; // What to do for each field. We should support * public int commitMaxTime = -1; // make sure these documents are committed within this much time }
Re: loading many documents by ID
> REPLACE_DOCUMENT > REPLACE_FIELDS > REPLACE_DISTINCT_FIELDS > ADD_FIELDS > ADD_DISTINCT_FIELDS What does "distinct" mean in this context? I am (was?) using DISTINCT to say, only add the unique fields. As implemented, it keeps a Collection for each field name. If the 'mode' is 'DISTINCT' the collection is Set, otherwise List There is a lot of processing going on inside Document Builder. Once you get to the UpdateCommand, you have already lost some information (copyFields have executed, some things have been converted to index form, etc). I noticed that! It made sense when I was implementing this in a RequestHandler, but it gets a little wonky inside the UpdateHandler - as you said, copyFields already executed. I think the best thing is to make a new command that does not directly take a lucene document as its input. perhaps: http://svn.lapnap.net/solr/solrj/src/org/apache/solr/client/solrj/SolrDocument.java http://svn.lapnap.net/solr/solrj/src/org/apache/solr/client/solrj/impl/SimpleSolrDoc.java Then the UpdateHandler would open the DocumentBuilder merge the existing document with the passed in document using whatever method specified. I would think one would also want to specify things per field. - append this value to this field - increment the value of this field - append this value to this field - overwrite this field How would you feel about an interface like this: public class IndexDocumentsCommand { public enum MODE { APPEND,// add the fields to existing fields OVERWRITE, // overwrite existing fields INCREMENT, // increment existing field DISTINCT // same as APPEND, but make sure there are distinct values }; // optional id in "internal" indexed form... if it is needed and not supplied, // it will be obtained from the doc. public String indexedId; public Collection docs; public boolean allowDups; public boolean overwrite; public SimpleOrderedMap modifyFieldMode; // What to do for each field. We should support * public int commitMaxTime = -1; // make sure these documents are committed within this much time } ryan
Re: loading many documents by ID
On 2/1/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: I have something working that adds a 'mode' to AddUpdateCommand. The modes I need are: Feel free to suggest replacements for the UpdateCommand classes if things become cumbersome. REPLACE_DOCUMENT REPLACE_FIELDS REPLACE_DISTINCT_FIELDS ADD_FIELDS ADD_DISTINCT_FIELDS What does "distinct" mean in this context? There is a lot of processing going on inside Document Builder. Once you get to the UpdateCommand, you have already lost some information (copyFields have executed, some things have been converted to index form, etc). I would think one would also want to specify things per field. - append this value to this field - increment the value of this field - append this value to this field - overwrite this field CSV/SQL handlers could define these per-field for multiple docs (column) for a request. XML could define per-field instance if we want, or we might want to restrict per field (column) for a single request. -Yonik
Re: loading many documents by ID
On 2/1/07 10:55 AM, "Ryan McKinley" <[EMAIL PROTECTED]> wrote: > > Is there a better word then 'update'? It seems there is already enough > confusion between UpdateHandlers, "Update Plugins", > UpdateRequestHandler etc. Try "modify". Solr uses "update" to include "add". wunder
Re: loading many documents by ID
What I think I'm seeing is two validation options: 1. Set the "updateable" fields explicitly in the schema. * throw an exception at startup if an updateable field is not stored. If somewhere down the road we figure out how to efficiently handled unstored fields, we can remove this error. * when 'updating', only copy the fields marked 'updateable' * If someone sends an 'update' request and there are no fields marked updateable, return an error 2. Assume all stored fields that are not copied to are 'updateable' * return an error if someone sends an 'update' request and there are no stored fields I vote for option #1 -- although most configurations that want to 'update' fields will probably mark all stored fields as 'updateable', it seems valuable to make the schema designer explicitly specify what will happen on an 'update' - - - - - - - Is there a better word then 'update'? It seems there is already enough confusion between UpdateHandlers, "Update Plugins", UpdateRequestHandler etc. In this case "update" makes sense as it is the SQL equivolent. - - - - - - - I have something working that adds a 'mode' to AddUpdateCommand. The modes I need are: REPLACE_DOCUMENT REPLACE_FIELDS REPLACE_DISTINCT_FIELDS ADD_FIELDS ADD_DISTINCT_FIELDS ryan
Re: loading many documents by ID
On Feb 1, 2007, at 12:05 AM, Ryan McKinley wrote: > > We'd have to make it very clear that this only works if all fields are > STORED. Isn't there some way to do this automatically instead of relying on documentation? We might need to add something, maybe a "required" attribute on fields, but a runtime error would be much, much better than a page on the wiki. what about copyField? With copyField, it is reasonable to have fields that are not stored and are generated from the other stored fields. (this is what my setup looks like) I would think copyFields would be exempt from the STORED mandate, and only the definitions would matter for an update restriction. Erik
Re: loading many documents by ID
On 1/31/07 9:05 PM, "Ryan McKinley" <[EMAIL PROTECTED]> wrote: >>> >>> We'd have to make it very clear that this only works if all fields are >>> STORED. >> >> Isn't there some way to do this automatically instead of relying >> on documentation? We might need to add something, maybe a >> "required" attribute on fields, but a runtime error would be >> much, much better than a page on the wiki. > > what about copyField? > > With copyField, it is reasonable to have fields that are not stored > and are generated from the other stored fields. (this is what my > setup looks like). Mine, too. That is why I suggested explicit declarations in the schema to say which fields are required. wunder
Re: loading many documents by ID
> > We'd have to make it very clear that this only works if all fields are > STORED. Isn't there some way to do this automatically instead of relying on documentation? We might need to add something, maybe a "required" attribute on fields, but a runtime error would be much, much better than a page on the wiki. what about copyField? With copyField, it is reasonable to have fields that are not stored and are generated from the other stored fields. (this is what my setup looks like)
Re: loading many documents by ID
On 1/31/07 3:39 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote: > > : Oh, and there have been numerous people interested in "updateable" > : documents, so it would be nice if that part was in the update handler. > > We'd have to make it very clear that this only works if all fields are > STORED. Isn't there some way to do this automatically instead of relying on documentation? We might need to add something, maybe a "required" attribute on fields, but a runtime error would be much, much better than a page on the wiki. wunder
Re: loading many documents by ID
On 1/31/07, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Jan 31, 2007, at 6:39 PM, Chris Hostetter wrote: > : Oh, and there have been numerous people interested in "updateable" > : documents, so it would be nice if that part was in the update > handler. > > We'd have to make it very clear that this only works if all fields are > STORED. That is perfectly reasonable, for sure. And I would support an "update" feature issuing an exception if it detected this case. There is an important caveat to all fields being stored though... if an update was sending in updated fields for all the non-stored fields, and only stored fields were being copied internally, all would be fine too. I think there might be two useful types of updates: 1) overwrite original field 2) add an additional value for a multi-valued field (useful for tagging?) I think eventually we could have this sort of feature internally copy the terms for non-stored fields somehow, but maybe that would only come along once Lucene supported something to facilitate this more? Not unless you store more info (a lot more info). We sould also be able to copy unstored fields with term vectors stored. ParallelReader might also hold some promise (putting a field to be updated in a separate index) The problem is that the lucene ids need to be kept in sync... I don't know how to do that w/o reindexing. -Yonik
Re: loading many documents by ID
On Jan 31, 2007, at 6:39 PM, Chris Hostetter wrote: : Oh, and there have been numerous people interested in "updateable" : documents, so it would be nice if that part was in the update handler. We'd have to make it very clear that this only works if all fields are STORED. That is perfectly reasonable, for sure. And I would support an "update" feature issuing an exception if it detected this case. There is an important caveat to all fields being stored though... if an update was sending in updated fields for all the non-stored fields, and only stored fields were being copied internally, all would be fine too. I think eventually we could have this sort of feature internally copy the terms for non-stored fields somehow, but maybe that would only come along once Lucene supported something to facilitate this more? Erik
Re: loading many documents by ID
: Oh, and there have been numerous people interested in "updateable" : documents, so it would be nice if that part was in the update handler. We'd have to make it very clear that this only works if all fields are STORED. -Hoss
Re: loading many documents by ID
On 1/30/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: I am working on an SQLUpdatePlugin that needs to set a few fields in the document and leave the rest as they were. Oh, and there have been numerous people interested in "updateable" documents, so it would be nice if that part was in the update handler. -Yonik
Re: loading many documents by ID
On 1/30/07, Ryan McKinley <[EMAIL PROTECTED]> wrote: I am working on an SQLUpdatePlugin that needs to set a few fields in the document and leave the rest as they were. My plan is to load the old documents from the current searcher and overwrite any fields set from SQL. What is the best way to load a (potentially) large set of documents by ID? Should I make one query with a list of all IDs? make a separate query for each ID? Is there a cache i can try to pull them from? See SolrIndexSearcher: public int getFirstMatch(Term t) throws IOException That returns an internal lucene docid for the first match of a term (make sure to convert it to internal form first via FieldType.toInternal() Then use SolrIndexSearcher.doc(int) to get the document (it will check the cache). Trying to retrieve multiple at once doesn't offer a performance benefit as Lucene needs to seek to each term anyway. -Yonik
loading many documents by ID
I am working on an SQLUpdatePlugin that needs to set a few fields in the document and leave the rest as they were. My plan is to load the old documents from the current searcher and overwrite any fields set from SQL. What is the best way to load a (potentially) large set of documents by ID? Should I make one query with a list of all IDs? make a separate query for each ID? Is there a cache i can try to pull them from? thanks for any pointers. ryan