My language might be a bit off (I am saying "string" when I probably mean "text" in the context of solr), but I'm pretty sure that my story is unwavering ;)
`id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` int(11) So, imagine that we have 1000 entries come in where "data" above is exactly the same for all 1000 entries, but user_id is different (id and created being different is irrelevant). I am thinking that prior to inserting into mysql, I should be able to concatenate the user_ids together with whitespace and then insert them into something like: `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` blob Then on solr's end it will treat the user_id as Text and parse it (I want to say tokenize, but maybe my language is incorrect here?). Then when I search user_id:2002+AND+created:[${**from}+TO+${until}]+data:"more" I want to be sure that if I look for user_id "2002", I will get data that only has a value "2002" in the user_id column and that a separate user with id "20" cannot accidentally pull data for user_id "2002" as a result of a fuzzy (my language ok?) match of 20 against (20)02. Current schema definition: <field name="user_id" type="int" indexed="true" stored="true"/> New schema definition: <field name="user_id" type="user_id_string" indexed="true" stored="true"/> ... <fieldType name="user_id_string" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory" maxTokenLength="120"/> </analyzer> </fieldType> I am obviously not a 1337 solr haxor :P Why do this? We have a lot of data coming in and I want to compact it as best as I can. Regards, Nate On Fri, Jun 7, 2013 at 1:23 PM, Jack Krupansky <j...@basetechnology.com>wrote: > To be clear, one normally doesn't do queries on portions of an "ID" - > usually it is one integrated string. > > Further strings are definitely NOT tokenized in Solr. > > Your story keeps changing, which is why I have to keep hedging my answers. > > At least with your latest store, your user_id should be a text/TextField > so that it will be tokenized. A query for "2002" will > match on complete tokens, not parts of tokens. If you want to match > exactly on the full user_id, use a quoted phrase for the full user_id. > > But... I still have to hedge, because you refer to "a string of > concatenated user id values". You seem to have two distinct definitions for > user id. > > So, until you disclose all of your requirements and your data model, > including a clarification about user id vs. "a string of concatenated user > id values", I can't answer your question definitively, other than "Maybe, > depending on what you really mean by user id." > > > -- Jack Krupansky > > -----Original Message----- From: z z > Sent: Friday, June 07, 2013 12:11 AM > > To: solr-user@lucene.apache.org > Subject: Re: Schema Change: Int -> String (i am the original poster, new > email address) > > The unique key is an auto-incremented int in the db. Sorry for having > given the impression that user_id is the unique key per document. This is > a table of events that are happening as users interact with our system. > It just so happens that we were inserting individual records for each user > before we even began to think about using something like Solr. Now, > however, it seems to me that we should be able to ask questions like "give > me all records for user "2002" that have this string value "more" in data2, > across this time stamp range [ .... ]. Several simultaneously inserted > rows into the db are exactly the same aside from the user_ids. I just want > to know beforehand if I can still maintain exact matches for a user if the > user_id becomes a string of concatenated user id values. > > From what you are saying it sounds like the "user_id_str" is really all I > need. It is tokenized and allows for partial searches. I just want to > make sure that "2002 15000 45" when tokenized doesn't allow "20" to > partially match the token "2002". > > On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky <j...@basetechnology.com>* > *wrote: > > In that case, you will need to keep two copies of the user ID, one which >> is a single, complete string, and one which is a tokenized field >> text/TextField so that you can do a keyword search against it. Use the >> string/StrField as the main copy and then use a <copyField> directive in >> the schema to copy from the main copy to the other copy. >> >> So, maybe "user_id" is the full unique key - you would have to specify, >> the full exact key to query against it, or use wildcards for partial >> matches, and "user" or "user_id_str" would be the tokenized text version >> that would allow a simple search by partial value, such as "2002". >> >> Even so, I'm still not convinced that you have given us your complete >> requirements. Is the user_id in fact the unique key for the documents? >> >> >> >