Re: Schema Change: Int - String (i am the original poster, new email address)
Maybe if I were to say that the column user_id will become user_ids that would clarify things? user_id:2002+AND+created:[${**from}+TO+${until}]+data:more becomes user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more where I want 2002 to be an exact positive match on one of the user_ids embedded in the TEXT ... not string :) If I am totally off or making no sense, feedback it very welcome. I am just seeing lots of similar data going into my db and it feels like Solr should be able to handle this. I just want to know if transforming the data like that will still allow exact searches against a user_id. My language from a solr gurus point of view is probably *very* poorly phrased ... exact and TEXT might not go hand in hand. Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search against it for 1442 will yield exact results? A search against 442 wont match right? 1. 20 1442 35 2. 20 442 35 3. 20 1442 user_ids:1442 - yields #1 #3 always? user_ids:442 - yields only #2 always? My lack of understanding about what solr does when it indexes is shining through :) On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote: My language might be a bit off (I am saying string when I probably mean text in the context of solr), but I'm pretty sure that my story is unwavering ;) `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` int(11) So, imagine that we have 1000 entries come in where data above is exactly the same for all 1000 entries, but user_id is different (id and created being different is irrelevant). I am thinking that prior to inserting into mysql, I should be able to concatenate the user_ids together with whitespace and then insert them into something like: `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` blob Then on solr's end it will treat the user_id as Text and parse it (I want to say tokenize, but maybe my language is incorrect here?). Then when I search user_id:2002+AND+created:[${**from}+TO+${until}]+data:more I want to be sure that if I look for user_id 2002, I will get data that only has a value 2002 in the user_id column and that a separate user with id 20 cannot accidentally pull data for user_id 2002 as a result of a fuzzy (my language ok?) match of 20 against (20)02. Current schema definition: field name=user_id type=int indexed=true stored=true/ New schema definition: field name=user_id type=user_id_string indexed=true stored=true/ ... fieldType name=user_id_string class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory maxTokenLength=120/ /analyzer /fieldType
Re: Schema Change: Int - String (i am the original poster, new email address)
Right, a search for 442 would not match 1442. -- Jack Krupansky -Original Message- From: z z Sent: Friday, June 07, 2013 2:18 AM To: solr-user@lucene.apache.org Subject: Re: Schema Change: Int - String (i am the original poster, new email address) Maybe if I were to say that the column user_id will become user_ids that would clarify things? user_id:2002+AND+created:[${**from}+TO+${until}]+data:more becomes user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more where I want 2002 to be an exact positive match on one of the user_ids embedded in the TEXT ... not string :) If I am totally off or making no sense, feedback it very welcome. I am just seeing lots of similar data going into my db and it feels like Solr should be able to handle this. I just want to know if transforming the data like that will still allow exact searches against a user_id. My language from a solr gurus point of view is probably *very* poorly phrased ... exact and TEXT might not go hand in hand. Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search against it for 1442 will yield exact results? A search against 442 wont match right? 1. 20 1442 35 2. 20 442 35 3. 20 1442 user_ids:1442 - yields #1 #3 always? user_ids:442 - yields only #2 always? My lack of understanding about what solr does when it indexes is shining through :) On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote: My language might be a bit off (I am saying string when I probably mean text in the context of solr), but I'm pretty sure that my story is unwavering ;) `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` int(11) So, imagine that we have 1000 entries come in where data above is exactly the same for all 1000 entries, but user_id is different (id and created being different is irrelevant). I am thinking that prior to inserting into mysql, I should be able to concatenate the user_ids together with whitespace and then insert them into something like: `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` blob Then on solr's end it will treat the user_id as Text and parse it (I want to say tokenize, but maybe my language is incorrect here?). Then when I search user_id:2002+AND+created:[${**from}+TO+${until}]+data:more I want to be sure that if I look for user_id 2002, I will get data that only has a value 2002 in the user_id column and that a separate user with id 20 cannot accidentally pull data for user_id 2002 as a result of a fuzzy (my language ok?) match of 20 against (20)02. Current schema definition: field name=user_id type=int indexed=true stored=true/ New schema definition: field name=user_id type=user_id_string indexed=true stored=true/ ... fieldType name=user_id_string class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory maxTokenLength=120/ /analyzer /fieldType
Re: Schema Change: Int - String (i am the original poster, new email address)
3. Too hard to say from the way you have described it. Show us some sample input. Jack, Here you go. *Row X* column1: data here column2: more data here ... user_id: 2002 *Row Y* column1: data here column2: more data here ... user_id: 45 *Row Z* column1: data here column2: more data here ... user_id: 45664 So what I plan on doing before inserting into mysql, which is where solr pulls the data from, is shrinking similar datasets into one row: *Single Row XYZ* column1: data here column2: more data here ... user_id: 2002 45 45664 Then I would like to have solr parse the user_id as a string. I just want to be sure that there wont be any fuzzy searching happening against the user_id. That is, 566 shouldn't be a valid value for the user_id list above. It has to return exact results based on user ids. Also I am wondering if this will affect performance at all, but I am thinking not because solr is very fast in general. Regards, Nate
Re: Schema Change: Int - String (i am the original poster, new email address)
I want to query against one user_id in the string. eg user_id:2002+AND+created:[${from}+TO+${until}]+data:more So all of the records with a 2002 in user_id need to be returned and only those records. If this can only be guaranteed by having user_id be an integer, then that is fine, but I would like to reduce the growth of our table. *Row X* column1: data here column2: more data here ... user_id: 2002 *Row Y* column1: data here column2: more data here ... user_id: 45 *Row Z* column1: data here column2: more data here ... user_id: 45664 So what I plan on doing before inserting into mysql, which is where solr pulls the data from, is shrinking similar datasets into one row: *Single Row XYZ* column1: data here column2: more data here ... user_id: 2002 45 45664
Re: Schema Change: Int - String (i am the original poster, new email address)
Okay, now, how about a few queries that you want to use? Do you want to query by parts of the user ID, or only by the whole (exact) value? If the user ID will be a string, fine, but having spaces makes it a little more painful to enter in a query - maybe use dashes. -- Jack Krupansky -Original Message- From: z z Sent: Thursday, June 06, 2013 11:31 PM To: solr-user@lucene.apache.org Subject: Re: Schema Change: Int - String (i am the original poster, new email address) 3. Too hard to say from the way you have described it. Show us some sample input. Jack, Here you go. *Row X* column1: data here column2: more data here ... user_id: 2002 *Row Y* column1: data here column2: more data here ... user_id: 45 *Row Z* column1: data here column2: more data here ... user_id: 45664 So what I plan on doing before inserting into mysql, which is where solr pulls the data from, is shrinking similar datasets into one row: *Single Row XYZ* column1: data here column2: more data here ... user_id: 2002 45 45664 Then I would like to have solr parse the user_id as a string. I just want to be sure that there wont be any fuzzy searching happening against the user_id. That is, 566 shouldn't be a valid value for the user_id list above. It has to return exact results based on user ids. Also I am wondering if this will affect performance at all, but I am thinking not because solr is very fast in general. Regards, Nate
Re: Schema Change: Int - String (i am the original poster, new email address)
eg user_id:2002+AND+created:[${from}+TO+${until}]+data:more Expected results: return row XYZ but ignore this row: column1: data here column2: more data here ... user_id: 45 15001 45664 *Row X* column1: data here column2: more data here ... user_id: 2002 *Row Y* column1: data here column2: more data here ... user_id: 45 *Row Z* column1: data here column2: more data here ... user_id: 45664 So what I plan on doing before inserting into mysql, which is where solr pulls the data from, is shrinking similar datasets into one row: *Single Row XYZ* column1: data here column2: more data here ... user_id: 2002 45 45664
Re: Schema Change: Int - String (i am the original poster, new email address)
In that case, you will need to keep two copies of the user ID, one which is a single, complete string, and one which is a tokenized field text/TextField so that you can do a keyword search against it. Use the string/StrField as the main copy and then use a copyField directive in the schema to copy from the main copy to the other copy. So, maybe user_id is the full unique key - you would have to specify, the full exact key to query against it, or use wildcards for partial matches, and user or user_id_str would be the tokenized text version that would allow a simple search by partial value, such as 2002. Even so, I'm still not convinced that you have given us your complete requirements. Is the user_id in fact the unique key for the documents? -- Jack Krupansky -Original Message- From: z z Sent: Thursday, June 06, 2013 11:48 PM To: solr-user@lucene.apache.org Subject: Re: Schema Change: Int - String (i am the original poster, new email address) I want to query against one user_id in the string. eg user_id:2002+AND+created:[${from}+TO+${until}]+data:more So all of the records with a 2002 in user_id need to be returned and only those records. If this can only be guaranteed by having user_id be an integer, then that is fine, but I would like to reduce the growth of our table. *Row X* column1: data here column2: more data here ... user_id: 2002 *Row Y* column1: data here column2: more data here ... user_id: 45 *Row Z* column1: data here column2: more data here ... user_id: 45664 So what I plan on doing before inserting into mysql, which is where solr pulls the data from, is shrinking similar datasets into one row: *Single Row XYZ* column1: data here column2: more data here ... user_id: 2002 45 45664
Re: Schema Change: Int - String (i am the original poster, new email address)
The unique key is an auto-incremented int in the db. Sorry for having given the impression that user_id is the unique key per document. This is a table of events that are happening as users interact with our system. It just so happens that we were inserting individual records for each user before we even began to think about using something like Solr. Now, however, it seems to me that we should be able to ask questions like give me all records for user 2002 that have this string value more in data2, across this time stamp range [ ]. Several simultaneously inserted rows into the db are exactly the same aside from the user_ids. I just want to know beforehand if I can still maintain exact matches for a user if the user_id becomes a string of concatenated user id values. From what you are saying it sounds like the user_id_str is really all I need. It is tokenized and allows for partial searches. I just want to make sure that 2002 15000 45 when tokenized doesn't allow 20 to partially match the token 2002. On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.comwrote: In that case, you will need to keep two copies of the user ID, one which is a single, complete string, and one which is a tokenized field text/TextField so that you can do a keyword search against it. Use the string/StrField as the main copy and then use a copyField directive in the schema to copy from the main copy to the other copy. So, maybe user_id is the full unique key - you would have to specify, the full exact key to query against it, or use wildcards for partial matches, and user or user_id_str would be the tokenized text version that would allow a simple search by partial value, such as 2002. Even so, I'm still not convinced that you have given us your complete requirements. Is the user_id in fact the unique key for the documents?
Re: Schema Change: Int - String (i am the original poster, new email address)
To be clear, one normally doesn't do queries on portions of an ID - usually it is one integrated string. Further strings are definitely NOT tokenized in Solr. Your story keeps changing, which is why I have to keep hedging my answers. At least with your latest store, your user_id should be a text/TextField so that it will be tokenized. A query for 2002 will match on complete tokens, not parts of tokens. If you want to match exactly on the full user_id, use a quoted phrase for the full user_id. But... I still have to hedge, because you refer to a string of concatenated user id values. You seem to have two distinct definitions for user id. So, until you disclose all of your requirements and your data model, including a clarification about user id vs. a string of concatenated user id values, I can't answer your question definitively, other than Maybe, depending on what you really mean by user id. -- Jack Krupansky -Original Message- From: z z Sent: Friday, June 07, 2013 12:11 AM To: solr-user@lucene.apache.org Subject: Re: Schema Change: Int - String (i am the original poster, new email address) The unique key is an auto-incremented int in the db. Sorry for having given the impression that user_id is the unique key per document. This is a table of events that are happening as users interact with our system. It just so happens that we were inserting individual records for each user before we even began to think about using something like Solr. Now, however, it seems to me that we should be able to ask questions like give me all records for user 2002 that have this string value more in data2, across this time stamp range [ ]. Several simultaneously inserted rows into the db are exactly the same aside from the user_ids. I just want to know beforehand if I can still maintain exact matches for a user if the user_id becomes a string of concatenated user id values. From what you are saying it sounds like the user_id_str is really all I need. It is tokenized and allows for partial searches. I just want to make sure that 2002 15000 45 when tokenized doesn't allow 20 to partially match the token 2002. On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.comwrote: In that case, you will need to keep two copies of the user ID, one which is a single, complete string, and one which is a tokenized field text/TextField so that you can do a keyword search against it. Use the string/StrField as the main copy and then use a copyField directive in the schema to copy from the main copy to the other copy. So, maybe user_id is the full unique key - you would have to specify, the full exact key to query against it, or use wildcards for partial matches, and user or user_id_str would be the tokenized text version that would allow a simple search by partial value, such as 2002. Even so, I'm still not convinced that you have given us your complete requirements. Is the user_id in fact the unique key for the documents?
Re: Schema Change: Int - String (i am the original poster, new email address)
My language might be a bit off (I am saying string when I probably mean text in the context of solr), but I'm pretty sure that my story is unwavering ;) `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` int(11) So, imagine that we have 1000 entries come in where data above is exactly the same for all 1000 entries, but user_id is different (id and created being different is irrelevant). I am thinking that prior to inserting into mysql, I should be able to concatenate the user_ids together with whitespace and then insert them into something like: `id` int(11) NOT NULL AUTO_INCREMENT `created` int(10) `data` varbinary(255) `user_id` blob Then on solr's end it will treat the user_id as Text and parse it (I want to say tokenize, but maybe my language is incorrect here?). Then when I search user_id:2002+AND+created:[${**from}+TO+${until}]+data:more I want to be sure that if I look for user_id 2002, I will get data that only has a value 2002 in the user_id column and that a separate user with id 20 cannot accidentally pull data for user_id 2002 as a result of a fuzzy (my language ok?) match of 20 against (20)02. Current schema definition: field name=user_id type=int indexed=true stored=true/ New schema definition: field name=user_id type=user_id_string indexed=true stored=true/ ... fieldType name=user_id_string class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory maxTokenLength=120/ /analyzer /fieldType I am obviously not a 1337 solr haxor :P Why do this? We have a lot of data coming in and I want to compact it as best as I can. Regards, Nate On Fri, Jun 7, 2013 at 1:23 PM, Jack Krupansky j...@basetechnology.comwrote: To be clear, one normally doesn't do queries on portions of an ID - usually it is one integrated string. Further strings are definitely NOT tokenized in Solr. Your story keeps changing, which is why I have to keep hedging my answers. At least with your latest store, your user_id should be a text/TextField so that it will be tokenized. A query for 2002 will match on complete tokens, not parts of tokens. If you want to match exactly on the full user_id, use a quoted phrase for the full user_id. But... I still have to hedge, because you refer to a string of concatenated user id values. You seem to have two distinct definitions for user id. So, until you disclose all of your requirements and your data model, including a clarification about user id vs. a string of concatenated user id values, I can't answer your question definitively, other than Maybe, depending on what you really mean by user id. -- Jack Krupansky -Original Message- From: z z Sent: Friday, June 07, 2013 12:11 AM To: solr-user@lucene.apache.org Subject: Re: Schema Change: Int - String (i am the original poster, new email address) The unique key is an auto-incremented int in the db. Sorry for having given the impression that user_id is the unique key per document. This is a table of events that are happening as users interact with our system. It just so happens that we were inserting individual records for each user before we even began to think about using something like Solr. Now, however, it seems to me that we should be able to ask questions like give me all records for user 2002 that have this string value more in data2, across this time stamp range [ ]. Several simultaneously inserted rows into the db are exactly the same aside from the user_ids. I just want to know beforehand if I can still maintain exact matches for a user if the user_id becomes a string of concatenated user id values. From what you are saying it sounds like the user_id_str is really all I need. It is tokenized and allows for partial searches. I just want to make sure that 2002 15000 45 when tokenized doesn't allow 20 to partially match the token 2002. On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.com* *wrote: In that case, you will need to keep two copies of the user ID, one which is a single, complete string, and one which is a tokenized field text/TextField so that you can do a keyword search against it. Use the string/StrField as the main copy and then use a copyField directive in the schema to copy from the main copy to the other copy. So, maybe user_id is the full unique key - you would have to specify, the full exact key to query against it, or use wildcards for partial matches, and user or user_id_str would be the tokenized text version that would allow a simple search by partial value, such as 2002. Even so, I'm still not convinced that you have given us your complete requirements. Is the user_id in fact the unique key for the documents?