Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-07 Thread z z
Maybe if I were to say that the column user_id will become user_ids
that would clarify things?

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

becomes

user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more

where I want 2002 to be an exact positive match on one of the user_ids
embedded in the TEXT ... not string :)  If I am totally off or making no
sense, feedback it very welcome.  I am just seeing lots of similar data
going into my db and it feels like Solr should be able to handle this.

I just want to know if transforming the data like that will still allow
exact searches against a user_id.  My language from a solr gurus point of
view is probably *very* poorly phrased ... exact and TEXT might not go
hand in hand.

Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search
against it for 1442 will yield exact results?  A search against 442
wont match right?

1. 20 1442 35
2. 20 442 35
3. 20 1442

user_ids:1442 - yields #1  #3 always?
user_ids:442 - yields only #2 always?

My lack of understanding about what solr does when it indexes is shining
through :)


On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote:

 My language might be a bit off (I am saying string when I probably mean
 text in the context of solr), but I'm pretty sure that my story is
 unwavering ;)

 `id` int(11) NOT NULL AUTO_INCREMENT
 `created` int(10)
 `data` varbinary(255)
 `user_id` int(11)

 So, imagine that we have 1000 entries come in where data above is
 exactly the same for all 1000 entries, but user_id is different (id and
 created being different is irrelevant).  I am thinking that prior to
 inserting into mysql, I should be able to concatenate the user_ids together
 with whitespace and then insert them into something like:

 `id` int(11) NOT NULL AUTO_INCREMENT
 `created` int(10)
 `data` varbinary(255)
 `user_id` blob

 Then on solr's end it will treat the user_id as Text and parse it (I want
 to say tokenize, but maybe my language is incorrect here?).

 Then when I search

 user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

 I want to be sure that if I look for user_id 2002, I will get data that
 only has a value 2002 in the user_id column and that a separate user with
 id 20 cannot accidentally pull data for user_id 2002 as a result of a
 fuzzy (my language ok?) match of 20 against (20)02.

 Current schema definition:

  field name=user_id type=int indexed=true stored=true/

 New schema definition:

 field name=user_id type=user_id_string indexed=true
 stored=true/
 ...
 fieldType name=user_id_string class=solr.TextField
 positionIncrementGap=100
   analyzer
 tokenizer class=solr.WhitespaceTokenizerFactory
 maxTokenLength=120/
   /analyzer
 /fieldType




Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-07 Thread Jack Krupansky

Right, a search for 442 would not match 1442.

-- Jack Krupansky

-Original Message- 
From: z z

Sent: Friday, June 07, 2013 2:18 AM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


Maybe if I were to say that the column user_id will become user_ids
that would clarify things?

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

becomes

user_id*s*:2002+AND+created:[${**from}+TO+${until}]+data:more

where I want 2002 to be an exact positive match on one of the user_ids
embedded in the TEXT ... not string :)  If I am totally off or making no
sense, feedback it very welcome.  I am just seeing lots of similar data
going into my db and it feels like Solr should be able to handle this.

I just want to know if transforming the data like that will still allow
exact searches against a user_id.  My language from a solr gurus point of
view is probably *very* poorly phrased ... exact and TEXT might not go
hand in hand.

Is the TEXT 20 1442 35 parsed as 20 1442 35 so that a search
against it for 1442 will yield exact results?  A search against 442
wont match right?

1. 20 1442 35
2. 20 442 35
3. 20 1442

user_ids:1442 - yields #1  #3 always?
user_ids:442 - yields only #2 always?

My lack of understanding about what solr does when it indexes is shining
through :)


On Fri, Jun 7, 2013 at 1:43 PM, z z zenlok.testi...@gmail.com wrote:


My language might be a bit off (I am saying string when I probably mean
text in the context of solr), but I'm pretty sure that my story is
unwavering ;)

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` int(11)

So, imagine that we have 1000 entries come in where data above is
exactly the same for all 1000 entries, but user_id is different (id and
created being different is irrelevant).  I am thinking that prior to
inserting into mysql, I should be able to concatenate the user_ids 
together

with whitespace and then insert them into something like:

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` blob

Then on solr's end it will treat the user_id as Text and parse it (I want
to say tokenize, but maybe my language is incorrect here?).

Then when I search

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

I want to be sure that if I look for user_id 2002, I will get data that
only has a value 2002 in the user_id column and that a separate user 
with

id 20 cannot accidentally pull data for user_id 2002 as a result of a
fuzzy (my language ok?) match of 20 against (20)02.

Current schema definition:

 field name=user_id type=int indexed=true stored=true/

New schema definition:

field name=user_id type=user_id_string indexed=true
stored=true/
...
fieldType name=user_id_string class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory
maxTokenLength=120/
  /analyzer
/fieldType






Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
3. Too hard to say from the way you have described it. Show us some sample
input.

Jack,

Here you go.

*Row X*
column1: data here
column2: more data here
...
user_id: 2002

*Row Y*
column1: data here
column2: more data here
...
user_id: 45

*Row Z*
column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*
column1: data here
column2: more data here
...
user_id: 2002 45 45664

Then I would like to have solr parse the user_id as a string.  I just want
to be sure that there wont be any fuzzy searching happening against the
user_id.  That is, 566 shouldn't be a valid value for the user_id list
above.  It has to return exact results based on user ids.  Also I am
wondering if this will affect performance at all, but I am thinking not
because solr is very fast in general.

Regards,
Nate


Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
I want to query against one user_id in the string.

eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

So all of the records with a 2002 in user_id need to be returned and only
those records.  If this can only be guaranteed by having user_id be an
integer, then that is fine, but I would like to reduce the growth of our
table.

*Row X*

 column1: data here
 column2: more data here
 ...
 user_id: 2002

 *Row Y*

 column1: data here
 column2: more data here
 ...
 user_id: 45

 *Row Z*

 column1: data here
 column2: more data here
 ...
 user_id: 45664

 So what I plan on doing before inserting into mysql, which is where solr
 pulls the data from, is shrinking similar datasets into one row:

 *Single Row XYZ*

 column1: data here
 column2: more data here
 ...
 user_id: 2002 45 45664



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
Okay, now, how about a few queries that you want to use? Do you want to 
query by parts of the user ID, or only by the whole (exact) value?


If the user ID will be a string, fine, but having spaces makes it a little 
more painful to enter in a query - maybe use dashes.


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Thursday, June 06, 2013 11:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


3. Too hard to say from the way you have described it. Show us some sample
input.

Jack,

Here you go.

*Row X*
column1: data here
column2: more data here
...
user_id: 2002

*Row Y*
column1: data here
column2: more data here
...
user_id: 45

*Row Z*
column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*
column1: data here
column2: more data here
...
user_id: 2002 45 45664

Then I would like to have solr parse the user_id as a string.  I just want
to be sure that there wont be any fuzzy searching happening against the
user_id.  That is, 566 shouldn't be a valid value for the user_id list
above.  It has to return exact results based on user ids.  Also I am
wondering if this will affect performance at all, but I am thinking not
because solr is very fast in general.

Regards,
Nate 



Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

Expected results:  return row XYZ but ignore this row:

column1: data here
column2: more data here
...
user_id: 45 15001 45664



 *Row X*

 column1: data here
 column2: more data here
 ...
 user_id: 2002

 *Row Y*

 column1: data here
 column2: more data here
 ...
 user_id: 45

 *Row Z*

 column1: data here
 column2: more data here
 ...
 user_id: 45664

 So what I plan on doing before inserting into mysql, which is where solr
 pulls the data from, is shrinking similar datasets into one row:

 *Single Row XYZ*

 column1: data here
 column2: more data here
 ...
 user_id: 2002 45 45664




Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
In that case, you will need to keep two copies of the user ID, one which is 
a single, complete string, and one which is a tokenized field text/TextField 
so that you can do a keyword search against it. Use the string/StrField as 
the main copy and then use a copyField directive in the schema to copy 
from the main copy to the other copy.


So, maybe user_id is the full unique key - you would have to specify, the 
full exact key to query against it, or use wildcards for partial matches, 
and user or user_id_str would be the tokenized text version that would 
allow a simple search by partial value, such as 2002.


Even so, I'm still not convinced that you have given us your complete 
requirements. Is the user_id in fact the unique key for the documents?


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Thursday, June 06, 2013 11:48 PM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


I want to query against one user_id in the string.

eg  user_id:2002+AND+created:[${from}+TO+${until}]+data:more

So all of the records with a 2002 in user_id need to be returned and only
those records.  If this can only be guaranteed by having user_id be an
integer, then that is fine, but I would like to reduce the growth of our
table.

*Row X*


column1: data here
column2: more data here
...
user_id: 2002

*Row Y*

column1: data here
column2: more data here
...
user_id: 45

*Row Z*

column1: data here
column2: more data here
...
user_id: 45664

So what I plan on doing before inserting into mysql, which is where solr
pulls the data from, is shrinking similar datasets into one row:

*Single Row XYZ*

column1: data here
column2: more data here
...
user_id: 2002 45 45664





Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
The unique key is an auto-incremented int in the db.  Sorry for having
given the impression that user_id is the unique key per document.  This is
a table of events that are happening as users interact with our system.
It just so happens that we were inserting individual records for each user
before we even began to think about using something like Solr.  Now,
however, it seems to me that we should be able to ask questions like give
me all records for user 2002 that have this string value more in data2,
across this time stamp range [  ].  Several simultaneously inserted
rows into the db are exactly the same aside from the user_ids.  I just want
to know beforehand if I can still maintain exact matches for a user if the
user_id becomes a string of concatenated user id values.

From what you are saying it sounds like the user_id_str is really all I
need.  It is tokenized and allows for partial searches.  I just want to
make sure that 2002 15000 45 when tokenized doesn't allow 20 to
partially match the token 2002.

On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.comwrote:

 In that case, you will need to keep two copies of the user ID, one which
 is a single, complete string, and one which is a tokenized field
 text/TextField so that you can do a keyword search against it. Use the
 string/StrField as the main copy and then use a copyField directive in
 the schema to copy from the main copy to the other copy.

 So, maybe user_id is the full unique key - you would have to specify,
 the full exact key to query against it, or use wildcards for partial
 matches, and user or user_id_str would be the tokenized text version
 that would allow a simple search by partial value, such as 2002.

 Even so, I'm still not convinced that you have given us your complete
 requirements. Is the user_id in fact the unique key for the documents?




Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread Jack Krupansky
To be clear, one normally doesn't do queries on portions of an ID - 
usually it is one integrated string.


Further strings are definitely NOT tokenized in Solr.

Your story keeps changing, which is why I have to keep hedging my answers.

At least with your latest store, your user_id should be a text/TextField so 
that it will be tokenized. A query for 2002 will
match on complete tokens, not parts of tokens. If you want to match exactly 
on the full user_id, use a quoted phrase for the full user_id.


But... I still have to hedge, because you refer to a string of concatenated 
user id values. You seem to have two distinct definitions for user id.


So, until you disclose all of your requirements and your data model, 
including a clarification about user id vs. a string of concatenated user 
id values, I can't answer your question definitively, other than Maybe, 
depending on what you really mean by user id.


-- Jack Krupansky

-Original Message- 
From: z z

Sent: Friday, June 07, 2013 12:11 AM
To: solr-user@lucene.apache.org
Subject: Re: Schema Change: Int - String (i am the original poster, new 
email address)


The unique key is an auto-incremented int in the db.  Sorry for having
given the impression that user_id is the unique key per document.  This is
a table of events that are happening as users interact with our system.
It just so happens that we were inserting individual records for each user
before we even began to think about using something like Solr.  Now,
however, it seems to me that we should be able to ask questions like give
me all records for user 2002 that have this string value more in data2,
across this time stamp range [  ].  Several simultaneously inserted
rows into the db are exactly the same aside from the user_ids.  I just want
to know beforehand if I can still maintain exact matches for a user if the
user_id becomes a string of concatenated user id values.


From what you are saying it sounds like the user_id_str is really all I

need.  It is tokenized and allows for partial searches.  I just want to
make sure that 2002 15000 45 when tokenized doesn't allow 20 to
partially match the token 2002.

On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky 
j...@basetechnology.comwrote:



In that case, you will need to keep two copies of the user ID, one which
is a single, complete string, and one which is a tokenized field
text/TextField so that you can do a keyword search against it. Use the
string/StrField as the main copy and then use a copyField directive in
the schema to copy from the main copy to the other copy.

So, maybe user_id is the full unique key - you would have to specify,
the full exact key to query against it, or use wildcards for partial
matches, and user or user_id_str would be the tokenized text version
that would allow a simple search by partial value, such as 2002.

Even so, I'm still not convinced that you have given us your complete
requirements. Is the user_id in fact the unique key for the documents?






Re: Schema Change: Int - String (i am the original poster, new email address)

2013-06-06 Thread z z
My language might be a bit off (I am saying string when I probably mean
text in the context of solr), but I'm pretty sure that my story is
unwavering ;)

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` int(11)

So, imagine that we have 1000 entries come in where data above is exactly
the same for all 1000 entries, but user_id is different (id and created
being different is irrelevant).  I am thinking that prior to inserting into
mysql, I should be able to concatenate the user_ids together with
whitespace and then insert them into something like:

`id` int(11) NOT NULL AUTO_INCREMENT
`created` int(10)
`data` varbinary(255)
`user_id` blob

Then on solr's end it will treat the user_id as Text and parse it (I want
to say tokenize, but maybe my language is incorrect here?).

Then when I search

user_id:2002+AND+created:[${**from}+TO+${until}]+data:more

I want to be sure that if I look for user_id 2002, I will get data that
only has a value 2002 in the user_id column and that a separate user with
id 20 cannot accidentally pull data for user_id 2002 as a result of a
fuzzy (my language ok?) match of 20 against (20)02.

Current schema definition:

 field name=user_id type=int indexed=true stored=true/

New schema definition:

field name=user_id type=user_id_string indexed=true
stored=true/
...
fieldType name=user_id_string class=solr.TextField
positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory
maxTokenLength=120/
  /analyzer
/fieldType

I am obviously not a 1337 solr haxor :P

Why do this?  We have a lot of data coming in and I want to compact it as
best as I can.

Regards,
Nate





On Fri, Jun 7, 2013 at 1:23 PM, Jack Krupansky j...@basetechnology.comwrote:

 To be clear, one normally doesn't do queries on portions of an ID -
 usually it is one integrated string.

 Further strings are definitely NOT tokenized in Solr.

 Your story keeps changing, which is why I have to keep hedging my answers.

 At least with your latest store, your user_id should be a text/TextField
 so that it will be tokenized. A query for 2002 will
 match on complete tokens, not parts of tokens. If you want to match
 exactly on the full user_id, use a quoted phrase for the full user_id.

 But... I still have to hedge, because you refer to a string of
 concatenated user id values. You seem to have two distinct definitions for
 user id.

 So, until you disclose all of your requirements and your data model,
 including a clarification about user id vs. a string of concatenated user
 id values, I can't answer your question definitively, other than Maybe,
 depending on what you really mean by user id.


 -- Jack Krupansky

 -Original Message- From: z z
 Sent: Friday, June 07, 2013 12:11 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Schema Change: Int - String (i am the original poster, new
 email address)

 The unique key is an auto-incremented int in the db.  Sorry for having
 given the impression that user_id is the unique key per document.  This is
 a table of events that are happening as users interact with our system.
 It just so happens that we were inserting individual records for each user
 before we even began to think about using something like Solr.  Now,
 however, it seems to me that we should be able to ask questions like give
 me all records for user 2002 that have this string value more in data2,
 across this time stamp range [  ].  Several simultaneously inserted
 rows into the db are exactly the same aside from the user_ids.  I just want
 to know beforehand if I can still maintain exact matches for a user if the
 user_id becomes a string of concatenated user id values.

 From what you are saying it sounds like the user_id_str is really all I
 need.  It is tokenized and allows for partial searches.  I just want to
 make sure that 2002 15000 45 when tokenized doesn't allow 20 to
 partially match the token 2002.

 On Fri, Jun 7, 2013 at 12:57 PM, Jack Krupansky j...@basetechnology.com*
 *wrote:

  In that case, you will need to keep two copies of the user ID, one which
 is a single, complete string, and one which is a tokenized field
 text/TextField so that you can do a keyword search against it. Use the
 string/StrField as the main copy and then use a copyField directive in
 the schema to copy from the main copy to the other copy.

 So, maybe user_id is the full unique key - you would have to specify,
 the full exact key to query against it, or use wildcards for partial
 matches, and user or user_id_str would be the tokenized text version
 that would allow a simple search by partial value, such as 2002.

 Even so, I'm still not convinced that you have given us your complete
 requirements. Is the user_id in fact the unique key for the documents?