RE: Were changes made to facetting on multivalued fields recently?

2014-04-11 Thread Jean-Sebastien Vachon
Thanks to both of you. I finally found the issue and you were right (again) ;)

The problem was not coming from the full indexation code containing the SQL 
replace statement but from another process whose job is to maintain our index 
up to date. This process had no idea that commas were to be replaced by spaces 
for some fields (and it should not about this either).

I changed the Tokenizer used for the field to the following and everything is 
fine now.


Thanks for your help

> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: April-10-14 1:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Were changes made to facetting on multivalued fields recently?
> 
> bq: The SQL query contains a Replace statement that does this
> 
> Well, I suspect that's where the issue is. The facet values being reported
> include:
> 134826
> which indicates that the incoming text to Solr still has the commas.
> Solr is seeing the commas and all.
> 
> You can cure this by using PatternReplaceCharFilterFactory and doing the
> substitution at index time if you want to.
> 
> That doesn't clarify why the behavior has changed though, but my
> supposition is that it has nothing to do with Solr, and something about your
> SQL statement is different.
> 
> Best,
> Erick
> 
> On Thu, Apr 10, 2014 at 9:33 AM, Jean-Sebastien Vachon  sebastien.vac...@wantedanalytics.com> wrote:
> > The SQL query contains a Replace statement that does this
> >
> >> -Original Message-
> >> From: Shawn Heisey [mailto:s...@elyograg.org]
> >> Sent: April-10-14 11:30 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Were changes made to facetting on multivalued fields
> recently?
> >>
> >> On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
> >> > Here are the field definitions for both our old and new index... as
> >> > you can
> >> see that are identical. We've been using this chain and field type
> >> starting with Solr 1.4 and never had any problem. As for the
> >> documents, both indexes are using the same data source. They could be
> >> slightly out of sync from time to time but we tend to index them on a
> >> daily basis. Both indexes are also using the same code (indexing through
> SolrJ) to index their content.
> >> >
> >> > The source is a column in MySql that contains entries such as "4,1"
> >> > that get stored in a Multivalued fields after replacing commas by
> >> > spaces
> >> >
> >> > OLD (4.6.1):
> >> > >> positionIncrementGap="100">
> >> >   
> >> > 
> >> >   
> >> > 
> >> >
> >> >  >> > stored="true" required="false" multiValued="true" />
> >>
> >> Just so you know, there's nothing here that would require the field
> >> to be multivalued.  WhitespaceTokenizerFactory does not create
> >> multiple field values, it creates multiple terms.  If you are
> >> actually inserting multiple values for the field in SolrJ, then you would
> need a multivalued field.
> >>
> >> What is replacing the commas with spaces?  I don't see anything here
> >> that would do that.  It sounds like that part of your indexing is not
> working.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
> >> -
> >> Aucun virus trouvé dans ce message.
> >> Analyse effectuée par AVG - www.avg.fr
> >> Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
> >> 09/04/2014
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
> 09/04/2014


Re: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Erick Erickson
bq: The SQL query contains a Replace statement that does this

Well, I suspect that's where the issue is. The facet values being
reported include:
134826
which indicates that the incoming text to Solr still has the commas.
Solr is seeing the commas and all.

You can cure this by using PatternReplaceCharFilterFactory and doing
the substitution at index time if you want to.

That doesn't clarify why the behavior has changed though, but my
supposition is that it has nothing to do with Solr, and something
about your SQL statement is different.

Best,
Erick

On Thu, Apr 10, 2014 at 9:33 AM, Jean-Sebastien Vachon
 wrote:
> The SQL query contains a Replace statement that does this
>
>> -Original Message-
>> From: Shawn Heisey [mailto:s...@elyograg.org]
>> Sent: April-10-14 11:30 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Were changes made to facetting on multivalued fields recently?
>>
>> On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
>> > Here are the field definitions for both our old and new index... as you can
>> see that are identical. We've been using this chain and field type starting 
>> with
>> Solr 1.4 and never had any problem. As for the documents, both indexes are
>> using the same data source. They could be slightly out of sync from time to
>> time but we tend to index them on a daily basis. Both indexes are also using
>> the same code (indexing through SolrJ) to index their content.
>> >
>> > The source is a column in MySql that contains entries such as "4,1"
>> > that get stored in a Multivalued fields after replacing commas by
>> > spaces
>> >
>> > OLD (4.6.1):
>> >> positionIncrementGap="100">
>> >   
>> > 
>> >   
>> > 
>> >
>> > > > stored="true" required="false" multiValued="true" />
>>
>> Just so you know, there's nothing here that would require the field to be
>> multivalued.  WhitespaceTokenizerFactory does not create multiple field
>> values, it creates multiple terms.  If you are actually inserting multiple 
>> values
>> for the field in SolrJ, then you would need a multivalued field.
>>
>> What is replacing the commas with spaces?  I don't see anything here that
>> would do that.  It sounds like that part of your indexing is not working.
>>
>> Thanks,
>> Shawn
>>
>>
>> -
>> Aucun virus trouvé dans ce message.
>> Analyse effectuée par AVG - www.avg.fr
>> Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
>> 09/04/2014


RE: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Jean-Sebastien Vachon
The SQL query contains a Replace statement that does this

> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: April-10-14 11:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Were changes made to facetting on multivalued fields recently?
> 
> On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
> > Here are the field definitions for both our old and new index... as you can
> see that are identical. We've been using this chain and field type starting 
> with
> Solr 1.4 and never had any problem. As for the documents, both indexes are
> using the same data source. They could be slightly out of sync from time to
> time but we tend to index them on a daily basis. Both indexes are also using
> the same code (indexing through SolrJ) to index their content.
> >
> > The source is a column in MySql that contains entries such as "4,1"
> > that get stored in a Multivalued fields after replacing commas by
> > spaces
> >
> > OLD (4.6.1):
> > positionIncrementGap="100">
> >   
> > 
> >   
> > 
> >
> >  > stored="true" required="false" multiValued="true" />
> 
> Just so you know, there's nothing here that would require the field to be
> multivalued.  WhitespaceTokenizerFactory does not create multiple field
> values, it creates multiple terms.  If you are actually inserting multiple 
> values
> for the field in SolrJ, then you would need a multivalued field.
> 
> What is replacing the commas with spaces?  I don't see anything here that
> would do that.  It sounds like that part of your indexing is not working.
> 
> Thanks,
> Shawn
> 
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2014.0.4355 / Base de données virale: 3882/7323 - Date:
> 09/04/2014


Re: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Shawn Heisey
On 4/10/2014 9:14 AM, Jean-Sebastien Vachon wrote:
> Here are the field definitions for both our old and new index... as you can 
> see that are identical. We've been using this chain and field type starting 
> with Solr 1.4 and never had any problem. As for the documents, both indexes 
> are using the same data source. They could be slightly out of sync from time 
> to time but we tend to index them on a daily basis. Both indexes are also 
> using the same code (indexing through SolrJ) to index their content.
> 
> The source is a column in MySql that contains entries such as "4,1" that get 
> stored in a Multivalued fields after replacing commas by spaces
> 
> OLD (4.6.1):
> positionIncrementGap="100">
>   
> 
>   
> 
> 
>  required="false" multiValued="true" />

Just so you know, there's nothing here that would require the field to
be multivalued.  WhitespaceTokenizerFactory does not create multiple
field values, it creates multiple terms.  If you are actually inserting
multiple values for the field in SolrJ, then you would need a
multivalued field.

What is replacing the commas with spaces?  I don't see anything here
that would do that.  It sounds like that part of your indexing is not
working.

Thanks,
Shawn



RE: Were changes made to facetting on multivalued fields recently?

2014-04-10 Thread Jean-Sebastien Vachon
Here are the field definitions for both our old and new index... as you can see 
that are identical. We've been using this chain and field type starting with 
Solr 1.4 and never had any problem. As for the documents, both indexes are 
using the same data source. They could be slightly out of sync from time to 
time but we tend to index them on a daily basis. Both indexes are also using 
the same code (indexing through SolrJ) to index their content.

The source is a column in MySql that contains entries such as "4,1" that get 
stored in a Multivalued fields after replacing commas by spaces

OLD (4.6.1):
   
  

  




NEW (4.7.1):


  

  
 



It looks like the /analysis/field hanlder is not active in our current setup. I 
will look into this and perform additional checks later as we are currently 
doing a full reindex of our DB.

Thanks for your time

> -Original Message-
> From: Shawn Heisey [mailto:s...@elyograg.org]
> Sent: April-09-14 5:23 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Were changes made to facetting on multivalued fields recently?
> 
> On 4/9/2014 2:15 PM, Erick Erickson wrote:
> > Right, but the response in the doc when you make a request is almost,
> > but not quite totally, unrelated to how facet values are tallied. It's
> > all about what tokens are actually in your index, which you can see in
> > the "schema browser"...
> 
> Supplement to what Erick has told you:
> 
> SOLR-5512 seems to be related to facets using docValues. The commit for
> that issue looks like it only touches on that specifically.If you do not have
> (and never have had) docValues on this field, then SOLR-5512 should not
> apply.
> 
> I am reasonably sure that for facets on fields with docValues, your facets
> would reflect the *stored* information, not the indexed information.
> 
> Finally, I don't think that docValues work on fieldtypes whose class is
> solr.TextField, which is the only class that can have an analysis chain that
> would turn "4 5 1" into three separate tokens.  The response that you shared
> where the value is "4 5 1" looks like there is only one value in the field -- 
> so
> for that document, it is effectively the same as one that is single-valued.
> 
> Bottom line: It looks like either your analysis chain is working differently 
> in
> the newer version, or you have documents in your newer index that are not
> in the older one.  Can you share the field and fieldType definitions from both
> versions?  Did your luceneMatchVersion change with the upgrade?  If you are
> using DIH to populate your index, can you also share your DIH config?
> 
> Thanks,
> Shawn
> 
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
> 27/03/2014 La Base de données des virus a expiré.


Re: Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Shawn Heisey

On 4/9/2014 2:15 PM, Erick Erickson wrote:

Right, but the response in the doc when you make a request is almost,
but not quite totally, unrelated to how facet values are tallied. It's
all about what tokens are actually in your index, which you can see in
the "schema browser"...


Supplement to what Erick has told you:

SOLR-5512 seems to be related to facets using docValues. The commit for 
that issue looks like it only touches on that specifically.If you do not 
have (and never have had) docValues on this field, then SOLR-5512 should 
not apply.


I am reasonably sure that for facets on fields with docValues, your 
facets would reflect the *stored* information, not the indexed information.


Finally, I don't think that docValues work on fieldtypes whose class is 
solr.TextField, which is the only class that can have an analysis chain 
that would turn "4 5 1" into three separate tokens.  The response that 
you shared where the value is "4 5 1" looks like there is only one value 
in the field -- so for that document, it is effectively the same as one 
that is single-valued.


Bottom line: It looks like either your analysis chain is working 
differently in the newer version, or you have documents in your newer 
index that are not in the older one.  Can you share the field and 
fieldType definitions from both versions?  Did your luceneMatchVersion 
change with the upgrade?  If you are using DIH to populate your index, 
can you also share your DIH config?


Thanks,
Shawn



Re: Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Erick Erickson
Right, but the response in the doc when you make a request is almost,
but not quite totally, unrelated to how facet values are tallied. It's
all about what tokens are actually in your index, which you can see in
the "schema browser"...

Let me know what the results are
Erick

On Wed, Apr 9, 2014 at 11:40 AM, Jean-Sebastien Vachon
 wrote:
> Thanks Erick I will check this as soon as I can.
>
> In the meantime, here is a sample query and how it looks in our index. It 
> looks good to me (at least that what is showing up as well in our other and 
> older indexes)
>
> http://10.0.5.227:8201/solr/Current/select?q=*:*&fl=ad_job_type_id&fq=ad_job_type_id:[*%20TO%20*]&facet=on&facet.field=ad_job_type_id&rows=1
>
> 
>  
>
>4 5 1
> 
>   
> 
>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: April-09-14 2:21 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Were changes made to facetting on multivalued fields recently?
>>
>> That is...um...very strange. It looks to me like you have somehow indexed a
>> bunch of new values. I'm guessing here, but it's suspicious that you have a
>> value "4,1" should that have been indexed as "4" and "1" as separate tokens?
>>
>> So here's what I'd do
>> 1> take a look at the solr/admin/schema browser output for that field
>> in the two versions. I suspect you'll see 7 values in 4.6 and a bazillion in 
>> 4.7.1.
>> 2> if <1> is true, take a look at the admin/analysis page for the
>> field in question and see some sample index-time inputs, especially for the
>> theoretical "4,1" entries. I suspect that 4.6 will break these up into two
>> tokens and 4.7.1 won't.
>> 3> if <2> is true, take a very careful look at the index-time analysis
>> chains in the two versions, I bet they're different and that accounts for 
>> your
>> observations.
>> 4> try 1-3, discover I'm totally off base and paste the schema.xml
>> definitions for the field in question in both 4.6 and 4.7.1 to this thread 
>> and
>> we can take a look.
>>
>> This should not have changed between 4.6 and 4.7.1, at least not
>> intentionally.
>>
>> Best,
>> Erick
>>
>> On Wed, Apr 9, 2014 at 11:04 AM, Jean-Sebastien Vachon > sebastien.vac...@wantedanalytics.com> wrote:
>> > Hi All,
>> >
>> > We just discovered that the response from Solr (4.7.1) when faceting on
>> one of our multi-valued fields has changed considerably.
>> >
>> > In the past (4.6.1 and prior versions as well) we used to have
>> > something like this: (there are 7 possible values for this attribute)
>> >
>> > 
>> > 
>> > 
>> > 
>> > 11454652
>> > 11387070
>> > 2095603
>> > 809992
>> > 567244
>> > 139389
>> > 4120
>> > 
>> > 
>> > 
>> > 
>> >
>> > And now with 4.7.1 we are getting this:
>> > 
>> > 
>> > 
>> > 
>> > 10954552
>> > 10884418
>> > 2000530
>> > 784491
>> > 535935
>> > 134826
>> > 11770
>> > ... there are too many values to list them all ...
>> >
>> > I checked the Change log for 4.7.1 and only saw an optimization made
>> > for https://issues.apache.org/jira/browse/SOLR-5512
>> >
>> > Is there any new configuration directive that we should be aware of?
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> >
>>
>> -
>> Aucun virus trouvé dans ce message.
>> Analyse effectuée par AVG - www.avg.fr
>> Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
>> 27/03/2014 La Base de données des virus a expiré.


RE: Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Jean-Sebastien Vachon
Thanks Erick I will check this as soon as I can.

In the meantime, here is a sample query and how it looks in our index. It looks 
good to me (at least that what is showing up as well in our other and older 
indexes)

http://10.0.5.227:8201/solr/Current/select?q=*:*&fl=ad_job_type_id&fq=ad_job_type_id:[*%20TO%20*]&facet=on&facet.field=ad_job_type_id&rows=1


 
   
   4 5 1

  


> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: April-09-14 2:21 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Were changes made to facetting on multivalued fields recently?
> 
> That is...um...very strange. It looks to me like you have somehow indexed a
> bunch of new values. I'm guessing here, but it's suspicious that you have a
> value "4,1" should that have been indexed as "4" and "1" as separate tokens?
> 
> So here's what I'd do
> 1> take a look at the solr/admin/schema browser output for that field
> in the two versions. I suspect you'll see 7 values in 4.6 and a bazillion in 
> 4.7.1.
> 2> if <1> is true, take a look at the admin/analysis page for the
> field in question and see some sample index-time inputs, especially for the
> theoretical "4,1" entries. I suspect that 4.6 will break these up into two
> tokens and 4.7.1 won't.
> 3> if <2> is true, take a very careful look at the index-time analysis
> chains in the two versions, I bet they're different and that accounts for your
> observations.
> 4> try 1-3, discover I'm totally off base and paste the schema.xml
> definitions for the field in question in both 4.6 and 4.7.1 to this thread and
> we can take a look.
> 
> This should not have changed between 4.6 and 4.7.1, at least not
> intentionally.
> 
> Best,
> Erick
> 
> On Wed, Apr 9, 2014 at 11:04 AM, Jean-Sebastien Vachon  sebastien.vac...@wantedanalytics.com> wrote:
> > Hi All,
> >
> > We just discovered that the response from Solr (4.7.1) when faceting on
> one of our multi-valued fields has changed considerably.
> >
> > In the past (4.6.1 and prior versions as well) we used to have
> > something like this: (there are 7 possible values for this attribute)
> >
> > 
> > 
> > 
> > 
> > 11454652
> > 11387070
> > 2095603
> > 809992
> > 567244
> > 139389
> > 4120
> > 
> > 
> > 
> > 
> >
> > And now with 4.7.1 we are getting this:
> > 
> > 
> > 
> > 
> > 10954552
> > 10884418
> > 2000530
> > 784491
> > 535935
> > 134826
> > 11770
> > ... there are too many values to list them all ...
> >
> > I checked the Change log for 4.7.1 and only saw an optimization made
> > for https://issues.apache.org/jira/browse/SOLR-5512
> >
> > Is there any new configuration directive that we should be aware of?
> >
> > Thanks
> >
> >
> >
> >
> >
> 
> -
> Aucun virus trouvé dans ce message.
> Analyse effectuée par AVG - www.avg.fr
> Version: 2014.0.4354 / Base de données virale: 3722/7256 - Date:
> 27/03/2014 La Base de données des virus a expiré.


Re: Were changes made to facetting on multivalued fields recently?

2014-04-09 Thread Erick Erickson
That is...um...very strange. It looks to me like you have somehow
indexed a bunch of new values. I'm guessing here, but it's suspicious
that you have a value "4,1" should that have been indexed as "4" and
"1" as separate tokens?

So here's what I'd do
1> take a look at the solr/admin/schema browser output for that field
in the two versions. I suspect you'll see 7 values in 4.6 and a
bazillion in 4.7.1.
2> if <1> is true, take a look at the admin/analysis page for the
field in question and see some sample index-time inputs, especially
for the theoretical "4,1" entries. I suspect that 4.6 will break these
up into two tokens and 4.7.1 won't.
3> if <2> is true, take a very careful look at the index-time analysis
chains in the two versions, I bet they're different and that accounts
for your observations.
4> try 1-3, discover I'm totally off base and paste the schema.xml
definitions for the field in question in both 4.6 and 4.7.1 to this
thread and we can take a look.

This should not have changed between 4.6 and 4.7.1, at least not intentionally.

Best,
Erick

On Wed, Apr 9, 2014 at 11:04 AM, Jean-Sebastien Vachon
 wrote:
> Hi All,
>
> We just discovered that the response from Solr (4.7.1) when faceting on one 
> of our multi-valued fields has changed considerably.
>
> In the past (4.6.1 and prior versions as well) we used to have something like 
> this: (there are 7 possible values for this attribute)
>
> 
> 
> 
> 
> 11454652
> 11387070
> 2095603
> 809992
> 567244
> 139389
> 4120
> 
> 
> 
> 
>
> And now with 4.7.1 we are getting this:
> 
> 
> 
> 
> 10954552
> 10884418
> 2000530
> 784491
> 535935
> 134826
> 11770
> ... there are too many values to list them all ...
>
> I checked the Change log for 4.7.1 and only saw an optimization made for 
> https://issues.apache.org/jira/browse/SOLR-5512
>
> Is there any new configuration directive that we should be aware of?
>
> Thanks
>
>
>
>
>