Re: deduplication of suggester results are not enough

2020-03-26 Thread Michal Hlavac
Hi Roland,

I wrote AnalyzingInfixSuggester that deduplicates data on several levels at 
index time.
I will publish it in few days on github. I'll wrote to this thread when done.

m.

On štvrtok 26. marca 2020 16:01:57 CET Szűcs Roland wrote:
> Hi All,
> 
> I follow the discussion of the suggester related discussions quite a while
> ago. Everybody agrees that it is not the expected behaviour from a
> Suggester where the terms are the entities and not the documents to return
> the same string representation several times.
> 
> One suggestion was to make deduplication on client side of Solr. It is very
> easy in most of the client solution as any set based data structure solve
> this.
> 
> *But one important problem is not solved the deduplication: suggest.count*.
> 
> If I have15 matches by the suggester and the suggest.count=10 and the first
> 9 matches are the same, I will get back only 2 after the deduplication and
> the remaining 5 unique terms will be never shown.
> 
> What is the solution for this?
> 
> Cheers,
> Roland
> 


Re: Deduplication

2015-05-20 Thread Shalin Shekhar Mangar
On Wed, May 20, 2015 at 12:59 PM, Bram Van Dam  wrote:

> >> Write a custom update processor and include it in your update chain.
> >> You will then have the ability to do anything you want with the entire
> >> input document before it hits the code to actually do the indexing.
>
> This sounded like the perfect option ... until I read Jack's comment:
>
> >
> > My understanding was that the distributed update processor is near the
> end
> > of the chain, so that running of user update processors occurs before the
> > distribution step, but is that distribution to the leader, or
> distribution
> > from leader to replicas for a shard?
>
> That would pose some potential problems.
>
> Would a custom update processor make the solution "cloud-safe"?
>

Starting with Solr 5.1, you have the ability to specify an update processor
on the fly to requests and you can even control whether it is to be
executed before any distribution happens or before it is actually indexed
on the replica.

e.g. you can specify processor=xyz,MyCustomUpdateProc in the request to
have processor xyz run first and then MyCustomUpdateProc and then the
default update processor chain (which will also distribute the doc to the
leader or from the leader to a replica). This also means that such
processors will not be executed on the replicas at all.

You can also specify post-processor=xyz,MyCustomUpdateProc to have xyz and
MyCustomUpdateProc to run on each replica (including the leader) right
before the doc is indexed (i.e. just before RunUpdateProcessor)

Unfortunately, due to an oversight, this feature hasn't been documented
well which is something I'll fix. See
https://issues.apache.org/jira/browse/SOLR-6892 for more details.


>
> Thx,
>
>  - Bram
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication

2015-05-20 Thread Alessandro Benedetti
What the Solr de-duplciation offers you is to calculate for each document
in input an Hash ( based on a set of fields).
You can then select two options :
 - Index everything, documents with same signature will be equals
- avoid the overwriting of duplicates.

How the similarity has is calculated is something you can play with and
customise if needed.

Clarified that, do you think can fit in some way, or definitely you are not
talking about deduce ?

2015-05-20 8:37 GMT+01:00 Bram Van Dam :

> On 19/05/15 14:47, Alessandro Benedetti wrote:
> > Hi Bram,
> > what do you mean with :
> > "  I
> > would like it to provide the unique value myself, without having the
> > deduplicator create a hash of field values " .
> >
> > This is not reduplication, but simple document filtering based on a
> > constraint.
> > In the case you want de-duplication ( which seemed from your very first
> > part of the mail) here you can find a lot of info :
>
> Not sure whether de-duplication is the right word for what I'm after, I
> essentially want a unique constraint on an arbitrary field. Without
> overwrite semantics, because I want Solr to tell me if a duplicate is
> sent to Solr.
>
> I was thinking that the de-duplication feature could accomplish this
> somehow.
>
>
>  - Bram
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Deduplication

2015-05-20 Thread Bram Van Dam
On 19/05/15 14:47, Alessandro Benedetti wrote:
> Hi Bram,
> what do you mean with :
> "  I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values " .
> 
> This is not reduplication, but simple document filtering based on a
> constraint.
> In the case you want de-duplication ( which seemed from your very first
> part of the mail) here you can find a lot of info :

Not sure whether de-duplication is the right word for what I'm after, I
essentially want a unique constraint on an arbitrary field. Without
overwrite semantics, because I want Solr to tell me if a duplicate is
sent to Solr.

I was thinking that the de-duplication feature could accomplish this
somehow.


 - Bram


Re: Deduplication

2015-05-20 Thread Bram Van Dam
>> Write a custom update processor and include it in your update chain.
>> You will then have the ability to do anything you want with the entire
>> input document before it hits the code to actually do the indexing.

This sounded like the perfect option ... until I read Jack's comment:

>
> My understanding was that the distributed update processor is near the end
> of the chain, so that running of user update processors occurs before the
> distribution step, but is that distribution to the leader, or distribution
> from leader to replicas for a shard?

That would pose some potential problems.

Would a custom update processor make the solution "cloud-safe"?

Thx,

 - Bram



Re: Deduplication

2015-05-19 Thread Jack Krupansky
Shawn, I was going to say the same thing, but... then I was thinking about
SolrCloud and the fact that update processors are invoked before the
document is set to its target node, so there wouldn't be a reliable way to
tell if the input document field value exists on the target rather than
current node.

Or does the update processing only occur on the leader node after being
forwarded from the originating node? Is the doc clear on this detail?

My understanding was that the distributed update processor is near the end
of the chain, so that running of user update processors occurs before the
distribution step, but is that distribution to the leader, or distribution
from leader to replicas for a shard?


-- Jack Krupansky

On Tue, May 19, 2015 at 9:01 AM, Shawn Heisey  wrote:

> On 5/19/2015 3:02 AM, Bram Van Dam wrote:
> > I'm looking for a way to have Solr reject documents if a certain field
> > value is duplicated (reject, not overwrite). There doesn't seem to be
> > any kind of unique option in schema fields.
> >
> > The de-duplication feature seems to make this (somewhat) possible, but I
> > would like it to provide the unique value myself, without having the
> > deduplicator create a hash of field values.
> >
> > Am I missing an obvious (or less obvious) way of accomplishing this?
>
> Write a custom update processor and include it in your update chain.
> You will then have the ability to do anything you want with the entire
> input document before it hits the code to actually do the indexing.
>
> A script update processor is included with Solr allows you to write your
> processor in a language other than Java, such as javascript.
>
>
> https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html
>
> Here's how to discard a document in an update processor written in Java:
>
>
> http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor
>
> The javadoc that I linked above describes the ability to return "false"
> in other languages to discard the document.
>
> Thanks,
> Shawn
>
>


Re: Deduplication

2015-05-19 Thread Shawn Heisey
On 5/19/2015 3:02 AM, Bram Van Dam wrote:
> I'm looking for a way to have Solr reject documents if a certain field
> value is duplicated (reject, not overwrite). There doesn't seem to be
> any kind of unique option in schema fields.
> 
> The de-duplication feature seems to make this (somewhat) possible, but I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values.
> 
> Am I missing an obvious (or less obvious) way of accomplishing this?

Write a custom update processor and include it in your update chain.
You will then have the ability to do anything you want with the entire
input document before it hits the code to actually do the indexing.

A script update processor is included with Solr allows you to write your
processor in a language other than Java, such as javascript.

https://lucene.apache.org/solr/5_1_0/solr-core/org/apache/solr/update/processor/StatelessScriptUpdateProcessorFactory.html

Here's how to discard a document in an update processor written in Java:

http://stackoverflow.com/questions/27108200/how-to-cancel-indexing-of-a-solr-document-using-update-request-processor

The javadoc that I linked above describes the ability to return "false"
in other languages to discard the document.

Thanks,
Shawn



Re: Deduplication

2015-05-19 Thread Alessandro Benedetti
Hi Bram,
what do you mean with :
"  I
would like it to provide the unique value myself, without having the
deduplicator create a hash of field values " .

This is not reduplication, but simple document filtering based on a
constraint.
In the case you want de-duplication ( which seemed from your very first
part of the mail) here you can find a lot of info :

https://cwiki.apache.org/confluence/display/solr/De-Duplication

Let me know for more detailed requirements!

2015-05-19 10:02 GMT+01:00 Bram Van Dam :

> Hi folks,
>
> I'm looking for a way to have Solr reject documents if a certain field
> value is duplicated (reject, not overwrite). There doesn't seem to be
> any kind of unique option in schema fields.
>
> The de-duplication feature seems to make this (somewhat) possible, but I
> would like it to provide the unique value myself, without having the
> deduplicator create a hash of field values.
>
> Am I missing an obvious (or less obvious) way of accomplishing this?
>
> Thanks,
>
>  - Bram
>



-- 
--

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Deduplication in SolrCloud

2012-07-27 Thread Lance Norskog
Should the old Signature code be removed? Given that the goal is to
have everyone use SolrCloud, maybe this kind of landmine should be
removed?

On Fri, Jul 27, 2012 at 8:43 AM, Markus Jelsma
 wrote:
> This issue doesn't really describe your problem but a more general problem of 
> distributed deduplication:
> https://issues.apache.org/jira/browse/SOLR-3473
>
>
> -Original message-
>> From:Daniel Brügge 
>> Sent: Fri 27-Jul-2012 17:38
>> To: solr-user@lucene.apache.org
>> Subject: Deduplication in SolrCloud
>>
>> Hi,
>>
>> in my old Solr Setup I have used the deduplication feature in the update
>> chain
>> with couple of fields.
>>
>> 
>>  
>> true
>>  signature
>> false
>>  uuid,type,url,content_hash
>> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature
>>  
>> 
>>  
>> 
>>
>> This worked fine. When I now use this in my 2 shards SolrCloud setup when
>> inserting 150.000 documents,
>> I am always getting an error:
>>
>> *INFO: end_commit_flush*
>> *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
>> *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
>> unable to create new native thread*
>> * at
>> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
>> *
>> * at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
>> *
>>
>> I am inserting the documents via CSV import and curl command and split them
>> also into 50k chunks.
>>
>> Without the dedupe chain, the import finishes after 40secs.
>>
>> The curl command writes to one of my shards.
>>
>>
>> Do you have an idea why this happens? Should I reduce the fields to one? I
>> have read that not using the id as
>> dedupe fields could be an issue?
>>
>>
>> I have searched for deduplication with SolrCloud and I am wondering if it
>> is already working correctly? see e.g.
>> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
>>
>> Thanks & regards
>>
>> Daniel
>>



-- 
Lance Norskog
goks...@gmail.com


RE: Deduplication in SolrCloud

2012-07-27 Thread Markus Jelsma
This issue doesn't really describe your problem but a more general problem of 
distributed deduplication:
https://issues.apache.org/jira/browse/SOLR-3473
 
 
-Original message-
> From:Daniel Brügge 
> Sent: Fri 27-Jul-2012 17:38
> To: solr-user@lucene.apache.org
> Subject: Deduplication in SolrCloud
> 
> Hi,
> 
> in my old Solr Setup I have used the deduplication feature in the update
> chain
> with couple of fields.
> 
> 
>  
> true
>  signature
> false
>  uuid,type,url,content_hash
>  name="signatureClass">org.apache.solr.update.processor.Lookup3Signature
>  
> 
>  
> 
> 
> This worked fine. When I now use this in my 2 shards SolrCloud setup when
> inserting 150.000 documents,
> I am always getting an error:
> 
> *INFO: end_commit_flush*
> *Jul 27, 2012 3:29:36 PM org.apache.solr.common.SolrException log*
> *SEVERE: null:java.lang.RuntimeException: java.lang.OutOfMemoryError:
> unable to create new native thread*
> * at
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:456)
> *
> * at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:284)
> *
> 
> I am inserting the documents via CSV import and curl command and split them
> also into 50k chunks.
> 
> Without the dedupe chain, the import finishes after 40secs.
> 
> The curl command writes to one of my shards.
> 
> 
> Do you have an idea why this happens? Should I reduce the fields to one? I
> have read that not using the id as
> dedupe fields could be an issue?
> 
> 
> I have searched for deduplication with SolrCloud and I am wondering if it
> is already working correctly? see e.g.
> http://lucene.472066.n3.nabble.com/SolrCloud-deduplication-td3984657.html
> 
> Thanks & regards
> 
> Daniel
> 


Re: Deduplication questions

2011-04-11 Thread Chris Hostetter

: Q1. Is is possible to pass *analyzed* content to the
: 
: public abstract class Signature {

No, analysis happens as the documents are being written to the lucene 
index, well after the UpdateProcessors have had a chance to interact with 
the values.

: Q2. Method calculate() is using concatenated fields from name,features,cat
: Is there any mechanism I could build  "field dependant signatures"?

At the moment the Signature API is fairly minimal, but it could definitley 
be improved by adding more methods (that have sensible defaults in the 
base class) that would give the impl more control over teh resulting 
signature ... we just beed people to propose good suggestions with example 
use cases.

: Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
: would work)

I don't know that what you describe is really intentional or not, but it 
should work


-Hoss


Re: Deduplication

2010-05-19 Thread Ahmet Arslan
> TermsComponent maybe? 
> 
> or faceting?
> q=*:*&facet=true&facet.field=signatureField&defType=lucene&rows=0&start=0
> 
> if you append &facet.mincount=1 to above url you can
> see your duplications
> 

After re-reading your message: sometimes you want to show duplicates, sometimes 
you don't want them. I have never used FieldCollapsing by myself but heard 
about it many times.

http://wiki.apache.org/solr/FieldCollapsing


  


Re: Deduplication

2010-05-19 Thread Ahmet Arslan

> Basically for some uses cases I would like to show
> duplicates for other I
> wanted them ignored.
> 
> If I have overwriteDupes=false and I just create the dedup
> hash how can I
> query for only unique hash values... ie something like a
> SQL group by. 

TermsComponent maybe? 

or faceting? 
q=*:*&facet=true&facet.field=signatureField&defType=lucene&rows=0&start=0

if you append &facet.mincount=1 to above url you can see your duplications


  


Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Two sites that use field-collapsing:
1) www.ilocal.nl
2) www.welke.nl
I'm not sure what you mean with double-tripping? The sites mentioned
do not have performance problems that are caused by field collapsing.

Field-collapsing currently only supports quasi distributed
field-collapsing (as I have described on the Solr wiki). Currently I
don't know a distributed field-collapsing algorithm that works
properly and does not influence the search time in such a way that the
search becomes slow.

Martijn

2009/11/26 Otis Gospodnetic :
> Hi Martijn,
>
>
> - Original Message 
>
>> From: Martijn v Groningen 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, November 26, 2009 3:19:40 AM
>> Subject: Re: Deduplication in 1.4
>>
>> Field collapsing has been used by many in their production
>> environment.
>
> Got any pointers to public sites you know use it?  I know of a high traffic 
> site that used an early version, and it caused performance problems.  Is 
> double-tripping still required?
>
>> The last few months the stability of the patch grew as
>> quiet some bugs were fixed. The only big feature missing currently is
>> caching of the collapsing algorithm. I'm currently working on that and
>
> Is it also full distributed-search-ready?
>
>> I will put it in a new patch in the coming next days.  So yes the
>> patch is very near being production ready.
>
> Thanks,
> Otis
>
>> Martijn
>>
>> 2009/11/26 KaktuChakarabati :
>> >
>> > Hey Otis,
>> > Yep, I realized this myself after playing some with the dedupe feature
>> > yesterday.
>> > So it does look like Field collapsing is what I need pretty much.
>> > Any idea on how close it is to being production-ready?
>> >
>> > Thanks,
>> > -Chak
>> >
>> > Otis Gospodnetic wrote:
>> >>
>> >> Hi,
>> >>
>> >> As far as I know, the point of deduplication in Solr (
>> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> >> document before indexing it in order to avoid duplicates in the index in
>> >> the first place.
>> >>
>> >> What you are describing is closer to field collapsing patch in SOLR-236.
>> >>
>> >>  Otis
>> >> --
>> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >>
>> >>
>> >>
>> >> - Original Message 
>> >>> From: KaktuChakarabati
>> >>> To: solr-user@lucene.apache.org
>> >>> Sent: Tue, November 24, 2009 5:29:00 PM
>> >>> Subject: Deduplication in 1.4
>> >>>
>> >>>
>> >>> Hey,
>> >>> I've been trying to find some documentation on using this feature in 1.4
>> >>> but
>> >>> Wiki page is alittle sparse..
>> >>> In specific, here's what i'm trying to do:
>> >>>
>> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> >>> offline documents deduplication process I have.
>> >>>
>> >>> All I want is for solr to compute a 'duplicate_signature' field based on
>> >>> this one at update time, so that when i search for documents later, all
>> >>> documents with same original 'duplicate_group_id' value will be rolled up
>> >>> (e.g i'll just get the first one that came back  according to relevancy).
>> >>>
>> >>> I enabled the deduplication processor and put it into updater, but i'm
>> >>> not
>> >>> seeing any difference in returned results (i.e results with same
>> >>> duplicate_id are returned separately..)
>> >>>
>> >>> is there anything i need to supply in query-time for this to take effect?
>> >>> what should be the behaviour? is there any working example of this?
>> >>>
>> >>> Anything will be helpful..
>> >>>
>> >>> Thanks,
>> >>> Chak
>> >>> --
>> >>> View this message in context:
>> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >>
>> >>
>> >
>> > --
>> > View this message in context:
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
>> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >
>> >
>
>


Re: Deduplication in 1.4

2009-11-26 Thread Otis Gospodnetic
Hi Martijn,

 
- Original Message 

> From: Martijn v Groningen 
> To: solr-user@lucene.apache.org
> Sent: Thu, November 26, 2009 3:19:40 AM
> Subject: Re: Deduplication in 1.4
> 
> Field collapsing has been used by many in their production
> environment. 

Got any pointers to public sites you know use it?  I know of a high traffic 
site that used an early version, and it caused performance problems.  Is 
double-tripping still required?

> The last few months the stability of the patch grew as
> quiet some bugs were fixed. The only big feature missing currently is
> caching of the collapsing algorithm. I'm currently working on that and

Is it also full distributed-search-ready?

> I will put it in a new patch in the coming next days.  So yes the
> patch is very near being production ready.

Thanks,
Otis

> Martijn
> 
> 2009/11/26 KaktuChakarabati :
> >
> > Hey Otis,
> > Yep, I realized this myself after playing some with the dedupe feature
> > yesterday.
> > So it does look like Field collapsing is what I need pretty much.
> > Any idea on how close it is to being production-ready?
> >
> > Thanks,
> > -Chak
> >
> > Otis Gospodnetic wrote:
> >>
> >> Hi,
> >>
> >> As far as I know, the point of deduplication in Solr (
> >> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> >> document before indexing it in order to avoid duplicates in the index in
> >> the first place.
> >>
> >> What you are describing is closer to field collapsing patch in SOLR-236.
> >>
> >>  Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >>
> >>
> >>
> >> - Original Message 
> >>> From: KaktuChakarabati 
> >>> To: solr-user@lucene.apache.org
> >>> Sent: Tue, November 24, 2009 5:29:00 PM
> >>> Subject: Deduplication in 1.4
> >>>
> >>>
> >>> Hey,
> >>> I've been trying to find some documentation on using this feature in 1.4
> >>> but
> >>> Wiki page is alittle sparse..
> >>> In specific, here's what i'm trying to do:
> >>>
> >>> I have a field, say 'duplicate_group_id' that i'll populate based on some
> >>> offline documents deduplication process I have.
> >>>
> >>> All I want is for solr to compute a 'duplicate_signature' field based on
> >>> this one at update time, so that when i search for documents later, all
> >>> documents with same original 'duplicate_group_id' value will be rolled up
> >>> (e.g i'll just get the first one that came back  according to relevancy).
> >>>
> >>> I enabled the deduplication processor and put it into updater, but i'm
> >>> not
> >>> seeing any difference in returned results (i.e results with same
> >>> duplicate_id are returned separately..)
> >>>
> >>> is there anything i need to supply in query-time for this to take effect?
> >>> what should be the behaviour? is there any working example of this?
> >>>
> >>> Anything will be helpful..
> >>>
> >>> Thanks,
> >>> Chak
> >>> --
> >>> View this message in context:
> >>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >>
> >
> > --
> > View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >



Re: Deduplication in 1.4

2009-11-26 Thread Martijn v Groningen
Field collapsing has been used by many in their production
environment. The last few months the stability of the patch grew as
quiet some bugs were fixed. The only big feature missing currently is
caching of the collapsing algorithm. I'm currently working on that and
I will put it in a new patch in the coming next days.  So yes the
patch is very near being production ready.

Martijn

2009/11/26 KaktuChakarabati :
>
> Hey Otis,
> Yep, I realized this myself after playing some with the dedupe feature
> yesterday.
> So it does look like Field collapsing is what I need pretty much.
> Any idea on how close it is to being production-ready?
>
> Thanks,
> -Chak
>
> Otis Gospodnetic wrote:
>>
>> Hi,
>>
>> As far as I know, the point of deduplication in Solr (
>> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
>> document before indexing it in order to avoid duplicates in the index in
>> the first place.
>>
>> What you are describing is closer to field collapsing patch in SOLR-236.
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> - Original Message 
>>> From: KaktuChakarabati 
>>> To: solr-user@lucene.apache.org
>>> Sent: Tue, November 24, 2009 5:29:00 PM
>>> Subject: Deduplication in 1.4
>>>
>>>
>>> Hey,
>>> I've been trying to find some documentation on using this feature in 1.4
>>> but
>>> Wiki page is alittle sparse..
>>> In specific, here's what i'm trying to do:
>>>
>>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>>> offline documents deduplication process I have.
>>>
>>> All I want is for solr to compute a 'duplicate_signature' field based on
>>> this one at update time, so that when i search for documents later, all
>>> documents with same original 'duplicate_group_id' value will be rolled up
>>> (e.g i'll just get the first one that came back  according to relevancy).
>>>
>>> I enabled the deduplication processor and put it into updater, but i'm
>>> not
>>> seeing any difference in returned results (i.e results with same
>>> duplicate_id are returned separately..)
>>>
>>> is there anything i need to supply in query-time for this to take effect?
>>> what should be the behaviour? is there any working example of this?
>>>
>>> Anything will be helpful..
>>>
>>> Thanks,
>>> Chak
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Deduplication in 1.4

2009-11-25 Thread KaktuChakarabati

Hey Otis,
Yep, I realized this myself after playing some with the dedupe feature
yesterday.
So it does look like Field collapsing is what I need pretty much.
Any idea on how close it is to being production-ready?

Thanks,
-Chak

Otis Gospodnetic wrote:
> 
> Hi,
> 
> As far as I know, the point of deduplication in Solr (
> http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate
> document before indexing it in order to avoid duplicates in the index in
> the first place.
> 
> What you are describing is closer to field collapsing patch in SOLR-236.
> 
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> - Original Message 
>> From: KaktuChakarabati 
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 24, 2009 5:29:00 PM
>> Subject: Deduplication in 1.4
>> 
>> 
>> Hey,
>> I've been trying to find some documentation on using this feature in 1.4
>> but
>> Wiki page is alittle sparse..
>> In specific, here's what i'm trying to do:
>> 
>> I have a field, say 'duplicate_group_id' that i'll populate based on some
>> offline documents deduplication process I have.
>> 
>> All I want is for solr to compute a 'duplicate_signature' field based on
>> this one at update time, so that when i search for documents later, all
>> documents with same original 'duplicate_group_id' value will be rolled up
>> (e.g i'll just get the first one that came back  according to relevancy).
>> 
>> I enabled the deduplication processor and put it into updater, but i'm
>> not
>> seeing any difference in returned results (i.e results with same
>> duplicate_id are returned separately..)
>> 
>> is there anything i need to supply in query-time for this to take effect?
>> what should be the behaviour? is there any working example of this?
>> 
>> Anything will be helpful..
>> 
>> Thanks,
>> Chak
>> -- 
>> View this message in context: 
>> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Deduplication-in-1.4-tp26504403p26522386.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication in 1.4

2009-11-24 Thread Otis Gospodnetic
Hi,

As far as I know, the point of deduplication in Solr ( 
http://wiki.apache.org/solr/Deduplication ) is to detect a duplicate document 
before indexing it in order to avoid duplicates in the index in the first place.

What you are describing is closer to field collapsing patch in SOLR-236.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: KaktuChakarabati 
> To: solr-user@lucene.apache.org
> Sent: Tue, November 24, 2009 5:29:00 PM
> Subject: Deduplication in 1.4
> 
> 
> Hey,
> I've been trying to find some documentation on using this feature in 1.4 but
> Wiki page is alittle sparse..
> In specific, here's what i'm trying to do:
> 
> I have a field, say 'duplicate_group_id' that i'll populate based on some
> offline documents deduplication process I have.
> 
> All I want is for solr to compute a 'duplicate_signature' field based on
> this one at update time, so that when i search for documents later, all
> documents with same original 'duplicate_group_id' value will be rolled up
> (e.g i'll just get the first one that came back  according to relevancy).
> 
> I enabled the deduplication processor and put it into updater, but i'm not
> seeing any difference in returned results (i.e results with same
> duplicate_id are returned separately..)
> 
> is there anything i need to supply in query-time for this to take effect?
> what should be the behaviour? is there any working example of this?
> 
> Anything will be helpful..
> 
> Thanks,
> Chak
> -- 
> View this message in context: 
> http://old.nabble.com/Deduplication-in-1.4-tp26504403p26504403.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-10 Thread Grant Ingersoll
I've seen similar errors when large background merges happen while  
looping in a result set.  See http://lucene.grantingersoll.com/2008/07/16/mysql-solr-and-communications-link-failure/




On Jan 9, 2009, at 12:50 PM, Mark Miller wrote:

Your basically writing segments more often now, and somehow avoiding  
a longer merge I think. Also, likely, deduplication is probably  
adding enough extra data to your index to hit a sweet spot where a  
merge is too long. Or something to that effect - MySql is especially  
sensitive to timeouts when doing a select * on a huge db in my  
testing. I didnt understand your answer on the autocommit - I take  
it you are using it? Or no?


All a guess, but it def points to a merge taking a bit long and  
causing a timeout. I think you can relax the MySql timeout settings  
if that is it.


I'd like to get to the bottom of this as well, so any other info you  
can provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had  
32

and no maxBufferedDocs set

Now I have:
32
50

I think taht setting maxBufferedDocs to 50 I am forcing more disk  
writting
than I would like... but at least it works fine (but a bit  
slower,opiously).


I keep saying that the most weird thing is that I don't have that  
problem

using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if  
someone can

give me an explanation why this is happening


Shalin Shekhar Mangar wrote:


On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
wrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in  
solrconfig.xml. I

can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do  
the same?





What I find strange is this line in the exception:
"Last packet sent to the server was 202481 ms ago."

Something took very very long to complete and the connection got  
closed by

the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and  
what did

you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.




--
Regards,
Shalin Shekhar Mangar.










--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Mark,
Sorry I was not enough especific, I wanted to mean that I have and I always
had autoCommit=false.
I will do some more traces and test. Will post if I have any new important
thing to mention.

Thanks.


Marc Sturlese wrote:
> 
> Hey Shalin,
> 
> In the begining (when the error was appearing) i had 
> 32
> and no maxBufferedDocs set
> 
> Now I have:
> 32
> 50
> 
> I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
> than I would like... but at least it works fine (but a bit
> slower,opiously).
> 
> I keep saying that the most weird thing is that I don't have that problem
> using solr1.3, just with the nightly...
> 
> Even that it's good that it works well now, would be great if someone can
> give me an explanation why this is happening
>  
> 
> 
> Shalin Shekhar Mangar wrote:
>> 
>> On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
>> wrote:
>> 
>>>
>>> hey there,
>>> I hadn't autoCommit set to true but I have it sorted! The error
>>> stopped
>>> appearing after setting the property maxBufferedDocs in solrconfig.xml.
>>> I
>>> can't exactly undersand why but it just worked.
>>> Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the
>>> same?
>>>
>>>
>> What I find strange is this line in the exception:
>> "Last packet sent to the server was 202481 ms ago."
>> 
>> Something took very very long to complete and the connection got closed
>> by
>> the time the next row was fetched from the opened resultset.
>> 
>> Just curious, what was the previous value of maxBufferedDocs and what did
>> you change it to?
>> 
>> 
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
>> -- 
>> Regards,
>> Shalin Shekhar Mangar.
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21378069.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
Your basically writing segments more often now, and somehow avoiding a 
longer merge I think. Also, likely, deduplication is probably adding 
enough extra data to your index to hit a sweet spot where a merge is too 
long. Or something to that effect - MySql is especially sensitive to 
timeouts when doing a select * on a huge db in my testing. I didnt 
understand your answer on the autocommit - I take it you are using it? 
Or no?


All a guess, but it def points to a merge taking a bit long and causing 
a timeout. I think you can relax the MySql timeout settings if that is it.


I'd like to get to the bottom of this as well, so any other info you can 
provide would be great.


- Mark

Marc Sturlese wrote:

Hey Shalin,

In the begining (when the error was appearing) i had 
32

and no maxBufferedDocs set

Now I have:
32
50

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 



Shalin Shekhar Mangar wrote:
  

On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
wrote:



hey there,
I hadn't autoCommit set to true but I have it sorted! The error
stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?


  

What I find strange is this line in the exception:
"Last packet sent to the server was 202481 ms ago."

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?




--
View this message in context:
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
Sent from the Solr - User mailing list archive at Nabble.com.


  

--
Regards,
Shalin Shekhar Mangar.





  




Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey Shalin,

In the begining (when the error was appearing) i had 
32
and no maxBufferedDocs set

Now I have:
32
50

I think taht setting maxBufferedDocs to 50 I am forcing more disk writting
than I would like... but at least it works fine (but a bit slower,opiously).

I keep saying that the most weird thing is that I don't have that problem
using solr1.3, just with the nightly...

Even that it's good that it works well now, would be great if someone can
give me an explanation why this is happening
 


Shalin Shekhar Mangar wrote:
> 
> On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese
> wrote:
> 
>>
>> hey there,
>> I hadn't autoCommit set to true but I have it sorted! The error
>> stopped
>> appearing after setting the property maxBufferedDocs in solrconfig.xml. I
>> can't exactly undersand why but it just worked.
>> Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?
>>
>>
> What I find strange is this line in the exception:
> "Last packet sent to the server was 202481 ms ago."
> 
> Something took very very long to complete and the connection got closed by
> the time the next row was fetched from the opened resultset.
> 
> Just curious, what was the previous value of maxBufferedDocs and what did
> you change it to?
> 
> 
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21376235.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Deduplication patch not working in nightly build

2009-01-09 Thread Shalin Shekhar Mangar
On Fri, Jan 9, 2009 at 9:23 PM, Marc Sturlese wrote:

>
> hey there,
> I hadn't autoCommit set to true but I have it sorted! The error stopped
> appearing after setting the property maxBufferedDocs in solrconfig.xml. I
> can't exactly undersand why but it just worked.
> Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?
>
>
What I find strange is this line in the exception:
"Last packet sent to the server was 202481 ms ago."

Something took very very long to complete and the connection got closed by
the time the next row was fetched from the opened resultset.

Just curious, what was the previous value of maxBufferedDocs and what did
you change it to?


>
> --
> View this message in context:
> http://www.nabble.com/Deduplication-patch-not-working-in-nightly-build-tp21287327p21374908.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

hey there,
I hadn't autoCommit set to true but I have it sorted! The error stopped
appearing after setting the property maxBufferedDocs in solrconfig.xml. I
can't exactly undersand why but it just worked.
Anyway, maxBufferedDocs is deprecaded, would ramBufferSizeMB do the same?

Thanks


Marc Sturlese wrote:
> 
> Hey there,
> I was using the Deduplication patch with Solr 1.3 release and everything
> was working perfectly. Now I upgraded to a nigthly build (20th december)
> to be able to use new facet algorithm and other stuff and DeDuplication is
> not working any more. I have followed exactly the same steps to apply the
> patch to the source code. I am geting this error:
> 
> WARNING: Error reading data 
> com.mysql.jdbc.CommunicationsException: Communications link failure due to
> underlying exception: 
> 
> ** BEGIN NESTED EXCEPTION ** 
> 
> java.io.EOFException
> 
> STACKTRACE:
> 
> java.io.EOFException
> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> at
> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> at
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> 
> 
> ** END NESTED EXCEPTION **
> Last packet sent to the server was 202481 ms ago.
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> at
> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> at
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
> logError
> WARNING: Exception while closing result set
> com.mysql.jdbc.CommunicationsException: Communications link failure due to
> underlying exception: 
> 
> ** BEGIN NESTED EXCEPTION ** 
> 
> java.io.EOFException
> 
> STACKTRACE:
> 
> java.io.EOFException
> at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> at com.mysql.jdbc.MysqlIO.nextRow(My

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Mark Miller
I can't imagine why dedupe would have anything to do with this, other 
than what was said, it perhaps is taking a bit longer to get a document 
to the db, and it times out (maybe a long signature calculation?). Have 
you tried changing your MySql settings to allow for a longer timeout? 
(sorry, I'm not to up to date on what you have tried).


Also, are you using autocommit during the import? If so, you might try 
turning it off for the full import.


- Mark

Marc Sturlese wrote:

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/my_db"

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url="jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 


java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsExcepti

Re: Deduplication patch not working in nightly build

2009-01-09 Thread Marc Sturlese

Hey there,
I am stack in this problem sine 3 days ago and no idea how to sort it.

I am using the nighlty from a week ago, mysql and this driver and url:
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost/my_db"

I can use deduplication patch with indexs of 200.000 docs and no problem.
When I try a full-import with a db of 1.500.000 it stops indexing at doc
number 15.000 aprox showing me the error posted above.
Once I get the exception, i restart tomcat and start a delta-import... this
time everything works fine!
I need to avoid this error in the full import, i have tryed:

url="jdbc:mysql://localhost/my_db?autoReconnect=true to sort it in case the
connection was closed due to long time until next doc was indexed, but
nothing changed... I keep having this:
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Error reading data 
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)


** END NESTED EXCEPTION **



Last packet sent to the server was 206097 ms ago.
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:279)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$500(JdbcDataSource.java:167)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:205)
at
org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:77)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:387)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:209)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:160)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:368)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:437)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:428)
Jan 9, 2009 1:38:18 PM org.apache.solr.handler.dataimport.JdbcDataSource
logError
WARNING: Exception while closing result set
com.mysql.jdbc.CommunicationsException: Communications link failure due to
underlying exception: 

** BEGIN NESTED EXCEPTION ** 

java.io.EOFException

STACKTRACE:

java.io.EOFException
at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Thanks I will have a look to my JdbcDataSource. Anyway it's weird because
using the 1.3 release I don't have that problem...

Shalin Shekhar Mangar wrote:
> 
> Yes, initially I figured that we are accidentally re-using a closed data
> source. But Noble has pinned it right. I guess you can try looking into
> your
> JDBC driver's documentation for a setting which increases the connection
> alive-ness.
> 
> On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> noble.p...@gmail.com> wrote:
> 
>> I guess the indexing of a doc is taking too long (may be because of
>> the de-dup patch) and the resultset gets closed automaticallly (timed
>> out)
>> --Noble
>>
>> On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese 
>> wrote:
>> >
>> > Donig this fix I get the same error :(
>> >
>> > I am going to try to set up the last nigthly build... let's see if I
>> have
>> > better luck.
>> >
>> > The thing is it stop indexing at the doc num 150.000 aprox... and give
>> me
>> > that mysql exception error... Without DeDuplication patch I can index 2
>> > milion docs without problems...
>> >
>> > I am pretty lost with this... :(
>> >
>> >
>> > Shalin Shekhar Mangar wrote:
>> >>
>> >> Yes I meant the 05/01/2008 build. The fix is a one line change
>> >>
>> >> Add the following as the last line of DataConfig.Entity.clearCache()
>> >> dataSrc = null;
>> >>
>> >>
>> >>
>> >> On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
>> >> wrote:
>> >>
>> >>>
>> >>> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
>> one
>> >>> works? If the fix you did is not really big can u tell me where in
>> the
>> >>> source is and what is it for? (I have been debuging and tracing a lot
>> the
>> >>> dataimporthandler source and I I would like to know what the
>> imporovement
>> >>> is
>> >>> about if it is not a problem...)
>> >>>
>> >>> Thanks!
>> >>>
>> >>>
>> >>> Shalin Shekhar Mangar wrote:
>> >>> >
>> >>> > Marc, I've just committed a fix which may have caused the bug. Can
>> you
>> >>> use
>> >>> > svn trunk (or the next nightly build) and confirm?
>> >>> >
>> >>> > On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>> >>> > noble.p...@gmail.com> wrote:
>> >>> >
>> >>> >> looks like a bug w/ DIH with the recent fixes.
>> >>> >> --Noble
>> >>> >>
>> >>> >> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
>> >>> 
>> >>> >> wrote:
>> >>> >> >
>> >>> >> > Hey there,
>> >>> >> > I was using the Deduplication patch with Solr 1.3 release and
>> >>> >> everything
>> >>> >> was
>> >>> >> > working perfectly. Now I upgraded to a nigthly build (20th
>> december)
>> >>> to
>> >>> >> be
>> >>> >> > able to use new facet algorithm and other stuff and
>> DeDuplication
>> is
>> >>> >> not
>> >>> >> > working any more. I have followed exactly the same steps to
>> apply
>> >>> the
>> >>> >> patch
>> >>> >> > to the source code. I am geting this error:
>> >>> >> >
>> >>> >> > WARNING: Error reading data
>> >>> >> > com.mysql.jdbc.CommunicationsException: Communications link
>> failure
>> >>> due
>> >>> >> to
>> >>> >> > underlying exception:
>> >>> >> >
>> >>> >> > ** BEGIN NESTED EXCEPTION **
>> >>> >> >
>> >>> >> > java.io.EOFException
>> >>> >> >
>> >>> >> > STACKTRACE:
>> >>> >> >
>> >>> >> > java.io.EOFException
>> >>> >> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>> >>> >> >at
>> >>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>> >>> >> >at
>> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>> >>> >> >at
>> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>> >>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>> >>> >> >at
>> >>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>> >>> >> >at
>> >>> com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>> >>> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>> >>> >> >at
>> >>> >> >
>> >>> >>
>> >>>
>> org.apache.s

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes, initially I figured that we are accidentally re-using a closed data
source. But Noble has pinned it right. I guess you can try looking into your
JDBC driver's documentation for a setting which increases the connection
alive-ness.

On Mon, Jan 5, 2009 at 5:29 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> I guess the indexing of a doc is taking too long (may be because of
> the de-dup patch) and the resultset gets closed automaticallly (timed
> out)
> --Noble
>
> On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese 
> wrote:
> >
> > Donig this fix I get the same error :(
> >
> > I am going to try to set up the last nigthly build... let's see if I have
> > better luck.
> >
> > The thing is it stop indexing at the doc num 150.000 aprox... and give me
> > that mysql exception error... Without DeDuplication patch I can index 2
> > milion docs without problems...
> >
> > I am pretty lost with this... :(
> >
> >
> > Shalin Shekhar Mangar wrote:
> >>
> >> Yes I meant the 05/01/2008 build. The fix is a one line change
> >>
> >> Add the following as the last line of DataConfig.Entity.clearCache()
> >> dataSrc = null;
> >>
> >>
> >>
> >> On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
> >> wrote:
> >>
> >>>
> >>> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this
> one
> >>> works? If the fix you did is not really big can u tell me where in the
> >>> source is and what is it for? (I have been debuging and tracing a lot
> the
> >>> dataimporthandler source and I I would like to know what the
> imporovement
> >>> is
> >>> about if it is not a problem...)
> >>>
> >>> Thanks!
> >>>
> >>>
> >>> Shalin Shekhar Mangar wrote:
> >>> >
> >>> > Marc, I've just committed a fix which may have caused the bug. Can
> you
> >>> use
> >>> > svn trunk (or the next nightly build) and confirm?
> >>> >
> >>> > On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> >>> > noble.p...@gmail.com> wrote:
> >>> >
> >>> >> looks like a bug w/ DIH with the recent fixes.
> >>> >> --Noble
> >>> >>
> >>> >> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
> >>> 
> >>> >> wrote:
> >>> >> >
> >>> >> > Hey there,
> >>> >> > I was using the Deduplication patch with Solr 1.3 release and
> >>> >> everything
> >>> >> was
> >>> >> > working perfectly. Now I upgraded to a nigthly build (20th
> december)
> >>> to
> >>> >> be
> >>> >> > able to use new facet algorithm and other stuff and DeDuplication
> is
> >>> >> not
> >>> >> > working any more. I have followed exactly the same steps to apply
> >>> the
> >>> >> patch
> >>> >> > to the source code. I am geting this error:
> >>> >> >
> >>> >> > WARNING: Error reading data
> >>> >> > com.mysql.jdbc.CommunicationsException: Communications link
> failure
> >>> due
> >>> >> to
> >>> >> > underlying exception:
> >>> >> >
> >>> >> > ** BEGIN NESTED EXCEPTION **
> >>> >> >
> >>> >> > java.io.EOFException
> >>> >> >
> >>> >> > STACKTRACE:
> >>> >> >
> >>> >> > java.io.EOFException
> >>> >> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> >>> >> >at
> >>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
> >>> >> >at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> >>> >> >at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> >>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> >>> >> >at
> >>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> >>> >> >at
> >>> com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> >>> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> >>> >> >at
> >>> >> >
> >>> >>
> >>>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> >>> >> > 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
I guess the indexing of a doc is taking too long (may be because of
the de-dup patch) and the resultset gets closed automaticallly (timed
out)
--Noble

On Mon, Jan 5, 2009 at 5:14 PM, Marc Sturlese  wrote:
>
> Donig this fix I get the same error :(
>
> I am going to try to set up the last nigthly build... let's see if I have
> better luck.
>
> The thing is it stop indexing at the doc num 150.000 aprox... and give me
> that mysql exception error... Without DeDuplication patch I can index 2
> milion docs without problems...
>
> I am pretty lost with this... :(
>
>
> Shalin Shekhar Mangar wrote:
>>
>> Yes I meant the 05/01/2008 build. The fix is a one line change
>>
>> Add the following as the last line of DataConfig.Entity.clearCache()
>> dataSrc = null;
>>
>>
>>
>> On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
>> wrote:
>>
>>>
>>> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
>>> works? If the fix you did is not really big can u tell me where in the
>>> source is and what is it for? (I have been debuging and tracing a lot the
>>> dataimporthandler source and I I would like to know what the imporovement
>>> is
>>> about if it is not a problem...)
>>>
>>> Thanks!
>>>
>>>
>>> Shalin Shekhar Mangar wrote:
>>> >
>>> > Marc, I've just committed a fix which may have caused the bug. Can you
>>> use
>>> > svn trunk (or the next nightly build) and confirm?
>>> >
>>> > On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>>> > noble.p...@gmail.com> wrote:
>>> >
>>> >> looks like a bug w/ DIH with the recent fixes.
>>> >> --Noble
>>> >>
>>> >> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
>>> 
>>> >> wrote:
>>> >> >
>>> >> > Hey there,
>>> >> > I was using the Deduplication patch with Solr 1.3 release and
>>> >> everything
>>> >> was
>>> >> > working perfectly. Now I upgraded to a nigthly build (20th december)
>>> to
>>> >> be
>>> >> > able to use new facet algorithm and other stuff and DeDuplication is
>>> >> not
>>> >> > working any more. I have followed exactly the same steps to apply
>>> the
>>> >> patch
>>> >> > to the source code. I am geting this error:
>>> >> >
>>> >> > WARNING: Error reading data
>>> >> > com.mysql.jdbc.CommunicationsException: Communications link failure
>>> due
>>> >> to
>>> >> > underlying exception:
>>> >> >
>>> >> > ** BEGIN NESTED EXCEPTION **
>>> >> >
>>> >> > java.io.EOFException
>>> >> >
>>> >> > STACKTRACE:
>>> >> >
>>> >> > java.io.EOFException
>>> >> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>>> >> >at
>>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>> >> >at
>>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>> >> >at
>>> com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>>> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>>> >> >at
>>> >> >
>>> >>
>>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>>> >> >
>>> >> >
>>> >> > ** END NESTED EXCEPTION **
>>> >> > Last packet sent to the server was 202481 ms ago.
>>> >> >at
>>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>> >> >at
>>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>> >> >at
>>> 

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Donig this fix I get the same error :(

I am going to try to set up the last nigthly build... let's see if I have
better luck.

The thing is it stop indexing at the doc num 150.000 aprox... and give me
that mysql exception error... Without DeDuplication patch I can index 2
milion docs without problems...

I am pretty lost with this... :(


Shalin Shekhar Mangar wrote:
> 
> Yes I meant the 05/01/2008 build. The fix is a one line change
> 
> Add the following as the last line of DataConfig.Entity.clearCache()
> dataSrc = null;
> 
> 
> 
> On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese
> wrote:
> 
>>
>> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
>> works? If the fix you did is not really big can u tell me where in the
>> source is and what is it for? (I have been debuging and tracing a lot the
>> dataimporthandler source and I I would like to know what the imporovement
>> is
>> about if it is not a problem...)
>>
>> Thanks!
>>
>>
>> Shalin Shekhar Mangar wrote:
>> >
>> > Marc, I've just committed a fix which may have caused the bug. Can you
>> use
>> > svn trunk (or the next nightly build) and confirm?
>> >
>> > On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
>> > noble.p...@gmail.com> wrote:
>> >
>> >> looks like a bug w/ DIH with the recent fixes.
>> >> --Noble
>> >>
>> >> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese
>> 
>> >> wrote:
>> >> >
>> >> > Hey there,
>> >> > I was using the Deduplication patch with Solr 1.3 release and
>> >> everything
>> >> was
>> >> > working perfectly. Now I upgraded to a nigthly build (20th december)
>> to
>> >> be
>> >> > able to use new facet algorithm and other stuff and DeDuplication is
>> >> not
>> >> > working any more. I have followed exactly the same steps to apply
>> the
>> >> patch
>> >> > to the source code. I am geting this error:
>> >> >
>> >> > WARNING: Error reading data
>> >> > com.mysql.jdbc.CommunicationsException: Communications link failure
>> due
>> >> to
>> >> > underlying exception:
>> >> >
>> >> > ** BEGIN NESTED EXCEPTION **
>> >> >
>> >> > java.io.EOFException
>> >> >
>> >> > STACKTRACE:
>> >> >
>> >> > java.io.EOFException
>> >> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>> >> >at
>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>> >> >at
>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>> >> >at
>> com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>> >> >
>> >> >
>> >> > ** END NESTED EXCEPTION **
>> >> > Last packet sent to the server was 202481 ms ago.
>> >> >at
>> com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>> >> >at
>> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>> >> >at
>> com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>> >> >at
>> >> >
>> >>
>> org.apache.solr.handler.dataimport.JdbcDataSource$Resul

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Yes I meant the 05/01/2008 build. The fix is a one line change

Add the following as the last line of DataConfig.Entity.clearCache()
dataSrc = null;



On Mon, Jan 5, 2009 at 4:22 PM, Marc Sturlese wrote:

>
> Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
> works? If the fix you did is not really big can u tell me where in the
> source is and what is it for? (I have been debuging and tracing a lot the
> dataimporthandler source and I I would like to know what the imporovement
> is
> about if it is not a problem...)
>
> Thanks!
>
>
> Shalin Shekhar Mangar wrote:
> >
> > Marc, I've just committed a fix which may have caused the bug. Can you
> use
> > svn trunk (or the next nightly build) and confirm?
> >
> > On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> > noble.p...@gmail.com> wrote:
> >
> >> looks like a bug w/ DIH with the recent fixes.
> >> --Noble
> >>
> >> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese 
> >> wrote:
> >> >
> >> > Hey there,
> >> > I was using the Deduplication patch with Solr 1.3 release and
> >> everything
> >> was
> >> > working perfectly. Now I upgraded to a nigthly build (20th december)
> to
> >> be
> >> > able to use new facet algorithm and other stuff and DeDuplication is
> >> not
> >> > working any more. I have followed exactly the same steps to apply the
> >> patch
> >> > to the source code. I am geting this error:
> >> >
> >> > WARNING: Error reading data
> >> > com.mysql.jdbc.CommunicationsException: Communications link failure
> due
> >> to
> >> > underlying exception:
> >> >
> >> > ** BEGIN NESTED EXCEPTION **
> >> >
> >> > java.io.EOFException
> >> >
> >> > STACKTRACE:
> >> >
> >> > java.io.EOFException
> >> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> >> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> >> >at
> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> >> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> >> >
> >> >
> >> > ** END NESTED EXCEPTION **
> >> > Last packet sent to the server was 202481 ms ago.
> >> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> >> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> >> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> >> >at
> >> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> >> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> >> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> >> >at
> >> >
> >>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Shalin you mean I should test the 05/01/2008 nighlty? maybe with this one
works? If the fix you did is not really big can u tell me where in the
source is and what is it for? (I have been debuging and tracing a lot the
dataimporthandler source and I I would like to know what the imporovement is
about if it is not a problem...)

Thanks!


Shalin Shekhar Mangar wrote:
> 
> Marc, I've just committed a fix which may have caused the bug. Can you use
> svn trunk (or the next nightly build) and confirm?
> 
> On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
> noble.p...@gmail.com> wrote:
> 
>> looks like a bug w/ DIH with the recent fixes.
>> --Noble
>>
>> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese 
>> wrote:
>> >
>> > Hey there,
>> > I was using the Deduplication patch with Solr 1.3 release and
>> everything
>> was
>> > working perfectly. Now I upgraded to a nigthly build (20th december) to
>> be
>> > able to use new facet algorithm and other stuff and DeDuplication is
>> not
>> > working any more. I have followed exactly the same steps to apply the
>> patch
>> > to the source code. I am geting this error:
>> >
>> > WARNING: Error reading data
>> > com.mysql.jdbc.CommunicationsException: Communications link failure due
>> to
>> > underlying exception:
>> >
>> > ** BEGIN NESTED EXCEPTION **
>> >
>> > java.io.EOFException
>> >
>> > STACKTRACE:
>> >
>> > java.io.EOFException
>> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>> >at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>> >at
>> >
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>> >at
>> >
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>> >
>> >
>> > ** END NESTED EXCEPTION **
>> > Last packet sent to the server was 202481 ms ago.
>> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>> >at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>> >at
>> >
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>> >at
>> >
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>> >at
>> >
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>> >at
>> >
>> org.apache.solr.handler.dataimport.DataImporter$1

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese


Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

I am really interested in updating to the nightly build as I think new facet
algorithm and  SolrDeletionPolicy are really great stuff!

>>Marc, I've just committed a fix which may have caused the bug. Can you use
>>svn trunk (or the next nightly build) and confirm? 
You meann the last nightly build?

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> looks like a bug w/ DIH with the recent fixes.
> --Noble
> 
> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese 
> wrote:
>>
>> Hey there,
>> I was using the Deduplication patch with Solr 1.3 release and everything
>> was
>> working perfectly. Now I upgraded to a nigthly build (20th december) to
>> be
>> able to use new facet algorithm and other stuff and DeDuplication is not
>> working any more. I have followed exactly the same steps to apply the
>> patch
>> to the source code. I am geting this error:
>>
>> WARNING: Error reading data
>> com.mysql.jdbc.CommunicationsException: Communications link failure due
>> to
>> underlying exception:
>>
>> ** BEGIN NESTED EXCEPTION **
>>
>> java.io.EOFException
>>
>> STACKTRACE:
>>
>> java.io.EOFException
>>at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>>
>>
>> ** END NESTED EXCEPTION **
>> Last packet sent to the server was 202481 ms ago.
>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>>at
>> org.apache.s

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Marc Sturlese

Yeah looks like but... if I don't use the DeDuplication patch everything
works perfect.  I can create my indexed using full import and delta import
without problems. The JdbcDataSource of the nightly is pretty similar to the
1.3 release...
The DeDuplication patch doesn't touch the dataimporthandler classes... it's
coz I thought the problem was not there (but can't say it for sure...)

I was thinking that the problem has something to do with the
UpdateRequestProcessorChain but don't know how this part of the source
works...

Any advice how could I sort it? I am really interested in updating to the
nightly build as I think new facet algorithm and  SolrDeletionPolicy are
really great stuff!

Thanks


Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> looks like a bug w/ DIH with the recent fixes.
> --Noble
> 
> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese 
> wrote:
>>
>> Hey there,
>> I was using the Deduplication patch with Solr 1.3 release and everything
>> was
>> working perfectly. Now I upgraded to a nigthly build (20th december) to
>> be
>> able to use new facet algorithm and other stuff and DeDuplication is not
>> working any more. I have followed exactly the same steps to apply the
>> patch
>> to the source code. I am geting this error:
>>
>> WARNING: Error reading data
>> com.mysql.jdbc.CommunicationsException: Communications link failure due
>> to
>> underlying exception:
>>
>> ** BEGIN NESTED EXCEPTION **
>>
>> java.io.EOFException
>>
>> STACKTRACE:
>>
>> java.io.EOFException
>>at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>>
>>
>> ** END NESTED EXCEPTION **
>> Last packet sent to the server was 202481 ms ago.
>>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>>at
>> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>>at
>> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>>at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>>at
>> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>> Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcData

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Shalin Shekhar Mangar
Marc, I've just committed a fix which may have caused the bug. Can you use
svn trunk (or the next nightly build) and confirm?

On Mon, Jan 5, 2009 at 3:10 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> looks like a bug w/ DIH with the recent fixes.
> --Noble
>
> On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese 
> wrote:
> >
> > Hey there,
> > I was using the Deduplication patch with Solr 1.3 release and everything
> was
> > working perfectly. Now I upgraded to a nigthly build (20th december) to
> be
> > able to use new facet algorithm and other stuff and DeDuplication is not
> > working any more. I have followed exactly the same steps to apply the
> patch
> > to the source code. I am geting this error:
> >
> > WARNING: Error reading data
> > com.mysql.jdbc.CommunicationsException: Communications link failure due
> to
> > underlying exception:
> >
> > ** BEGIN NESTED EXCEPTION **
> >
> > java.io.EOFException
> >
> > STACKTRACE:
> >
> > java.io.EOFException
> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> >at
> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> >at
> >
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> >at
> >
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> >
> >
> > ** END NESTED EXCEPTION **
> > Last packet sent to the server was 202481 ms ago.
> >at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
> >at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
> >at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
> >at
> com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
> >at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
> >at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
> >at
> >
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
> >at
> >
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
> >at
> >
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
> >at
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> > Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
> > logError
> > WARNING: Exception while closing result set
> > com.mysql.jdbc.CommunicationsException: Communications link failure due
> to
> > underlying exception:
> >
> > ** BEGIN NESTED EXCEPTION **
> >
> > java.io.EOFException
> >
> > STACKTRACE:
> >
> > java.io.EOFException
> >at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
> >at com.mysql.jdbc.MysqlIO.reuseAndReadPa

Re: Deduplication patch not working in nightly build

2009-01-05 Thread Noble Paul നോബിള്‍ नोब्ळ्
looks like a bug w/ DIH with the recent fixes.
--Noble

On Mon, Jan 5, 2009 at 2:36 PM, Marc Sturlese  wrote:
>
> Hey there,
> I was using the Deduplication patch with Solr 1.3 release and everything was
> working perfectly. Now I upgraded to a nigthly build (20th december) to be
> able to use new facet algorithm and other stuff and DeDuplication is not
> working any more. I have followed exactly the same steps to apply the patch
> to the source code. I am geting this error:
>
> WARNING: Error reading data
> com.mysql.jdbc.CommunicationsException: Communications link failure due to
> underlying exception:
>
> ** BEGIN NESTED EXCEPTION **
>
> java.io.EOFException
>
> STACKTRACE:
>
> java.io.EOFException
>at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2404)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>at
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
>
>
> ** END NESTED EXCEPTION **
> Last packet sent to the server was 202481 ms ago.
>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2563)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>at com.mysql.jdbc.ResultSet.next(ResultSet.java:6144)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:294)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.access$400(JdbcDataSource.java:189)
>at
> org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator$1.hasNext(JdbcDataSource.java:225)
>at
> org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:229)
>at
> org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:76)
>at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:351)
>at
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:193)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:144)
>at
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:334)
>at
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:407)
>at
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:388)
> Jan 5, 2009 10:06:16 AM org.apache.solr.handler.dataimport.JdbcDataSource
> logError
> WARNING: Exception while closing result set
> com.mysql.jdbc.CommunicationsException: Communications link failure due to
> underlying exception:
>
> ** BEGIN NESTED EXCEPTION **
>
> java.io.EOFException
>
> STACKTRACE:
>
> java.io.EOFException
>at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:1905)
>at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:2351)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:2862)
>at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:771)
>at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1289)
>at com.mysql.jdbc.RowDataDynamic.nextRecord(RowDataDynamic.java:362)
>at com.mysql.jdbc.RowDataDynamic.next(RowDataDynamic.java:352)
>at com.mysql.jdbc.RowDataDynamic.close(RowDataDynamic.java:150)
>at com.mysql.jdbc.R