How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
To search for duplicate IDs, I am running the following query: select?q=*:*&facet=true&facet.field=id&rows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeExceptio

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Aloke Ghoshal
Does adding facet.mincount=2 help? On Tue, Jul 30, 2013 at 11:46 PM, Dotan Cohen wrote: > To search for duplicate IDs, I am running the following query: > select?q=*:*&facet=true&facet.field=id&rows=0 > > However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving > OutOfMemoryError error

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:16 PM, Dotan Cohen wrote: To search for duplicate IDs, I am running the following query: select?q=*:*&facet=true&facet.field=id&rows=0 However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving OutOfMemoryError errors instead of the desired facet: Might there be a les

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Are you talking about the document's ID field? If so, you can't have duplicates... the latter document would overwrite the earlier. If not, sorry for asking irrelevant questions. :) Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Scienc

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal wrote: > Does adding facet.mincount=2 help? > > In fact, when adding facet.mincount=20 (I know that some dupes are in the hundreds) I got the OutOfMemoryError in seconds instead of minutes. -- Dotan Cohen http://gibberish.co.il http://what-is-what

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta wrote: > Are you talking about the document's ID field? > > If so, you can't have duplicates... the latter document would overwrite the > earlier. > > If not, sorry for asking irrelevant questions. :) > In Solr 4.1 we were using overwrite=false

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Michael Della Bitta
Since this is a one-time problem, Have you thought of just dumping all the IDs and looking for dupes using sort and awk or something similar to that? Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:24 PM, Shawn Heisey wrote: > Add &facet.method=enum to the query URL. This will cause Solr to enumerate > the facet information on every query rather than load it into the field > cache, which takes a lot of memory. Solr 4.1 was probably very close to > running out of m

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta wrote: > Since this is a one-time problem, Have you thought of just dumping all the > IDs and looking for dupes using sort and awk or something similar to that? > All 100,000,000 of them :) That would take even longer! Also, I fear that this is

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Shawn Heisey
On 7/30/2013 12:49 PM, Dotan Cohen wrote: ‎Thanks, the query ran for almost 2 full minutes but it returned results! I'll google for how to increase the disk cache for queries like this. Other than the Qtime, is there no way to judge the amount of memory required for a particular query to run? T

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Mikhail Khludnev
Dotan, Could you please provide more line of the stack trace? I have no idea why it made worse at 4.3. I know that 4.3 can use facets backed on DocValues, which are modest for the heap. But from what I saw, but can be wrong it's disabled from numeric facets. Hence, I can suggest to reindex id as s

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Jack Krupansky
Krupansky -Original Message- From: Dotan Cohen Sent: Tuesday, July 30, 2013 2:16 PM To: solr-user@lucene.apache.org Subject: How might one search for dupe IDs other than faceting on the ID field? To search for duplicate IDs, I am running the following query: select?q=*:*&facet=

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Jack Krupansky
Jack Krupansky -Original Message- From: Jack Krupansky Sent: Tuesday, July 30, 2013 4:14 PM To: solr-user@lucene.apache.org Subject: Re: How might one search for dupe IDs other than faceting on the ID field? The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... any particu

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Bill Bell
This seems like a fairly large issue. Can you create a Jira issue ? Bill Bell Sent from mobile On Jul 30, 2013, at 12:34 PM, Dotan Cohen wrote: > On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal wrote: >> Does adding facet.mincount=2 help? >> >> > > In fact, when adding facet.mincount=20 (I

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey wrote: > On 7/30/2013 12:49 PM, Dotan Cohen wrote: >> >> ‎Thanks, the query ran for almost 2 full minutes but it returned >> results! I'll google for how to increase the disk cache for queries >> like this. Other than the Qtime, is there no way to judg

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev wrote: > Dotan, > > Could you please provide more line of the stack trace? Sure, thanks: java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.servlet.SolrDispat

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky wrote: > The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... > any particular reason you did not use it? > > See: > http://wiki.apache.org/solr/Deduplication > > and > > https://cwiki.apache.org/confluence/display/solr/De-Du

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Wed, Jul 31, 2013 at 12:48 AM, Jack Krupansky wrote: > You could also try the terms component which provides a very efficient > facet-like feature - counting the terms. And you can set a minimum term > frequency of 2, so only the dups would come back: > > curl "http://localhost:8983/solr/terms?

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen
On Wed, Jul 31, 2013 at 4:56 AM, Bill Bell wrote: > On Jul 30, 2013, at 12:34 PM, Dotan Cohen wrote: >> On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal wrote: >>> Does adding facet.mincount=2 help? >> >> In fact, when adding facet.mincount=20 (I know that some dupes are in >> the hundreds) I got

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Mikhail Khludnev
fwiw, this code won't capture uncommitted duplicates. On Wed, Jul 31, 2013 at 9:41 AM, Dotan Cohen wrote: > On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky > wrote: > > The Solr SignatureUpdateProcessorFactory is designed to facilitate > dedupe... > > any particular reason you did not use it

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Jack Krupansky
Good to note! But... any "search" will not detect dupe IDs for uncommitted documents. -- Jack Krupansky -Original Message- From: Mikhail Khludnev Sent: Wednesday, July 31, 2013 6:11 AM To: solr-user Subject: Re: How might one search for dupe IDs other than faceting on the

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-31 Thread Shawn Heisey
On 7/30/2013 11:22 PM, Dotan Cohen wrote: > I see, thanks. I thought that 'disk cache' was something on disk, such > as swap space. The server is already maxed out on RAM: > $ free -m > total used free sharedbuffers > cached > Mem: 14980 14906 7