On Mon, Sep 19, 2011 at 7:19 AM, Johan De Meersman <vegiv...@tuxera.be>wrote:
> ----- Original Message ----- > > From: "Hank" <hes...@gmail.com> > > > > While running a -rq on a large table, I got the following error: > > > > myisamchk: warning: Duplicate key for record at 54381140 against > > record at 54380810 > > > > How do I find which records are duplicated (without doing the typical > > self-join or "having cnt(*)>1" query)? This table has 144 million > > rows, so that's not really feasible. > > Given that the error is a duplicate *key*, "select <key> from <table> group > by <key> having count(<key>) > 1" is an index-covered query, and should thus > be perfectly feasible :-) > > What I'm not so sure about, is wether the duplicate key will show up > correctly in the index - as that index may be marked corrupt - and so, if it > falls back to a full tablescan, it's indeed going to take a long time. If it > does, however, there's no other option anyway: the only way to do it fast is > an index, and that index is untrustworthy. Exactly - I can't create an index on the table until I remove the duplicate records. I suppose I could create a non-unique index on the key fields, and try that, but then I'd have to (1) create that index, and then (2) do the full table scan query. Either way, it's going to take a tremendous amount of time to do that. Alternatively, it would be most helpful if the tools provided that find and output the offending record block #s also provided a quick way to actually print out those offending rows so I could track down how they got in there in the first place. -Hank