Sorry for the delayed answer.
On Mon, 10 Oct 2011, Jean-Francois Dockes wrote:
It would seem that there is some file in your document set which is
crashing recoll. We need to determine which it is, get it out of the
indexed set so that you can begin to use recoll again, and if at all
possible, I would very much like to get a copy to fix the bug (if this
is confidential data, we'll try other ways to get details about the
issue).
For a beginning, we need to have a look at the log file before the point
where recoll crashes.
I rebuilt the package with noopt,nostrip,debug. I debugged it down to
recoll-1.13.04/common/unacpp.cpp, function unacmaybefold(). It is called
with dofold = true. unacfold_string() returns with -1, errno set to 12
(ENOMEM). Then unacmaybefold() goes on to format an error message:
45 if (status < 0) {
46 if (cout)
47 free(cout);
48 char cerrno[20];
49 sprintf(cerrno, "%d", errno);
50 out = string("unac_string failed, errno : ") + cerrno;
51 return false;
52 }
however on line 50 the string concatenation runs out of memory (not
surprising, after unacfold_string() failed with ENONEM, and that's the
source of the std::bad_alloc exception object.
This happens right after the million lines of
:3:../rcldb/rcldb.cpp:813:Db::splitter::takeword: unac failed for [...]
are printed, during which phase the VSZ of the recollindex process grows
constantly. When the process finally reaches the above, the VSZ is
1,521,648 KB.
I followed unacfold_string() to unacmaybefold_string() and started to
suspect that it leaks somewhere. The code was very hard to follow in
gdb/ddd (I guess some optimization remained enabled, because the line
number kept jumping around and it was very hard to set breakpoints). After
a while I got tired and started it under valgrind, and thankfully valgrind
completed the top of the stack: it is indeed convert(), called by
unacmaybefold_string(), that leaks an iconv() conversion descriptor (and
therefore, memory) in the error path(s). (I think it's very wasteful to
open/close a descriptor for the same conversion thousands of times, but I
digress.)
I identified the file that caused this huge number of conversion errors --
it's a Maildir file with a zip and a rar attachment. Both compressed files
have the same contents: two latin2 encoded text files (tables, actually),
1.3 and 1.4 MB in size. In total 5.4 MB of latin2 encoded text that caused
90,228 conversion failures (and presumably leaked the same number of conv
descs).
The following patch fixed my problem. VSZ peaks around 160 MB.
Laszlo
--- build/recoll-1.13.04/unac/unac.c 2010-01-30 08:58:40.000000000 +0100
+++ build2/recoll-1.13.04/unac/unac.c 2011-10-11 23:05:21.000000000 +0200
@@ -10661,7 +10661,7 @@ static int convert(const char* from, con
if(errno == E2BIG)
/* fall thru to the E2BIG case below */;
else
- return -1;
+ goto err;
} else {
/* The offending character was replaced by a SPACE, skip it. */
in += 2;
@@ -10670,7 +10670,7 @@ static int convert(const char* from, con
break;
}
} else {
- return -1;
+ goto err;
}
case E2BIG:
{
@@ -10690,7 +10690,7 @@ static int convert(const char* from, con
DEBUG("realloc %d bytes failed\n", out_size+1);
free(saved);
*outp = 0;
- return -1;
+ goto err;
}
}
out = out_base + length;
@@ -10698,7 +10698,7 @@ static int convert(const char* from, con
}
break;
default:
- return -1;
+ goto err;
break;
}
}
@@ -10710,6 +10710,9 @@ static int convert(const char* from, con
(*outp)[*out_lengthp] = '\0';
return 0;
+err:
+ iconv_close(cd);
+ return -1;
}
int unacmaybefold_string(const char* charset,
--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org