Hello Invenio developers,
sorry for this long mail and my unability to provide patches to fix my
current issue, but I like to comment it in the -users list because I
think may matter to other installations.
After finishing our 0.99.1 migration, we've started to index fulltext in
our installation. To start with, and as stated in the -developers list,
I've had to patch bibindex_engine.py so that it accepts any 856 second
indicator, changing _ to %, ex:
@@ -1455,7 +1458,7 @@ def task_run_core():
# Let's work on single words!
wordTables = get_word_tables(task_get_option("windex"))
for index_id, index_tags in wordTables:
- wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF',
get_words_from_phrase, {'8564_u': get_words_from_fulltext})
+ wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF',
get_words_from_phrase, {'8564%u': get_words_from_fulltext})
_last_word_table = wordTable
wordTable.report_on_table_consistency()
I know that it may be not accepted yet to be merged upstream, and
myself, I have to do a full check.
A second issue I'm having is that, in our site, we have a lot of HTML
documents, and a bunch of them are in non-utf8 charset (mostly
iso-8859-1 and windows-1251). I have been watching and debugging it the
whole morning. In a word, bibindex_engine expect everything in utf8,
and when it is not, it complains loudly. Adding the exception to the
message, I got:
2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for
recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid
data).
How to get utf8 clean text from any HTML document, from any charset?
html2text has the -ascii option to output unaccented text, but it didn't
do anything good in my files. Fortunately, lynx does it cleanly. This
quick-and-dirty patch allows me to do some progress:
@@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect,
stemming_language=None):
elif os.path.basename(conv_program) == "html2text":
cmd = "%s %s > %s" % \
(conv_program, tmp_name, tmp_dst_name)
+ cmd = "lynx -dump -display_charset=utf8 %s >%s" % \
+ (tmp_name, tmp_dst_name)
else:
write_message("Error: Do not know how to handle %s
conversion program." % conv_program, sys.stderr)
# try to run it:
But my joy was short lived after seeing that my bibindex task ended in
error. Running it in verbose -v9 mode, I can see a lot of 'got signal
12 frame' messages like this one:
[...]
2010-03-15 11:54:08 --> ... data to elaborate: [('pdf',
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf')]
2010-03-15 11:54:08 --> .... processing pdf from
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf started
2010-03-15 11:54:09 --> ..... launching /usr/bin/pdftotext -enc UTF-8
/tmp/tmpGyWn-Minvenio.tmp /home/ddd/invenio/var/tmp/tmpvz3OGFinvenio.tmp.txt
2010-03-15 11:54:09 --> .... processing pdf from
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
2010-03-15 11:54:09 --> ... reading fulltext files from
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
2010-03-15 11:54:09 --> ... reading fulltext files from
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
2010-03-15 11:54:09 --> ... data to elaborate: [('pdf',
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf')]
2010-03-15 11:54:09 --> .... processing pdf from
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
2010-03-15 11:54:10 --> task_sig_ping(), got signal 12 frame <frame object at
0x2b60d70>
2010-03-15 11:54:10 --> Updating task status to ERROR.
2010-03-15 11:54:10 --> Task #615 finished but not resubmitted. [ERROR]
I've ben digging into this signal issue into the git repository, I found
the following patch from Samuele:
http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=commitdiff;h=795252c39cdaaefd8649185373a8869064801d14
But after adjusting the path of the files to my running installation, it
doesn't apply cleanly:
$ guilt push
Applying patch..dropped-signal-usage-in-bibsched-bibtasks.patch
error: patch failed: lib/python/invenio/bibsched.py:747
error: lib/python/invenio/bibsched.py: patch does not apply
error: patch failed: lib/python/invenio/bibtask.py:53
error: lib/python/invenio/bibtask.py: patch does not apply
To force apply this patch, use 'guilt push -f'
Summing up my help request: I found workarounds except for this signal
issue, well above my skills. Do you have any suggestion to overcome it?
Thanks,
Ferran