Hello Invenio developers,

sorry for this long mail and my unability to provide patches to fix my
current issue, but I like to comment it in the -users list because I
think may matter to other installations.

After finishing our 0.99.1 migration, we've started to index fulltext in
our installation.  To start with, and as stated in the -developers list,
I've had to patch bibindex_engine.py so that it accepts any 856 second
indicator, changing _ to %, ex:


@@ -1455,7 +1458,7 @@ def task_run_core():
     # Let's work on single words!
     wordTables = get_word_tables(task_get_option("windex"))
     for index_id, index_tags in wordTables:
-        wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', 
get_words_from_phrase, {'8564_u': get_words_from_fulltext})
+        wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', 
get_words_from_phrase, {'8564%u': get_words_from_fulltext})
         _last_word_table = wordTable
         wordTable.report_on_table_consistency()


I know that it may be not accepted yet to be merged upstream, and
myself, I have to do a full check.

A second issue I'm having is that, in our site, we have a lot of HTML
documents, and a bunch of them are in non-utf8 charset (mostly
iso-8859-1 and windows-1251).  I have been watching and debugging it the
whole morning.  In a word, bibindex_engine expect everything in utf8,
and when it is not, it complains loudly.  Adding the exception to the
message, I got:


 2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for 
recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid 
data).


How to get utf8 clean text from any HTML document, from any charset?
html2text has the -ascii option to output unaccented text, but it didn't
do anything good in my files.  Fortunately, lynx does it cleanly.  This
quick-and-dirty patch allows me to do some progress:


@@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect, 
stemming_language=None):
                 elif os.path.basename(conv_program) == "html2text":
                     cmd = "%s %s > %s" % \
                           (conv_program, tmp_name, tmp_dst_name)
+                    cmd = "lynx -dump -display_charset=utf8 %s >%s" % \
+                        (tmp_name, tmp_dst_name)
                 else:
                     write_message("Error: Do not know how to handle %s 
conversion program." % conv_program, sys.stderr)
                 # try to run it:


But my joy was short lived after seeing that my bibindex task ended in
error.  Running it in verbose -v9 mode, I can see a lot of 'got signal
12 frame' messages like this one:


 [...]
 2010-03-15 11:54:08 --> ... data to elaborate: [('pdf', 
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf')]
 2010-03-15 11:54:08 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf started
 2010-03-15 11:54:09 --> ..... launching /usr/bin/pdftotext -enc UTF-8 
/tmp/tmpGyWn-Minvenio.tmp /home/ddd/invenio/var/tmp/tmpvz3OGFinvenio.tmp.txt
 2010-03-15 11:54:09 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
 2010-03-15 11:54:09 --> ... reading fulltext files from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
 2010-03-15 11:54:09 --> ... reading fulltext files from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
 2010-03-15 11:54:09 --> ... data to elaborate: [('pdf', 
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf')]
 2010-03-15 11:54:09 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
 2010-03-15 11:54:10 --> task_sig_ping(), got signal 12 frame <frame object at 
0x2b60d70>
 2010-03-15 11:54:10 --> Updating task status to ERROR.
 2010-03-15 11:54:10 --> Task #615 finished but not resubmitted. [ERROR]


I've ben digging into this signal issue into the git repository, I found
the following patch from Samuele:

 
http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=commitdiff;h=795252c39cdaaefd8649185373a8869064801d14

But after adjusting the path of the files to my running installation, it
doesn't apply cleanly:


 $ guilt push
 Applying patch..dropped-signal-usage-in-bibsched-bibtasks.patch
 error: patch failed: lib/python/invenio/bibsched.py:747
 error: lib/python/invenio/bibsched.py: patch does not apply
 error: patch failed: lib/python/invenio/bibtask.py:53
 error: lib/python/invenio/bibtask.py: patch does not apply
 To force apply this patch, use 'guilt push -f'


Summing up my help request: I found workarounds except for this signal
issue, well above my skills.  Do you have any suggestion to overcome it?

Thanks,

Ferran

Reply via email to