R: Recommend lynx instead of html2text, and signal issues when indexing fulltext

Samuele Kaplun Mon, 15 Mar 2010 16:13:11 +0100

Hi Ferran!

I quickly reply you just on the signal issue:


There is a patch available that you can apply directly to release 0.99.1.

You can find it in my public branch here:

<http://cdsware.cern.ch/repo/?p=personal/cds-invenio-sam.git;a=commit;h=04d2cee71d151b7ce600011f4d2414ff28020419>

This should apply cleanly to your repository.

Best regards,
    Samuele
________________________________________
Da: Ferran Jorba [[email protected]]
Inviato: lunedì 15 marzo 2010 12.55
A: project-cdsware-users (CDS Invenio users)
Oggetto: Recommend lynx instead of html2text, and signal issues when indexing 
fulltext

Hello Invenio developers,

sorry for this long mail and my unability to provide patches to fix my
current issue, but I like to comment it in the -users list because I
think may matter to other installations.

After finishing our 0.99.1 migration, we've started to index fulltext in
our installation.  To start with, and as stated in the -developers list,
I've had to patch bibindex_engine.py so that it accepts any 856 second
indicator, changing _ to %, ex:


@@ -1455,7 +1458,7 @@ def task_run_core():
     # Let's work on single words!
     wordTables = get_word_tables(task_get_option("windex"))
     for index_id, index_tags in wordTables:
-        wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', 
get_words_from_phrase, {'8564_u': get_words_from_fulltext})
+        wordTable = WordTable(index_id, index_tags, 'idxWORD%02dF', 
get_words_from_phrase, {'8564%u': get_words_from_fulltext})
         _last_word_table = wordTable
         wordTable.report_on_table_consistency()


I know that it may be not accepted yet to be merged upstream, and
myself, I have to do a full check.

A second issue I'm having is that, in our site, we have a lot of HTML
documents, and a bunch of them are in non-utf8 charset (mostly
iso-8859-1 and windows-1251).  I have been watching and debugging it the
whole morning.  In a word, bibindex_engine expect everything in utf8,
and when it is not, it complains loudly.  Adding the exception to the
message, I got:


 2010-03-15 09:47:05 --> Error: Cannot put word num??riques with sign 1 for 
recID 10 (exception: 'utf8' codec can't decode bytes in position 9-11: invalid 
data).


How to get utf8 clean text from any HTML document, from any charset?
html2text has the -ascii option to output unaccented text, but it didn't
do anything good in my files.  Fortunately, lynx does it cleanly.  This
quick-and-dirty patch allows me to do some progress:


@@ -417,6 +417,8 @@ def get_words_from_fulltext(url_direct_or_indirect, 
stemming_language=None):
                 elif os.path.basename(conv_program) == "html2text":
                     cmd = "%s %s > %s" % \
                           (conv_program, tmp_name, tmp_dst_name)
+                    cmd = "lynx -dump -display_charset=utf8 %s >%s" % \
+                        (tmp_name, tmp_dst_name)
                 else:
                     write_message("Error: Do not know how to handle %s 
conversion program." % conv_program, sys.stderr)
                 # try to run it:


But my joy was short lived after seeing that my bibindex task ended in
error.  Running it in verbose -v9 mode, I can see a lot of 'got signal
12 frame' messages like this one:


 [...]
 2010-03-15 11:54:08 --> ... data to elaborate: [('pdf', 
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf')]
 2010-03-15 11:54:08 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf started
 2010-03-15 11:54:09 --> ..... launching /usr/bin/pdftotext -enc UTF-8 
/tmp/tmpGyWn-Minvenio.tmp /home/ddd/invenio/var/tmp/tmpvz3OGFinvenio.tmp.txt
 2010-03-15 11:54:09 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
 2010-03-15 11:54:09 --> ... reading fulltext files from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909-1910v4nINDICE.pdf ended
 2010-03-15 11:54:09 --> ... reading fulltext files from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
 2010-03-15 11:54:09 --> ... data to elaborate: [('pdf', 
'http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf')]
 2010-03-15 11:54:09 --> .... processing pdf from 
http://ddd.uab.cat/pub/revvetesp/revvetesp_a1909m9v4n1.pdf started
 2010-03-15 11:54:10 --> task_sig_ping(), got signal 12 frame <frame object at 
0x2b60d70>
 2010-03-15 11:54:10 --> Updating task status to ERROR.
 2010-03-15 11:54:10 --> Task #615 finished but not resubmitted. [ERROR]


I've ben digging into this signal issue into the git repository, I found
the following patch from Samuele:

 
http://cdsware.cern.ch/repo/?p=cds-invenio.git;a=commitdiff;h=795252c39cdaaefd8649185373a8869064801d14

But after adjusting the path of the files to my running installation, it
doesn't apply cleanly:


 $ guilt push
 Applying patch..dropped-signal-usage-in-bibsched-bibtasks.patch
 error: patch failed: lib/python/invenio/bibsched.py:747
 error: lib/python/invenio/bibsched.py: patch does not apply
 error: patch failed: lib/python/invenio/bibtask.py:53
 error: lib/python/invenio/bibtask.py: patch does not apply
 To force apply this patch, use 'guilt push -f'


Summing up my help request: I found workarounds except for this signal
issue, well above my skills.  Do you have any suggestion to overcome it?

Thanks,

Ferran

R: Recommend lynx instead of html2text, and signal issues when indexing fulltext

Reply via email to