Le jeudi 16 novembre 2006 à 18:55 +0000, Jamie McCracken a écrit :
> Luca Ferretti wrote:
> > I'm trying to check and eventually expand info in
> > http://live.gnome.org/Tracker/SupportedFormats.
> > 
> > So I'm planning to create files of various formats, then search for text
> > inside them. 
> > 
> > ############ Test Procedure ###
> > 
> > I used the "stable" version (0.5.1), while I've the CVS versions
> > installed too (I'll test it later).
> > 
> > By now I tested some word processor document formats: I wrote a one-line
> > document in OO.o Writer (the one in Ubuntu Edgy) and I saved it in
> > various format. The file has a content and some metadata (the one you
> > can add in File->Properties).
> > 
> > The exact procedure is:
> >      1. create the ODT file
> >      2. save it and close OO.o
> >      3. open the ODT file
> >      4. use File -> Save As ..
> >      5. chose a different format
> >      6. save the file in new "alien" format
> >      7. close the file and OO.o
> >      8. restart from #3
> > 
> > Then I searched with `tracker-search` at least 2 times for each file:
> > one for a word that's only in file content ("potenzialità"), one for a
> > word that's only in file metadata ("particolare") - of course I wrote
> > this file in Italian language.
> > 
> > ############# Test Results ###
> > 
> > ODT (OpenDocument Text)
> >   content:          yes
> >   metadata:                 yes [1]
> >   extra:                    keywords metadata are auto-tagged
> > 
> > OTT (OpenDocument Text Template)
> >   content:                  no (????)

Now yes.

> >   metadata:                 yes [1]
> >   extra:                    as above
> > 
> > SXW (OpenOffice 1.x Text)
> >   content:                  yes
> >   metadata:                 no
> > 
> > STW (OpenOffice 1.x Text Template)
> >   content:                  no

Now yes.

> >   metadata:                 no
> > 
> > DOC (Word 97/2000/XP | Word 95 | Word 6.0)
> >   content:                  yes [2]
> >   metadata:                 no  [3]
> > 
> > RTF (Rich Text Format)
> >   content:                  no  [4]
> >   metadata:                 no  [4]


For what I noted above, I provide a trivial patch which just adds two
new filters.


> > ########### Conclusions ###
> > 
> > I suspect that the RTF format is currently not managed by tracker. We
> > should manage it, 'cause it's the only format supported by all Word
> > Processors. Read note [4] about metadata and non ASCII characters.
> 
> package unrtf in debian/ubuntu  universe might help with this - it has 
> command line to convert to plain text - anyone wanna write a filter for 
> this?

hum,

$ unrtf --text pooooo.rtf
This is UnRTF, version 0.19.2
By Dave Davey and Marcos Serrou do Amaral
Original Author: Zach T. Smith
Processing pooooo.rtf...
### Translation from RTF performed by UnRTF, version 0.19.2
### For information about this marvellous program,
### please go to http://www.gnu.org/software/unrtf/unrtf.html
### document uses ANSI character set
### font table contains 4 fonts total
modello, ,schema,
AUTHOR: Luca Ferretti
### creaton date: 16 November 2006 15:29
### revision date: 1 January 1601
### last printed: 1 January 1601
### comments: StarWriter

-----------------
Questo ?? un semplice esempio delle potenzialit?? di OO.o
Done.
165 words were hashed.


It is not possible to remove extra output (except with shell commands of
course)... Perhaps we should extract some parts of its source code.


> > Extraction of metadata seems to work only for ODT and OTT formats.
> 
> that might be a libgsf limitation
> 
> > 
> > Moreover I don't understand why Tracker don't extract contents for
> > OpenDocument and OpenOffice templates. Is it a format design choice or a
> > tracker issue?
> 
> dunno - I guess they have different mime types?
> 
> could be an easy matter to sort out (just copy the filter)
> 
> > 
> > Finally could be interesting investigate the core (read crash) that
> > occurs every time DOC files are created or touch-ed. Is it possible to
> > execute tracker-extract directly with some debug options?
> 
> no but the errors are probably occuring in libgsf if thats any help?
> 
> gdb tracker-extract and make sure last param (mime) is "application/msword"
> 
> 
> 
> > 
> > I'll update the page on gnome.org wiki, but I like if someone could
> > before perform the same test using my file (attached) or creating a
> > custom one (maybe in other languages).
> > 
> > Also could be interesting try to index DOC files created directly MS
> > Word (not in my computer).
> > 
> > ########### Notes ###
> > 
> > [1] in File->Properties there are 2 tabs for metadata. Metadata in User
> > tab are ignored by tracker.
> 
> we only extract specific metadata not any old junk. We have File.Other 
> as a dumping ground though for misc crap that needs to be indexed
> 
> > 
> > [2] tail-ing ~/.Tracker/tracker.lo I see that saving the doc file, a
> > core file is created. This could depend on tracker-extract
> 
> most likely
> 
diff -pruN -x CVS tracker.orig/filters/application/Makefile.am tracker.modif/filters/application/Makefile.am
--- tracker.orig/filters/application/Makefile.am	2006-11-12 17:05:20.000000000 +0100
+++ tracker.modif/filters/application/Makefile.am	2006-11-16 20:43:02.000000000 +0100
@@ -5,10 +5,12 @@ thumbappbin_SCRIPTS = 	pdf_filter \
 			vnd.oasis.opendocument.presentation_filter \
 			vnd.oasis.opendocument.spreadsheet_filter \
 			vnd.oasis.opendocument.text_filter \
+			vnd.oasis.opendocument.text-template_filter \
 			vnd.stardivision.writer_filter \
 			vnd.sun.xml.calc_filter \
 			vnd.sun.xml.impress_filter \
 			vnd.sun.xml.writer_filter \
+			vnd.sun.xml.writer.template_filter \
 			x-abiword_filter
 
 EXTRA_DIST = $(thumbappbin_SCRIPTS)
diff -pruN -x CVS tracker.orig/filters/application/vnd.oasis.opendocument.text-template_filter tracker.modif/filters/application/vnd.oasis.opendocument.text-template_filter
--- tracker.orig/filters/application/vnd.oasis.opendocument.text-template_filter	1970-01-01 01:00:00.000000000 +0100
+++ tracker.modif/filters/application/vnd.oasis.opendocument.text-template_filter	2006-11-16 20:43:02.000000000 +0100
@@ -0,0 +1,3 @@
+#!/bin/sh
+
+nice -n19 unzip -p "$1" content.xml | o3totxt > "$2"
diff -pruN -x CVS tracker.orig/filters/application/vnd.sun.xml.writer.template_filter tracker.modif/filters/application/vnd.sun.xml.writer.template_filter
--- tracker.orig/filters/application/vnd.sun.xml.writer.template_filter	1970-01-01 01:00:00.000000000 +0100
+++ tracker.modif/filters/application/vnd.sun.xml.writer.template_filter	2006-11-16 20:43:02.000000000 +0100
@@ -0,0 +1,3 @@
+#!/bin/sh
+
+nice -n19 unzip -p "$1" content.xml | o3totxt > "$2"
_______________________________________________
tracker-list mailing list
tracker-list@gnome.org
http://mail.gnome.org/mailman/listinfo/tracker-list

Reply via email to