Dear all,

I am running into problems when I try to parse SGML documents [0] that are
valid XML apart from the fact that they lack a root tag. The whole issue is
complicated by the fact that the input files are pretty large (i.e. 9.1 GB
gzipped files) and that I therefore cannot read them completely into memory.
My goal is to extract each "document" and index it using Lucene, so I need
access to the data at one point, but can throw it away immediately after
processing.

The input data looks something like [1] and my main problem is that none of
the parsers I tried cope with the missing root tag. The main problem is
that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or
data.xml's [4] parse function I get a broken representation and can't extract 
all
data using either zippers (e.g. like in [5]) or by working on the parsed data
directly (as in [6]). 

The main problem is that the *first* <DOC> is (wrongly) assumed to be the root
tag for the entire document and that the result of the parse looks something
like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want output
such as from my (hypothetical) processing function:

(documents-from-gigaword-file (-> in-file
                                  (io/input-stream) 
                                  (GZIPInputStream.))))
({:id "AFP_ENG_20101220.0219"
  :type "story"
  :headline "Headline 1"
  :paragraphs ("Paragraph 1" "Paragraph 2")}

{:id "AFP_ENG_20101220.0235"
  :type "story"
  :headline "Headline 2"
  :text ("Paragraph 3")})

But I get the follwing right now: (no wonder!)

user=> (clojure.pprint/pprint (gw-file->documents (io/file 
"/home/babilen/foo.gz")))
({:id "AFP_ENG_20101206.0235",
  :type "story",
  :headline " Headline 2 ",
  :paragraphs (" Paragraph 3 ")})


I am, however, unsure how to proceed. I tried wrapping the input stream in
"<XML> ... </XML>" [10] but that requires me to read the entire file into memory
and I get OutOfMemory errors when working on the complete corpus. So in short
my questions are:

* Do you know a parser that I can use to parse this data?
* Lacking that: How can I wrap the GZIPInputStream in opening and closing
  tags?
* Do you think that I should just write a parser myself? (seems a lot of work
  just because the enclosing tags are missing)
* Are there other feasible approaches?

Any input would be most appreciated!

References
----------

[0] The input data is the English gigaword corpus from
    http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07

[1] Example data:

    <DOC id="AFP_ENG_20101220.0219" type="story" >
    <HEADLINE>
    Headline 1
    </HEADLINE>
    <DATELINE>
    Location, Dec 20, 2010 (AFP)
    </DATELINE>
    <TEXT>
    <P>
    Paragraph 1
    </P>
    <P>
    Paragraph 2
    </P>
    </TEXT>
    </DOC>
    <DOC id="AFP_ENG_20101206.0235" type="story" >
    <HEADLINE>
    Headline 2
    </HEADLINE>
    <DATELINE>
    Location, Dec 6, 2010 (AFP)
    </DATELINE>
    <TEXT>
    <P>
    Paragraph 3
    </P>
    </TEXT>
    </DOC>

[3] 
https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj
[4] https://github.com/clojure/data.xml/
[5] Extraction using a zipper:
    (defn gw-file->documents
     [in-file]
     (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
      (map (fn [doc]
            {:id (dzx/xml1-> doc (dzx/attr :id))
            :type (dzx/xml1-> doc (dzx/attr :type))
            :headline (dzx/xml1-> doc :HEADLINE dzx/text)
            :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
       (dzx/xml-> xml-zipper :DOC))))
[6] Example extraction of data on the output of parse(-xml) directly:
    I use filter-tag to search for all :DOC's and call process-document for
    each.

    (defn- filter-tag
      [tag xmls]
      (filter identity
              (for [x xmls
                    :when (= tag (:tag x))]
                x)))

    (defn process-document
      [doc]
      {:id   (:id (:attrs doc))
       :type (:type (:attrs doc))
       :headline (filter-tag :HEADLINE (xml-seq doc))})
[7] Parsing with tagsoup

user=> (clojure.pprint/pprint 
  (tagsoup/parse-xml 
   (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.))))
{:tag :DOC,
 :attrs {:id "AFP_ENG_20101220.0219", :type "story"},
 :content
 [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]}
  {:tag :DATELINE,
   :attrs nil,
   :content ["\nLocation, Dec 20, 2010 (AFP)\n"]}
  {:tag :TEXT,
   :attrs nil,
   :content
   [{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]}
    {:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]}
  {:tag :DOC,
   :attrs {:id "AFP_ENG_20101206.0235", :type "story"},
   :content
   [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]}
    {:tag :DATELINE,
     :attrs nil,
     :content ["\nLocation, Dec 6, 2010 (AFP)\n"]}
    {:tag :TEXT,
     :attrs nil,
     :content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]}

[8] Parsing with clojure.data.xml/parse
user=> (clojure.pprint/pprint 
  (clojure.data.xml/parse 
    (-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.))))
{:tag :DOC,
 :attrs {:id "AFP_ENG_20101220.0219", :type "story"},
 :content
 ({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")}
  {:tag :DATELINE,
   :attrs {},
   :content ("\nLocation, Dec 20, 2010 (AFP)\n")}
  {:tag :TEXT,
   :attrs {},
   :content
   ({:tag :P, :attrs {}, :content ("\nParagraph 1\n")}
    {:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})}

[9] My actual code:

(defn- parse-gw-file
  [in-file]
  (->> in-file
    (io/input-stream)
    (GZIPInputStream.)
    (ts/parse-xml)))

(defn gw-file->documents
  [in-file]
  (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
    (map (fn [doc]
           {:id (dzx/xml1-> doc (dzx/attr :id))
            :type (dzx/xml1-> doc (dzx/attr :type))
            :headline (dzx/xml1-> doc :HEADLINE dzx/text)
            :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
         (dzx/xml-> xml-zipper :DOC))))

[10] Wrapping the stream:
(defn- parse-gw-file
  [in-file]
  (let [unzipped-file (->> in-file
                        (io/input-stream)
                        (GZIPInputStream.))
        wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")]
    (->> wrapped-file
      (ByteArrayInputStream.)
    (ts/parse-xml (ByteArrayInputStream. 
                    (.getBytes (str "<XML>" (slurp unzipped-file) "</XML>")
                               "UTF-8")))))
-- 
Wolodja <babi...@gmail.com>

4096R/CAF14EFC
081C B7CD FF04 2BA9 94EA  36B2 8B7F 7D30 CAF1 4EFC

Attachment: signature.asc
Description: Digital signature

Reply via email to