I am processing a very large xml file, 13MB, using clojure.xml.parse
and clojure.contrib.zip-filter.xml with clojure 1.0.0.
The xml file contains information on 13000 japanese characters and I'm
extracting about 200 or so.
At its core it extracts a very small subset of elements using:
(xml-> kdic :character [:literal #(contains? kcset (text %))] node)
Where kcset is a set of desired characters.
My understanding of this is that it returns a lazy-seq which if I
"count"-ed the length of the sequence it would return 200 (not
13000). But in practice it actually generates a stack overflow.
At the end of this post I have a relatively short version of the
program which throws the stack overflow. In this case it has a
(count ...) call which causes the stack overflow. In the full program
I tried a few variations like so:
(dorun (for [knode knodes] (print-kinfo knode))))
To try to get the information to print, but before it also reaches the
end of list it also throws a stack overflow.
I also have the stack trace at the end as well.
Thanks!
Here's the short version of the program:
(ns kanji.prkanji
(:use clojure.xml )
(:use [clojure.zip :only (xml-zip node)])
(:use clojure.contrib.zip-filter.xml)
(:import java.lang.Character$UnicodeBlock)
(:import java.io.File))
(def CJK Character$UnicodeBlock/CJK_UNIFIED_IDEOGRAPHS)
(defn filter-for-kanji
[chars]
(filter #(= CJK (Character$UnicodeBlock/of %)) chars))
(defn get-unique-kanji
[chars]
(let [kchars (filter-for-kanji chars)]
(set kchars)))
(defn print-kinfos
[knodes]
(count knodes))
;; this is what I would normally do: (dorun (for [knode knodes] (print-
kinfo knode))))
(defn get-kdic-info
[kdic kchars]
(let [kcset (set (map str kchars))]
(xml-> kdic :character [:literal #(contains? kcset (text %))]
node)))
(defn load-kdic
[fname]
(xml-zip (parse (File. fname))))
(defn process-file
[file]
(let [kchars (get-unique-kanji (slurp file))]
(print-kinfos
(get-kdic-info
(load-kdic "kanji/kdic-data.xml") kchars))))
(process-file (second *command-line-args*))
And here's the top of the stack trace:
Exception in thread "main" java.lang.StackOverflowError (prkanji.clj:
0)
at clojure.lang.Compiler.eval(Compiler.java:4543)
at clojure.lang.Compiler.load(Compiler.java:4857)
at clojure.lang.Compiler.loadFile(Compiler.java:4824)
at clojure.main$load_script__5833.invoke(main.clj:206)
at clojure.main$init_opt__5836.invoke(main.clj:211)
at clojure.main$initialize__5846.invoke(main.clj:239)
at clojure.main$null_opt__5868.invoke(main.clj:264)
at clojure.main$legacy_script__5883.invoke(main.clj:295)
at clojure.lang.Var.invoke(Var.java:346)
at clojure.main.legacy_script(main.java:34)
at clojure.lang.Script.main(Script.java:20)
Caused by: java.lang.StackOverflowError
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.boundedLength(RT.java:1117)
at clojure.lang.AFn.applyToHelper(AFn.java:168)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:443)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$descendants__48$fn__50.invoke
(zip_filter.clj:63)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.core$seq__3133.invoke(core.clj:103)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1502)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.boundedLength(RT.java:1117)
at clojure.lang.RestFn.applyTo(RestFn.java:135)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
at clojure.lang.ArraySeq.reduce(ArraySeq.java:116)
at clojure.core$reduce__3319.invoke(core.clj:536)
at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
at clojure.lang.RestFn.invoke(RestFn.java:460)
at clojure.contrib.zip_filter.xml$text__102.invoke(xml.clj:43)
at kanji.prkanji$get_kdic_info__147$fn__149.invoke(prkanji.clj:36)
at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.core$seq__3133.invoke(core.clj:103)
at clojure.core$spread__3240.invoke(core.clj:383)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.core$mapcat__3842.doInvoke(core.clj:1528)
at clojure.lang.RestFn.invoke(RestFn.java:428)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67.invoke
(zip_filter.clj:88)
at clojure.lang.APersistentVector$Seq.reduce(APersistentVector.java:
476)
at clojure.core$reduce__3319.invoke(core.clj:536)
at clojure.contrib.zip_filter$mapcat_chain__65.invoke(zip_filter.clj:
89)
at clojure.contrib.zip_filter.xml$xml__GT___119.doInvoke(xml.clj:75)
at clojure.lang.RestFn.applyTo(RestFn.java:144)
at clojure.core$apply__3243.doInvoke(core.clj:390)
at clojure.lang.RestFn.invoke(RestFn.java:443)
at clojure.contrib.zip_filter.xml$seq_test__111$fn__113.invoke
(xml.clj:55)
at clojure.contrib.zip_filter$fixup_apply__60.invoke(zip_filter.clj:
76)
at clojure.contrib.zip_filter$mapcat_chain__65$fn__67$fn__69.invoke
(zip_filter.clj:88)
at clojure.core$map__3815$fn__3817.invoke(core.clj:1503)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.Cons.next(Cons.java:37)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next__3117.invoke(core.clj:50)
at clojure.core$concat__3255$cat__3269$fn__3270.invoke(core.clj:428)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
at clojure.lang.RT.seq(RT.java:436)
at clojure.lang.LazySeq.seq(LazySeq.java:41)
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en