I have a plain text file containing an English-language essay that I'd like 
to split into sentences, based on the presence of punctuation.

I wrote this function to determine if a given character is an English 
punctuation mark:

(defn ispunc? [c]
  (> (count (filter #(= % c) '("." "!" "?" ";"))) 0))

I know that this method is not grammatically perfect, in that acronyms such 
as "U.S." will get mis-parsed, etc., but this is just an experiment and 
does not need that level of precision.

Then, I tried applying it with partition-by on a file I've slurped:

(def my-text (slurp "mytext.txt"))
(def my-sentences (partition-by ispunc? my-text))

Unfortunately, this returns a sequence of 1, whose first and only element 
contains the entire text, since ispunc? depends on looking at a single 
character.

So I tried producing a list of chars from the string and passing it to 
partition-by with ispunc? like this:

(def my-text-chars (partition (count my-text) my-text))
(def my-sentences (partition-by ispunc? (first my-text-chars)))

That worked, in that it's logically "correct", but when I try to access any 
of the elements in my-sentences I get a java.lang.OutOfMemoryError (the 
source text file, "mytext.txt" is 1.3 mb in size).

So is there a simpler and more idiomatic way of doing this without using up 
all the heap space?

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to