I have a plain text file containing an English-language essay that I'd like to split into sentences, based on the presence of punctuation.
I wrote this function to determine if a given character is an English punctuation mark: (defn ispunc? [c] (> (count (filter #(= % c) '("." "!" "?" ";"))) 0)) I know that this method is not grammatically perfect, in that acronyms such as "U.S." will get mis-parsed, etc., but this is just an experiment and does not need that level of precision. Then, I tried applying it with partition-by on a file I've slurped: (def my-text (slurp "mytext.txt")) (def my-sentences (partition-by ispunc? my-text)) Unfortunately, this returns a sequence of 1, whose first and only element contains the entire text, since ispunc? depends on looking at a single character. So I tried producing a list of chars from the string and passing it to partition-by with ispunc? like this: (def my-text-chars (partition (count my-text) my-text)) (def my-sentences (partition-by ispunc? (first my-text-chars))) That worked, in that it's logically "correct", but when I try to access any of the elements in my-sentences I get a java.lang.OutOfMemoryError (the source text file, "mytext.txt" is 1.3 mb in size). So is there a simpler and more idiomatic way of doing this without using up all the heap space? -- -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.