I have been working on a program that will take a website, and extract all the links from the body of the HTML page. I am using tagsoup <https://github.com/nathell/clj-tagsoup> to create a tree structure from an html page.
The current issue I am running into is traversing the tree structure and pulling out all the links. I have a function that will parses the tree using a for loop and recursion, but it does not feel very idiomatic. The list it returns is filled with vectors of emtpy lists and nil values. I can flatten out the data structure and grab everything I need out of it, but it feels clunky. I was looking for some tips on how I could impore my code, since this is the first complicated clojure program I have written. Here is the code I have written for extracting the a tags out of the html tree. (defn get-tags ([tag html] (get-tags tag [] html)) ([tag found html] (if html (for [el html] (if (vector? el) (if (= (soup/tag el) tag) (conj found el) (->> (soup/children el) (remove #(or (string? %) (nil? %))) (get-tags tag found) (conj found)))))))) It gets called with this something like this. Normally the site would be a lot bigger, but I deleted a lot of the tree for this post. (def html-tree [:body {} [:a {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"} "triangular-nordic"] [:table {} [:tr {} [:td {:colspan "1", :rowspan "1"} [:a {:href "/files/", :shape "rect"} [:img {:src "truck.gif", :title "Slug's File Archive"}]]] [:td {:colspan "1", :rowspan "1"} [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]] [:td {:colspan "1", :rowspan "1"} [:a {:href "http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966", :shape "rect"} "Forecast"] [:br {:clear "none"}] [:a {:href "http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no", :shape "rect"} "Radar"] [:br {:clear "none"}] [:a {:href "http://news.google.com/", :shape "rect"} "News"] [:br {:clear "none"}]] [:td {:colspan "1", :rowspan "1"} [:a {:href "http://reddit.com", :shape "rect"} "Reddit"] [:br {:clear "none"}] [:a {:href "http://digg.com", :shape "rect"} "Digg"] [:br {:clear "none"}]]]]]) (get-tags :a html-tree) This evaluates to (nil nil [[:a {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"} "triangular-nordic"]] [([([([[:a {:href "/files/", :shape "rect"} [:img {:src "truck.gif", :title "Slug's File Archive"}]]])] [([[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]])] [([[:a {:href "http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966", :shape "rect"} "Forecast"]] [()] [[:a {:href "http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no", :shape "rect"} "Radar"]] [()] [[:a {:href "http://news.google.com/", :shape "rect"} "News"]] [()])] [([[:a {:href "http://reddit.com", :shape "rect"} "Reddit"]] [()] [[:a {:href "http://digg.com", :shape "rect"} "Digg"]] [()])])])]) -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.