Re: HttpKit, Enlive (html retrieval and parsing)

2014-01-13 Thread Jarrod Swart
This is exactly what I do and it works great!

On Saturday, January 11, 2014 7:00:22 PM UTC-5, Jan Herich wrote:

 I don't recommend using java's built in HTTP retrieval (by passing 
 java.net.URL object to enlive html-resource function).
 Not only is it significantly slower then using clj-http (which uses 
 apache-http client under the hood), but it's also unreliable
 when issuing more parallel requests. 
 Current enlive library supports plug-able parsers, the default one is 
 TagSoup, but you can switch it very easily for example
 for JSoup by setting *parser* dynamic var. 
 You can have a look at one of my little projects where i used enlive for 
 html scraping 
 herehttps://github.com/janherich/lazada-quest/blob/master/src/lazada_quest/scrapper.clj
  , 
 in this case, i used clj-http as 
 http client:

 (ns lazada-quest.scrapper
   (:require [clojure.string :as string]
 [clj-http.client :as client]
 [net.cgrand.enlive-html :as html]))


 (defn fetch-url
   Given some url string, fetch html content of the resource served under url 
 adress and return
it in the form of enlive nodes
   [url]

   (html/html-resource (:body (client/get url {:as :stream}

 It would be straightforward to replace use of clj-http with http-kit 
 synchronous api, or asynchronous api with some changes

 Dňa nedeľa, 12. januára 2014 0:24:48 UTC+1 Dave Tenny napísal(-a):

 I'm just playing around with tool kits to retrieve and parse html from 
 web pages and files that I already have on disk (such as JDK API 
 documentation).

 Based on too little time, it looks like [http-kit 2.1.16] will retrieve 
 but not parse html, and [enlive 1.1.5] will retrieve AND parse html.

 Or is there a whole built-in parse capability I'm missing in http-kiit?

 Also, http-kit doesn't seem to want to retrieve content from a file:/// 
 url, whereas enlive is happy with both local and remote content.

 I'm just messing around, I wanted to have some REPL javadoc logic that 
 didn't fire up a browser or use the swing app (whose fonts are unreadable 
 for me, and half a day spent trying to change it was not fruitful).

 Any tips or suggestions?  Just don't want to make sure I'm missing 
 obvious things.

 Thanks!





-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


HttpKit, Enlive (html retrieval and parsing)

2014-01-11 Thread Dave Tenny
I'm just playing around with tool kits to retrieve and parse html from web 
pages and files that I already have on disk (such as JDK API documentation).

Based on too little time, it looks like [http-kit 2.1.16] will retrieve 
but not parse html, and [enlive 1.1.5] will retrieve AND parse html.

Or is there a whole built-in parse capability I'm missing in http-kiit?

Also, http-kit doesn't seem to want to retrieve content from a file:/// 
url, whereas enlive is happy with both local and remote content.

I'm just messing around, I wanted to have some REPL javadoc logic that 
didn't fire up a browser or use the swing app (whose fonts are unreadable 
for me, and half a day spent trying to change it was not fruitful).

Any tips or suggestions?  Just don't want to make sure I'm missing obvious 
things.

Thanks!



-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: HttpKit, Enlive (html retrieval and parsing)

2014-01-11 Thread Dave Tenny
I was using net.cgrand.enlive-html/html-resource and org.httpkit.client/get 
for the page retrievals.

On Saturday, January 11, 2014 6:24:48 PM UTC-5, Dave Tenny wrote:

 I'm just playing around with tool kits to retrieve and parse html from web 
 pages and files that I already have on disk (such as JDK API documentation).

 Based on too little time, it looks like [http-kit 2.1.16] will retrieve 
 but not parse html, and [enlive 1.1.5] will retrieve AND parse html.

 Or is there a whole built-in parse capability I'm missing in http-kiit?

 Also, http-kit doesn't seem to want to retrieve content from a file:/// 
 url, whereas enlive is happy with both local and remote content.

 I'm just messing around, I wanted to have some REPL javadoc logic that 
 didn't fire up a browser or use the swing app (whose fonts are unreadable 
 for me, and half a day spent trying to change it was not fruitful).

 Any tips or suggestions?  Just don't want to make sure I'm missing obvious 
 things.

 Thanks!





-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: HttpKit, Enlive (html retrieval and parsing)

2014-01-11 Thread Matching Socks
Java has HTTP retrieval built in.  Clojure's core functions can use file or 
http URLs:

user (slurp http://google.com;)

user (slurp file:///etc/passwd)

Parsing HTML on the other hand is a question of not just science but also 
art.  Doesn't enlive use Tag Soup?

-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: HttpKit, Enlive (html retrieval and parsing)

2014-01-11 Thread Jan Herich
I don't recommend using java's built in HTTP retrieval (by passing 
java.net.URL object to enlive html-resource function).
Not only is it significantly slower then using clj-http (which uses 
apache-http client under the hood), but it's also unreliable
when issuing more parallel requests. 
Current enlive library supports plug-able parsers, the default one is 
TagSoup, but you can switch it very easily for example
for JSoup by setting *parser* dynamic var. 
You can have a look at one of my little projects where i used enlive for 
html scraping 
herehttps://github.com/janherich/lazada-quest/blob/master/src/lazada_quest/scrapper.clj
 , 
in this case, i used clj-http as 
http client:

(ns lazada-quest.scrapper
  (:require [clojure.string :as string]
[clj-http.client :as client]
[net.cgrand.enlive-html :as html]))


(defn fetch-url
  Given some url string, fetch html content of the resource served under url 
adress and return
   it in the form of enlive nodes
  [url]

  (html/html-resource (:body (client/get url {:as :stream}

It would be straightforward to replace use of clj-http with http-kit 
synchronous api, or asynchronous api with some changes

Dňa nedeľa, 12. januára 2014 0:24:48 UTC+1 Dave Tenny napísal(-a):

 I'm just playing around with tool kits to retrieve and parse html from web 
 pages and files that I already have on disk (such as JDK API documentation).

 Based on too little time, it looks like [http-kit 2.1.16] will retrieve 
 but not parse html, and [enlive 1.1.5] will retrieve AND parse html.

 Or is there a whole built-in parse capability I'm missing in http-kiit?

 Also, http-kit doesn't seem to want to retrieve content from a file:/// 
 url, whereas enlive is happy with both local and remote content.

 I'm just messing around, I wanted to have some REPL javadoc logic that 
 didn't fire up a browser or use the swing app (whose fonts are unreadable 
 for me, and half a day spent trying to change it was not fruitful).

 Any tips or suggestions?  Just don't want to make sure I'm missing obvious 
 things.

 Thanks!





-- 
-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.