Hello everybody,

I’m Matteo Ceccarello, a PhD student in Computer Engineering from 
university of
Padova, Italy. Recently, I’ve started working with Clojure, a language that 
I
find both powerful and fun. I’ve participated to GSoC 2013 and 2012 with
JPF <http://babelfish.arc.nasa.gov/trac/jpf/wiki>. I’m interested in
participating to GSoC 2014 with Clojure, in order to gain confidence with
clojure under the guidance of a mentor while contributing to the community. 
I’ll
describe what I’d like to do below.
Motivation 

Currently, alongside my research on parallel graph algorithms, I’ve started
writing a web crawler in Clojure. The reasons I want to do this in clojure 
are
many:

   - The web crawling software we are using
   (Heritrix <https://crawler.archive.org>) performs blocking IO. I think 
   that in
   big crawls this is a showstopper, since to achieve a good throughput one 
   must
   use many Java threads (in the order of 500) that end up spending most of 
   their
   time performing blocking waits. 
   - Clojure has the wonderful core.async library, with all the magic that 
   go
   blocks and channels bring. 
   - The http-kit library makes possible to perform async requests to web
   servers in a very simple way. 
   - Clojure has an awesome support for concurrency. 

What I want to achieve eventually is a concurrent asynchronous web crawler,
without a single blocking call.

Since I’m new to the language, before starting to get some real work done, 
I’ve
worked on a few small projects, to get a feel of the language. What I 
learned is
that everything you put in a ref or atom must be immutable. A fundamental
component of a web crawler is a data structure that tells you if you have
already visited a URL. Bloom filters are a common choice for the 
implementation
of this component. Since this bloom filter will be put in a ref, I want it 
to
be immutable and, for efficiency reasons, persistent. After some research 
on the
web, I found out that an immutable persistent implementation of the bloom 
filter
is still missing from the Clojure ecosystem.

Since it seems that Clojure is missing a persistent implementation of many
probabilistic data structures, I came up with the following idea.
Persistent probabilistic data structures for Clojure 

Clojure seems to miss persistent implementations of many useful 
probabilistic
data structures, in particular:

   - Bloom filters <http://webhdd.ru/library/files/10.1.1.127.9672.pdf> 
   - Counting Bloom filters<http://webhdd.ru/library/files/10.1.1.127.9672.pdf> 
   - Compact approximators <http://arxiv.org/pdf/cs.DS/0306046.pdf> 
   - HyperLogLog 
counters<http://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewPDFInterstitial/dmAH0110/2100>
 
   - Count-min 
sketches<http://twiki.di.uniroma1.it/pub/Ing_algo/WebHome/p14_Cormode_JAl_05.pdf>
 

Of course one can use one of the various implementations of these data 
structures for
Java, however, being these implementations mutable, they cannot be used
in idiomatic concurrent clojure code (as for my understanding of idiomatic
concurrent clojure code).

What I propose to realize is an optimized persistent implementation of these
libraries. Hence I plan to explore different paths using
benchmarks <https://github.com/hugoduncan/criterium> as a guide,
for instance to decide whether it’s more convenient to use standard clojure 
vectors
to represent bit vectors or to provide a custom persistent
bit vector implementation.

Since I like the feeling of having a static type checker that
prevents common errors, the library will be annotated with
Typed Clojure <http://typedclojure.org/>. Moreover I find interesting the 
idea
of using tests as machine checkable documentation, and
midje-doc <http://docs.caudate.me/lein-midje-doc/> seems the right tool for 
the job.

Is there someone interested in mentoring me with this project?

Yours sincerely
Matteo

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to