Hello everybody,
I’m Matteo Ceccarello, a PhD student in Computer Engineering from university of Padova, Italy. Recently, I’ve started working with Clojure, a language that I find both powerful and fun. I’ve participated to GSoC 2013 and 2012 with JPF <http://babelfish.arc.nasa.gov/trac/jpf/wiki>. I’m interested in participating to GSoC 2014 with Clojure, in order to gain confidence with clojure under the guidance of a mentor while contributing to the community. I’ll describe what I’d like to do below. Motivation Currently, alongside my research on parallel graph algorithms, I’ve started writing a web crawler in Clojure. The reasons I want to do this in clojure are many: - The web crawling software we are using (Heritrix <https://crawler.archive.org>) performs blocking IO. I think that in big crawls this is a showstopper, since to achieve a good throughput one must use many Java threads (in the order of 500) that end up spending most of their time performing blocking waits. - Clojure has the wonderful core.async library, with all the magic that go blocks and channels bring. - The http-kit library makes possible to perform async requests to web servers in a very simple way. - Clojure has an awesome support for concurrency. What I want to achieve eventually is a concurrent asynchronous web crawler, without a single blocking call. Since I’m new to the language, before starting to get some real work done, I’ve worked on a few small projects, to get a feel of the language. What I learned is that everything you put in a ref or atom must be immutable. A fundamental component of a web crawler is a data structure that tells you if you have already visited a URL. Bloom filters are a common choice for the implementation of this component. Since this bloom filter will be put in a ref, I want it to be immutable and, for efficiency reasons, persistent. After some research on the web, I found out that an immutable persistent implementation of the bloom filter is still missing from the Clojure ecosystem. Since it seems that Clojure is missing a persistent implementation of many probabilistic data structures, I came up with the following idea. Persistent probabilistic data structures for Clojure Clojure seems to miss persistent implementations of many useful probabilistic data structures, in particular: - Bloom filters <http://webhdd.ru/library/files/10.1.1.127.9672.pdf> - Counting Bloom filters<http://webhdd.ru/library/files/10.1.1.127.9672.pdf> - Compact approximators <http://arxiv.org/pdf/cs.DS/0306046.pdf> - HyperLogLog counters<http://www.dmtcs.org/dmtcs-ojs/index.php/proceedings/article/viewPDFInterstitial/dmAH0110/2100> - Count-min sketches<http://twiki.di.uniroma1.it/pub/Ing_algo/WebHome/p14_Cormode_JAl_05.pdf> Of course one can use one of the various implementations of these data structures for Java, however, being these implementations mutable, they cannot be used in idiomatic concurrent clojure code (as for my understanding of idiomatic concurrent clojure code). What I propose to realize is an optimized persistent implementation of these libraries. Hence I plan to explore different paths using benchmarks <https://github.com/hugoduncan/criterium> as a guide, for instance to decide whether it’s more convenient to use standard clojure vectors to represent bit vectors or to provide a custom persistent bit vector implementation. Since I like the feeling of having a static type checker that prevents common errors, the library will be annotated with Typed Clojure <http://typedclojure.org/>. Moreover I find interesting the idea of using tests as machine checkable documentation, and midje-doc <http://docs.caudate.me/lein-midje-doc/> seems the right tool for the job. Is there someone interested in mentoring me with this project? Yours sincerely Matteo -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.