Re: [CODE4LIB] apache hadoop

2018-12-19 Thread Eric Lease Morgan
Thank you for the replies, and now I am aware of three different tools for parallel/distributed computing: 1. GNU Parallel (https://www.gnu.org/software/parallel/) - 'Works great for me. Easy to install. Easy to use. Given well-written programs, can be used to take advantage of all the cores

Re: [CODE4LIB] apache hadoop

2018-12-18 Thread Erik Hatcher
These days, in my industry experiences, the predominant way to go is Spark for distributed work, rather than Hadoop. Since many of you have your catalog in Solr (and Blacklight, blush, thank you), you could straightforwardly leverage our open source spark-solr library. https://github.com/L

Re: [CODE4LIB] apache hadoop

2018-12-18 Thread Roy Tennant
One other aspect that occurred to me is that since I had essentially been a procedural programmer, wrapping my head around the Map/Reduce paradigm took a while. Specifically, it took me a while to understand how to use the M/R paradigm for particular processing tasks. I suppose a key breakthrough f

Re: [CODE4LIB] apache hadoop

2018-12-18 Thread Stephen Meyer
Hi Eric, I have extremely limited experience with one small set of tests, but wanted to share a couple of quick book recommendations that helped me run MapReduce jobs in Hadoop. First, with a shout out to the Spark in the Dark reading club that formed after last year's conference, see the chap

Re: [CODE4LIB] apache hadoop

2018-12-17 Thread Roy Tennant
Péter provided a good start, I just wanted to mention that using the "streaming" option you can write code in pretty much whatever you want, certainly Python and Perl. I've even mixed and matched, where my mapping program is in Python and my reducing program (optional since the mapper might just be

Re: [CODE4LIB] apache hadoop

2018-12-17 Thread Péter Király
Hi Eric, sounds an interesting project! You have multiple choices. Hadoop is at least two kind of things: a distributed file system and a distributed computation engine with its own API. If you upload files to the file system Hadoop will distribute them in a safe way. The basic idea of the comput

[CODE4LIB] apache hadoop

2018-12-17 Thread Eric Lease Morgan
What is your experience with Apache Hadoop? I have very recently been granted root privileges on as many as three virtual machines. Each machine has forty-four cores, and more hard disk space & RAM than I really know how to exploit. I got access to these machines to work on a project I call The