Not right now. If it matters: I had built a one-time hadoop-like system in ec2 that had a lot of J components (J and shell scripts, mostly). But right now, I am debugging someone else's existing hadoop based implementation, and that little essay was as much for my benefit (putting some of my thoughts in order) as it was for yours (assuming you got any benefit from it).
Thanks, -- Raul On Fri, Dec 26, 2014 at 10:59 AM, Jon Hough <[email protected]> wrote: > Happy holidays! > > I'm just a little curious. Are you using Hadoop with J? > >> From: [email protected] >> Date: Fri, 26 Dec 2014 05:41:10 -0500 >> To: [email protected] >> Subject: [Jchat] hadoop map/reduce from a J user's perspective >> >> Hadoop's use of the "map/reduce" term would suggest, to the J >> programmer something like: >> >> v/ u"0 >> >> Or, maybe >> >> v/ u"1 >> >> But that's actually not a very good representation of what it's doing. >> >> This is closer to what it's doing: >> v;.2 ;/:~<;.2 u;.2 >> >> (Oh, and early releases hadoop did not have the "reduce" step, they >> were just a mechanism to farm jobs out across a bunch of different >> machines.) >> >> More specifically, it's set up to handle a unix pipeline (reading from >> stdin, writing to stdout, implicitly operating on lines terminated by >> ascii character 10), farming the job out across multiple machines for >> the "map" operation, then sorting the result and farming that out >> again across multiple machines for the "reduce" operation. >> >> Also, "u" and "v" are expected to be java classes (and you can specify >> which *.jar file contains those classes), though there are prebuilt >> implementations which let you run any arbitrary unix command (for >> example hadoop-streaming.jar). >> >> (Of course, there are a lot of ornate details, most of which usually >> get ignored (having to do with tuning the environment and/or catering >> to somewhat arbitrary constraints of somewhat arbitrary programs). >> Also, there are numerous versions and flavors, all slightly >> incompatible.) >> >> But what if the data you are processing is not "lines" but something >> else? Files? Directories extracted from archives? Etc...? >> >> Well, for that case you should probably set things up so that a "line" >> is a descriptor of work to be done, and have the code being run >> download/process/upload the real work while stdout becomes lines >> briefly summarizing the work which was done. (This assumes you only >> are "mapping".) Alternatively, if you are "reducing" then lines output >> from the mapper would describe the next phase of work to be done (and >> this assumes that sorting is a useful mechanism, in this context). >> Though there are also constraints on resources which you have to not >> run afoul of (too much time, too much memory, or too much disk and a >> process can get shut down - and there's some variation on what those >> limits are - but they are relatively small by default - maybe a >> gigabyte of memory and 10 minutes of processing time, for example). >> >> (On the flip side, once you are running code in a hadoop cluster, you >> pretty much have unrestricted access to the machine -- as long as you >> stay within your resource constraints - the machine instance gets >> discarded when you are done with it.) >> >> Finally, speed... when running a hadoop job, it's several seconds to >> set up a small cluster, minutes to set up a larger cluster, I've not >> pushed the boundaries enough to know what a really large job looks >> like, and the good documentation seems to be mostly for early versions >> of hadoop (capable of handling only 4000 machines for a cluster) but I >> have noticed that some features documented for early versions are no >> longer present [or, at least, are syntactically different] in more >> recent versions. >> >> But I guess that's pretty normal for all software projects. Makes it a >> pain to use, though, when old documentation just plain doesn't work >> and you can't find the relevant newer documentation. (And, sadly, >> search engines seem to be getting worse and worse at locating relevant >> documentation, as opposed to auto-generated spam .. though perhaps >> that also has to do with my idea of relevance?) >> >> Anyways, enough musing for now, time to get back to debugging... >> >> Thanks, >> >> -- >> Raul >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
