Re: Issues with ResourceManager scheduling functions

2013-12-09 Thread Sandy Ryza
to them? Can you give an example? Many thanks. Bill On Mon, Dec 9, 2013 at 3:30 AM, Sandy Ryza sandy.r...@cloudera.comwrote: YARN currently is unable to handle requests with different resource requirements at the same priority (YARN-314). Using different priorities would likely solve

Re: Perfect configuration setting

2013-12-03 Thread Sandy Ryza
Hi Geelong, Check out Todd Lipcon's presentation on tuning MapReduce performance: http://www.slideshare.net/cloudera/mr-perf -Sandy On Mon, Dec 2, 2013 at 11:14 PM, Geelong Yao geelong...@gmail.com wrote: Hi Everyone I am now testing the best performance of my cluster Can anyone give me

Re: Perfect configuration setting

2013-12-03 Thread Sandy Ryza
allowed. On Tue, Dec 3, 2013 at 4:26 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Geelong, Check out Todd Lipcon's presentation on tuning MapReduce performance: http://www.slideshare.net/cloudera/mr-perf -Sandy On Mon, Dec 2, 2013 at 11:14 PM, Geelong Yao geelong...@gmail.comwrote

Re: MRV2 job takes to long to start

2013-11-28 Thread Sandy Ryza
What scheduler are you using? What do you mean by start? For the first map task to start? -Sandy On Thu, Nov 28, 2013 at 6:07 AM, Juan Martin Pampliega jpampli...@gmail.com wrote: Hi, I have a map-reduce job that was developed for MRV1 and is now being run daily with no modifications in

Re: 答复: problems of FairScheduler in hadoop2.2.0

2013-11-27 Thread Sandy Ryza
/allocations *发件人:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *发送时间:* 2013年11月27日 16:33 *收件人:* user@hadoop.apache.org *主题:* Re: problems of FairScheduler in hadoop2.2.0 Hi, Can you share the contents of your fair-scheduler.xml? If you submit just a single job, does it run? What do you

Re: Any reference for upgrade hadoop from 1.x to 2.2

2013-11-22 Thread Sandy Ryza
For MapReduce and YARN, we recently published a couple blog posts on migrating: http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-users/ http://blog.cloudera.com/blog/2013/11/migrating-to-mapreduce-2-on-yarn-for-operators/ hope that helps, Sandy On Fri, Nov 22, 2013 at

Re: Limit on total jobs running using fair scheduler

2013-11-19 Thread Sandy Ryza
Unfortunately, this is not possible in the MR1 fair scheduler without setting the jobs for individual pools. In MR2, fair scheduler hierarchical queues will allow setting maxRunningApps at the top of the hierarchy, which would have the effect you're looking for. -Sandy On Tue, Nov 19, 2013 at

Re: Allocating Containers on a particular Node in Yarn

2013-11-14 Thread Sandy Ryza
()); amRmClient.releaseAssignedContainer(containerId); } return amRmClient.allocate(0); -Gaurav On 11/13/2013 07:36 PM, Sandy Ryza wrote: In that case, the AMRMClient code looks correct to me. Can you share the code you've written against it that's not receiving the correct

Re: Allocating Containers on a particular Node in Yarn

2013-11-14 Thread Sandy Ryza
on and relax locality set to true without requesting rack, I don’t get the containers on the required host What scheduler are you using and what properties are you using to turn the scheduler delay on? Thanks -Gaurav *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* Thursday

Re: Allocating Containers on a particular Node in Yarn

2013-11-14 Thread Sandy Ryza
by default, set to -1. /description /property -Gaurav *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* Thursday, November 14, 2013 12:41 PM *To:* user@hadoop.apache.org *Subject:* Re: Allocating Containers on a particular Node in Yarn Great to hear. Other

Re: Allocating Containers on a particular Node in Yarn

2013-11-13 Thread Sandy Ryza
-Gaurav On 11/13/2013 4:02 PM, Sandy Ryza wrote: Yeah, specifying a host name with relaxLocality is meaningful. Schedulers use delay scheduling ( http://www.cs.berkeley.edu/~matei/talks/2010/eurosys_delaysched.pdf) to achieve locality when relaxLocality is on. But it is turned off by default

Re: Allocating Containers on a particular Node in Yarn

2013-11-13 Thread Sandy Ryza
-Gaurav On 11/13/2013 4:24 PM, Sandy Ryza wrote: [moving to user list] Right. relaxLocality needs to be set on the next level up. It determines whether locality can be relaxed to that level. Confusing, I know. If you are using AMRMClient, you should be able to accomplish what you're

Re: Allocating Containers on a particular Node in Yarn

2013-11-13 Thread Sandy Ryza
-Gaurav On 11/13/2013 5:04 PM, gaurav wrote: I have hadoop-2.2.0 Thanks -Gaurav On 11/13/2013 4:59 PM, Sandy Ryza wrote: What version are you using? Setting the relax locality to true for nodes is always correct. For racks, this is not necessarily the case. When I look at trunk

Re: question about preserving data locality in MapReduce with Yarn

2013-10-31 Thread Sandy Ryza
that needs to consider HDFS data-locality. thx. r. On Mon, Oct 28, 2013 at 10:21 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Ricky, The input splits contain the locations of the blocks they cover. The AM gets the information from the input splits and submits requests for those location

Re: question about preserving data locality in MapReduce with Yarn

2013-10-28 Thread Sandy Ryza
Hi Ricky, The input splits contain the locations of the blocks they cover. The AM gets the information from the input splits and submits requests for those location. Each container request spans all the replicas that the block is located on. Are you interested in something more specific?

Re: Yarn never use TeraSort#TotalOrderPartitioner when run TeraSort job?

2013-10-18 Thread Sandy Ryza
Hi Sam, Have you tried changing the map or reduce classes and seeing if that has any effect? -Sandy On Fri, Oct 18, 2013 at 8:05 AM, Ravi Prakash ravi...@ymail.com wrote: Sam, I would guess that the jar file you think is running, is not actually the one. I am guessing that in the task

Re: State of Art in Hadoop Log aggregation

2013-10-11 Thread Sandy Ryza
Just a clarification: Cloudera Manager is now free for any number of nodes. Ref: http://www.cloudera.com/content/cloudera/en/products/cloudera-manager.html -Sandy On Fri, Oct 11, 2013 at 7:05 AM, DSuiter RDX dsui...@rdx.com wrote: Sagar, It sounds like you want a management console. We are

Re: Non data-local scheduling

2013-10-03 Thread Sandy Ryza
Hi Andre, Try setting yarn.scheduler.capacity.node-locality-delay to a number between 0 and 1. This will turn on delay scheduling - here's the doc on how this works: For applications that request containers on particular nodes, the number of scheduling opportunities since the last container

Re: Non data-local scheduling

2013-10-03 Thread Sandy Ryza
is a scheduling opportunity, how many are there?). It does not seem to be in the current documentation http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html 2013/10/3 Sandy Ryza sandy.r...@cloudera.com Hi Andre, Try setting yarn.scheduler.capacity.node-locality

Re: Cluster config: Mapper:Reducer Task Capapcity

2013-09-30 Thread Sandy Ryza
Hi Himanshu, Changing the ratio is definitely a reasonable thing to do. The capacities come from the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum tasktracker configurations. You can tweak these on your nodes to get your desired ratio. -Sandy On Mon, Sep

Re: Which Subphases Do Times on JobHistory Web UI Cover

2013-09-24 Thread Sandy Ryza
Average map time includes everything the map task is doing, i.e. all the things you mentioned. Reduce time does not cover shuffle time. Reduce time is the time spent calling the reducer function and writing its output to HDFS. Merge time is related to reduce, not map. -Sandy On Tue, Sep 24,

Re: Semantics of ApplicationResourceUsageReport

2013-09-21 Thread Sandy Ryza
Hi Albert, You're correct about used. Reserved is a little bit more arcane - it refers to a mechanism that schedulers use to prevent applications with larger container sizes from starving. Applications place container reservations on nodes, and no other containers can be placed on the node

Re: YARN MapReduce 2 concepts

2013-09-19 Thread Sandy Ryza
Hi Mohit, answers inline On Fri, Sep 20, 2013 at 1:33 AM, Mohit Anchlia mohitanch...@gmail.comwrote: I am going through the concepts of resource manager, application master and node manager. As I undersand resource manager receives the job submission and launches application master. It also

Re: Scheduler question

2013-09-09 Thread Sandy Ryza
Hi John, YARN schedulers handle this with the concept of reservations. Scheduling decisions occur on node heartbeats. When a node that is full heartbeats, the next application that should be able to place a container on it gets to place a reservation on it. Each node has space for a single

Re: question about fair scheduler

2013-08-23 Thread Sandy Ryza
That's right that the other 2 apps will end up getting 10 resources each, but as more resources become released, eventually the cluster will converge to a fair state. I.e. if the first app requested additional resources after releasing resources, it would not receive any more until either another

Re: Is fair scheduler still experimental?

2013-08-22 Thread sandy . ryza
Moving to cdh-user, Hi, The Fair Scheduler in 4.3 is stable and is recommended by Cloudera. -Sandy On Aug 22, 2013, at 6:20 PM, ch huang justlo...@gmail.com wrote: hi,all: i use cdh4.3 yarn , it's default scheduler is capacity scheduler ,i want to switch to fair scheduler,but i

Re: Is there any way to use a hdfs file as a Circular buffer?

2013-08-15 Thread Sandy Ryza
Hi Lin, It might be worth checking out Apache Flume, which was built for highly parallel ingest into HDFS. -Sandy On Thu, Aug 15, 2013 at 11:16 AM, Adam Faris afa...@linkedin.com wrote: If every device can send it's information as a 'event', you could use a publish-subscribe messaging

Re: Calling a MATLAB library in map reduce program

2013-08-14 Thread Sandy Ryza
To add to that, if you want to take advantage of MapReduce, e.g. you need to do a distributed grouping or sort, pipes or streaming would be the way to go. If you're mainly interested in running your code in parallel on a cluster, distributed shell, a YARN application outside of MapReduce, could

Re: Maven Cloudera Configuration problem

2013-08-13 Thread sandy . ryza
Hi Pavan, Configuration properties generally aren't included in the jar itself unless you explicitly set them in your java code. Rather they're picked up from the mapred-site.xml file located in the Hadoop configuration directory on the host you're running your job from. Is there an issue

Re: Maven Cloudera Configuration problem

2013-08-13 Thread sandy . ryza
Nothing in your pom.xml should affect the configurations your job runs with. Are you running your job from a node on the cluster? When you say localhost configurations, do you mean it's using the LocalJobRunner? -sandy (iphnoe tpying) On Aug 13, 2013, at 9:07 AM, Pavan Sudheendra

Re: why FairScheduler prefer to schedule MR jobs into the same node?

2013-08-08 Thread Sandy Ryza
Hi devdoer, What version are you using? -Sandy On Thu, Aug 8, 2013 at 4:25 AM, devdoer bird devd...@gmail.com wrote: HI: I configure the FairScheduler with default settings and my job has 19 reduce tasks. I found that all the reduce tasks are schedule to run in one node. While with

Re: whitelist feature of YARN

2013-08-07 Thread Sandy Ryza
is this white list feature is supported with. But am not sure what is meant by submitting ResourceRequests directly to RM. Can you please elaborate on this or give me a pointer to some example code on how to do it... Thanks for the reply, -Kishore On Mon, Jul 8, 2013 at 10:53 PM, Sandy Ryza

Re: whitelist feature of YARN

2013-08-07 Thread Sandy Ryza
relaxLocality) that means the old argument containerCount is gone! How would I be able to specify how many containers do I need? We now expect that you submit a ContainerRequest for each container you want. -Kishore On Wed, Aug 7, 2013 at 11:37 AM, Sandy Ryza sandy.r

Re: Parameter 'yarn.nodemanager.resource.cpu-cores' does not work

2013-07-22 Thread Sandy Ryza
Hi Sam, LinuxResourceCalculatorPlugin and DominantResourceCalculator control separate things. The former is for a NodeManager to calculate the resource usage of a container process so that it can kill it if it gets too large. The latter is used by the Capacity Scheduler to allocate containers,

Re: API difference between hadoop branch1 and branch2?

2013-07-18 Thread Sandy Ryza
Hi Li Yu, I don't think it has been published yet, but a document on the MapReduce changes was recently completed at https://issues.apache.org/jira/browse/MAPREDUCE-5184. -Sandy On Thu, Jul 11, 2013 at 4:18 AM, Yu Li car...@gmail.com wrote: Dear all, I have some applications used to run on

Re: Running hadoop for processing sources in full sky maps

2013-07-13 Thread Sandy Ryza
Hi Andrea, For copying the full sky map to each node, look up the distributed cache. It works by placing the sky map file on HDFS and each task will pull it down when needed. For feeding the input data into Hadoop, what format is it in currently? One simple way would be to have a text file

Re: temporary folders for YARN tasks

2013-07-02 Thread Sandy Ryza
LocalDirAllocator should help with this. You can look through MapReduce code to see how it's used. -Sandy On Mon, Jul 1, 2013 at 11:01 PM, Devaraj k devara...@huawei.com wrote: You can make use of this configuration to do the same. ** ** property descriptionList of

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
CPU limits are only enforced if cgroups is turned on. With cgroups on, they are only limited when there is contention, in which case tasks are given CPU time in proportion to the number of cores requested for/allocated to them. Does that make sense? -Sandy On Tue, Jul 2, 2013 at 9:50 AM,

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
cores and simply fight it out in the OS thread scheduler. Thanks, john ** ** *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* Tuesday, July 02, 2013 11:56 AM *To:* user@hadoop.apache.org *Subject:* Re: Containers and CPU ** ** CPU limits are only enforced

Re: Containers and CPU

2013-07-02 Thread Sandy Ryza
containers per 8-core node? John ** ** ** ** *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* Tuesday, July 02, 2013 1:26 PM *To:* user@hadoop.apache.org *Subject:* Re: Containers and CPU ** ** Use of cgroups for controlling CPU is off by default, but can

Re: Yarn job stuck with no application master being assigned

2013-06-21 Thread Sandy Ryza
:28 PM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Siddhi, Moving this question to the CDH list. Does setting yarn.scheduler.capacity.maximum-am-resource-percent to .5 help? Have you tried using the Fair Scheduler? -Sandy On Fri, Jun 21, 2013 at 4:21 PM, Siddhi Mehta smehtau

Re: Container size configuration

2013-06-13 Thread Sandy Ryza
Hi Yuzhang, Moving this question to the Hadoop user list. Are you using MapReduce or writing your own YARN application? In MapReduce, all maps must request the same amount of memory and all reduces must request the same amount of memory. It would be trivial to do this in your own YARN

Re: Why my tests shows Yarn is worse than MRv1 for terasort?

2013-06-07 Thread Sandy Ryza
Hey Sam, Thanks for sharing your results. I'm definitely curious about what's causing the difference. A couple observations: It looks like you've got yarn.nodemanager.resource.memory-mb in there twice with two different values. Your max JVM memory of 1000 MB is (dangerously?) close to the

Re: History server - Yarn

2013-06-07 Thread Sandy Ryza
Hi Rahul, The job history server is currently specific to MapReduce. -Sandy On Fri, Jun 7, 2013 at 8:56 AM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hello, I was doing some sort of prototyping on top of YARN. I was able to launch AM and then AM in turn was able to spawn a few

Re: What is the best way to build and debug MR application?

2013-06-06 Thread Sandy Ryza
Hi Lin, This is by no means a comprehensive answer to your question, but I've found that I'm able to iterate fastest by writing unit tests using MRUnit ( http://mrunit.apache.org/) -Sandy On Thu, Jun 6, 2013 at 7:02 PM, Lin Yang lin.yang.ja...@gmail.com wrote: Hi,dear friends, I have setup

Re: What is the best way to build and debug MR application?

2013-06-06 Thread Sandy Ryza
(and running them from Eclipse) On Thu, Jun 6, 2013 at 7:21 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Lin, This is by no means a comprehensive answer to your question, but I've found that I'm able to iterate fastest by writing unit tests using MRUnit ( http://mrunit.apache.org

Re: built hadoop! please help with next steps?

2013-05-31 Thread Sandy Ryza
Hi John, Here's how I deploy/debug Hadoop locally: To build and tar Hadoop: mvn clean package -Pdist -Dtar -DskipTests=true The tar will be located in the project directory under hadoop-dist/target/. I untar it into my deploy directory. I then copy these scripts into the same directory:

Re: built hadoop! please help with next steps?

2013-05-31 Thread Sandy Ryza
-reduce-jobs-with-eclipse looks promising as a Hadoop-in-Eclipse strategy, but it is over a year old and I’m not sure if it applies to Hadoop 2.0 and YARN. John ** ** *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] *Sent:* Friday, May 31, 2013 12:13 PM *To:* user

Re: Shuffle phase replication factor

2013-05-23 Thread Sandy Ryza
In MR1, the tasktracker serves the mapper files (so that tasks don't have to stick around taking up resources). In MR2, the shuffle service, which lives inside the nodemanager, serves them. -Sandy On Thu, May 23, 2013 at 10:22 AM, John Lilley john.lil...@redpoint.netwrote: Ling,

Re: Is there a way to limit # of hadoop tasks per user at runtime?

2013-05-21 Thread Sandy Ryza
Hi Mehmet, Are you using MR1 or MR2? The fair scheduler, present in both versions, but configured slightly differently, allows you to limit the number of map and reduce tasks in a queue. The configuration can be updated at runtime by modifying the scheduler's allocations file. It also has a

Re: YARN in 0.23 vs 2.0

2013-05-16 Thread Sandy Ryza
Hi John, You are correct that both 0.23 and 2.0 contain YARN, and that 1.x does not. The (confusing) reason for this is that the 1.x line descends from the 0.20 line, while the 2.0 line descends from the 0.23 line. -Sandy On Thu, May 16, 2013 at 11:46 AM, John Lilley

Re: Installed Hadoop on Linux server - not able to see web UI

2013-05-16 Thread Sandy Ryza
Hi Raj, The web UIs are located on different ports than the RPC ports you specified. If you are using MR1, the HDFS UI is typically located on port 50070, and the MapReduce UI is typically located on port 50030. -Sandy On Thu, May 16, 2013 at 2:58 PM, Raj Hadoop hadoop...@yahoo.com wrote:

Re: Why could not find finished jobs in yarn.resourcemanager.webapp.address?

2013-05-02 Thread Sandy Ryza
This shouldn't be asked on the dev lists, so putting mapreduce-dev and hdfs-dev in the bcc. Have you made sure you're not using the local job runner? Did you restart the resourcemanager after running the job? -Sandy On Thu, May 2, 2013 at 6:31 PM, sam liu samliuhad...@gmail.com wrote: Can

Re: Why could not find finished jobs in yarn.resourcemanager.webapp.address?

2013-05-02 Thread Sandy Ryza
/hadoop- mapreduce-examples-2.0.3-alpha.jar pi 2 30' 2013/5/3 Sandy Ryza sandy.r...@cloudera.com This shouldn't be asked on the dev lists, so putting mapreduce-dev and hdfs-dev in the bcc. Have you made sure you're not using the local job runner? Did you restart the resourcemanager after

Re: YARN - container networking and ports

2013-04-22 Thread Sandy Ryza
The yarn-default.xml file in the Hadoop repository contains the default ports for all of the YARN protocols. -Sandy On Mon, Apr 22, 2013 at 8:27 AM, Marcos Luis Ortiz Valmaseda marcosluis2...@gmail.com wrote: A great overview of MR2, you can find it in the Cloudera´s blog:

Re: Article: 'How to Deploy Hadoop 2 (Yarn) on EC2'

2013-04-17 Thread Sandy Ryza
This is great, Keith. On Wed, Apr 17, 2013 at 12:58 PM, Keith Wiley kwi...@keithwiley.com wrote: I've posted an article on my website that details precisely how to deploy Hadoop 2.0 with Yarn on AWS (or least how I did it, whether or not such an approach will translate to others'

Re: YARN - specify hosts in ContainerRequest

2013-04-12 Thread Sandy Ryza
. Huffman bhuff...@etinternational.com wrote: I get a container, but not on the node I'm asking for. Thanks, Brian On 04/12/2013 03:01 PM, Sandy Ryza wrote: What do you mean when you say it doesn't seem to use the code? That you're not getting containers back? -Sandy On Fri, Apr 12

Re: Streaming value of (200MB) from a SequenceFile

2013-03-31 Thread Sandy Ryza
Hi Jerry, I assume you're providing your own Writable implementation? The Writable readFields method is given a stream. Are you able to perform you able to perform your processing while reading the it there? -Sandy On Sat, Mar 30, 2013 at 10:52 AM, Jerry Lam chiling...@gmail.com wrote: Hi

Re: Streaming value of (200MB) from a SequenceFile

2013-03-31 Thread Sandy Ryza
Hi Rahul, I don't think saving the stream for later use would work - I was just suggesting that if only some aggregate statistics needed to be calculated, they could be calculated at read time instead of in the mapper. Nothing requires a Writable to contain all the data that it reads. That's a

Re: Yarn communication between containers

2013-03-27 Thread Sandy Ryza
Hi tmp, YARN doesn't provide an explicit protocol for doing this. Applications are expected to have their own mechanism for communication between task containers, other task containers, and app masters. If you want to see how this is done in MapReduce, I would suggest looking at the

Re: Any answer ? Candidate application for map reduce

2013-03-25 Thread Sandy Ryza
Hi Bala, A standard benchmark program for mapreduce is terasort, which is included in the hadoop examples jar. You can generate data for it using teragen, which runs a map-only job: hadoop jar path-to-examples-jar.jar number of records directory to put them in and then sort the data using

Re: Too many open files error with YARN

2013-03-20 Thread Sandy Ryza
Hi Kishore, 50010 is the datanode port. Does your lsof indicate that the sockets are in CLOSE_WAIT? I had come across an issue like this where that was a symptom. -Sandy On Wed, Mar 20, 2013 at 4:24 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am running a date

Re: Question regarding hadoop jar command usage

2013-03-13 Thread Sandy Ryza
the jar command vs mapred job command (looks like the hadoop job command is deprecated). Thanks Kay On Wed, Mar 13, 2013 at 10:14 AM, Sandy Ryza sandy.r...@cloudera.comwrote: Hi Kay, The jar is just executed locally. If the jar fires up a mapreduce job and sets itself as the job jar

Re: Transpose

2013-03-05 Thread Sandy Ryza
Hi, Essentially what you want to do is group your data points by their position in the column, and have each reduce call construct the data for each row into a row. To have each record that the mapper processes be one of the columns, you can use TextInputFormat with

Re: Map reduce technique

2013-03-05 Thread Sandy Ryza
Hi Balachandar, In MapReduce, interpreting input files as key value pairs is accomplished through InputFormats. Some common InputFormats are TextInputFormat, which uses lines in a text file as values and their byte offset into the file as keys, KeyValueTextInputFormat, which interprets the first

Re: Accumulo and Mapreduce

2013-03-04 Thread Sandy Ryza
Hi Aji, Oozie is a mature project for managing MapReduce workflows. http://oozie.apache.org/ -Sandy On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody justin.wo...@gmail.com wrote: Aji, Why don't you just chain the jobs together? http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

Re: Custom output value for map function

2013-02-27 Thread Sandy Ryza
Hi Paul, To do this, you need to make your Dog class implement Hadoop's Writable interface, so that it can be serialized to and deserialized from bytes. http://hadoop.apache.org/docs/r1.1.1/api/org/apache/hadoop/io/Writable.html The methods you implement would look something like this: public

Re: Custom output value for map function

2013-02-27 Thread Sandy Ryza
) out.writeInt(12) the following would be correct text = in.readUTF(); number = in.readInt(); and this would fail: number = in.readInt(); text = in.readUTF(); ? 2013/2/27 Sandy Ryza sandy.r...@cloudera.com: Hi Paul, To do this, you need to make your Dog class implement Hadoop's Writable

Re: Can I perfrom a MR on my local filesystem

2013-02-18 Thread Sandy Ryza
Hi Nikhil, The jobtracker doesn't do any deployment of other daemons. They are expected to be installed and started on other nodes separately. If I understand your question more broadly, MR doesn't necessarily run its map and reduce tasks on the nodes that contain the data. All data is read

Re: Sorting huge text files in Hadoop

2013-02-15 Thread Sandy Ryza
A map-only job does not result in the standard shuffle-sort. Map outputs are written directly to HDFS. -Sandy On Fri, Feb 15, 2013 at 12:23 PM, Jay Vyas jayunit...@gmail.com wrote: Maybe im mistaken about what is meant by map-only. Does a map-only job still result in standard shuffle-sort ?

Re: Generic output key class

2013-02-10 Thread Sandy Ryza
Hi Amit, One way to accomplish this would be to create a custom writable implementation, TextOrIntWritable, that has fields for both. It could look something like: class TextOrIntWritable implements Writable { private boolean isText; private Text text; private IntWritable integer; void