Re: Question about log files

2015-04-06 Thread Fabio C.
I noticed that too, I think Hadoop keeps the file open all the time and
when you delete it it is just no more able to write on it and doesn't try
to recreate it. Not sure if it's a Log4j problem or an Hadoop one...
yanghaogn, which is the *correct* way to delete the Hadoop logs? I didn't
find anything better than deleting the file and restarting the service...

On Mon, Apr 6, 2015 at 9:27 AM, 杨浩 yangha...@gmail.com wrote:

 I think the log information has lost.

  the hadoop is not designed for that you deleted these files incorrectly

 2015-04-02 11:45 GMT+08:00 煜 韦 yu20...@hotmail.com:

 Hi there,
 If log files are deleted without restarting service, it seems that the
 logs is to be lost for later operation. For example, on namenode, datanode.
 Why not log files could be re-created when deleted by mistake or on
 purpose during cluster is running?

 Thanks,
 Jared





Re: Prune out data to a specific reduce task

2015-03-11 Thread Fabio C.
As far as I know the code running in each reducer is the same you specify
in your reduce function, so if you know in advance the features of the data
you want to ignore you can just instruct reducers to do so.
If you are able to tell whether or not to keep an entry at the beginning,
you can filter them out within the map function.
I could think of a wordcount example where we tell the map phase to ignore
all the words starting with a specific letter...
What kind of data are you processing and what is the filtering condition?
Anyway I'm sorry I can't help with the actual code, but I'm not really into
this right now.

On Wed, Mar 11, 2015 at 12:13 PM, xeonmailinglist-gmail 
xeonmailingl...@gmail.com wrote:

  Maybe the correct question is, how can I filter data in mapreduce in Java?



 On 11-03-2015 10:36, xeonmailinglist-gmail wrote:

 To exclude data to a specific reducer, should I build a partitioner that
 do this? Should I have a map function that checks to which reduce task the
 output goes?

 Can anyone give me some suggestion?

 And by the way, I really want to exclude data to a reduce task. So, I will
 run more than 1 reducer, even if one of them does not get input data.


 On 11-03-2015 10:28, xeonmailinglist-gmail wrote:

 Hi,

 I have this job that has 3 map tasks and 2 reduce tasks. But, I want to
 excludes data that will go to the reduce task 2. This means that, only
 reducer 1 will produce data, and the other one will be empty, or even it
 doesn't execute.

 How can I do this in MapReduce?

 [image: Example Job Execution]


 Thanks,

 --
 --


 --
 --


 --
 --




Can RM ignore heartbeats?

2015-02-24 Thread Fabio C.
Hi everyone,
I have a question about the ResourceManager behavior:
when the ResourceManager allocates a container, it takes some time before
the NMToken is sent and then received by the ApplicationMaster.
During this time, it is possible to receive another heartbeat from the AM,
equal to the last one (since the AM is not aware of the allocated
resources).
Is there any policy in YARN that makes the RM aware of this and ignore this
last heartbeat?
I ask this because I would expect way more superfluous containers
allocated, in comparison to the ones I can see from the logs.

Thanks in advance

Fabio


Re: hadoop learning

2015-02-21 Thread Fabio C.
Hi Rishabh,
I didn't know anything about Hadoop a few months ago, and I started from
the very beginning. I don't suggest you to start with online documentation,
that is always fragmented, incomplete and sometimes not even up to date.
Also starting by directly using Hadoop is the fastest way to frustration
and will just take you to abandon this technology.
I can suggest you two books I used to start with, and they have been quite
helpful for someone who didn't even know what mapreduce is. They provide
many examples and use cases (especially the first one):
- OReilly - Hadoop The Definitive Guide 3rd Edition. This is quite old
but, other than the coding part, it could explain quite well what hadoop
is, what it does and how it works. It is mainly about old versions of
Hadoop, but I believe it's something you should know, even because most of
articles online still refer to the pre-YARN terminology.
-  Addison-Wesley Professional - Apache Hadoop YARN: Moving beyond
MapReduce and Batch Processing with Apache Hadoop 2. This is what you I
used to really understand the new hadoop architecture and terminology.
Sometimes it gives too many details, but better more than less. It also has
a couple of chapters about installing Hadoop.

Good luck

Fabio

On Sat, Feb 21, 2015 at 3:33 PM, Ted Yu yuzhih...@gmail.com wrote:

 Rishabh:
 You can start with:
 http://wiki.apache.org/hadoop/HowToContribute

 There're several components: common, hdfs, YARN, mapreduce, ...
 Which ones are you interested in ?

 Cheers

 On Sat, Feb 21, 2015 at 12:18 AM, Bhupendra Gupta bhupendra1...@gmail.com
  wrote:

 I have been learning and trying to implement a hadoop ecosystem for one
 of the POC from last 1 month or so and i think that the best way to learn
 is by doing it..

 Hadoop as the concept has lots of implementation and i picked up
 hortonworks sandbox for learning...
 This has helped me in guaging some of the concepts and few practical
 understanding as well.

 Happy learning

 Sent from my iPhone

 Bhupendra Gupta

  On 21-Feb-2015, at 1:39 pm, Rishabh Agrawal ss.rishab...@gmail.com
 wrote:
 
  Hello,
 
  Please tell me where can i learn the concepts of Big Data and Hadoop
 from the scratch. Please provide some links online.
 
 
 
  Rishabh Agrawal





Steps for container release

2015-02-20 Thread Fabio C.
Hi everyone,
I was trying to understand the process that makes the resources of a
container available again to the ResourceManager.
As far as I can guess from the logs, the AM:
- sends a stop request to the NodeManager for the specific container
- suddenly tells the RM about the release of the resources, which become
available (queues are re-sorted).
Actually, I was expecting the RM to wait for an acknowledgment from the NM
(through NM-RM heartbeat) about the real end of the container, but it
looks to me that the resources are made available upon receiving this info
from the AM (AM-RM heartbeat).
Maybe the container decommission time is so small to be irrelevant?

The logs are at INFO level, and I can't change it to DEBUG since I'm not
the only one using the cluster, so maybe I am missing something.

Thanks

Fabio