Desicion Tree Implementation in Hadoop MapReduce

2013-11-23 Thread unmesha sreeveni
I want to go through Decision tree implementation in mahout. Refereed Apache
Mahout 

6 Feb 2012 - Apache Mahout 0.6 released
Apache Mahout has reached version 0.6. All developers are encouraged
to begin using version 0.6. Highlights include:
Improved Decision Tree performance and added support for regression problems

Where can I find its source code and documentation.

Should I download mahout

-- 
*Thanks & Regards*

Unmesha Sreeveni U.B

*Junior Developer*


Re: Desicion Tree Implementation in Hadoop MapReduce

2013-11-23 Thread Yexi Jiang
You can directly find it at https://github.com/apache/mahout, or you can
check out from svn by following
https://cwiki.apache.org/confluence/display/MAHOUT/Version+Control.


2013/11/23 unmesha sreeveni 

> I want to go through Decision tree implementation in mahout. Refereed Apache
> Mahout 
>
> 6 Feb 2012 - Apache Mahout 0.6 released
> Apache Mahout has reached version 0.6. All developers are encouraged to begin 
> using version 0.6. Highlights include:
> Improved Decision Tree performance and added support for regression problems
>
> Where can I find its source code and documentation.
>
> Should I download mahout
>
> --
> *Thanks & Regards*
>
> Unmesha Sreeveni U.B
>
> *Junior Developer*
>
>
>


-- 
--
Yexi Jiang,
ECS 251,  yjian...@cs.fiu.edu
School of Computer and Information Science,
Florida International University
Homepage: http://users.cis.fiu.edu/~yjian004/


RE: In YARN, how does a task tracker knows the address of a job tracker?

2013-11-23 Thread John Lilley
Ricky,

What you are doing sounds familiar.  We are in the process of implementing, not 
exactly MapReduce, but a system that has to do many of the things that 
MapReduce does (find data splits, define tasks, choose execution affinity, 
launch an app master, etc)

There is another special thing that MapReduce under YARN does that a normal 
YARN app cannot easily access, which are "auxiliary services".  MapReduce sets 
up a YARN auxiliary service to serve up the results of mapper outputs.  I think 
it is based on netty or jetty and HTTP.  The point is, that the MR aux service 
is part of the Hadoop distro, so all MR has to do is tell the NM to run it.  
Regular YARN apps don't have this luxury without installing jars on each node 
and adding them to the hadoop stack's CLASSPATH.  There doesn't appear to be 
any standard or documented way to inject extra jars into the hadoop install.  
As they say, that exercise is left to the reader.

john

From: ricky l [mailto:rickylee0...@gmail.com]
Sent: Thursday, November 21, 2013 3:40 PM
To: user@hadoop.apache.org
Subject: Re: In YARN, how does a task tracker knows the address of a job 
tracker?

Hi John, thanks for your reply. I suspect there will be some external 
communication between AM and container tasks. I am trying to implement a 
Hadoop-like system to Yarn and I wanted to draw a high-level steps before 
starting the work. thanks,


On Thu, Nov 21, 2013 at 3:27 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
MapReduce also communicates outside of what is directly supported by YARN.
In a YARN application, there is very little direct communication between the 
client and the AM, and between the AM and container tasks.
I think that an AM can update to the client two pieces of information -- 
"state" and "percent complete".
However, at launch time an AM can open up a protocol port and tell the client 
and the container tasks how to connect back.
I don't know the details, but I believe that the MapReduce AM communicates 
directly with all mapper, reducer tasks as well as the client.
John


From: ricky l [mailto:rickylee0...@gmail.com]
Sent: Thursday, November 21, 2013 12:36 PM
To: user@hadoop.apache.org
Subject: Re: In YARN, how does a task tracker knows the address of a job 
tracker?

Thank you for the answer, Omkar.

I read the links that were helpful. Though the concept of job tracker/task 
tracker does not exist in the YARN MapReduce, doesn't it use the binary of 
job/task tracker? I though the application master runs job tracker binary and 
the containers in the node will run task tracker binary. thx

On Thu, Nov 21, 2013 at 2:06 PM, Omkar Joshi 
mailto:ojo...@hortonworks.com>> wrote:
Hi,

Starting with YARN there is no notion of job tracker and task tracker. Here is 
a quick summary
JobTracker :-
1) Resource management :- Now done by Resource Manager (it does all scheduling 
work)
2) Application state management :- managing and launching new map /reduce tasks 
(done by Application Master .. It is per job not one single entity in the 
cluster for all jobs like MRv1).
TaskTracker :- replaced by Node Manager

I would suggest you read the YARN blog 
post. 
This will answer most of your questions. Plus read 
this (slide 
12) for how job actually gets executed.

Thanks,
Omkar Joshi
Hortonworks Inc.

On Thu, Nov 21, 2013 at 7:52 AM, ricky l 
mailto:rickylee0...@gmail.com>> wrote:
Hi all,

I have a question of how a task tracker identifies job tracker address when I 
submit MR job through YARN. As far as I know, both job tracker and task 
trackers are launched through application master and I am curious about the 
details about job and task tracker launch sequence.

thanks.


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.




Re: Unmanaged AMs

2013-11-23 Thread Hitesh Shah
Hello Kishore, 

An unmanaged AM has no relation to the language being used. An unmanaged AM is 
an AM that is launched outside of the YARN cluster i.e. manually launched 
elsewhere and not by the RM ( using the application submission context provided 
by a client). It was built to be a dev-tool for application developers to be 
able to test their AMs ( attach debuggers, etc ) and is not meant to be 
something that is used in production.

As for other languages, all interactions with the YARN components is via 
protobuf-based RPC and you could use the appropriate language binding for 
protobuf. Take a look at https://github.com/hortonworks/gohadoop - this has 
code for a YARN app written in Go. There is still some work left to get this to 
work seamlessly for all language types but the go code should point you in the 
right direction.

-- Hitesh

On Nov 21, 2013, at 6:18 AM, Krishna Kishore Bonagiri wrote:

> Hi,
> 
>   I have seen in comments for code in UnmanagedAMLauncher.java that AM can be 
> in any language. What does that mean? Can AM be written in C++ language? If 
> so, how would I be able to be connect to RM and how would I be able to 
> request for containers? I mean what is the interface doing these things? Is 
> there a sample code/example somewhere to get an idea about how to do it?
> 
> Thanks,
> Kishore



Uncompressed size of Sequence files

2013-11-23 Thread Robert Dyer
Is there an easy way to get the uncompressed size of a sequence file that
is block compressed?  I am using the Snappy compressor.

I realize I can obviously just decompress them to temporary files to get
the size, but I would assume there is an easier way.  Perhaps an existing
tool that my search did not turn up?

If not, I will have to run a MR job load each compressed block and read the
Snappy header to get the size.  I need to do this for a large number of
files so I'd prefer a simple CLI tool (sort of like 'hadoop fs -du').

- Robert