Get Hadoop update

2013-05-22 Thread Vimal Jain
Hi,
I would like to receive Hadoop notifications.

-- 
Thanks and Regards,
Vimal Jain


How is sharing done in HDFS ?

2013-05-22 Thread Agarwal, Nikhil
Hi,

Can anyone guide me to some pointers or explain how HDFS shares the information 
put in the temporary directories (hadoop.tmp.dir, mapred.tmp.dir, etc.) to all 
other nodes?

I suppose that during execution of a MapReduce job, the JobTracker prepares a 
file called jobtoken and puts it in the temporary directories. which needs to 
be read by all TaskTrackers. So, how does HDFS share the contents? Does it use 
nfs mount or ?

Thanks & Regards,
Nikhil



Re: How is sharing done in HDFS ?

2013-05-22 Thread Kun Ling
Hi, Agarwal,
Hadoop just put the jobtoken, _partitionlst, and  some other files that
needed to share in a directory located in hdfs://namenode:port/tmp//.

   And all the TaskTracker will access these files from the shared tmp
directory, just like the way  they share the input file in the HDFS.



yours,
Ling Kun


On Wed, May 22, 2013 at 4:29 PM, Agarwal, Nikhil
wrote:

>  Hi,
>
> ** **
>
> Can anyone guide me to some pointers or explain how HDFS shares the
> information put in the temporary directories (hadoop.tmp.dir,
> mapred.tmp.dir, etc.) to all other nodes? 
>
> ** **
>
> I suppose that during execution of a MapReduce job, the JobTracker
> prepares a file called jobtoken and puts it in the temporary directories.
> which needs to be read by all TaskTrackers. So, how does HDFS share the
> contents? Does it use nfs mount or ….?
>
> ** **
>
> Thanks & Regards,
>
> Nikhil
>
> ** **
>



-- 
http://www.lingcc.com


Re: How is sharing done in HDFS ?

2013-05-22 Thread Harsh J
The job-specific files, placed by the client, are downloaded individually
by every tasktracker from the HDFS (The process is called "localization" of
the task before it starts up) and then used.


On Wed, May 22, 2013 at 1:59 PM, Agarwal, Nikhil
wrote:

>  Hi,
>
> ** **
>
> Can anyone guide me to some pointers or explain how HDFS shares the
> information put in the temporary directories (hadoop.tmp.dir,
> mapred.tmp.dir, etc.) to all other nodes? 
>
> ** **
>
> I suppose that during execution of a MapReduce job, the JobTracker
> prepares a file called jobtoken and puts it in the temporary directories.
> which needs to be read by all TaskTrackers. So, how does HDFS share the
> contents? Does it use nfs mount or ….?
>
> ** **
>
> Thanks & Regards,
>
> Nikhil
>
> ** **
>



-- 
Harsh J


Auto created 'target' folder?

2013-05-22 Thread Taco Jan Osinga
Hi all,

Quite a newby here.

I'm creating an application for internal use, which deploys several demo-sites 
of our HBase application. These demos should contain a blank state (fixtures) 
with some data. Therefor I have created export files which needed to be 
imported (using the MapReduce-way of importing). This all works when I run my 
script as root. So this process works.

However, it doesn't work running the script as user tomcat7 (permission 
denied). Even adding tomcat7 to supergroup didn't fix the problem.
 
In the process I noticed there's a directory structure created, named 
"target/test-dir" which contains hadoop**.jar files. If I chmod -R 777 this 
target folder, I suddenly am able to import being tomcat7!

My question:
- What are these target folders (they appear on the working path)?
- How can I make sure the user(s) in supergroup have write permissions (is it 
even wise to do that)?
- It seems like some kind of temp folder: Shouldn't this folder be removed 
after the process is finished (it keeps growing)?

Regards from The Netherlands,
Taco Jan Osinga

Re: How is sharing done in HDFS ?

2013-05-22 Thread Kun Ling
Hi Agarwal,
Thanks to Harsh J's reply. I have found the following code( based on
hadoop-1.0.4)  that may give you some help:

   localizedJobTokenFile() in TaskTracker.java: which localize a file named
 "JobToken" .
   localizeJobConfFile() in TaskTracker.java: which localize a file named
"Job.xml"
   And also some Distributed Cache files will also be localized by calling
the function: taskDistributedCacheManager.setupCache().

   all the above function is called in the initializeJob() method of
TaskTracker.java.

And the JobToken file is copied from the directory from
jobClient.getSystemDir(), which is initialized as an shared directory in
HDFS  in offerService() of TaskTracker.java.


  To Harsh:   While after looking into the sourcecode( based on
hadoop-1.0.4), I have the following questions:
1. Where is the Job.xml stored in the shared HDFS, while looking into
 the code, I only found the readFields(DataInput in) method of class Task
in Task.java. And the only statement is " jobFile = Text.readString(in)"

   2. There is also a _partition.lst file, and also job.jar file, which is
also shared by all the Tasks, While I do not find any code corresponding to
localize this file, Do you know what code in which file makes partition.lst
localization happen?

   3. Is there any file that need to share, besides  JobToken, Job.xml,
distributed cache files, _partition.lst, job.jar file?

   4. all the observation is based on Hadoop 1.0.4 source code. Any update
of the latest hadoop-2.0-alpha, and the Hadoop-trunk?


On Wed, May 22, 2013 at 4:45 PM, Harsh J  wrote:

> The job-specific files, placed by the client, are downloaded individually
> by every tasktracker from the HDFS (The process is called "localization" of
> the task before it starts up) and then used.
>
>
> On Wed, May 22, 2013 at 1:59 PM, Agarwal, Nikhil <
> nikhil.agar...@netapp.com> wrote:
>
>>  Hi,
>>
>> ** **
>>
>> Can anyone guide me to some pointers or explain how HDFS shares the
>> information put in the temporary directories (hadoop.tmp.dir,
>> mapred.tmp.dir, etc.) to all other nodes? 
>>
>> ** **
>>
>> I suppose that during execution of a MapReduce job, the JobTracker
>> prepares a file called jobtoken and puts it in the temporary directories.
>> which needs to be read by all TaskTrackers. So, how does HDFS share the
>> contents? Does it use nfs mount or ….?
>>
>> ** **
>>
>> Thanks & Regards,
>>
>> Nikhil
>>
>> ** **
>>
>
>
>
> --
> Harsh J
>



-- 
http://www.lingcc.com


Re: ETL Tools

2013-05-22 Thread Lenin Raj
We have used Pentaho in our projects.. it meets all your conditions. It can
connect to hadoop.

Good community support too.

--
Lenin.
Sent from my Android.
On May 22, 2013 2:19 AM, "Aji Janis"  wrote:

> Thanks for the suggestion. What about Clover or Talend? Have any of you
> tried it before - interested in knowing how it compares against Pentaho
>
>
> On Tue, May 21, 2013 at 12:26 PM, sudhakara st wrote:
>
>> Hello,
>>
>> Flume is better,more sophisticated one is Pentaho, it open source but you
>> have to pay for support.
>>
>>
>> On Tue, May 21, 2013 at 9:52 PM, Shahab Yunus wrote:
>>
>>> For batch imports, I would also suggest Sqoop. Very easy to use,
>>> specially if you have mySql into the play. I have not used Sqoop 2 but that
>>> is suppose to add enterprise level robustness and admin support as well.
>>>
>>> -Shahab
>>>
>>>
>>> On Tue, May 21, 2013 at 12:17 PM, Peyman Mohajerian 
>>> wrote:
>>>
 Apache Flume is one option.


 On Tue, May 21, 2013 at 7:32 AM, Aji Janis  wrote:

> Hello users,
>
> I am interested in hearing about what sort of ETL tools are you using
> with your cloud based apps. Ideally, I am looking ETL(s) with the 
> following
> feature:
>
> -free (yup)
> -open-source/community support
> -handles different types of sources or atleast has plugins may be
> (email, rss, filesystem, relational databases, etc)
> -can load data to Hbase(or hdfs)
> -ease of use (eg: setting up a new extract source shouldn't be a week
> long effort)
>
> I am open to combining multiple ETLs to get the job done too. Any
> suggestions? Thank you for your time
>


>>>
>>
>>
>> --
>>
>> Regards,
>> ...Sudhakara.st
>>
>>
>
>


Rack-awareness in Hadoop-2.0.3-alpha

2013-05-22 Thread Mohammad Mustaqeem
Is Hadoop-2.0.3-alpha does not support Rack-awarness?
I am trying to make Hadoop cluster Rack-Aware for a week but I
haven't succeed.

What I am doing.
I am adding following property in etc/hadoop/core-site.xml :

net.topology.script.file.name
/home/hadoop/hadoop-2.0.3-alpha/etc/hadoop/topology.sh

The topology.sh and topology.data is attached with this mail.

I want to tell you that once I used this property(topology.script.file.name)
and files to make Hadoop-0.22.0 rack-aware but I am unable to do the same
for Hadoop-2.0.3-alpha.
What I am doing wrong?
Please, somebody help me.


-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad


topology.data
Description: Binary data


topology.sh
Description: Bourne shell script


RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
Oh I see.  Does this mean there is another service and TCP listen port for this 
purpose?
Thanks for your indulgence... I would really like to read more about this 
without bothering the group but not sure where to start to learn these 
internals other than the code.
john

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to 
its local disk, the reduce tasks will pull the data through HTTP for further 
processing.

Am 21.05.2013 um 19:57 schrieb John Lilley 
mailto:john.lil...@redpoint.net>>:


When MapReduce enters "shuffle" to partition the tuples, I am assuming that it 
writes intermediate data to HDFS.  What replication factor is used for those 
temporary files?
john


--
Kai Voigt
k...@123.org






RE: MapReduce shuffle algorithm

2013-05-22 Thread John Lilley
Thanks!  I will read the elephant book more thoroughly.
john

From: Bertrand Dechoux [mailto:decho...@gmail.com]
Sent: Tuesday, May 21, 2013 1:22 PM
To: user@hadoop.apache.org
Subject: Re: MapReduce shuffle algorithm

An introduction to the subject can be found in the best known reference :

Hadoop: The Definitive Guide, 3rd Edition
Storage and Analysis at Internet Scale
By Tom White
Publisher: O'Reilly Media / Yahoo Press
Released: May 2012
Chapter 6 How MapReduce Works -> Shuffle and Sort -> around page 208
http://shop.oreilly.com/product/0636920021773.do

After reading this, you should have a good understanding of the architecture 
and know that indeed there is no "shuffle phase replication factor" (cf your 
question on another thread). For the technical details, the code is probably 
the next step.
Regards
Bertrand


On Tue, May 21, 2013 at 6:58 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
I am very interested in a deep understanding of the MapReduce "Shuffle" phase 
algorithm and implementation.  Are there whitepapers I could read for an 
explanation?  Or another mailing list for this question?  Obviously there is 
the code ;-)
john




Re: Shuffle phase replication factor

2013-05-22 Thread Shahab Yunus
As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
definitive :) place to start. It is pretty thorough for starts and once you
are gone through it, the code will start making more sense too.

Regards,
Shahab


On Wed, May 22, 2013 at 10:33 AM, John Lilley wrote:

>  Oh I see.  Does this mean there is another service and TCP listen port
> for this purpose?
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.
>
> john
>
> ** **
>
> *From:* Kai Voigt [mailto:k...@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
> ** **
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.
>
> ** **
>
> Am 21.05.2013 um 19:57 schrieb John Lilley :
>
>
>
> 
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?
>
> john
>
>  
>
> ** **
>
> -- 
>
> Kai Voigt
>
> k...@123.org
>
> ** **
>
>
>
> 
>
> ** **
>


YARN in 2.0 and 0.23

2013-05-22 Thread John Lilley
We intend to use the YARN APIs fairly soon.  Are there notable differences in 
YARNs classes, interfaces or semantics between 0.23 and 2.0?  It seems to be 
supported on both versions.
Thanks,
John



RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
This brings up another nagging question I've had for some time.  Between HDFS 
and shuffle, there seems to be the potential for "every node connecting to 
every other node" via TCP.  Are there explicit mechanisms in place to manage or 
limit simultaneous connections?  Is the protocol simply robust enough to allow 
a server-side to disconnect at any time to free up slots and the client-side 
will retry the request?
Thanks
john

From: Shahab Yunus [mailto:shahab.yu...@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really 
definitive :) place to start. It is pretty thorough for starts and once you are 
gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this 
purpose?
Thanks for your indulgence... I would really like to read more about this 
without bothering the group but not sure where to start to learn these 
internals other than the code.
john

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to 
its local disk, the reduce tasks will pull the data through HTTP for further 
processing.

Am 21.05.2013 um 19:57 schrieb John Lilley 
mailto:john.lil...@redpoint.net>>:

When MapReduce enters "shuffle" to partition the tuples, I am assuming that it 
writes intermediate data to HDFS.  What replication factor is used for those 
temporary files?
john


--
Kai Voigt
k...@123.org






Please, Un-subscribe!

2013-05-22 Thread Simone Martinelli
Hi there, could you please un-subscribe me from this mailing list?
Thank you
Simone

-- 


Best regards / Cordialement / Cordiali saluti

*Simone Martinelli
Associate, K2 Partnering Solutions 

+44 (0) 752 227 0504 | Mobile
k2smartinelli | Skype

Austin | Baar | Boston | Delhi | Genève | London | Mexico City | Moscow |
Providence | San Francisco | São Paulo | Shanghai | Singapore | Stuttgart |
Tokyo

Registered Office: 1-3 Heathmans Road, Parsons Green, London SW6 4TJ,
United Kingdom – VAT No. GB 927 293 207 – Reg.No.: 3534323

This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error, please notify the sender
immediately by email or telephone and delete the email from any computer and
destroy any material that may have been printed from this email file. All
reasonable precautions have been taken to ensure no viruses are present in
this email.*


Re: Shuffle phase replication factor

2013-05-22 Thread Rahul Bhattacharjee
There are properties/configuration to control the no. of copying threads
for copy.
tasktracker.http.threads=40
Thanks,
Rahul


On Wed, May 22, 2013 at 8:16 PM, John Lilley wrote:

>  This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?
>
> Thanks
>
> john
>
> ** **
>
> *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
> ** **
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.
>
> ** **
>
> Regards,
>
> Shahab
>
> ** **
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley 
> wrote:
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.
>
> john
>
>  
>
> *From:* Kai Voigt [mailto:k...@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
>  
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.
>
>  
>
> Am 21.05.2013 um 19:57 schrieb John Lilley :
>
> ** **
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?
>
> john
>
>  
>
>  
>
> -- 
>
> Kai Voigt
>
> k...@123.org
>
>  
>
> ** **
>
>  
>
> ** **
>


RE: Shuffle phase replication factor

2013-05-22 Thread John Lilley
U, is that also the limit for the number of simultaneous connections?  In 
general, one does not need a 1:1 map between threads and connections.
If this is the connection limit, does it imply  that the client or server side 
aggressively disconnects after a transfer?
What happens to the pending/failing connection attempts that exceed the limit?
Thanks!
john

From: Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
Sent: Wednesday, May 22, 2013 8:52 AM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

There are properties/configuration to control the no. of copying threads for 
copy.
tasktracker.http.threads=40
Thanks,
Rahul

On Wed, May 22, 2013 at 8:16 PM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
This brings up another nagging question I’ve had for some time.  Between HDFS 
and shuffle, there seems to be the potential for “every node connecting to 
every other node” via TCP.  Are there explicit mechanisms in place to manage or 
limit simultaneous connections?  Is the protocol simply robust enough to allow 
a server-side to disconnect at any time to free up slots and the client-side 
will retry the request?
Thanks
john

From: Shahab Yunus 
[mailto:shahab.yu...@gmail.com]
Sent: Wednesday, May 22, 2013 8:38 AM

To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really 
definitive :) place to start. It is pretty thorough for starts and once you are 
gone through it, the code will start making more sense too.

Regards,
Shahab

On Wed, May 22, 2013 at 10:33 AM, John Lilley 
mailto:john.lil...@redpoint.net>> wrote:
Oh I see.  Does this mean there is another service and TCP listen port for this 
purpose?
Thanks for your indulgence… I would really like to read more about this without 
bothering the group but not sure where to start to learn these internals other 
than the code.
john

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to 
its local disk, the reduce tasks will pull the data through HTTP for further 
processing.

Am 21.05.2013 um 19:57 schrieb John Lilley 
mailto:john.lil...@redpoint.net>>:

When MapReduce enters “shuffle” to partition the tuples, I am assuming that it 
writes intermediate data to HDFS.  What replication factor is used for those 
temporary files?
john


--
Kai Voigt
k...@123.org







RE: Shuffle phase

2013-05-22 Thread John Lilley
I was reading the elephant book trying to understand which process actually 
serves up the HTTP transfer on the mapper side.  Is it the each map task?  Or 
is there some persistent task on each worker that serves up mapper output for 
all map tasks?
Thanks,
John

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, May 21, 2013 12:59 PM
To: user@hadoop.apache.org
Subject: Re: Shuffle phase replication factor

The map output doesn't get written to HDFS. The map task writes its output to 
its local disk, the reduce tasks will pull the data through HTTP for further 
processing.

Am 21.05.2013 um 19:57 schrieb John Lilley 
mailto:john.lil...@redpoint.net>>:


When MapReduce enters "shuffle" to partition the tuples, I am assuming that it 
writes intermediate data to HDFS.  What replication factor is used for those 
temporary files?
john


--
Kai Voigt
k...@123.org






RE: Viewing snappy compressed files

2013-05-22 Thread Robert Rapplean
Thanks! This shortcuts my current process considerably, and should take the 
pressure off for the short term. I'd still like to be able to analyze the data 
in a python script without having to make a local copy, but that can wait.

Best,

Robert Rapplean
Senior Software Engineer
303-872-2256  direct  | 303.438.9597  main | www.trueffect.com

From: Sanjay Subramanian [mailto:sanjay.subraman...@wizecommerce.com]
Sent: Tuesday, May 21, 2013 11:56 AM
To: user@hadoop.apache.org
Subject: Re: Viewing snappy compressed files

+1 Thanks Rahul-da

Or u can use
hdfs dfs -text /path/to/dir/on/hdfs/part-r-0.snappy | less


From: Rahul Bhattacharjee 
mailto:rahul.rec@gmail.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Tuesday, May 21, 2013 9:52 AM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Viewing snappy compressed files

I haven't tried this with snappy , but you can try using hadoop fs -text 

On Tue, May 21, 2013 at 8:28 PM, Robert Rapplean 
mailto:robert.rappl...@trueffect.com>> wrote:
Hey, there. My Google skills have failed me, and I hope someone here can point 
me in the right direction.


We're storing data on our Hadoop cluster in Snappy compressed format. When we 
pull a raw file down and try to read it, however, the Snappy libraries don't 
know how to read the files. They tell me that the stream is missing the snappy 
identifier. I tried inserting 0xff 0x06 0x00 0x00 0x73 0x4e 0x61 0x50 0x70 0x59 
into the beginning of the file, but that didn't do it.

Can someone point me to resources for figuring out how to uncompress these 
files without going through Hadoop?


Robert Rapplean
Senior Software Engineer
303-872-2256  direct  | 303.438.9597  main 
| www.trueffect.com



CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Re: Rack-awareness in Hadoop-2.0.3-alpha

2013-05-22 Thread Chris Nauroth
common-dev and hdfs-dev removed/bcc'd

Hi Mohammad,

Rack awareness is supported in 2.0.3-alpha.  The only potential problem I
see in your configuration is that topology.sh contains a definition for
HADOOP_CONF that points back at your hadoop-0.22.0/conf directory.  If that
directory doesn't contain the right topology.data file, then resolution to
a rack might not work as you expected.

If that doesn't fix the problem, then I have another suggestion: change
topology.sh to add some debugging statements that echo to a temp file the
arguments that the script has received, and the results it decides to
print.  This would show you exactly how Hadoop called your script and
exactly what rack your script replied back to Hadoop.  It might give you an
idea for what to investigate next.

Hope this helps,

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Wed, May 22, 2013 at 6:29 AM, Mohammad Mustaqeem
<3m.mustaq...@gmail.com>wrote:

> Is Hadoop-2.0.3-alpha does not support Rack-awarness?
> I am trying to make Hadoop cluster Rack-Aware for a week but I
> haven't succeed.
>
> What I am doing.
> I am adding following property in etc/hadoop/core-site.xml :
> 
> net.topology.script.file.name
> /home/hadoop/hadoop-2.0.3-alpha/etc/hadoop/topology.sh
> 
> The topology.sh and topology.data is attached with this mail.
>
> I want to tell you that once I used this property(
> topology.script.file.name) and files to make Hadoop-0.22.0 rack-aware but
> I am unable to do the same for Hadoop-2.0.3-alpha.
> What I am doing wrong?
> Please, somebody help me.
>
>
> --
> *With regards ---*
> *Mohammad Mustaqeem*,
> M.Tech (CSE)
> MNNIT Allahabad
>


Re: Auto created 'target' folder?

2013-05-22 Thread Chris Nauroth
Can you provide additional information about the exact commands that you
are trying to run?

"target/test-dir" is something that gets created during the Hadoop
codebase's Maven build process.  Are you running Maven commands?  If so,
are you running Maven commands as a user different from tomcat7?  This
would result in a target directory owned by a different user, and the
tomcat 7 user might not have permission to access it.

If you're not actually running Maven commands, are you otherwise
referencing target/test-dir in your process, such as on a classpath?  If
so, then the same problems would apply: if your process launches as a
different user from the owner of the target/test-dir directory, then you
might have a permission problem.

So far, this sounds like a local file system permission issue rather than a
Hadoop-specific issue.

Hope this helps,

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Wed, May 22, 2013 at 2:21 AM, Taco Jan Osinga  wrote:

> Hi all,
>
> Quite a newby here.
>
> I'm creating an application for internal use, which deploys several
> demo-sites of our HBase application. These demos should contain a blank
> state (fixtures) with some data. Therefor I have created export files which
> needed to be imported (using the MapReduce-way of importing). This all
> works when I run my script as root. So this process works.
>
> However, it doesn't work running the script as user tomcat7 (permission
> denied). Even adding tomcat7 to supergroup didn't fix the problem.
>
> In the process I noticed there's a directory structure created, named
> "target/test-dir" which contains hadoop**.jar files. If I chmod -R 777 this
> target folder, I suddenly am able to import being tomcat7!
>
> My question:
> - What are these target folders (they appear on the working path)?
> - How can I make sure the user(s) in supergroup have write permissions (is
> it even wise to do that)?
> - It seems like some kind of temp folder: Shouldn't this folder be removed
> after the process is finished (it keeps growing)?
>
> Regards from The Netherlands,
> Taco Jan Osinga


Re: Get Hadoop update

2013-05-22 Thread Chris Nauroth
Hi Vimal,

Full information on how to subscribe and unsubscribe from the various lists
is here:

http://hadoop.apache.org/mailing_lists.html

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Wed, May 22, 2013 at 1:01 AM, Vimal Jain  wrote:

> Hi,
> I would like to receive Hadoop notifications.
>
> --
> Thanks and Regards,
> Vimal Jain
>


Re: Please, Un-subscribe!

2013-05-22 Thread Chris Nauroth
Hi Simone,

Please see this wiki page for full information on how to subscribe or
unsubscribe from the various mailing lists:

http://hadoop.apache.org/mailing_lists.html

Chris Nauroth
Hortonworks
http://hortonworks.com/



On Wed, May 22, 2013 at 7:49 AM, Simone Martinelli <
smartine...@k2partnering.com> wrote:

> Hi there, could you please un-subscribe me from this mailing list?
> Thank you
> Simone
>
> --
>
>
> Best regards / Cordialement / Cordiali saluti
>
> *Simone Martinelli
> Associate, K2 Partnering Solutions 
>
> +44 (0) 752 227 0504 | Mobile
> k2smartinelli | Skype
>
> Austin | Baar | Boston | Delhi | Genève | London | Mexico City | Moscow |
> Providence | San Francisco | São Paulo | Shanghai | Singapore | Stuttgart
> | Tokyo
>
> Registered Office: 1-3 Heathmans Road, Parsons Green, London SW6 4TJ,
> United Kingdom – VAT No. GB 927 293 207 – Reg.No.: 3534323
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error, please notify the sender
> immediately by email or telephone and delete the email from any computer and
> destroy any material that may have been printed from this email file. All
> reasonable precautions have been taken to ensure no viruses are present in
> this email.*
>
>


Re: Rack-awareness in Hadoop-2.0.3-alpha

2013-05-22 Thread Patai Sangbutsarakum
I believe that his topology.sh and .data files are already correct.
bash topology.sh 172.31.13.133 mustaqeem-1 mustaqeem-4
/default/rack /rack2 /rack3

output looks exactly the same as mine.


Mohammad,
1. did you restart namenode after you modified the configuration? in 0.20,
restart namenode is required.
2. Try to revert the scheduler to the default i believe it is FIFO, and
play with Rack-awareness again.
3. Check if you need to chown topology.sh to the user that run namenode
4. Check if you need to chmod u+x topology.sh
5. Check if topology.data can be read by the user.


hope this help



On Wed, May 22, 2013 at 9:49 AM, Chris Nauroth wrote:

> common-dev and hdfs-dev removed/bcc'd
>
> Hi Mohammad,
>
> Rack awareness is supported in 2.0.3-alpha.  The only potential problem I
> see in your configuration is that topology.sh contains a definition for
> HADOOP_CONF that points back at your hadoop-0.22.0/conf directory.  If that
> directory doesn't contain the right topology.data file, then resolution to
> a rack might not work as you expected.
>
> If that doesn't fix the problem, then I have another suggestion: change
> topology.sh to add some debugging statements that echo to a temp file the
> arguments that the script has received, and the results it decides to
> print.  This would show you exactly how Hadoop called your script and
> exactly what rack your script replied back to Hadoop.  It might give you an
> idea for what to investigate next.
>
> Hope this helps,
>
> Chris Nauroth
> Hortonworks
> http://hortonworks.com/
>
>
>
> On Wed, May 22, 2013 at 6:29 AM, Mohammad Mustaqeem <
> 3m.mustaq...@gmail.com> wrote:
>
>> Is Hadoop-2.0.3-alpha does not support Rack-awarness?
>> I am trying to make Hadoop cluster Rack-Aware for a week but I
>> haven't succeed.
>>
>> What I am doing.
>> I am adding following property in etc/hadoop/core-site.xml :
>> 
>> net.topology.script.file.name
>> /home/hadoop/hadoop-2.0.3-alpha/etc/hadoop/topology.sh
>> 
>> The topology.sh and topology.data is attached with this mail.
>>
>> I want to tell you that once I used this property(
>> topology.script.file.name) and files to make Hadoop-0.22.0 rack-aware
>> but I am unable to do the same for Hadoop-2.0.3-alpha.
>> What I am doing wrong?
>> Please, somebody help me.
>>
>>
>> --
>> *With regards ---*
>> *Mohammad Mustaqeem*,
>> M.Tech (CSE)
>> MNNIT Allahabad
>>
>
>


Hive tmp logs

2013-05-22 Thread Raj Hadoop
Hi,
 
My hive job logs are being written to /tmp/hadoop directory. I want to change 
it to a different location i.e. a sub directory somehere under the 'hadoop' 
user home directory.
How do I change it.
 
Thanks,
Ra

Re: Hive tmp logs

2013-05-22 Thread Sanjay Subramanian

  hive.querylog.location
  /path/to/hivetmp/dir/on/local/linux/disk



  hive.exec.scratchdir
   /data01/workspace/hive scratch/dir/on/local/linux/disk


From: Anurag Tangri mailto:tangri.anu...@gmail.com>>
Reply-To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>
Date: Wednesday, May 22, 2013 11:56 AM
To: "u...@hive.apache.org" 
mailto:u...@hive.apache.org>>
Cc: Hive mailto:u...@hive.apache.org>>, User 
mailto:user@hadoop.apache.org>>
Subject: Re: Hive tmp logs

Hi,
You can add Hive query log property in your hive site xml and point to the 
directory you want.

Thanks,
Anurag Tangri

Sent from my iPhone

On May 22, 2013, at 11:53 AM, Raj Hadoop 
mailto:hadoop...@yahoo.com>> wrote:

Hi,

My hive job logs are being written to /tmp/hadoop directory. I want to change 
it to a different location i.e. a sub directory somehere under the 'hadoop' 
user home directory.
How do I change it.

Thanks,
Ra

CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply email and 
destroy all copies of the original message along with any attachments, from 
your computer system. If you are the intended recipient, please be advised that 
the content of this message is subject to access, review and disclosure by the 
sender's Email System Administrator.


Eclipse plugin

2013-05-22 Thread Bharati
Hi,

I am trying to get or build eclipse plugin for 1.2.0 

All the methods I found on the web did not work for me. Any tutorial, methods  
to build the plugin will help.

I need to build a hadoop map reduce project and be able to debug in eclipse.

Thanks,
Bharati
Sent from my iPadFortigate Filtered


Sqoop Import Oracle Error - Attempted to generate class with no columns!

2013-05-22 Thread Raj Hadoop
Hi,
 
I just finished setting up Apache sqoop 1.4.3. I am trying to test basic sqoop 
import on Oracle.
 
sqoop import --connect jdbc:oracle:thin:@//intelli.dmn.com:1521/DBT --table 
usr1.testonetwo --username usr123 --password passwd123
 
 
I am getting the error as 
13/05/22 17:18:16 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM usr1.testonetwo t WHERE 1=0
13/05/22 17:18:16 ERROR tool.ImportTool: Imported Failed: Attempted to generate 
class with no columns!
 
I checked the database and the query runs fine from Oracle sqlplus client and 
Toad.
 
Thanks,
Raj

Re: Eclipse plugin

2013-05-22 Thread Jing Zhao
Hi Bharati,

Usually you only need to run "ant clean jar jar-test" and "ant
eclipse" on your code base, and then import the project into your
eclipse. Can you provide some more detailed description about the
problem you met?

Thanks,
-Jing

On Wed, May 22, 2013 at 2:25 PM, Bharati  wrote:
> Hi,
>
> I am trying to get or build eclipse plugin for 1.2.0
>
> All the methods I found on the web did not work for me. Any tutorial, methods 
>  to build the plugin will help.
>
> I need to build a hadoop map reduce project and be able to debug in eclipse.
>
> Thanks,
> Bharati
> Sent from my iPad
> Fortigate Filtered
>


Re: YARN in 2.0 and 0.23

2013-05-22 Thread Arun C Murthy
I'd use the 2.0 APIs, they are days away from getting frozen and will be 
supported compatibly for foreseeable future.

Details to track here:
https://issues.apache.org/jira/browse/YARN-386

hth,
Arun

On May 22, 2013, at 7:38 AM, John Lilley wrote:

> We intend to use the YARN APIs fairly soon.  Are there notable differences in 
> YARNs classes, interfaces or semantics between 0.23 and 2.0?  It seems to be 
> supported on both versions.
> Thanks,
> John
>  

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Eclipse plugin

2013-05-22 Thread Bharati
Hi Jing,

I want to be able to open a project as map reduce project in eclipse instead of 
java project as per some of the videos on youtube.  

For now let us say I want to write a wordcount program and step through it with 
hadoop 1.2.0 
How can I use eclipse to rewrite the code. 

The goal here is to setup the development env to start project as mad reduce 
right in eclipse or netbeans which ever works better. The idea is to be able to 
step through the code. 

Thanks,
Bharati

Sent from my iPad

On May 22, 2013, at 2:42 PM, Jing Zhao  wrote:

> Hi Bharati,
> 
>Usually you only need to run "ant clean jar jar-test" and "ant
> eclipse" on your code base, and then import the project into your
> eclipse. Can you provide some more detailed description about the
> problem you met?
> 
> Thanks,
> -Jing
> 
> On Wed, May 22, 2013 at 2:25 PM, Bharati  wrote:
>> Hi,
>> 
>> I am trying to get or build eclipse plugin for 1.2.0
>> 
>> All the methods I found on the web did not work for me. Any tutorial, 
>> methods  to build the plugin will help.
>> 
>> I need to build a hadoop map reduce project and be able to debug in eclipse.
>> 
>> Thanks,
>> Bharati
>> Sent from my iPad
>> Fortigate Filtered
>> 
Fortigate Filtered


Re: Eclipse plugin

2013-05-22 Thread Sanjay Subramanian
Hi

I don't use any need any special plugin to walk thru the code

All my map reduce jobs have a

JobMapper.java
JobReducer.java
JobProcessor.java (set any configs u like)

I create a new maven project in eclipse (easier to manage dependencies) ….the 
elements are in the order as they should appear in the POM

Then In Eclipse Debug Configurations I create a new JAVA application and then I 
start debugging ! That’s it…..


MAVEN REPO INFO






Cloudera repository

https://repository.cloudera.com/artifactory/cloudera-repos/








2.0.0-cdh4.1.2






org.apache.hadoop

hadoop-mapreduce-client-core

${cloudera_version}

compile





org.apache.hadoop

hadoop-common

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile



WordCountNew (please modify as needed)
==


public class WordCountNew {



public static class Map extends 
org.apache.hadoop.mapreduce.Mapper {

  private final static IntWritable one = new IntWritable(1);

  private Text word = new Text();



  public void map(LongWritable key, Text value, Context ctxt) throws 
IOException, InterruptedException {

FileSplit fileSplit = (FileSplit)ctxt.getInputSplit();

// System.out.println(value.toString());

String fileName =  fileSplit.getPath().toString();

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

ctxt.write(word, one);

}

  }

}



public static class Reduce extends 
org.apache.hadoop.mapreduce.Reducer {

  public void reduce(Text key, Iterable values, Context ctxt) 
throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values) {

  sum += value.get();

}

ctxt.write(key, new IntWritable(sum));

  }

}



public static void main(String[] args) throws Exception {

org.apache.hadoop.conf.Configuration hadoopConf = new 
org.apache.hadoop.conf.Configuration();

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_END.getVal());

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_CACHED_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_CACHED.getVal());

hadoopConf.set("io.compression.codecs", 
"org.apache.hadoop.io.compress.GzipCodec");


  Job job = new Job(hadoopConf);

  job.setJobName("wordcountNEW");

  job.setJarByClass(WordCountNew.class);

  job.setOutputKeyClass(Text.class);

  job.setOutputValueClass(IntWritable.class);

  job.setMapOutputKeyClass(Text.class);

  job.setMapOutputValueClass(IntWritable.class);



  job.setMapperClass(WordCountNew.Map.class);

  job.setCombinerClass(WordCountNew.Reduce.class);

  job.setReducerClass(Reduce.class);



//   job.setInputFormatClass(ZipMultipleLineRecordInputFormat.class);

  
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);


  job.setOutputFormatClass(TextOutputFormat.class);



  if (FileUtils.doesFileOrDirectoryExist(args[1])){

  org.apache.commons.io.FileUtils.deleteDirectory(new File(args[1]));

  }

  org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(job, 
new Path(args[0]));

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(job, 
new Path(args[1]));



  job.waitForCompletion(true);

  System.out.println();

}

}





From: Bharati 
mailto:bharati.ad...@mparallelo.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 3:39 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi Jing,

I want to be able to open a project as map reduce project in eclipse instead of 
java project as per some of the videos on youtube.

For now let us say I want to write a wordcount program and step through it with 
hadoop 1.2.0
How can I use eclipse to rewrite the code.

The goal here is to setup the development env to start project as mad reduce 
right in eclipse or netbeans which ever works better. The idea is to be able to 
step through the code.

Thanks,
Bharati

Sent from my iPad

On May 22, 2013, at 2:42 PM, Jing Zhao 
mailto:j...@hortonworks.com>> wrote:

> Hi Bharati,
>
>Usually you only need to run "ant clean jar jar-test" and "ant
> eclipse" on your code base, and then import the project into your
> eclipse. Can you provide some more detailed description about the
> problem you met?
>
> Thanks,
> -Jing
>
> On Wed, May 22, 2013 at 2:25 PM, Bharati 
> mailto:bharati.ad...@mparallelo.com>> wrote:
>> Hi,
>>
>> I am trying to get or build eclipse plugin for 1.2.0
>>
>> All the methods I found on the web did not work for me. Any tutorial, 
>

Re: Eclipse plugin

2013-05-22 Thread Sanjay Subramanian
Forgot to add, if u run Windows and Eclipse and want to do Hadoop u have to 
setup Cygwin and add $CYGWIN_PATH/bin to PATH

Good Luck

Sanjay

From: Sanjay Subramanian 
mailto:sanjay.subraman...@wizecommerce.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 4:23 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi

I don't use any need any special plugin to walk thru the code

All my map reduce jobs have a

JobMapper.java
JobReducer.java
JobProcessor.java (set any configs u like)

I create a new maven project in eclipse (easier to manage dependencies) ….the 
elements are in the order as they should appear in the POM

Then In Eclipse Debug Configurations I create a new JAVA application and then I 
start debugging ! That’s it…..


MAVEN REPO INFO






Cloudera repository

https://repository.cloudera.com/artifactory/cloudera-repos/








2.0.0-cdh4.1.2






org.apache.hadoop

hadoop-mapreduce-client-core

${cloudera_version}

compile





org.apache.hadoop

hadoop-common

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile





org.apache.hadoop

hadoop-client

${cloudera_version}

compile



WordCountNew (please modify as needed)
==


publicclass WordCountNew {



   public static class Map extends 
org.apache.hadoop.mapreduce.Mapper {

 private final static IntWritable one = new IntWritable(1);

 private Text word = new Text();



 public void map(LongWritable key, Text value, Context ctxt) throws 
IOException, InterruptedException {

FileSplit fileSplit = (FileSplit)ctxt.getInputSplit();

// System.out.println(value.toString());

String fileName =  fileSplit.getPath().toString();

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

ctxt.write(word, one);

   }

 }

   }



   public static class Reduce extends org.apache.hadoop.mapreduce.Reducer {

 public void reduce(Text key, Iterable values, Context ctxt) 
throws IOException, InterruptedException {

   int sum = 0;

   for (IntWritable value : values) {

 sum += value.get();

   }

   ctxt.write(key, new IntWritable(sum));

 }

   }



   public static void main(String[] args) throws Exception {

org.apache.hadoop.conf.Configuration hadoopConf = new 
org.apache.hadoop.conf.Configuration();

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_END.getVal());

hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_CACHED_SEPARATOR.getVal(), 
MapredConfEnum.PRODUCT_IMPR_LOG_REC_CACHED.getVal());

hadoopConf.set("io.compression.codecs", 
"org.apache.hadoop.io.compress.GzipCodec");


 Job job = new Job(hadoopConf);

 job.setJobName("wordcountNEW");

 job.setJarByClass(WordCountNew.class);

 job.setOutputKeyClass(Text.class);

 job.setOutputValueClass(IntWritable.class);

 job.setMapOutputKeyClass(Text.class);

 job.setMapOutputValueClass(IntWritable.class);



 job.setMapperClass(WordCountNew.Map.class);

 job.setCombinerClass(WordCountNew.Reduce.class);

 job.setReducerClass(Reduce.class);



//  job.setInputFormatClass(ZipMultipleLineRecordInputFormat.class);

 
job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);


 job.setOutputFormatClass(TextOutputFormat.class);



 if (FileUtils.doesFileOrDirectoryExist(args[1])){

 org.apache.commons.io.FileUtils.deleteDirectory(new File(args[1]));

 }

 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(job, 
new Path(args[0]));

org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(job, 
new Path(args[1]));



 job.waitForCompletion(true);

 System.out.println();

   }

}





From: Bharati 
mailto:bharati.ad...@mparallelo.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Date: Wednesday, May 22, 2013 3:39 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Re: Eclipse plugin

Hi Jing,

I want to be able to open a project as map reduce project in eclipse instead of 
java project as per some of the videos on youtube.

For now let us say I want to write a wordcount program and step through it with 
hadoop 1.2.0
How can I use eclipse to rewrite the code.

The goal here is to setup the development env to start project as mad reduce 
right in eclipse or netbeans which ever works better. The idea is to be able to 
step through the code.

Thanks,
Bharati

Sent from my iPad

On May 22, 2013, at 2:42 PM, Jing Zhao 
mailto:j...@hortonworks.com>> wrote:

> Hi Bharati,
>
>Usually you only need to run "ant clean j

Re: Eclipse plugin

2013-05-22 Thread Bharati Adkar
Hi,
I am using a mac. 

I have not used maven before,  I am new to hadoop and eclipse. 

Any directions to start a project as map reuce as per all the videos on youtube.

Thanks,
Bharati


On May 22, 2013, at 4:23 PM, Sanjay Subramanian 
 wrote:

> Hi 
> 
> I don't use any need any special plugin to walk thru the code 
> 
> All my map reduce jobs have a 
> 
> JobMapper.java
> JobReducer.java
> JobProcessor.java (set any configs u like)
> 
> I create a new maven project in eclipse (easier to manage dependencies) ….the 
> elements are in the order as they should appear in the POM
> 
> Then In Eclipse Debug Configurations I create a new JAVA application and then 
> I start debugging ! That’s it…..
> 
> 
> MAVEN REPO INFO
> 
> 
> 
> Cloudera repository
> https://repository.cloudera.com/artifactory/cloudera-repos/
> 
> 
> 
> 
> 2.0.0-cdh4.1.2
> 
> 
> 
> org.apache.hadoop
> hadoop-mapreduce-client-core
> ${cloudera_version}
> compile
> 
> 
> org.apache.hadoop
> hadoop-common
> ${cloudera_version}
> compile
> 
> 
> org.apache.hadoop
> hadoop-client
> ${cloudera_version}
> compile
> 
> 
> org.apache.hadoop
> hadoop-client
> ${cloudera_version}
> compile
> 
> 
> WordCountNew (please modify as needed)
> ==
> 
> public class WordCountNew {
>  
> 
>  
>   public static class Map extends 
> org.apache.hadoop.mapreduce.Mapper {
>  
> private final static IntWritable one = new IntWritable(1);
>  
> private Text word = new Text();
>  
> 
>  
> public void map(LongWritable key, Text value, Context ctxt) throws 
> IOException, InterruptedException {
> FileSplit fileSplit = (FileSplit)ctxt.getInputSplit();
> // System.out.println(value.toString());
> String fileName =  fileSplit.getPath().toString();
> String line = value.toString();
> StringTokenizer tokenizer = new StringTokenizer(line);
> while (tokenizer.hasMoreTokens()) {
> word.set(tokenizer.nextToken());
> ctxt.write(word, one);
>  
>   }
>  
> }
>  
>   }
>  
> 
>  
>   public static class Reduce extends 
> org.apache.hadoop.mapreduce.Reducer {
>  
> public void reduce(Text key, Iterable values, Context ctxt) 
> throws IOException, InterruptedException {
>  
>   int sum = 0;
>  
>   for (IntWritable value : values) {
>  
> sum += value.get();
>  
>   }
>  
>   ctxt.write(key, new IntWritable(sum));
>  
> }
>  
>   }
>  
> 
>  
>   public static void main(String[] args) throws Exception {
> org.apache.hadoop.conf.Configuration hadoopConf = new 
> org.apache.hadoop.conf.Configuration();
> hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_SEPARATOR.getVal(), 
> MapredConfEnum.PRODUCT_IMPR_LOG_REC_END.getVal());
> hadoopConf.set(MapredConfEnum.IMPRESSIONS_LOG_REC_CACHED_SEPARATOR.getVal(), 
> MapredConfEnum.PRODUCT_IMPR_LOG_REC_CACHED.getVal());
> hadoopConf.set("io.compression.codecs", 
> "org.apache.hadoop.io.compress.GzipCodec");
> 
>  
> Job job = new Job(hadoopConf);
>  
> job.setJobName("wordcountNEW");
>  
> job.setJarByClass(WordCountNew.class);
>  
> job.setOutputKeyClass(Text.class);
>  
> job.setOutputValueClass(IntWritable.class);
>  
> job.setMapOutputKeyClass(Text.class);
>  
> job.setMapOutputValueClass(IntWritable.class);
>  
> 
>  
> job.setMapperClass(WordCountNew.Map.class);
>  
> job.setCombinerClass(WordCountNew.Reduce.class);
>  
> job.setReducerClass(Reduce.class);
>  
> //  
> job.setInputFormatClass(ZipMultipleLineRecordInputFormat.class);
>  
> 
> job.setInputFormatClass(org.apache.hadoop.mapreduce.lib.input.TextInputFormat.class);
> 
>  
> job.setOutputFormatClass(TextOutputFormat.class);
>  
>  
> if (FileUtils.doesFileOrDirectoryExist(args[1])){
>  
> org.apache.commons.io.FileUtils.deleteDirectory(new File(args[1]));
>  
> }
>  
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(job, 
> new Path(args[0]));
> 
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputPath(job, 
> new Path(args[1]));
>  
> 
>  
> job.waitForCompletion(true);
>  
> System.out.println();
>  
>   }
> }
>  
> 
> 
> 
> 
> From: Bharati 
> Reply-To: "user@hadoop.apache.org" 
> Date: Wednesday, May 22, 2013 3:39 PM
> To: "user@hadoop.apache.org" 
> Subject: Re: Eclipse plugin
> 
> Hi Jing,
> 
> I want to be able to open a project as map reduce project in eclipse instead 
> of java project as per some of the videos on youtube.  
> 
> For now let us say I want to write a wordcount program and step through it 
> with hadoop 1.2.0 
> How can I use eclipse to rewrite the code. 
> 
> The goal here is to setup the development env to start project as mad reduce 
> right in eclipse or netbeans which ever works better. The idea is to be able 
> to step through the code. 
> 
> Thanks,
> Bharati
> 
> Sent from my iPad
> 
> On May 22, 2013, at 2:42 PM, Jing Zhao  wrote:
> 
> > Hi Bharati,
> > 
> >Usually you only need to run "ant clean jar jar-test" and "ant
>

Re: Sqoop Import Oracle Error - Attempted to generate class with no columns!

2013-05-22 Thread Venkat Ranganathan
Resending, as the last one bounced.

You need to specify username and tablename in uppercase, otherwise the job
will fail

Thanks

Venkat


On Wed, May 22, 2013 at 4:19 PM, Venkat  wrote:

> You need to specify username and tablename in uppercase
>
> Venkat
>
>
> On Wed, May 22, 2013 at 2:27 PM, Raj Hadoop  wrote:
>
>> Hi,
>>
>> I just finished setting up Apache sqoop 1.4.3. I am trying to test basic
>> sqoop import on Oracle.
>>
>> sqoop import --connect jdbc:oracle:thin:@//intelli.dmn.com:1521/DBT--table 
>> usr1.testonetwo --username usr123 --password passwd123
>>
>>
>> I am getting the error as
>> 13/05/22 17:18:16 INFO manager.SqlManager: Executing SQL statement:
>> SELECT t.* FROM usr1.testonetwo t WHERE 1=0
>> 13/05/22 17:18:16 ERROR tool.ImportTool: Imported Failed: Attempted to
>> generate class with no columns!
>>
>> I checked the database and the query runs fine from Oracle sqlplus client
>> and Toad.
>>
>> Thanks,
>> Raj
>>
>
>
>
> --
> Regards
>
> Venkat
>


RE: YARN in 2.0 and 0.23

2013-05-22 Thread John Lilley
We don't necessarily have the freedom to choose; we are an application provider 
and we desire compatibility for as many versions as possible so as to fit into 
existing Hadoop installations.  For example, we currently read and write HDFS 
for 0.23, 1.0, 1.1, and 2.0.  Given that, am I trying to understand what 
significant differences to expect between the 0.23 and 2.0 YARN environments.

Thanks
John


From: Arun C Murthy [mailto:a...@hortonworks.com]
Sent: Wednesday, May 22, 2013 4:35 PM
To: user@hadoop.apache.org
Subject: Re: YARN in 2.0 and 0.23

I'd use the 2.0 APIs, they are days away from getting frozen and will be 
supported compatibly for foreseeable future.

Details to track here:
https://issues.apache.org/jira/browse/YARN-386

hth,
Arun

On May 22, 2013, at 7:38 AM, John Lilley wrote:


We intend to use the YARN APIs fairly soon.  Are there notable differences in 
YARNs classes, interfaces or semantics between 0.23 and 2.0?  It seems to be 
supported on both versions.
Thanks,
John


--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: Shuffle phase replication factor

2013-05-22 Thread Kun Ling
Hi John,


   1. for the number of  simultaneous connection limitations. You can
configure this using the mapred.reduce.parallel.copies flag. the default
 is 5.

   2. For the aggressively disconnect implication, I am afraid it is only a
little. Normally, each reducer will connect to each mapper task, and asking
for the partions of the map output file.   Because there are about 5
simultaneous connections to fetch the map output for each reducer. For a
large MR cluster with 1000 node, and a Huge MR job with 1000 Mapper, and
1000 reducer, for each node, there are only about 5 connections. So the
imply is only a little.


  3.  What happens to the pending/ failing coonection, the short answer is:
just try to reconnect.There is a List<>, which maintain all the output
of the Mapper that need to copied, and the element will be removed iff the
map output is successfully copied.  A forever loop will keep on look into
the List, and fetch the corrsponding map output.


  All the above answer is based on the Hadoop 1.0.4 source code, especially
the ReduceTask.java file.

yours,
Ling Kun


On Wed, May 22, 2013 at 10:57 PM, John Lilley wrote:

>  U, is that also the limit for the number of simultaneous
> connections?  In general, one does not need a 1:1 map between threads and
> connections.
>
> If this is the connection limit, does it imply  that the client or server
> side aggressively disconnects after a transfer?  
>
> What happens to the pending/failing connection attempts that exceed the
> limit?
>
> Thanks!
>
> john
>
> ** **
>
> *From:* Rahul Bhattacharjee [mailto:rahul.rec@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:52 AM
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
> ** **
>
> There are properties/configuration to control the no. of copying threads
> for copy.
> tasktracker.http.threads=40
> Thanks,
> Rahul
>
> ** **
>
> On Wed, May 22, 2013 at 8:16 PM, John Lilley 
> wrote:
>
> This brings up another nagging question I’ve had for some time.  Between
> HDFS and shuffle, there seems to be the potential for “every node
> connecting to every other node” via TCP.  Are there explicit mechanisms in
> place to manage or limit simultaneous connections?  Is the protocol simply
> robust enough to allow a server-side to disconnect at any time to free up
> slots and the client-side will retry the request?
>
> Thanks
>
> john
>
>  
>
> *From:* Shahab Yunus [mailto:shahab.yu...@gmail.com]
> *Sent:* Wednesday, May 22, 2013 8:38 AM
>
>
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
>  
>
> As mentioned by Bertrand, Hadoop, The Definitive Guide, is well... really
> definitive :) place to start. It is pretty thorough for starts and once you
> are gone through it, the code will start making more sense too.
>
>  
>
> Regards,
>
> Shahab
>
>  
>
> On Wed, May 22, 2013 at 10:33 AM, John Lilley 
> wrote:
>
> Oh I see.  Does this mean there is another service and TCP listen port for
> this purpose?
>
> Thanks for your indulgence… I would really like to read more about this
> without bothering the group but not sure where to start to learn these
> internals other than the code.
>
> john
>
>  
>
> *From:* Kai Voigt [mailto:k...@123.org]
> *Sent:* Tuesday, May 21, 2013 12:59 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Shuffle phase replication factor
>
>  
>
> The map output doesn't get written to HDFS. The map task writes its output
> to its local disk, the reduce tasks will pull the data through HTTP for
> further processing.
>
>  
>
> Am 21.05.2013 um 19:57 schrieb John Lilley :
>
>  
>
> When MapReduce enters “shuffle” to partition the tuples, I am assuming
> that it writes intermediate data to HDFS.  What replication factor is used
> for those temporary files?
>
> john
>
>  
>
>  
>
> -- 
>
> Kai Voigt
>
> k...@123.org
>
>  
>
>  
>
>  
>
>  
>
> ** **
>



-- 
http://www.lingcc.com


dncp_block_verification log

2013-05-22 Thread Brahma Reddy Battula
Hi All,



On some systems, I noticed that when the scanner runs, the 
dncp_block_verification.log.curr file under the block pool gets quite large ..





Please let me know..



i) why it is growing in only some machines..?

ii) Wht's solution..?



Following links also will describes the problem



http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201303.mbox/%3ccajzooycpad5w6cqdteliufy-h9r0pind9f0xelvt2bftwmm...@mail.gmail.com%3E



Thanks



Brahma Reddy