Re: Issue with running hadoop program using eclipse

2013-01-30 Thread Mohammad Tariq
Hello Vikas,

 It clearly shows that the class can not be found. For
debugging, you can write your MR job as a standalone java program and debug
it. It works. And if you want to just debug your mapper / reducer logic,
you should look into using MRUnit. There is a good
write-upat
Cloudera's blog section which talks about it in detail.

HTH

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Thu, Jan 31, 2013 at 11:56 AM, Vikas Jadhav wrote:

> Hi
> I have one windows machine and one linux machine
> my eclipse is on winowds machine
> and hadoop running on single linux machine
> I am trying to run wordcount program from eclipse(on windows machine) to
> Hadoop(on linux machine)
>  I getting following error
>
>  3/01/31 11:48:14 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 13/01/31 11:48:15 WARN mapred.JobClient: No job jar file set.  User
> classes may not be found. See JobConf(Class) or JobConf#setJar(String).
> 13/01/31 11:48:16 INFO input.FileInputFormat: Total input paths to process
> : 1
> 13/01/31 11:48:16 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 13/01/31 11:48:16 WARN snappy.LoadSnappy: Snappy native library not loaded
> 13/01/31 11:48:26 INFO mapred.JobClient: Running job: job_201301300613_0029
> 13/01/31 11:48:27 INFO mapred.JobClient:  map 0% reduce 0%
> 13/01/31 11:48:40 INFO mapred.JobClient: Task Id :
> attempt_201301300613_0029_m_00_0, Status : FAILED
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> WordCount$MapClass
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:867)
>  at
> org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:396)
>  at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>  at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: java.lang.ClassNotFoundException: WordCount$MapClass
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>  at java.lang.Class.forName0(Native Method)
>  at java.lang.Class.forName(Class.java:247)
>  at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
>  at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865)
>  ... 8 more
>
>
>
>
> Also I want to know how to debug Hadoop Program using eclipse.
>
>
>
>
> Thank You.
>
>


Re: What is the best way to load data from one cluster to another cluster (Urgent requirement)

2013-01-30 Thread samir das mohapatra
thanks all.


On Thu, Jan 31, 2013 at 11:19 AM, Satbeer Lamba wrote:

> I might be wrong but have you considered distcp?
> On Jan 31, 2013 11:15 AM, "samir das mohapatra" 
> wrote:
>
>> Hi All,
>>
>>Any one knows,  how to load data from one hadoop cluster(CDH4) to
>> another Cluster (CDH4) . They way our project needs are
>>1) It should  be delta load or incremental load.
>>2) It should be based on the timestamp
>>3) Data volume are 5PB
>>
>> Any Help 
>>
>> Regards,
>> samir.
>>
>


Recommendation required for Right Hadoop Distribution (CDH OR HortonWork)

2013-01-30 Thread samir das mohapatra
Hi All,
   My Company wanted to implement right Distribution for Apache Hadoop
   for  its Production as well as Dev. Can any one suggest me which one
will good for future.

Hints:
They wanted to know both pros and cons.


Regards,
samir.


Re: What is the best way to load data from one cluster to another cluster (Urgent requirement)

2013-01-30 Thread Satbeer Lamba
I might be wrong but have you considered distcp?
On Jan 31, 2013 11:15 AM, "samir das mohapatra" 
wrote:

> Hi All,
>
>Any one knows,  how to load data from one hadoop cluster(CDH4) to
> another Cluster (CDH4) . They way our project needs are
>1) It should  be delta load or incremental load.
>2) It should be based on the timestamp
>3) Data volume are 5PB
>
> Any Help 
>
> Regards,
> samir.
>


Re: Filesystem closed exception

2013-01-30 Thread Hemanth Yamijala
FS Caching is enabled on the cluster (i.e. the default is not changed).

Our code isn't actually mapper code, but a standalone java program being
run as part of Oozie. It just seemed confusing and not a very clear
strategy to leave unclosed resources. Hence my suggestion to get an
uncached FS handle for this use case alone. Note, I am not suggesting to
disable FS caching in general.

Thanks
Hemanth


On Thu, Jan 31, 2013 at 12:19 AM, Alejandro Abdelnur wrote:

> Hemanth,
>
> Is FS caching enabled or not in your cluster?
>
> A simple solution would be to modify your mapper code not to close the FS.
> It will go away when the task ends anyway.
>
> Thx
>
>
> On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala <
> yhema...@thoughtworks.com> wrote:
>
>> Hi,
>>
>> We are noticing a problem where we get a filesystem closed exception when
>> a map task is done and is finishing execution. By map task, I literally
>> mean the MapTask class of the map reduce code. Debugging this we found that
>> the mapper is getting a handle to the filesystem object and itself calling
>> a close on it. Because filesystem objects are cached, I believe the
>> behaviour is as expected in terms of the exception.
>>
>> I just wanted to confirm that:
>>
>> - if we do have a requirement to use a filesystem object in a mapper or
>> reducer, we should either not close it ourselves
>> - or (seems better to me) ask for a new version of the filesystem
>> instance by setting the fs.hdfs.impl.disable.cache property to true in job
>> configuration.
>>
>> Also, does anyone know if this behaviour was any different in Hadoop 0.20
>> ?
>>
>> For some context, this behaviour is actually seen in Oozie, which runs a
>> launcher mapper for a simple java action. Hence, the java action could very
>> well interact with a file system. I know this is probably better addressed
>> in Oozie context, but wanted to get the map reduce view of things.
>>
>>
>> Thanks,
>> Hemanth
>>
>
>
>
> --
> Alejandro
>


Fwd: YARN NM containers were killed

2013-01-30 Thread YouPeng Yang
Hi

   I have posted my question for a day,please can somebody help me to
figure  out
what the problem is.
   Thank you.
regards
YouPeng Yang


-- Forwarded message --
From: YouPeng Yang 
Date: 2013/1/30
Subject: YARN NM containers were killed
To: user@hadoop.apache.org


i've tested the hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar on my hadoop
environment
(   1 RM - Hadoop01 and 3 NM --Hadoop02,Hadoop03,Hadoop04
  OS:CDH4.1.2 rhel5.5):
./bin/hadoop jar
share/hadoop/mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.1.2.jar
 wordcount 1/input output

when i checked the log .i was confused by the plz:
my hadoop creates 2 containers in Hadoop02,1 container in Hadoop03 ,however
0 container Hadoop04.

the result of the containers processing:

Hadoop02:
* container_1359422495723_0001_01_01
(its state changes as follows:NEW --> LOCALIZING --> LOCALIZED --> RUNNING
--> KILLING --> EXITED_WITH_SUCCESS)

  the log indates that:
NodeStatusUpdaterImpl: Sending out status for container: container_id {,
app_attempt_id {, application_id {, id: 1, cluster_timestamp:
1359422495723, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics:
"", exit_status: -1000,
 ContainerLaunch: Container container_1359422495723_0001_01_01
succeeded
Container: Container container_1359422495723_0001_01_01 transitioned
from RUNNING to EXITED_WITH_SUCCESS
 ContainerLaunch: Cleaning up container
container_1359422495723_0001_01_01
NMAuditLogger: USER=hadoop OPERATION=Container Finished - Succeeded
TARGET=ContainerImpl RESULT=SUCCESSAPPID=application_1359422495723_0001
CONTAINERID=container_1359422495723_0001_01_01
 * container_1359422495723_0001_01_03
(its state changes as follows:NEW --> LOCALIZING --> LOCALIZED --> RUNNING
--> KILLING --> CONTAINER_CLEANEDUP_AFTER_KILL--> DONE)
 the log indates that:
NodeStatusUpdaterImpl: Sending out status for container: container_id {,
app_attempt_id {, application_id {, id: 1, cluster_timestamp:
1359422495723, }, attemptId: 1, }, id: 3, }, state: C_RUNNING, diagnostics:
"Container killed by the ApplicationMaster.\n", exit_status: -1000,
 DefaultContainerExecutor: Exit code from task is : 137
NMAuditLogger: USER=hadoop OPERATION=Container Finished - Killed
TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1359422495723_0001
CONTAINERID=container_1359422495723_0001_01_03

Hadoop03:
* container_1359422495723_0001_01_02
(its state changes as follows:NEW --> LOCALIZING --> LOCALIZED --> RUNNING
--> KILLING --> CONTAINER_CLEANEDUP_AFTER_KILL--> DONE)
 NodeStatusUpdaterImpl: Sending out status for container: container_id {,
app_attempt_id {, application_id {, id: 1, cluster_timestamp:
1359422495723, }, attemptId: 1, }, id: 2, }, state: C_RUNNING, diagnostics:
"Container killed by the ApplicationMaster.\n", exit_status: -1000,
DefaultContainerExecutor: Exit code from task is : 143
NMAuditLogger: USER=hadoop OPERATION=Container Finished - Killed
TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1359422495723_0001
CONTAINERID=container_1359422495723_0001_01_02

My questions:
1. Why  were 2 containers created in Hadoop02,however Hadoop04 got
nothing.is it normal ?
2. What is the principle that guides containers to be created.
 3. Why were the two containers (the container_*_03 and the
container_*_02)  killed, while the container_*_01 succeeded.
   is it normal?


logs of Hadoop01 as follows:

2013-01-29 09:23:48,904 INFO
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated
new applicationId: 1
2013-01-29 09:23:50,201 INFO
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application
with id 1 submitted by user hadoop
2013-01-29 09:23:50,204 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=hadoop
IP=10.167.14.221 OPERATION=Submit Application Request
TARGET=ClientRMServiceRESULT=SUCCESS APPID=application_1359422495723_0001
2013-01-29 09:23:50,221 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1359422495723_0001 State change from NEW to SUBMITTED
2013-01-29 09:23:50,221 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Registering appattempt_1359422495723_0001_01
2013-01-29 09:23:50,222 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1359422495723_0001_01 State change from NEW to SUBMITTED
2013-01-29 09:23:50,242 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler:
Application Submission: application_1359422495723_0001 from hadoop,
currently active: 1
2013-01-29 09:23:50,250 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1359422495723_0001_01 State change from SUBMITTED to
SCHEDULED
2013-01-29 09:23:50,250 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1359422495723_0001 State change from SUBMITTED to ACCEPTED
2013-01-29 09:23:50,581 INFO
org.a

Re: How to find Blacklisted Nodes via cli.

2013-01-30 Thread Hemanth Yamijala
Hi,

Part answer: you can get the blacklisted tasktrackers using the command
line:

mapred job -list-blacklisted-trackers.

Also, I think that a blacklisted tasktracker becomes 'unblacklisted' if it
works fine after some time. Though I am not very sure about this.

Thanks
hemanth


On Wed, Jan 30, 2013 at 9:35 PM, Dhanasekaran Anbalagan
wrote:

> Hi Guys,
>
> How to find Blacklisted Nodes via, command line. I want to see job
> Tracker Blacklisted Nodes and hdfs Blacklisted Nodes.
>
> and also how to clear blacklisted nodes to clear start. The only option to
> restart the service. some other way clear the Blacklisted Nodes.
>
> please guide me.
>
> -Dhanasekaran.
>
> Did I learn something today? If not, I wasted it.
>


Problem in reading Map Output file via RecordReader

2013-01-30 Thread anil gupta
Hi All,

I am using HBase0.92.1. I am trying to break the HBase bulk loading into
multiple MR jobs since i want to populate more than one HBase table from a
single csv file. I have looked into MultiTableOutputFormat class but i
doesnt solve my purpose becasue it does not generates HFile.

I modified the bulk loader job of HBase and removed the reducer phase so
that i can generate  output of  for multiple
tables in one MR job(phase 1).
Now, i ended up writing an input format that reads  to use it to read the output of mappers(phase 1) and generate the
HFiles for each table.

I implemented a RecordReader assuming that i can use the
readFields(DataInput) to read ImmutableBytesWritable and Put respectively.

As per my understanding, format of the input file(output files of mappers
of phase 1) is .
However when i am trying to read the file like that, the size of the
ImmutableBytesWritable is wrong and its throwing OOM due to that. Size of
ImmutableBytesWritable(rowkey) should not be greater than 32 bytes for my
use case but the as per the input it is 808460337 bytes. I am pretty sure
that either my understanding of input format is wrong or my implementation
of record reader is having some problem.

Can someone tell me the correct way of deserializing the output file of
mapper? or There is some problem with my code?
Here is the link to my initial stab at RecordReader:
https://dl.dropbox.com/u/64149128/ImmutableBytesWritable_Put_RecordReader.java
-- 
Thanks & Regards,
Anil Gupta


Re: Oozie workflow error - renewing token issue

2013-01-30 Thread Daryn Sharp
The token renewer needs to be the job tracker principal.  I think oozie had "mr 
token" hardcoded at one point, but later changed it to use a conf setting.

The rest of the log looks very odd - ie. it looks like security is off, but it 
can't be.  It's trying to renew hdfs tokens issued for the "hdfs" superuser.  
Are you running your job as hdfs?  With security enabled, the container 
executor won't run jobs for hdfs…  It's also trying to renew as "mapred 
(auth:SIMPLE)".  SIMPLE means that security is disabled, which doesn't make 
sense since tokens require kerberos.

The cannot find renewer class for MAPREDUCE_DELEGATION_TOKEN is also odd 
because it did subsequently try to renew the token.

Daryn



On Jan 30, 2013, at 2:14 PM, Alejandro Abdelnur wrote:

Cobert,

[Moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o]

* What version of Oozie are you using?
* Is the cluster a secure setup (Kerberos enabled)?
* Would you mind posting the complete launcher logs?

Thx



On Wed, Jan 30, 2013 at 6:14 AM, Corbett Martin 
mailto:comar...@nhin.com>> wrote:
Thanks for the tip.

The sqoop command listed in the stdout log file is:
sqoop
import
--driver
org.apache.derby.jdbc.ClientDriver
--connect
jdbc:derby://test-server:1527/mondb
--username
monuser
--password
x
--table
MONITOR
--split-by
request_id
--target-dir
/mon/import
--append
--incremental
append
--check-column
request_id
--last-value
200

The following information is from the sterr and stdout log files.

/var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stderr
No such sqoop tool: sqoop. See 'sqoop help'.
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], 
exit code [1]


/var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stdout
...
...
...
>>> Invoking Sqoop command line now >>>

1598 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has not 
been set in the environment. Cannot check for additional configuration.
Intercepting System.exit(1)

<<< Invocation of Main class completed <<<

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], 
exit code [1]

Oozie Launcher failed, finishing Hadoop job gracefully

Oozie Launcher ends


~Corbett Martin
Software Architect
AbsoluteAR Accounts Receivable Services - An NHIN Solution

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Tuesday, January 29, 2013 10:11 PM
To: mailto:user@hadoop.apache.org>>
Subject: Re: Oozie workflow error - renewing token issue

The job that Oozie launches for your action, which you are observing is 
failing, does its own logs (task logs) show any errors?

On Wed, Jan 30, 2013 at 4:59 AM, Corbett Martin 
mailto:comar...@nhin.com>> wrote:
> Oozie question
>
>
>
> I'm trying to run an Oozie workflow (sqoop action) from the Hue
> console and it fails every time.  No exception in the oozie log but I
> see this in the Job Tracker log file.
>
>
>
> Two primary issues seem to be
>
> 1.   Client mapred tries to renew a token with renewer specified as mr token
>
>
>
> And
>
>
>
> 2.   Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
>
>
>
> Any ideas how to get past this?
>
>
>
> Full Stacktrace:
>
>
>
> 2013-01-29 17:11:28,860 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Creating password for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088860, maxDate=136010560,
> sequenceNumber=75, masterKeyId=8
>
> 2013-01-29 17:11:28,871 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Creating password for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> sequenceNumber=76, masterKeyId=8
>
> 2013-01-29 17:11:29,202 INFO
> org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal:
> registering token for renewal for service 
> =10.204.12.62:8021 and jobID
> =
> job_201301231648_0029
>
> 2013-01-29 17:11:29,211 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Token renewal requested for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> sequenceNumber=76, masterKeyId=8
>
> 2013-01-29 17:11:29,211 ERROR
> org.apache.hadoop.security.UserGroupInformation:
>

Re: How to Integrate MicroStrategy with Hadoop

2013-01-30 Thread samir das mohapatra
thanks for quick reply.


On Thu, Jan 31, 2013 at 2:19 AM, Nitin Pawar wrote:

> this is specific to cloudera to please post only to cdh users
>
> here is the link
>
> http://www.cloudera.com/content/cloudera/en/solutions/partner/Microstrategy.html
>
> you can follow links from there on
>
>
> On Thu, Jan 31, 2013 at 2:16 AM, samir das mohapatra <
> samir.help...@gmail.com> wrote:
>
>> We are using coludera Hadoop
>>
>>
>> On Thu, Jan 31, 2013 at 2:12 AM, samir das mohapatra <
>> samir.help...@gmail.com> wrote:
>>
>>> Hi All,
>>>I wanted to know how to connect HAdoop with MircoStrategy
>>>Any help is very helpfull.
>>>
>>>   Witing for you response
>>>
>>> Note: Any Url and Example will be really help full for me.
>>>
>>> Thanks,
>>> samir
>>>
>>
>>
>
>
> --
> Nitin Pawar
>


How to Integrate Cloudera Hadoop With Microstrategy and Hadoop With SAP HANA

2013-01-30 Thread samir das mohapatra
Regards,
samir.


Re: How to Integrate MicroStrategy with Hadoop

2013-01-30 Thread Nitin Pawar
this is specific to cloudera to please post only to cdh users

here is the link
http://www.cloudera.com/content/cloudera/en/solutions/partner/Microstrategy.html

you can follow links from there on


On Thu, Jan 31, 2013 at 2:16 AM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> We are using coludera Hadoop
>
>
> On Thu, Jan 31, 2013 at 2:12 AM, samir das mohapatra <
> samir.help...@gmail.com> wrote:
>
>> Hi All,
>>I wanted to know how to connect HAdoop with MircoStrategy
>>Any help is very helpfull.
>>
>>   Witing for you response
>>
>> Note: Any Url and Example will be really help full for me.
>>
>> Thanks,
>> samir
>>
>
>


-- 
Nitin Pawar


Re: How to Integrate MicroStrategy with Hadoop

2013-01-30 Thread samir das mohapatra
We are using coludera Hadoop


On Thu, Jan 31, 2013 at 2:12 AM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> Hi All,
>I wanted to know how to connect HAdoop with MircoStrategy
>Any help is very helpfull.
>
>   Witing for you response
>
> Note: Any Url and Example will be really help full for me.
>
> Thanks,
> samir
>


How to Integrate SAP HANA WITH Hadoop

2013-01-30 Thread samir das mohapatra
Hi all
I we need the connectivity of SAP HANA with Hadoop,
 Do you have any experience with that can you please share some documents
and example with me ,so that it will be really help full for me

thanks,
samir


How to Integrate MicroStrategy with Hadoop

2013-01-30 Thread samir das mohapatra
Hi All,
   I wanted to know how to connect HAdoop with MircoStrategy
   Any help is very helpfull.

  Witing for you response

Note: Any Url and Example will be really help full for me.

Thanks,
samir


Re: Oozie workflow error - renewing token issue

2013-01-30 Thread Alejandro Abdelnur
Cobert,

[Moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o]

* What version of Oozie are you using?
* Is the cluster a secure setup (Kerberos enabled)?
* Would you mind posting the complete launcher logs?

Thx



On Wed, Jan 30, 2013 at 6:14 AM, Corbett Martin  wrote:

> Thanks for the tip.
>
> The sqoop command listed in the stdout log file is:
> sqoop
> import
> --driver
> org.apache.derby.jdbc.ClientDriver
> --connect
> jdbc:derby://test-server:1527/mondb
> --username
> monuser
> --password
> x
> --table
> MONITOR
> --split-by
> request_id
> --target-dir
> /mon/import
> --append
> --incremental
> append
> --check-column
> request_id
> --last-value
> 200
>
> The following information is from the sterr and stdout log files.
>
> /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
> # more stderr
> No such sqoop tool: sqoop. See 'sqoop help'.
> Intercepting System.exit(1)
> Failing Oozie Launcher, Main class
> [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
>
>
> /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
> # more stdout
> ...
> ...
> ...
> >>> Invoking Sqoop command line now >>>
>
> 1598 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has
> not been set in the environment. Cannot check for additional configuration.
> Intercepting System.exit(1)
>
> <<< Invocation of Main class completed <<<
>
> Failing Oozie Launcher, Main class
> [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]
>
> Oozie Launcher failed, finishing Hadoop job gracefully
>
> Oozie Launcher ends
>
>
> ~Corbett Martin
> Software Architect
> AbsoluteAR Accounts Receivable Services - An NHIN Solution
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com]
> Sent: Tuesday, January 29, 2013 10:11 PM
> To: 
> Subject: Re: Oozie workflow error - renewing token issue
>
> The job that Oozie launches for your action, which you are observing is
> failing, does its own logs (task logs) show any errors?
>
> On Wed, Jan 30, 2013 at 4:59 AM, Corbett Martin  wrote:
> > Oozie question
> >
> >
> >
> > I'm trying to run an Oozie workflow (sqoop action) from the Hue
> > console and it fails every time.  No exception in the oozie log but I
> > see this in the Job Tracker log file.
> >
> >
> >
> > Two primary issues seem to be
> >
> > 1.   Client mapred tries to renew a token with renewer specified as mr
> token
> >
> >
> >
> > And
> >
> >
> >
> > 2.   Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
> >
> >
> >
> > Any ideas how to get past this?
> >
> >
> >
> > Full Stacktrace:
> >
> >
> >
> > 2013-01-29 17:11:28,860 INFO
> >
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> > Creating password for identifier: owner=hdfs, renewer=mr token,
> > realUser=oozie, issueDate=1359501088860, maxDate=136010560,
> > sequenceNumber=75, masterKeyId=8
> >
> > 2013-01-29 17:11:28,871 INFO
> >
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> > Creating password for identifier: owner=hdfs, renewer=mr token,
> > realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> > sequenceNumber=76, masterKeyId=8
> >
> > 2013-01-29 17:11:29,202 INFO
> > org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal:
> > registering token for renewal for service =10.204.12.62:8021 and jobID
> > =
> > job_201301231648_0029
> >
> > 2013-01-29 17:11:29,211 INFO
> >
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> > Token renewal requested for identifier: owner=hdfs, renewer=mr token,
> > realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> > sequenceNumber=76, masterKeyId=8
> >
> > 2013-01-29 17:11:29,211 ERROR
> > org.apache.hadoop.security.UserGroupInformation:
> > PriviledgedActionException as:mapred (auth:SIMPLE)
> > cause:org.apache.hadoop.security.AccessControlException: Client mapred
> > tries to renew a token with renewer specified as mr token
> >
> > 2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token:
> > Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
> >
> > 2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server
> > handler 9 on 8021, call renewDelegationToken(Kind:
> > MAPREDUCE_DELEGATION_TOKEN, Service: 10.204.12.62:8021, Ident: 00 04
> > 68 64
> > 66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 7a 69 65 8a 01 3c 88 94 58
> > 67 8a
> > 01 3c ac a0 dc 67 4c 08), rp

Re: Filesystem closed exception

2013-01-30 Thread Alejandro Abdelnur
Hemanth,

Is FS caching enabled or not in your cluster?

A simple solution would be to modify your mapper code not to close the FS.
It will go away when the task ends anyway.

Thx


On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala  wrote:

> Hi,
>
> We are noticing a problem where we get a filesystem closed exception when
> a map task is done and is finishing execution. By map task, I literally
> mean the MapTask class of the map reduce code. Debugging this we found that
> the mapper is getting a handle to the filesystem object and itself calling
> a close on it. Because filesystem objects are cached, I believe the
> behaviour is as expected in terms of the exception.
>
> I just wanted to confirm that:
>
> - if we do have a requirement to use a filesystem object in a mapper or
> reducer, we should either not close it ourselves
> - or (seems better to me) ask for a new version of the filesystem instance
> by setting the fs.hdfs.impl.disable.cache property to true in job
> configuration.
>
> Also, does anyone know if this behaviour was any different in Hadoop 0.20 ?
>
> For some context, this behaviour is actually seen in Oozie, which runs a
> launcher mapper for a simple java action. Hence, the java action could very
> well interact with a file system. I know this is probably better addressed
> in Oozie context, but wanted to get the map reduce view of things.
>
>
> Thanks,
> Hemanth
>



-- 
Alejandro


Re: Same Map/Reduce works on command BUT it hanges through Oozie withouth any error msg

2013-01-30 Thread Alejandro Abdelnur
Yaotian,

*Oozie version?
*More details on what exactly is your workflow action (mapred, java, shell,
etc)
*What is is in the task log of the oozie laucher job for that action?

Thx



On Fri, Jan 25, 2013 at 10:43 PM, yaotian  wrote:

> I manually run it in Hadoop. It works.
>
> But as a job i run it through Oozie. The map/reduce hanged withouth No any
> error msg.
>
> I checked master and datanode.
>
> ===> From the master. i saw:
> 2013-01-26 06:32:38,517 INFO org.apache.hadoop.mapred.JobTracker: Adding
> task (JOB_SETUP) 'attempt_201301251528_0014_r_04_0' to tip
> task_201301251528_0014_r_04, for tracker 'tracker_datanode1:localhost/
> 127.0.0.1:45695'
> 2013-01-26 06:32:44,538 INFO org.apache.hadoop.mapred.JobInProgress: Task
> 'attempt_201301251528_0014_r_04_0' has completed
> task_201301251528_0014_r_04 successfully.
> 2013-01-26 06:33:40,640 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
> History file job_201301090834_0089.   Cache size is 0
> 2013-01-26 06:33:40,640 WARN org.mortbay.log: /jobtaskshistory.jsp:
> java.io.FileNotFoundException: File
> /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0089_1357872144366_hadoop_sorting+locations+per+user
> does not exist.
> 2013-01-26 06:35:17,008 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
> History file job_201301090834_0051.   Cache size is 0
> 2013-01-26 06:35:17,009 WARN org.mortbay.log: /jobtaskshistory.jsp:
> java.io.FileNotFoundException: File
> /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0051_1357870007893_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dsort%5Fuser%3A
> does not exist.
> 2013-01-26 06:40:04,251 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
> History file job_201301090834_0026.   Cache size is 0
> 2013-01-26 06:40:04,251 WARN org.mortbay.log: /taskstatshistory.jsp:
> java.io.FileNotFoundException: File
> /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/09/00/job_201301090834_0026_1357722497582_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dreport%5Fsta
> does not exist.
>
> ===> Form the datanode. Just saw
> 2013-01-26 06:39:11,143 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:39:41,256 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:40:11,371 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:40:41,486 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:41:11,604 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:41:41,724 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:42:11,845 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
> 2013-01-26 06:42:41,959 INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201301251528_0013_m_00_0 0.0%
> hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
>
>
>
>
>


-- 
Alejandro


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Bertrand Dechoux
Well, the documentation is more explicite.

Specifies the percentage of blocks that should satisfy the minimal
replication requirement defined by* dfs.namenode.replication.min*.

Which happens to be 1 by default but doesn't need to stay that way.

Regards

Bertrand

On Wed, Jan 30, 2013 at 5:45 PM, Nan Zhu  wrote:

>  I think Chen is asking replication lost,
>
> so, according to Harsh's reply, in safe mode, NN will know all blocks
> which has less replications than 3(by default setup) but no less than 1,
> and after getting out from safe mode, it will instruct the real replicating
> works? Hope I understand it correctly
>
> Best,
>
> --
> Nan Zhu
> School of Computer Science,
> McGill University
>
>
> On Wednesday, 30 January, 2013 at 11:39 AM, Harsh J wrote:
>
> Yes, if there are missing blocks (i.e. all replicas lost), and the
> block availability threshold is set to its default of 0.999f (99.9%
> availability required), then NN will not come out of safemode
> automatically. You can control this behavior by configuring
> dfs.namenode.safemode.threshold.
>
> On Wed, Jan 30, 2013 at 10:06 PM, Chen He  wrote:
>
> Hi Harsh
>
> I have a question. How namenode gets out of safemode in condition of data
> blocks lost, only administrator? Accordin to my experiences, the NN (0.21)
> stayed in safemode about several days before I manually turn safemode off.
> There were 2 blocks lost.
>
> Chen
>
>
> On Wed, Jan 30, 2013 at 10:27 AM, Harsh J  wrote:
>
>
> NN does recalculate new replication work to do due to unavailable
> replicas ("under-replication") when it starts and receives all block
> reports, but executes this only after out of safemode. When in
> safemode, across the HDFS services, no mutations are allowed.
>
> On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  wrote:
>
> Hi, all
>
> I'm wondering if HDFS is stopped, and some of the machines of the
> cluster
> are moved, some of the block replication are definitely lost for moving
> machines
>
> when I restart the system, will the namenode recalculate the data
> distribution?
>
> Best,
>
> --
> Nan Zhu
> School of Computer Science,
> McGill University
>
>
>
>
> --
> Harsh J
>
>
>
>
> --
> Harsh J
>
>
>


Re: Data migration from one cluster to other running diff. versions

2013-01-30 Thread Harsh J
DistCp is the fastest option, letting you copy data in parallel. For
incompatible RPC versions between different HDFS clusters, the HFTP
solution can work (documented on the DistCp manual).

On Wed, Jan 30, 2013 at 10:13 PM, Siddharth Tiwari
 wrote:
> Hi Team,
>
> What is the best way to migrate data residing on one cluster to another
> cluster ?
> Are there better methods available than distcp ?
> What if both the clusters running different RPC protocol versions ?
>
>
> **
> Cheers !!!
> Siddharth Tiwari
> Have a refreshing day !!!
> "Every duty is holy, and devotion to duty is the highest form of worship of
> God.”
> "Maybe other people will try to limit me but I don't limit myself"



-- 
Harsh J


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Nan Zhu
I think Chen is asking replication lost,  

so, according to Harsh's reply, in safe mode, NN will know all blocks which has 
less replications than 3(by default setup) but no less than 1, and after 
getting out from safe mode, it will instruct the real replicating works? Hope I 
understand it correctly

Best,  

-- 
Nan Zhu
School of Computer Science,
McGill University



On Wednesday, 30 January, 2013 at 11:39 AM, Harsh J wrote:

> Yes, if there are missing blocks (i.e. all replicas lost), and the
> block availability threshold is set to its default of 0.999f (99.9%
> availability required), then NN will not come out of safemode
> automatically. You can control this behavior by configuring
> dfs.namenode.safemode.threshold.
> 
> On Wed, Jan 30, 2013 at 10:06 PM, Chen He  (mailto:airb...@gmail.com)> wrote:
> > Hi Harsh
> > 
> > I have a question. How namenode gets out of safemode in condition of data
> > blocks lost, only administrator? Accordin to my experiences, the NN (0.21)
> > stayed in safemode about several days before I manually turn safemode off.
> > There were 2 blocks lost.
> > 
> > Chen
> > 
> > 
> > On Wed, Jan 30, 2013 at 10:27 AM, Harsh J  > (mailto:ha...@cloudera.com)> wrote:
> > > 
> > > NN does recalculate new replication work to do due to unavailable
> > > replicas ("under-replication") when it starts and receives all block
> > > reports, but executes this only after out of safemode. When in
> > > safemode, across the HDFS services, no mutations are allowed.
> > > 
> > > On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  > > (mailto:zhunans...@gmail.com)> wrote:
> > > > Hi, all
> > > > 
> > > > I'm wondering if HDFS is stopped, and some of the machines of the
> > > > cluster
> > > > are moved, some of the block replication are definitely lost for moving
> > > > machines
> > > > 
> > > > when I restart the system, will the namenode recalculate the data
> > > > distribution?
> > > > 
> > > > Best,
> > > > 
> > > > --
> > > > Nan Zhu
> > > > School of Computer Science,
> > > > McGill University
> > > > 
> > > 
> > > 
> > > 
> > > 
> > > --
> > > Harsh J
> > > 
> > 
> > 
> 
> 
> 
> 
> -- 
> Harsh J
> 
> 




Data migration from one cluster to other running diff. versions

2013-01-30 Thread Siddharth Tiwari
Hi Team,

What is the best way to migrate data residing on one cluster to another cluster 
?
Are there better methods available than distcp ?
What if both the clusters running different RPC protocol versions ?

**

Cheers !!!

Siddharth Tiwari

Have a refreshing day !!!
"Every duty is holy, and devotion to duty is the highest form of worship of 
God.” 

"Maybe other people will try to limit me but I don't limit myself"
  

Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Harsh J
Yes, if there are missing blocks (i.e. all replicas lost), and the
block availability threshold is set to its default of 0.999f (99.9%
availability required), then NN will not come out of safemode
automatically. You can control this behavior by configuring
dfs.namenode.safemode.threshold.

On Wed, Jan 30, 2013 at 10:06 PM, Chen He  wrote:
> Hi Harsh
>
> I have a question. How namenode gets out of safemode in condition of data
> blocks lost, only administrator? Accordin to my experiences, the NN (0.21)
> stayed in safemode about several days before I manually turn safemode off.
> There were 2 blocks lost.
>
> Chen
>
>
> On Wed, Jan 30, 2013 at 10:27 AM, Harsh J  wrote:
>>
>> NN does recalculate new replication work to do due to unavailable
>> replicas ("under-replication") when it starts and receives all block
>> reports, but executes this only after out of safemode. When in
>> safemode, across the HDFS services, no mutations are allowed.
>>
>> On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  wrote:
>> > Hi, all
>> >
>> > I'm wondering if HDFS is stopped, and some of the machines of the
>> > cluster
>> > are moved,  some of the block replication are definitely lost for moving
>> > machines
>> >
>> > when I restart the system, will the namenode recalculate the data
>> > distribution?
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> > School of Computer Science,
>> > McGill University
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Nitin Pawar
following are the configs it looks for . Unless Admin forces it to come out
of safenode, it respects below values

dfs.namenode.safemode.threshold-pct0.999fSpecifies the percentage of blocks
that should satisfy the minimal replication requirement defined by
dfs.namenode.replication.min. Values less than or equal to 0 mean not to
wait for any particular percentage of blocks before exiting safemode.
Values greater than 1 will make safe mode permanent.
dfs.namenode.safemode.min.datanodes0Specifies the number of datanodes that
must be considered alive before the name node exits safemode. Values less
than or equal to 0 mean not to take the number of live datanodes into
account when deciding whether to remain in safe mode during startup. Values
greater than the number of datanodes in the cluster will make safe mode
permanent.dfs.namenode.safemode.extension3Determines extension of safe
mode in milliseconds after the threshold level is reached.


On Wed, Jan 30, 2013 at 10:06 PM, Chen He  wrote:

> Hi Harsh
>
> I have a question. How namenode gets out of safemode in condition of data
> blocks lost, only administrator? Accordin to my experiences, the NN (0.21)
> stayed in safemode about several days before I manually turn safemode off.
> There were 2 blocks lost.
>
> Chen
>
>
> On Wed, Jan 30, 2013 at 10:27 AM, Harsh J  wrote:
>
>> NN does recalculate new replication work to do due to unavailable
>> replicas ("under-replication") when it starts and receives all block
>> reports, but executes this only after out of safemode. When in
>> safemode, across the HDFS services, no mutations are allowed.
>>
>> On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  wrote:
>> > Hi, all
>> >
>> > I'm wondering if HDFS is stopped, and some of the machines of the
>> cluster
>> > are moved,  some of the block replication are definitely lost for moving
>> > machines
>> >
>> > when I restart the system, will the namenode recalculate the data
>> > distribution?
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> > School of Computer Science,
>> > McGill University
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Nitin Pawar


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Chen He
Hi Harsh

I have a question. How namenode gets out of safemode in condition of data
blocks lost, only administrator? Accordin to my experiences, the NN (0.21)
stayed in safemode about several days before I manually turn safemode off.
There were 2 blocks lost.

Chen

On Wed, Jan 30, 2013 at 10:27 AM, Harsh J  wrote:

> NN does recalculate new replication work to do due to unavailable
> replicas ("under-replication") when it starts and receives all block
> reports, but executes this only after out of safemode. When in
> safemode, across the HDFS services, no mutations are allowed.
>
> On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  wrote:
> > Hi, all
> >
> > I'm wondering if HDFS is stopped, and some of the machines of the cluster
> > are moved,  some of the block replication are definitely lost for moving
> > machines
> >
> > when I restart the system, will the namenode recalculate the data
> > distribution?
> >
> > Best,
> >
> > --
> > Nan Zhu
> > School of Computer Science,
> > McGill University
> >
> >
>
>
>
> --
> Harsh J
>


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Harsh J
NN does recalculate new replication work to do due to unavailable
replicas ("under-replication") when it starts and receives all block
reports, but executes this only after out of safemode. When in
safemode, across the HDFS services, no mutations are allowed.

On Wed, Jan 30, 2013 at 8:34 AM, Nan Zhu  wrote:
> Hi, all
>
> I'm wondering if HDFS is stopped, and some of the machines of the cluster
> are moved,  some of the block replication are definitely lost for moving
> machines
>
> when I restart the system, will the namenode recalculate the data
> distribution?
>
> Best,
>
> --
> Nan Zhu
> School of Computer Science,
> McGill University
>
>



-- 
Harsh J


Re: How to find Blacklisted Nodes via cli.

2013-01-30 Thread Nitin Pawar
*bin/hadoop dfsadmin -report should give you what you are looking for.*
*
*
*a node is blacklisted only if there are too many failures on a particular
node. You can clear it by restarting the particular datanode or tasktracker
service. This is for the better performance of your hadoop cluster to
remove a node where too many failures are happening so that you can figure
out the reason and then add it back in cluster. you can also exclude the
node from the cluster all together. *
*
*
*I never tried myself removing a node from blacklist on the fly so not sure
how that is done. I will wait if someone can tell me witout restart its
possible or not (without increasing the failure thresholds ) *


On Wed, Jan 30, 2013 at 9:35 PM, Dhanasekaran Anbalagan
wrote:

> Hi Guys,
>
> How to find Blacklisted Nodes via, command line. I want to see job
> Tracker Blacklisted Nodes and hdfs Blacklisted Nodes.
>
> and also how to clear blacklisted nodes to clear start. The only option to
> restart the service. some other way clear the Blacklisted Nodes.
>
> please guide me.
>
> -Dhanasekaran.
>
> Did I learn something today? If not, I wasted it.
>



-- 
Nitin Pawar


Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Chen He
That is correct if you do not manually exit NN safemode.

Regards

Chen
On Jan 30, 2013 8:59 AM, "Jean-Marc Spaggiari" 
wrote:

> Hi Nan,
>
> When the Namenode will EXIT the safemode, you, you can assume that all
> blocks ARE fully replicated. If the Namenode is still IN safemode that
> mean that all blocks are NOT fully replicated.
>
> JM
>
> 2013/1/29, Nan Zhu :
> > So, we can assume that all blocks are fully replicated at the start
> point of
> > HDFS?
> >
> > Best,
> >
> > --
> > Nan Zhu
> > School of Computer Science,
> > McGill University
> >
> >
> >
> > On Tuesday, 29 January, 2013 at 10:50 PM, Chen He wrote:
> >
> >> Hi Nan
> >>
> >> Namenode will stay in safemode before all blocks are replicated. During
> >> this time, the jobtracker can not see any tasktrackers. (MRv1).
> >>
> >> Chen
> >>
> >> On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu  >> (mailto:zhunans...@gmail.com)> wrote:
> >> > Hi, all
> >> >
> >> > I'm wondering if HDFS is stopped, and some of the machines of the
> >> > cluster are moved,  some of the block replication are definitely lost
> >> > for moving machines
> >> >
> >> > when I restart the system, will the namenode recalculate the data
> >> > distribution?
> >> >
> >> > Best,
> >> >
> >> > --
> >> > Nan Zhu
> >> > School of Computer Science,
> >> > McGill University
> >> >
> >> >
> >>
> >
> >
>


Re: Tricks to upgrading Sequence Files?

2013-01-30 Thread Terry Healy
AVROs versioning capability might help if that could replace
SequenceFile in your workflow.

Just a thought.

-Terry

On 1/29/13 9:17 PM, David Parks wrote:
> I'll consider a patch to the SequenceFile, if we could manually override the
> sequence file input Key and Value that's read from the sequence file headers
> we'd have a clean solution.
>
> I don't like versioning my Model object because it's used by 10's of other
> classes and I don't want to risk less maintained classes continuing to use
> an old version.
>
> For the time being I just used 2 jobs. First I renamed the old Model Object
> to the original name, read it in, upgraded it, and wrote the new version
> with a different class name.
>
> Then I renamed the classes again so the new model object used the original
> name and read in the altered name and cloned it into the original name.
>
> All in all an hours work only, but having a cleaner process would be better.
> I'll add the request to JIRA at a minimum.
>
> Dave
>
>
> -Original Message-
> From: Harsh J [mailto:ha...@cloudera.com] 
> Sent: Wednesday, January 30, 2013 2:32 AM
> To: 
> Subject: Re: Tricks to upgrading Sequence Files?
>
> This is a pretty interesting question, but unfortunately there isn't an
> inbuilt way in SequenceFiles itself to handle this. However, your key/value
> classes can be made to handle versioning perhaps - detecting if what they've
> read is of an older time and decoding it appropriately (while handling newer
> encoding separately, in the normal fashion).
> This would be much better than going down the classloader hack paths I
> think?
>
> On Tue, Jan 29, 2013 at 1:11 PM, David Parks  wrote:
>> Anyone have any good tricks for upgrading a sequence file.
>>
>>
>>
>> We maintain a sequence file like a flat file DB and the primary object 
>> in there changed in recent development.
>>
>>
>>
>> It's trivial to write a job to read in the sequence file, update the 
>> object, and write it back out in the new format.
>>
>>
>>
>> But since sequence files read and write the key/value class I would 
>> either need to rename the model object with a version number, or 
>> change the header of each sequence file.
>>
>>
>>
>> Just wondering if there are any nice tricks to this.
>
>
> --
> Harsh J
>



Re: what will happen when HDFS restarts but with some dead nodes

2013-01-30 Thread Jean-Marc Spaggiari
Hi Nan,

When the Namenode will EXIT the safemode, you, you can assume that all
blocks ARE fully replicated. If the Namenode is still IN safemode that
mean that all blocks are NOT fully replicated.

JM

2013/1/29, Nan Zhu :
> So, we can assume that all blocks are fully replicated at the start point of
> HDFS?
>
> Best,
>
> --
> Nan Zhu
> School of Computer Science,
> McGill University
>
>
>
> On Tuesday, 29 January, 2013 at 10:50 PM, Chen He wrote:
>
>> Hi Nan
>>
>> Namenode will stay in safemode before all blocks are replicated. During
>> this time, the jobtracker can not see any tasktrackers. (MRv1).
>>
>> Chen
>>
>> On Tue, Jan 29, 2013 at 9:04 PM, Nan Zhu > (mailto:zhunans...@gmail.com)> wrote:
>> > Hi, all
>> >
>> > I'm wondering if HDFS is stopped, and some of the machines of the
>> > cluster are moved,  some of the block replication are definitely lost
>> > for moving machines
>> >
>> > when I restart the system, will the namenode recalculate the data
>> > distribution?
>> >
>> > Best,
>> >
>> > --
>> > Nan Zhu
>> > School of Computer Science,
>> > McGill University
>> >
>> >
>>
>
>


RE: Oozie workflow error - renewing token issue

2013-01-30 Thread Corbett Martin
Thanks for the tip.

The sqoop command listed in the stdout log file is:
sqoop
import
--driver
org.apache.derby.jdbc.ClientDriver
--connect
jdbc:derby://test-server:1527/mondb
--username
monuser
--password
x
--table
MONITOR
--split-by
request_id
--target-dir
/mon/import
--append
--incremental
append
--check-column
request_id
--last-value
200

The following information is from the sterr and stdout log files.

/var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stderr
No such sqoop tool: sqoop. See 'sqoop help'.
Intercepting System.exit(1)
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], 
exit code [1]


/var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stdout
...
...
...
>>> Invoking Sqoop command line now >>>

1598 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has not 
been set in the environment. Cannot check for additional configuration.
Intercepting System.exit(1)

<<< Invocation of Main class completed <<<

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SqoopMain], 
exit code [1]

Oozie Launcher failed, finishing Hadoop job gracefully

Oozie Launcher ends


~Corbett Martin
Software Architect
AbsoluteAR Accounts Receivable Services - An NHIN Solution

-Original Message-
From: Harsh J [mailto:ha...@cloudera.com]
Sent: Tuesday, January 29, 2013 10:11 PM
To: 
Subject: Re: Oozie workflow error - renewing token issue

The job that Oozie launches for your action, which you are observing is 
failing, does its own logs (task logs) show any errors?

On Wed, Jan 30, 2013 at 4:59 AM, Corbett Martin  wrote:
> Oozie question
>
>
>
> I'm trying to run an Oozie workflow (sqoop action) from the Hue
> console and it fails every time.  No exception in the oozie log but I
> see this in the Job Tracker log file.
>
>
>
> Two primary issues seem to be
>
> 1.   Client mapred tries to renew a token with renewer specified as mr token
>
>
>
> And
>
>
>
> 2.   Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
>
>
>
> Any ideas how to get past this?
>
>
>
> Full Stacktrace:
>
>
>
> 2013-01-29 17:11:28,860 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Creating password for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088860, maxDate=136010560,
> sequenceNumber=75, masterKeyId=8
>
> 2013-01-29 17:11:28,871 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Creating password for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> sequenceNumber=76, masterKeyId=8
>
> 2013-01-29 17:11:29,202 INFO
> org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal:
> registering token for renewal for service =10.204.12.62:8021 and jobID
> =
> job_201301231648_0029
>
> 2013-01-29 17:11:29,211 INFO
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
> Token renewal requested for identifier: owner=hdfs, renewer=mr token,
> realUser=oozie, issueDate=1359501088871, maxDate=136010571,
> sequenceNumber=76, masterKeyId=8
>
> 2013-01-29 17:11:29,211 ERROR
> org.apache.hadoop.security.UserGroupInformation:
> PriviledgedActionException as:mapred (auth:SIMPLE)
> cause:org.apache.hadoop.security.AccessControlException: Client mapred
> tries to renew a token with renewer specified as mr token
>
> 2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token:
> Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
>
> 2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 9 on 8021, call renewDelegationToken(Kind:
> MAPREDUCE_DELEGATION_TOKEN, Service: 10.204.12.62:8021, Ident: 00 04
> 68 64
> 66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 7a 69 65 8a 01 3c 88 94 58
> 67 8a
> 01 3c ac a0 dc 67 4c 08), rpc version=2, client version=28,
> methodsFingerPrint=1830206421 from 10.204.12.62:9706: error:
> org.apache.hadoop.security.AccessControlException: Client mapred tries
> to renew a token with renewer specified as mr token
>
> org.apache.hadoop.security.AccessControlException: Client mapred tries
> to renew a token with renewer specified as mr token
>
>   at
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSec
> retManager.renewToken(AbstractDelegationTokenSecretManager.java:274)
>
>   at
> org.apache.hadoop.mapred.Job

negative map input bytes counter?

2013-01-30 Thread Jim Donofrio
I have seen the map input bytes counter go negative temporarily on 
hadoop 1.x at the beginning of a job. It then corrects itself later in 
the job and seems to be accurate. Any ideas?


http://terrier.org/docs/v2.2.1/hadoop_indexing.html

I also saw this behavior in a job output listed at the above site:

INFO JobClient - Map output bytes=7724117
INFO JobClient - Map input bytes=-4167449
INFO JobClient - Combine input records=0


Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Michel Segel
Can you say Centos?
:-)

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jan 30, 2013, at 4:21 AM, Jean-Marc Spaggiari  
wrote:

> Hi,
> 
> Also, think about the memory you will need in your DataNode to serve
> all this data... I'm not sure there is any server which can take that
> today. You need a certain amount of memory per block in the DN. With
> all this data, you will have S many blocks...
> 
> Regarding RH vs Ubuntu, I think Ubuntu is more an end user
> distribution than a server one. And I found RH a bit "not enought
> free". I have installed Debian on all my servers.
> 
> JM
> 
> 2013/1/30, Vijay Thakorlal :
>> Jeba,
>> 
>> 
>> 
>> I'm not aware of any hadoop limitations in this respect (others may be able
>> to comment on this); since blocks are just files on the OS, the datanode
>> will create subdirectories to store blocks to avoid problems with large
>> numbers of files in a single directory. So I would think the limitations
>> are
>> primarily around the type of file system you select, for ext3 it
>> theoretically supports up to 16TB (http://en.wikipedia.org/wiki/Ext3) and
>> for ext4 up to 1EB (http://en.wikipedia.org/wiki/Ext4). Although you're
>> probably already planning on deploying 64-bit servers, I believe for large
>> FS on ext4 you'd be better off with a 64-bit server.
>> 
>> 
>> 
>> As far as OS is concerned anecdotally (based on blogs, hadoop mailing lists
>> etc) I believe there are more production deployments using RHEL and/or
>> CentOS than Ubuntu.
>> 
>> 
>> 
>> It's probably not practical to have nodes with 1PB of data for the reasons
>> that others have mentioned and due to the replication traffic that will be
>> generated if the node dies. Not to mention fsck times with large file
>> systems.
>> 
>> 
>> 
>> Vijay
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: jeba earnest [mailto:jebaearn...@yahoo.com]
>> Sent: 30 January 2013 10:40
>> To: user@hadoop.apache.org
>> Subject: Re: Maximum Storage size in a Single datanode
>> 
>> 
>> 
>> 
>> 
>> I want to use either UBUNTU or REDHAT .
>> 
>> I just want to know how much storage space we can allocate in a single data
>> node.
>> 
>> 
>> 
>> Is there any limitations in hadoop for storage in single node?
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Jeba
>> 
>>  _
>> 
>> From: "Pamecha, Abhishek" 
>> To: "user@hadoop.apache.org" ; jeba earnest
>> 
>> Sent: Wednesday, 30 January 2013 2:45 PM
>> Subject: Re: Maximum Storage size in a Single datanode
>> 
>> 
>> 
>> What would be the reason you would do that?
>> 
>> 
>> 
>> You would want to leverage distributed dataset for higher availability and
>> better response times.
>> 
>> 
>> 
>> The maximum storage depends completely on the disks  capacity of your nodes
>> and what your OS supports. Typically I have heard of about 1-2 TB/node to
>> start with, but I may be wrong.
>> 
>> -abhishek
>> 
>> 
>> 
>> 
>> 
>> From: jeba earnest 
>> Reply-To: "user@hadoop.apache.org" , jeba earnest
>> 
>> Date: Wednesday, January 30, 2013 1:38 PM
>> To: "user@hadoop.apache.org" 
>> Subject: Maximum Storage size in a Single datanode
>> 
>> 
>> 
>> 
>> 
>> Hi,
>> 
>> 
>> 
>> Is it possible to keep 1 Petabyte in a single data node?
>> 
>> If not, How much is the maximum storage for a particular data node?
>> 
>> 
>> 
>> Regards,
>> M. Jeba
> 


Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Fatih Haltas
I think, he just wants to learn the approximate storage capacity of each
datanode that he should configure , 1 PB is just a made up amount of
storage, I guess. Probably, in my opinion, he already knows that even not
caring hadoop, this is too much for any server, but he made just a made up
mistake :)

Dont keep on at him that much :)

30 Ocak 2013 Çarşamba tarihinde Jean-Marc Spaggiari adlı kullanıcı şöyle
yazdı:

> Hi,
>
> Also, think about the memory you will need in your DataNode to serve
> all this data... I'm not sure there is any server which can take that
> today. You need a certain amount of memory per block in the DN. With
> all this data, you will have S many blocks...
>
> Regarding RH vs Ubuntu, I think Ubuntu is more an end user
> distribution than a server one. And I found RH a bit "not enought
> free". I have installed Debian on all my servers.
>
> JM
>
> 2013/1/30, Vijay Thakorlal :
> > Jeba,
> >
> >
> >
> > I'm not aware of any hadoop limitations in this respect (others may be
> able
> > to comment on this); since blocks are just files on the OS, the datanode
> > will create subdirectories to store blocks to avoid problems with large
> > numbers of files in a single directory. So I would think the limitations
> > are
> > primarily around the type of file system you select, for ext3 it
> > theoretically supports up to 16TB (http://en.wikipedia.org/wiki/Ext3)
> and
> > for ext4 up to 1EB (http://en.wikipedia.org/wiki/Ext4). Although you're
> > probably already planning on deploying 64-bit servers, I believe for
> large
> > FS on ext4 you'd be better off with a 64-bit server.
> >
> >
> >
> > As far as OS is concerned anecdotally (based on blogs, hadoop mailing
> lists
> > etc) I believe there are more production deployments using RHEL and/or
> > CentOS than Ubuntu.
> >
> >
> >
> > It's probably not practical to have nodes with 1PB of data for the
> reasons
> > that others have mentioned and due to the replication traffic that will
> be
> > generated if the node dies. Not to mention fsck times with large file
> > systems.
> >
> >
> >
> > Vijay
> >
> >
> >
> >
> >
> >
> >
> > From: jeba earnest [mailto:jebaearn...@yahoo.com]
> > Sent: 30 January 2013 10:40
> > To: user@hadoop.apache.org
> > Subject: Re: Maximum Storage size in a Single datanode
> >
> >
> >
> >
> >
> > I want to use either UBUNTU or REDHAT .
> >
> > I just want to know how much storage space we can allocate in a single
> data
> > node.
> >
> >
> >
> > Is there any limitations in hadoop for storage in single node?
> >
> >
> >
> >
> >
> >
> >
> > Regards,
> >
> > Jeba
> >
> >   _
> >
> > From: "Pamecha, Abhishek" 
> > To: "user@hadoop.apache.org" ; jeba earnest
> > 
> > Sent: Wednesday, 30 January 2013 2:45 PM
> > Subject: Re: Maximum Storage size in a Single datanode
> >
> >
> >
> > What would be the reason you would do that?
> >
> >
> >
> > You would want to leverage distributed dataset for higher availability
> and
> > better response times.
> >
> >
> >
> > The maximum storage depends completely on the disks  capacity of your
> nodes
> > and what your OS supports. Typically I have heard of about 1-2 TB/node to
> > start with, but I may be wrong.
> >
> > -abhishek
> >
> >
> >
> >
> >
> > From: jeba earnest 
> > Reply-To: "user@hadoop.apache.org" , jeba
> earnest
> > 
> > Date: Wednesday, January 30, 2013 1:38 PM
> > To: "user@hadoop.apache.org" 
> > Subject: Maximum Storage size in a Single datanode
> >
> >
> >
> >
> >
> > Hi,
> >
> >
> >
> > Is it possible to keep 1 Petabyte in a single data node?
> >
> > If not, How much is the maximum storage for a particular data node?
> >
> >
> >
> > Regards,
> > M. Jeba
> >
> >
> >
> >
>


Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Jean-Marc Spaggiari
Hi,

Also, think about the memory you will need in your DataNode to serve
all this data... I'm not sure there is any server which can take that
today. You need a certain amount of memory per block in the DN. With
all this data, you will have S many blocks...

Regarding RH vs Ubuntu, I think Ubuntu is more an end user
distribution than a server one. And I found RH a bit "not enought
free". I have installed Debian on all my servers.

JM

2013/1/30, Vijay Thakorlal :
> Jeba,
>
>
>
> I'm not aware of any hadoop limitations in this respect (others may be able
> to comment on this); since blocks are just files on the OS, the datanode
> will create subdirectories to store blocks to avoid problems with large
> numbers of files in a single directory. So I would think the limitations
> are
> primarily around the type of file system you select, for ext3 it
> theoretically supports up to 16TB (http://en.wikipedia.org/wiki/Ext3) and
> for ext4 up to 1EB (http://en.wikipedia.org/wiki/Ext4). Although you're
> probably already planning on deploying 64-bit servers, I believe for large
> FS on ext4 you'd be better off with a 64-bit server.
>
>
>
> As far as OS is concerned anecdotally (based on blogs, hadoop mailing lists
> etc) I believe there are more production deployments using RHEL and/or
> CentOS than Ubuntu.
>
>
>
> It's probably not practical to have nodes with 1PB of data for the reasons
> that others have mentioned and due to the replication traffic that will be
> generated if the node dies. Not to mention fsck times with large file
> systems.
>
>
>
> Vijay
>
>
>
>
>
>
>
> From: jeba earnest [mailto:jebaearn...@yahoo.com]
> Sent: 30 January 2013 10:40
> To: user@hadoop.apache.org
> Subject: Re: Maximum Storage size in a Single datanode
>
>
>
>
>
> I want to use either UBUNTU or REDHAT .
>
> I just want to know how much storage space we can allocate in a single data
> node.
>
>
>
> Is there any limitations in hadoop for storage in single node?
>
>
>
>
>
>
>
> Regards,
>
> Jeba
>
>   _
>
> From: "Pamecha, Abhishek" 
> To: "user@hadoop.apache.org" ; jeba earnest
> 
> Sent: Wednesday, 30 January 2013 2:45 PM
> Subject: Re: Maximum Storage size in a Single datanode
>
>
>
> What would be the reason you would do that?
>
>
>
> You would want to leverage distributed dataset for higher availability and
> better response times.
>
>
>
> The maximum storage depends completely on the disks  capacity of your nodes
> and what your OS supports. Typically I have heard of about 1-2 TB/node to
> start with, but I may be wrong.
>
> -abhishek
>
>
>
>
>
> From: jeba earnest 
> Reply-To: "user@hadoop.apache.org" , jeba earnest
> 
> Date: Wednesday, January 30, 2013 1:38 PM
> To: "user@hadoop.apache.org" 
> Subject: Maximum Storage size in a Single datanode
>
>
>
>
>
> Hi,
>
>
>
> Is it possible to keep 1 Petabyte in a single data node?
>
> If not, How much is the maximum storage for a particular data node?
>
>
>
> Regards,
> M. Jeba
>
>
>
>


RE: Maximum Storage size in a Single datanode

2013-01-30 Thread Vijay Thakorlal
Jeba,

 

I'm not aware of any hadoop limitations in this respect (others may be able
to comment on this); since blocks are just files on the OS, the datanode
will create subdirectories to store blocks to avoid problems with large
numbers of files in a single directory. So I would think the limitations are
primarily around the type of file system you select, for ext3 it
theoretically supports up to 16TB (http://en.wikipedia.org/wiki/Ext3) and
for ext4 up to 1EB (http://en.wikipedia.org/wiki/Ext4). Although you're
probably already planning on deploying 64-bit servers, I believe for large
FS on ext4 you'd be better off with a 64-bit server.

 

As far as OS is concerned anecdotally (based on blogs, hadoop mailing lists
etc) I believe there are more production deployments using RHEL and/or
CentOS than Ubuntu. 

 

It's probably not practical to have nodes with 1PB of data for the reasons
that others have mentioned and due to the replication traffic that will be
generated if the node dies. Not to mention fsck times with large file
systems.

 

Vijay

 

 

 

From: jeba earnest [mailto:jebaearn...@yahoo.com] 
Sent: 30 January 2013 10:40
To: user@hadoop.apache.org
Subject: Re: Maximum Storage size in a Single datanode

 

 

I want to use either UBUNTU or REDHAT .

I just want to know how much storage space we can allocate in a single data
node.

 

Is there any limitations in hadoop for storage in single node?

 

 

 

Regards,

Jeba

  _  

From: "Pamecha, Abhishek" 
To: "user@hadoop.apache.org" ; jeba earnest
 
Sent: Wednesday, 30 January 2013 2:45 PM
Subject: Re: Maximum Storage size in a Single datanode

 

What would be the reason you would do that? 

 

You would want to leverage distributed dataset for higher availability and
better response times.

 

The maximum storage depends completely on the disks  capacity of your nodes
and what your OS supports. Typically I have heard of about 1-2 TB/node to
start with, but I may be wrong.

-abhishek

 

 

From: jeba earnest 
Reply-To: "user@hadoop.apache.org" , jeba earnest

Date: Wednesday, January 30, 2013 1:38 PM
To: "user@hadoop.apache.org" 
Subject: Maximum Storage size in a Single datanode

 

 

Hi,



Is it possible to keep 1 Petabyte in a single data node?

If not, How much is the maximum storage for a particular data node? 

 

Regards,
M. Jeba

 



Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Mohammad Tariq
I completely agree with everyone in the thread. Perhaps you are not
concerned much about the processing part, but it is still not a good idea.
Remember the power of Hadoop lies in the principle of "divide and rule" and
you are trying to go against that.

On Wednesday, January 30, 2013, Chris Embree  wrote:
> You should probably think about this in a more cluster fashion.  A single
node with a PB of data is probably not a good allocation of CPU : Disk
ration.  In addition, you need enough RAM on your NameNode to keep track of
all of your blocks.  A few nodes with a PB each would quickly drive up NN
RAM requirements.
> As others have mentioned, the local file system that HDFS sits on top of
may have limits.  We're going to use EXT4 which should handle that much,
but it's probably still not a good idea.
> If you're just thinking of storing lots of data, you might consider
GlusterFS instead.
> I highly recommend RedHat over Ubuntu.
> Hope that helps.
>
> On Wed, Jan 30, 2013 at 5:40 AM, jeba earnest 
wrote:
>
> I want to use either UBUNTU or REDHAT .
> I just want to know how much storage space we can allocate in a single
data node.
> Is there any limitations in hadoop for storage in single node?
>
>
>
> Regards,
> Jeba
> 
> From: "Pamecha, Abhishek" 
> To: "user@hadoop.apache.org" ; jeba earnest <
jebaearn...@yahoo.com>
> Sent: Wednesday, 30 January 2013 2:45 PM
> Subject: Re: Maximum Storage size in a Single datanode
>
> What would be the reason you would do that?
> You would want to leverage distributed dataset for higher availability
and better response times.
> The maximum storage depends completely on the disks  capacity of your
nodes and what your OS supports. Typically I have heard of about 1-2
TB/node to start with, but I may be wrong.
> -abhishek
>
> From: jeba earnest 
> Reply-To: "user@hadoop.apache.org" , jeba earnest

> Date: Wednesday, January 30, 2013 1:38 PM
> To: "user@hadoop.apache.org" 
> Subject: Maximum Storage size in a Single datanode
>
>
> Hi,
>
>
> Is it possible to keep 1 Petabyte in a single data node?
> If not, How much is the maximum storage for a particular data node?
>
> Regards,
> M. Jeba
>

-- 
Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Chris Embree
You should probably think about this in a more cluster fashion.  A single
node with a PB of data is probably not a good allocation of CPU : Disk
ration.  In addition, you need enough RAM on your NameNode to keep track of
all of your blocks.  A few nodes with a PB each would quickly drive up NN
RAM requirements.

As others have mentioned, the local file system that HDFS sits on top of
may have limits.  We're going to use EXT4 which should handle that much,
but it's probably still not a good idea.

If you're just thinking of storing lots of data, you might consider
GlusterFS instead.

I highly recommend RedHat over Ubuntu.

Hope that helps.

On Wed, Jan 30, 2013 at 5:40 AM, jeba earnest  wrote:

>
> I want to use either UBUNTU or REDHAT .
> I just want to know how much storage space we can allocate in a single
> data node.
>
> Is there any limitations in hadoop for storage in single node?
>
>
>
> Regards,
> Jeba
>   --
> *From:* "Pamecha, Abhishek" 
> *To:* "user@hadoop.apache.org" ; jeba earnest <
> jebaearn...@yahoo.com>
> *Sent:* Wednesday, 30 January 2013 2:45 PM
> *Subject:* Re: Maximum Storage size in a Single datanode
>
>  What would be the reason you would do that?
>
>  You would want to leverage distributed dataset for higher availability
> and better response times.
>
>  The maximum storage depends completely on the disks  capacity of your
> nodes and what your OS supports. Typically I have heard of about 1-2
> TB/node to start with, but I may be wrong.
> -abhishek
>
>
>   From: jeba earnest 
> Reply-To: "user@hadoop.apache.org" , jeba earnest
> 
> Date: Wednesday, January 30, 2013 1:38 PM
> To: "user@hadoop.apache.org" 
> Subject: Maximum Storage size in a Single datanode
>
>
>  Hi,
>
>
>  Is it possible to keep 1 Petabyte in a single data node?
>  If not, How much is the maximum storage for a particular data node?
>
> Regards,
> M. Jeba
>
>
>


Re: Maximum Storage size in a Single datanode

2013-01-30 Thread jeba earnest


I want to use either UBUNTU or REDHAT .
I just want to know how much storage space we can allocate in a single data 
node.

Is there any limitations in hadoop for storage in single node?



 
Regards,

Jeba



 From: "Pamecha, Abhishek" 
To: "user@hadoop.apache.org" ; jeba earnest 
 
Sent: Wednesday, 30 January 2013 2:45 PM
Subject: Re: Maximum Storage size in a Single datanode
 

What would be the reason you would do that? 

You would want to leverage distributed dataset for higher availability and 
better response times.

The maximum storage depends completely on the disks  capacity of your nodes and 
what your OS supports. Typically I have heard of about 1-2 TB/node to start 
with, but I may be wrong.
-abhishek

From: jeba earnest 
Reply-To: "user@hadoop.apache.org" , jeba earnest 

Date: Wednesday, January 30, 2013 1:38 PM
To: "user@hadoop.apache.org" 
Subject: Maximum Storage size in a Single datanode




Hi,



Is it possible to keep 1 Petabyte in a single data node?

If not, How much is the maximum storage for a particular data node? 
 
Regards,
M. Jeba

Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Pamecha, Abhishek
What would be the reason you would do that?

You would want to leverage distributed dataset for higher availability and 
better response times.

The maximum storage depends completely on the disks  capacity of your nodes and 
what your OS supports. Typically I have heard of about 1-2 TB/node to start 
with, but I may be wrong.
-abhishek


From: jeba earnest mailto:jebaearn...@yahoo.com>>
Reply-To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>, jeba earnest 
mailto:jebaearn...@yahoo.com>>
Date: Wednesday, January 30, 2013 1:38 PM
To: "user@hadoop.apache.org" 
mailto:user@hadoop.apache.org>>
Subject: Maximum Storage size in a Single datanode


Hi,


Is it possible to keep 1 Petabyte in a single data node?
If not, How much is the maximum storage for a particular data node?

Regards,
M. Jeba


RE: Maximum Storage size in a Single datanode

2013-01-30 Thread Vijay Thakorlal
Hi Jeba,

 

There are other considerations too, for example, if a single node holds 1 PB
of data and if it were to die this would cause a significant amount of
traffic as NameNode arranges for new replicas to be created.

 

Vijay

 

From: Bertrand Dechoux [mailto:decho...@gmail.com] 
Sent: 30 January 2013 09:14
To: user@hadoop.apache.org; jeba earnest
Subject: Re: Maximum Storage size in a Single datanode

 

I would say the hard limit is due to the OS local file system (and your
budget).

So short answer for ext3 : it doesn't seems so.
http://en.wikipedia.org/wiki/Ext3

And I am not sure the answer is the most interesting. Even if you could put
1 Peta on one node, what is usually interesting is the ratio
storage/compute.

Bertrand

On Wed, Jan 30, 2013 at 9:08 AM, jeba earnest  wrote:

 

Hi,



Is it possible to keep 1 Petabyte in a single data node?

If not, How much is the maximum storage for a particular data node?

 

Regards,
M. Jeba

 



Re: Maximum Storage size in a Single datanode

2013-01-30 Thread Bertrand Dechoux
I would say the hard limit is due to the OS local file system (and your
budget).

So short answer for ext3 : it doesn't seems so.
http://en.wikipedia.org/wiki/Ext3

And I am not sure the answer is the most interesting. Even if you could put
1 Peta on one node, what is usually interesting is the ratio
storage/compute.

Bertrand

On Wed, Jan 30, 2013 at 9:08 AM, jeba earnest  wrote:

>
> Hi,
>
>
> Is it possible to keep 1 Petabyte in a single data node?
> If not, How much is the maximum storage for a particular data node?
>
> Regards,
> M. Jeba
>


Maximum Storage size in a Single datanode

2013-01-30 Thread jeba earnest


Hi,


Is it possible to keep 1 Petabyte in a single data node?
If not, How much is the maximum storage for a particular data node?
 Regards,
M. Jeba