RE: Limitation of key-value pairs for a particular key.

2013-01-18 Thread Utkarsh Gupta
You are right
Actually we were expecting the values to be sorted.
We tried to reproduce the problem by this simple code
private final IntWritable one=new IntWritable(1);
private Text word=new Text();
@Override
public void map(LongWritable key,Text value, Context context) throws 
IOException, InterruptedException {
int N=3;
for(int i=0;iN;i++)
{
word.set(i+);
System.out.println(i);
context.write(one,word);
}
}
For smaller N numbers were in order but for N 300 order was not maintained

From: Harsh J [mailto:ha...@cloudera.com]
Sent: Thursday, January 17, 2013 1:57 AM
To: mapreduce-user
Subject: RE: Limitation of key-value pairs for a particular key.


We don't sort values (only keys) nor apply any manual limits in MR. Can your 
post a reproduceable test case to support your suspicion?
On Jan 16, 2013 4:34 PM, Utkarsh Gupta 
utkarsh_gu...@infosys.commailto:utkarsh_gu...@infosys.com wrote:
Hi,
Thanks for the response. There was some issues with my code. I have checked 
that in detail.
All the values of map are present in reducer but not in sorted order. This case 
happens if the number of values are too large for a key.

Thanks
Utkarsh

From: Vinod Kumar Vavilapalli 
[mailto:vino...@hortonworks.commailto:vino...@hortonworks.com]
Sent: Thursday, January 10, 2013 11:00 PM
To: mapreduce-user@hadoop.apache.orgmailto:mapreduce-user@hadoop.apache.org
Subject: Re: Limitation of key-value pairs for a particular key.

There isn't any limit like that. Can you reproduce this consistently? If so, 
please file a ticket.

It will definitely help if you can provide a test case which can reproduce this 
issue.

Thanks,
+Vinod

On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta 
utkarsh_gu...@infosys.commailto:utkarsh_gu...@infosys.com wrote:
Hi,

I am using Apache Hadoop 1.0.4 on a 10 node cluster of commodity machines with 
Ubuntu 12.04 Server edition. I am having a issue with my map reduce code. While 
debugging I found that the reducer can take 262145 values for a particular key. 
If more values are there, they seem to be corrupted. I checked the values while 
emitting from map and again checked in reducer.
I am wondering is there any such kind of limitation in the Hadoop or is it a 
configuration problem.


Thanks and Regards
Utkarsh Gupta



 CAUTION - Disclaimer *

This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely

for the use of the addressee(s). If you are not the intended recipient, please

notify the sender by e-mail and delete the original message. Further, you are 
not

to copy, disclose, or distribute this e-mail or its contents to any other 
person and

any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken

every reasonable precaution to minimize this risk, but is not liable for any 
damage

you may sustain as a result of any virus in this e-mail. You should carry out 
your

own virus checks before opening the e-mail or attachment. Infosys reserves the

right to monitor and review the content of all messages sent to or from this 
e-mail

address. Messages sent to or from this e-mail address may be stored on the

Infosys e-mail system.

***INFOSYS End of Disclaimer INFOSYS***




--
+Vinod
Hortonworks Inc.
http://hortonworks.com/


RE: Limitation of key-value pairs for a particular key.

2013-01-18 Thread Sven Groot
Hi,

 

I think I know what's going on here. It has to do with how many spills the
map task performs.

 

You are emitting the numbers in order, so if there is only one spill, they
stay in order. For larger number of records, the map task will create more
than one spill, which must be merged. During the merge, the original order
is not preserved.

 

If you want the original order to be preserved, you must set io.sort.mb
and/or io.sort.record.percent such that the map task requires only a single
spill.

 

Cheers,

Sven 

 

From: Utkarsh Gupta [mailto:utkarsh_gu...@infosys.com] 
Sent: 18 January 2013 18:25
To: mapreduce-user@hadoop.apache.org
Subject: RE: Limitation of key-value pairs for a particular key.

 

You are right 

Actually we were expecting the values to be sorted.

We tried to reproduce the problem by this simple code

private final IntWritable one=new IntWritable(1);

private Text word=new Text();

@Override

public void map(LongWritable key,Text value, Context context) throws
IOException, InterruptedException {

int N=3;

for(int i=0;iN;i++)

{

word.set(i+);

System.out.println(i);

context.write(one,word);

}

}

For smaller N numbers were in order but for N 300 order was not
maintained

 

From: Harsh J [mailto:ha...@cloudera.com] 
Sent: Thursday, January 17, 2013 1:57 AM
To: mapreduce-user
Subject: RE: Limitation of key-value pairs for a particular key.

 

We don't sort values (only keys) nor apply any manual limits in MR. Can your
post a reproduceable test case to support your suspicion?

On Jan 16, 2013 4:34 PM, Utkarsh Gupta utkarsh_gu...@infosys.com
mailto:utkarsh_gu...@infosys.com  wrote:

Hi,

Thanks for the response. There was some issues with my code. I have checked
that in detail. 

All the values of map are present in reducer but not in sorted order. This
case happens if the number of values are too large for a key. 

 

Thanks

Utkarsh

 

From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com
mailto:vino...@hortonworks.com ] 
Sent: Thursday, January 10, 2013 11:00 PM
To: mapreduce-user@hadoop.apache.org
mailto:mapreduce-user@hadoop.apache.org 
Subject: Re: Limitation of key-value pairs for a particular key.

 

There isn't any limit like that. Can you reproduce this consistently? If so,
please file a ticket.

It will definitely help if you can provide a test case which can reproduce
this issue.

Thanks,
+Vinod

 

On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta utkarsh_gu...@infosys.com
mailto:utkarsh_gu...@infosys.com  wrote:

Hi,

 

I am using Apache Hadoop 1.0.4 on a 10 node cluster of commodity machines
with Ubuntu 12.04 Server edition. I am having a issue with my map reduce
code. While debugging I found that the reducer can take 262145 values for a
particular key. If more values are there, they seem to be corrupted. I
checked the values while emitting from map and again checked in reducer.

I am wondering is there any such kind of limitation in the Hadoop or is it a
configuration problem.

 

 

Thanks and Regards

Utkarsh Gupta

 

 


 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely

for the use of the addressee(s). If you are not the intended recipient,
please 
notify the sender by e-mail and delete the original message. Further, you
are not 
to copy, disclose, or distribute this e-mail or its contents to any other
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has
taken 
every reasonable precaution to minimize this risk, but is not liable for any
damage 
you may sustain as a result of any virus in this e-mail. You should carry
out your 
own virus checks before opening the e-mail or attachment. Infosys reserves
the 
right to monitor and review the content of all messages sent to or from this
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***




-- 
+Vinod
Hortonworks Inc.
http://hortonworks.com/ 



RE: On a lighter note

2013-01-18 Thread Fabio Pitzolu
Awesome Tariq!!

You made my day!! :-D

 

Fabio Pitzolu

www.gr-ci.com

 

From: Anand Sharma [mailto:anand2sha...@gmail.com] 
Sent: venerdì 18 gennaio 2013 04:10
To: user@hadoop.apache.org
Subject: Re: On a lighter note

 

Awesome one Tariq!!

 

On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq 
mailto:donta...@gmail.com donta...@gmail.com wrote:

You are right Michael, as always :)




Warm Regards,

Tariq

 https://mtariq.jux.com/ https://mtariq.jux.com/

 http://cloudfront.blogspot.com cloudfront.blogspot.com

 

On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel 
mailto:michael_se...@hotmail.com michael_se...@hotmail.com wrote:

I'm thinking 'Downfall'

 

But I could be wrong.

 

On Jan 17, 2013, at 6:56 PM, Yongzhi Wang 
mailto:wang.yongzhi2...@gmail.com wang.yongzhi2...@gmail.com wrote:





Who can tell me what is the name of the original film? Thanks!

Yongzhi

 

On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq 
mailto:donta...@gmail.com donta...@gmail.com wrote:

I am sure you will suffer from severe stomach ache after watching this :)

 http://www.youtube.com/watch?v=hEqQMLSXQlY
http://www.youtube.com/watch?v=hEqQMLSXQlY




Warm Regards,

Tariq

 https://mtariq.jux.com/ https://mtariq.jux.com/

cloudfront.blogspot.com http://cloudfront.blogspot.com/ 

 

 

 

 



Re: building a department GPU cluster

2013-01-18 Thread Roberto Nunnari

Thiago Vieira wrote:
I've seen some academic researches on this direction, with good results. 
Some computations can be expressed by GPGPU, but it is still a restrict 
number of cases. If is not easy to solve problems using MapReduce, solve 
some problems with SIMD is harder.


Ok.. Thank you all for your time.. I'll keep searching.
Best regards.
Robi




--
Thiago Vieira


On Thu, Jan 17, 2013 at 9:24 PM, Russell Jurney 
russell.jur...@gmail.com mailto:russell.jur...@gmail.com wrote:


Hadoop streaming can do this, and there's been some discussion in
the past, but it's not a core use case. Check the list archives.

Russell Jurney http://datasyndrome.com

On Jan 17, 2013, at 9:25 AM, Jeremy Lewi jer...@lewi.us
mailto:jer...@lewi.us wrote:


I don't think running hadoop on a GPU cluster is a common use
case; the types of workloads for a hadoop vs. gpu cluster are very
different although a quick google search did turn up some. So this
is probably not the best mailing list for your question.

J


On Thu, Jan 17, 2013 at 5:18 AM, Roberto Nunnari
roberto.nunn...@supsi.ch mailto:roberto.nunn...@supsi.ch wrote:

Roberto Nunnari wrote:

Hi all.

I'm writing to you to ask for advice or a hint to the
right direction.

In our department, more and more researchers ask us (IT
administrators) to assemble (or to buy) GPGPU powered
workstations to do parallel computing.

As I already manage a small CPU cluster (resources managed
using SGE), with my boss we talked about building a new
GPU cluster. The problem is that I have no experience at
all with GPU clusters.

Apart from the already running GPU workstations, we
already have some new HW that looks promising to me as a
starting point for a GPU cluster.

- 1x Dell PowerEdge R720
- 1x Dell PowerEdge C410x
- 1x NVIDIA M2090 PCIe x16
- 1x NVIDIA iPASS Cable Kit
(Dell forgot to include the iPASS adapter for the R720!! :-D)

I'd be grateful if you could kindly give me some advice
and/or hint to the right direction.

In particular I'm interested on your opinion on:
1) is the above HW suitable for a small (2 to 4/6 GPUs)
GPU cluster?
2) is apache adhoop suitable (or what could we use?) as a
queuing and resource management system? We would like the
cluster to be usable by many users at once in a way that
no user has to worry about resources, just like we do on
the CPU cluster with SGE.
3) What distribution of linux would be more appropriate?
4) necessary stack of sw? (cuda, hadoop, other?)

Thank you very much for your valuable insight!

Best regards.
Robi


Anybody on this, please?
Robi




Re: On a lighter note

2013-01-18 Thread Josh Long
LOL

Thanks,
Josh Long
Spring Developer Advocate
SpringSource, a Division of VMware
http://www.joshlong.com || joshlong.com || http://twitter.com/starbuxman


On Fri, Jan 18, 2013 at 5:06 PM, Mohammad Tariq donta...@gmail.com wrote:
 lol :)

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Fri, Jan 18, 2013 at 1:54 PM, Fabio Pitzolu fabio.pitz...@gr-ci.com
 wrote:

 Awesome Tariq!!

 You made my day!! :-D



 Fabio Pitzolu

 www.gr-ci.com



 From: Anand Sharma [mailto:anand2sha...@gmail.com]
 Sent: venerdì 18 gennaio 2013 04:10
 To: user@hadoop.apache.org
 Subject: Re: On a lighter note



 Awesome one Tariq!!



 On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.com
 wrote:

 You are right Michael, as always :)


 Warm Regards,

 Tariq

 https://mtariq.jux.com/

 cloudfront.blogspot.com



 On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.com
 wrote:

 I'm thinking 'Downfall'



 But I could be wrong.



 On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com
 wrote:



 Who can tell me what is the name of the original film? Thanks!

 Yongzhi



 On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.com
 wrote:

 I am sure you will suffer from severe stomach ache after watching this :)

 http://www.youtube.com/watch?v=hEqQMLSXQlY


 Warm Regards,

 Tariq

 https://mtariq.jux.com/

 cloudfront.blogspot.com












Re: On a lighter note

2013-01-18 Thread iwannaplay games
Awesome
:)



Regards
Prabhjot


Re: OutofMemoryError when running an YARN application with 25 containers

2013-01-18 Thread Krishna Kishore Bonagiri
Hi Anil,

  Thanks or the reply. I was trying google to know how to increase heap
size, and found that the option -Xmx1500m has be passed as command line
argument for java. Is that the way you are suggesting? If so, how can I
pass it for Application Master, because it is Client program that actually
launches the AM...
  Or is there any other way for doing it?

Thanks,
Kishore


On Tue, Jan 15, 2013 at 11:48 AM, anil gupta anilgupt...@gmail.com wrote:

 The following log tells you the exact error:

 *JVMDUMP013I Processed dump event systhrow, detail
 java/lang/OutOfMemoryError.*
 * *
 *Exception in thread Thread-7 java.lang.OutOfMemoryError*
 *at ApplicationMaster.readMessage(**ApplicationMaster.java:241)*
 *at ApplicationMaster$**SectionLeaderRunnable.run(**
 ApplicationMaster.java:825)*
 * *
 *at java.lang.Thread.run(Thread.**java:736)*

 You might need to increase the HeapSize of ApplicationMaster.

 HTH,
 Anil Gupta


 On Mon, Jan 14, 2013 at 4:35 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi,

   I am getting the following error in ApplicationMaster.stderr when
 running an application with around 25 container launches. How can I resolve
 this issue?

 JVMDUMP006I Processing dump event systhrow, detail
 java/lang/OutOfMemoryError - please wait.
 JVMDUMP032I JVM requested Heap dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd'
 in response to an event
 JVMDUMP010I Heap dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd
 JVMDUMP032I JVM requested Java dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt'
 in response to an event
 JVMDUMP010I Java dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt
 JVMDUMP032I JVM requested Snap dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc'
 in response to an event
 JVMDUMP010I Snap dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc
 JVMDUMP013I Processed dump event systhrow, detail
 java/lang/OutOfMemoryError.
 Exception in thread Thread-7 java.lang.OutOfMemoryError
 at ApplicationMaster.readMessage(ApplicationMaster.java:241)
 at
 ApplicationMaster$SectionLeaderRunnable.run(ApplicationMaster.java:825)
 at java.lang.Thread.run(Thread.java:736)


 Thanks,
 Kishore




 --
 Thanks  Regards,
 Anil Gupta


Re: OutofMemoryError when running an YARN application with 25 containers

2013-01-18 Thread Krishna Kishore Bonagiri
Hi Arun,

  Thanks for the reply. I am not running a Map Reduce application, running
some distributed application. And I am using 2.0.0-alpha. Also, I have one
more query.

 I am seeing that from the time ApplicationMaster is sumitted by Client to
the ASM part of AM, it is taking around 7 seconds for AM to come up. Is
there a way to improve that time?


Thanks,
Kishore


On Tue, Jan 15, 2013 at 5:43 PM, Arun C Murthy a...@hortonworks.com wrote:

 How many maps  reduces did your job have? Also, what release are you
 using? I'd recommend at least 2.0.2-alpha, though we should be able to
 release 2.0.3-alpha very soon.

 Arun

 On Jan 14, 2013, at 4:35 AM, Krishna Kishore Bonagiri wrote:

 Hi,

   I am getting the following error in ApplicationMaster.stderr when
 running an application with around 25 container launches. How can I resolve
 this issue?

 JVMDUMP006I Processing dump event systhrow, detail
 java/lang/OutOfMemoryError - please wait.
 JVMDUMP032I JVM requested Heap dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd'
 in response to an event
 JVMDUMP010I Heap dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd
 JVMDUMP032I JVM requested Java dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt'
 in response to an event
 JVMDUMP010I Java dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt
 JVMDUMP032I JVM requested Snap dump using
 '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc'
 in response to an event
 JVMDUMP010I Snap dump written to
 /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc
 JVMDUMP013I Processed dump event systhrow, detail
 java/lang/OutOfMemoryError.
 Exception in thread Thread-7 java.lang.OutOfMemoryError
 at ApplicationMaster.readMessage(ApplicationMaster.java:241)
 at
 ApplicationMaster$SectionLeaderRunnable.run(ApplicationMaster.java:825)
 at java.lang.Thread.run(Thread.java:736)


 Thanks,
 Kishore


 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/





RE: On a lighter note

2013-01-18 Thread Viral Bajaria
LOL just amazing... I remember having a similar conversation with someone
who didn't understand meaning of secondary namenode :-)

Viral
--
From: iwannaplay games
Sent: 1/18/2013 1:24 AM
To: user@hadoop.apache.org
Subject: Re: On a lighter note

Awesome
:)



Regards
Prabhjot


Re: how to restrict the concurrent running map tasks?

2013-01-18 Thread Harsh J
You will need to use an alternative scheduler for this.

Look at minMaps/maxMaps/etc. properties in FairScheduler at
http://hadoop.apache.org/docs/stable/fair_scheduler.html#Allocation+File+%28fair-scheduler.xml%29
Alternatively, look at resource-based scheduling in CapacityScheduler at
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling

P.s. Do not use general@ list for user level queries. The right list is
user@hadoop.apache.org.


On Fri, Jan 18, 2013 at 3:52 PM, hwang joe.haiw...@gmail.com wrote:

 Hi all:

 My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the
 same time. I have found 2 parameter related to this question.

 a) mapred.job.map.capacity

 but in my hadoop version, this parameter seems abandoned.

 b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (

 http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml
 )

 I set this variable like below:

 Configuration conf = new Configuration();
 conf.set(date, date);
 conf.set(mapred.job.queue.name, hadoop);
 conf.set(mapred.jobtracker.taskScheduler.maxRunningTasksPerJob, 10);

 DistributedCache.createSymlink(conf);
 Job job = new Job(conf, ConstructApkDownload_ + date);
 ...

 The problem is that it doesn't work. There is still more than 50 maps
 running as the job starts.

 I'm not sure whether I set this parameter in wrong way ? or misunderstand
 it.

 After looking through the hadoop document, I can't find another parameter
 to limit the concurrent running map tasks.

 Hope someone can help me ,Thanks.




-- 
Harsh J


Re: On a lighter note

2013-01-18 Thread Mohammad Tariq
Folks quite often get confused by the name. But this one is just unbeatable
:)

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 LOL just amazing... I remember having a similar conversation with someone
 who didn't understand meaning of secondary namenode :-)

 Viral
 --
 From: iwannaplay games
 Sent: 1/18/2013 1:24 AM

 To: user@hadoop.apache.org
 Subject: Re: On a lighter note

 Awesome
 :)



 Regards
 Prabhjot




Re: Estimating disk space requirements

2013-01-18 Thread Jean-Marc Spaggiari
Hi Panshul,

If you have 20 GB with a replication factor set to 3, you have only
6.6GB available, not 11GB. You need to divide the total space by the
replication factor.

Also, if you store your JSon into HBase, you need to add the key size
to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
store it. Without including the key size. Even with a replication
factor set to 5 you don't have the space.

Now, you can add some compression, but even with a lucky factor of 50%
you still don't have the space. You will need something like 90%
compression factor to be able to store this data in your cluster.

A 1T drive is now less than $100... So you might think about replacing
you 20 GB drives by something bigger.
to reply to your last question, for your data here, you will need AT
LEAST 350GB overall storage. But that's a bare minimum. Don't go under
500GB.

IMHO

JM

2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
 Hello,

 I was estimating how much disk space do I need for my cluster.

 I have 24 million JSON documents approx. 5kb each
 the Json is to be stored into HBASE with some identifying data in coloumns
 and I also want to store the Json for later retrieval based on the Id data
 as keys in Hbase.
 I have my HDFS replication set to 3
 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB
 is available for use on my 20 GB node.

 I have no idea, if I have not enabled Hbase replication, is the HDFS
 replication enough to keep the data safe and redundant.
 How much total disk space I will need for the storage of the data.

 Please help me estimate this.

 Thank you so much.

 --
 Regards,
 Ouch Whisper
 010101010101



Re: Estimating disk space requirements

2013-01-18 Thread Mirko Kämpf
Hi,

some comments are inside your message ...


2013/1/18 Panshul Whisper ouchwhis...@gmail.com

 Hello,

 I was estimating how much disk space do I need for my cluster.

 I have 24 million JSON documents approx. 5kb each
 the Json is to be stored into HBASE with some identifying data in coloumns
 and I also want to store the Json for later retrieval based on the Id data
 as keys in Hbase.
 I have my HDFS replication set to 3
 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
 GB is available for use on my 20 GB node.


11 GB is quite small  - or is there a typo?

The amount of raw data is about 115 GB
   *nr of items* *size of an item* *
* *Bytes* *GB*  24 1.00E+006 5 1.02E+003
12288000 114.4409179688  (without additional key and metadata)

Depending in the amount of overhead this could be about 200GB x 3 is 600GB
just for distributed storage.

And than you need some capacity to store intermediate processing data (20%
to 30%) of the processed data is recommendet.

So you might prepare a capacity of 1TB or even more if your dataset grows.





 I have no idea, if I have not enabled Hbase replication, is the HDFS
 replication enough to keep the data safe and redundant.


The replication on the HDFS level is sufficient for keeping the data safe,
no need to replicate the HBase tables separately.


  How much total disk space I will need for the storage of the data.




 Please help me estimate this.

 Thank you so much.

 --
 Regards,
 Ouch Whisper
 010101010101


Best wishes
Mirko


FW: HBase Master not getting started

2013-01-18 Thread Kumar, Deepak8
Hi,
Could you please guide?

Regards,
Deepak

-Original Message-
From: Kumar, Deepak8 [CCC-OT_IT NE] 
Sent: Thursday, January 17, 2013 1:36 PM
To: 'cdh-u...@cloudera.org'
Cc: Kumar, Deepak8 [CCC-OT_IT NE]
Subject: HBase Master not getting started

Hi,
Something abnormal happened in my cluster. Actually the default location of 
snapshot  dataDir for zookeeper is /var/lib/zookeeper in cdh4. The disk at 
which /var location is configured became full and the cluster went down 
(zookeeper  HBase was in ERROR status). I have cleaned /var location but it 
seems the snapshot  dataDir location of zookeeper is not getting updated  
HBase master is not able to connect to zookeeper.

Could you please guide me?

Regards,
Deepak


Re: On a lighter note

2013-01-18 Thread Fabio Pitzolu
Someone should made one about unsubscribing from this mailing list ! :D


*Fabio Pitzolu*
Consultant - BI  Infrastructure

Mob. +39 3356033776
Telefono 02 87157239
Fax. 02 93664786

*Gruppo Consulenza Innovazione - http://www.gr-ci.com*


2013/1/18 Mohammad Tariq donta...@gmail.com

 Folks quite often get confused by the name. But this one is just
 unbeatable :)

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote:

 LOL just amazing... I remember having a similar conversation with someone
 who didn't understand meaning of secondary namenode :-)

 Viral
 --
 From: iwannaplay games
 Sent: 1/18/2013 1:24 AM

 To: user@hadoop.apache.org
 Subject: Re: On a lighter note

 Awesome
 :)



 Regards
 Prabhjot





Re: Estimating disk space requirements

2013-01-18 Thread Panshul Whisper
Thank you for the replies,

So I take it that I should have atleast 800 GB on total free space on
HDFS.. (combined free space of all the nodes connected to the cluster). So
I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
this be enough for the storage?
Please confirm.

Thanking You,
Regards,
Panshul.


On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Panshul,

 If you have 20 GB with a replication factor set to 3, you have only
 6.6GB available, not 11GB. You need to divide the total space by the
 replication factor.

 Also, if you store your JSon into HBase, you need to add the key size
 to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

 So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
 store it. Without including the key size. Even with a replication
 factor set to 5 you don't have the space.

 Now, you can add some compression, but even with a lucky factor of 50%
 you still don't have the space. You will need something like 90%
 compression factor to be able to store this data in your cluster.

 A 1T drive is now less than $100... So you might think about replacing
 you 20 GB drives by something bigger.
 to reply to your last question, for your data here, you will need AT
 LEAST 350GB overall storage. But that's a bare minimum. Don't go under
 500GB.

 IMHO

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Hello,
 
  I was estimating how much disk space do I need for my cluster.
 
  I have 24 million JSON documents approx. 5kb each
  the Json is to be stored into HBASE with some identifying data in
 coloumns
  and I also want to store the Json for later retrieval based on the Id
 data
  as keys in Hbase.
  I have my HDFS replication set to 3
  each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11
 GB
  is available for use on my 20 GB node.
 
  I have no idea, if I have not enabled Hbase replication, is the HDFS
  replication enough to keep the data safe and redundant.
  How much total disk space I will need for the storage of the data.
 
  Please help me estimate this.
 
  Thank you so much.
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




-- 
Regards,
Ouch Whisper
010101010101


Re: Estimating disk space requirements

2013-01-18 Thread Jean-Marc Spaggiari
20 nodes with 40 GB will do the work.

After that you will have to consider performances based on your access
pattern. But that's another story.

JM

2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
 Thank you for the replies,

 So I take it that I should have atleast 800 GB on total free space on
 HDFS.. (combined free space of all the nodes connected to the cluster). So
 I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will
 this be enough for the storage?
 Please confirm.

 Thanking You,
 Regards,
 Panshul.


 On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Panshul,

 If you have 20 GB with a replication factor set to 3, you have only
 6.6GB available, not 11GB. You need to divide the total space by the
 replication factor.

 Also, if you store your JSon into HBase, you need to add the key size
 to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.

 So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
 store it. Without including the key size. Even with a replication
 factor set to 5 you don't have the space.

 Now, you can add some compression, but even with a lucky factor of 50%
 you still don't have the space. You will need something like 90%
 compression factor to be able to store this data in your cluster.

 A 1T drive is now less than $100... So you might think about replacing
 you 20 GB drives by something bigger.
 to reply to your last question, for your data here, you will need AT
 LEAST 350GB overall storage. But that's a bare minimum. Don't go under
 500GB.

 IMHO

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Hello,
 
  I was estimating how much disk space do I need for my cluster.
 
  I have 24 million JSON documents approx. 5kb each
  the Json is to be stored into HBASE with some identifying data in
 coloumns
  and I also want to store the Json for later retrieval based on the Id
 data
  as keys in Hbase.
  I have my HDFS replication set to 3
  each node has Hadoop and hbase and Ubuntu installed on it.. so approx
  11
 GB
  is available for use on my 20 GB node.
 
  I have no idea, if I have not enabled Hbase replication, is the HDFS
  replication enough to keep the data safe and redundant.
  How much total disk space I will need for the storage of the data.
 
  Please help me estimate this.
 
  Thank you so much.
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




 --
 Regards,
 Ouch Whisper
 010101010101



Re: On a lighter note

2013-01-18 Thread shashwat shriparv
:)



∞
Shashwat Shriparv



On Fri, Jan 18, 2013 at 6:43 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote:

 Someone should made one about unsubscribing from this mailing list ! :D


 *Fabio Pitzolu*
 Consultant - BI  Infrastructure

 Mob. +39 3356033776
 Telefono 02 87157239
 Fax. 02 93664786

 *Gruppo Consulenza Innovazione - http://www.gr-ci.com*


 2013/1/18 Mohammad Tariq donta...@gmail.com

 Folks quite often get confused by the name. But this one is just
 unbeatable :)

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria 
 viral.baja...@gmail.comwrote:

 LOL just amazing... I remember having a similar conversation with
 someone who didn't understand meaning of secondary namenode :-)

 Viral
 --
 From: iwannaplay games
 Sent: 1/18/2013 1:24 AM

 To: user@hadoop.apache.org
 Subject: Re: On a lighter note

 Awesome
 :)



 Regards
 Prabhjot






Re: Estimating disk space requirements

2013-01-18 Thread Panshul Whisper
If we look at it with performance in mind,
is it better to have 20 Nodes with 40 GB HDD
or is it better to have 10 Nodes with 80 GB HDD?

they are connected on a gigabit LAN

Thnx


On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 20 nodes with 40 GB will do the work.

 After that you will have to consider performances based on your access
 pattern. But that's another story.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Thank you for the replies,
 
  So I take it that I should have atleast 800 GB on total free space on
  HDFS.. (combined free space of all the nodes connected to the cluster).
 So
  I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
 Will
  this be enough for the storage?
  Please confirm.
 
  Thanking You,
  Regards,
  Panshul.
 
 
  On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  Hi Panshul,
 
  If you have 20 GB with a replication factor set to 3, you have only
  6.6GB available, not 11GB. You need to divide the total space by the
  replication factor.
 
  Also, if you store your JSon into HBase, you need to add the key size
  to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
 
  So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
  store it. Without including the key size. Even with a replication
  factor set to 5 you don't have the space.
 
  Now, you can add some compression, but even with a lucky factor of 50%
  you still don't have the space. You will need something like 90%
  compression factor to be able to store this data in your cluster.
 
  A 1T drive is now less than $100... So you might think about replacing
  you 20 GB drives by something bigger.
  to reply to your last question, for your data here, you will need AT
  LEAST 350GB overall storage. But that's a bare minimum. Don't go under
  500GB.
 
  IMHO
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Hello,
  
   I was estimating how much disk space do I need for my cluster.
  
   I have 24 million JSON documents approx. 5kb each
   the Json is to be stored into HBASE with some identifying data in
  coloumns
   and I also want to store the Json for later retrieval based on the Id
  data
   as keys in Hbase.
   I have my HDFS replication set to 3
   each node has Hadoop and hbase and Ubuntu installed on it.. so approx
   11
  GB
   is available for use on my 20 GB node.
  
   I have no idea, if I have not enabled Hbase replication, is the HDFS
   replication enough to keep the data safe and redundant.
   How much total disk space I will need for the storage of the data.
  
   Please help me estimate this.
  
   Thank you so much.
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




-- 
Regards,
Ouch Whisper
010101010101


Re: Query: Hadoop's threat to Informatica

2013-01-18 Thread Jeff Bean
Informatica's take on the question:

http://www.informatica.com/hadoop/

My take on the question:

Hadoop is definitely disruptive and there have been times where we've been
able to blow missed data pipeline SLAs out of the water using Hadoop where
tools like Informatica were not able to. But Informatica's take on metadata
management, mixed workloads, and governance are somewhat well taken. It's
not that this stuff isn't doable with Hadoop, but it's that the maturity of
enterprise tools like Informatica are a little farther along.

Jeff

On Thu, Jan 17, 2013 at 10:51 PM, Mohammad Tariq donta...@gmail.com wrote:

 Hello Sameer,

  Pl find my comments embedded below :

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Fri, Jan 18, 2013 at 11:21 AM, Sameer Jain sameer.j...@evalueserve.com
  wrote:

   Hi,



 I am trying to understand the different data analysis algorithms
 available in the market. Analyst opinion suggests that Informatica and
 Hadoop have the best offerings in this space.



 However, I am not very clear as to how the two are different and how they
 compete, because Hadoop is being used by IBM etc. Since you appear to be a
 fairly seasoned expert in this domain, I would like to get your perspective
 on the following:



 I would hugely appreciate any thoughts/insights around

 · The workings of Hadoop/Mapreduce

 Hadoop is an open source platform that allows
 us to store and process huge, really huge, amount
 of data over a network of machines(need not be
 very sophisticated). It has 2 layers viz : HDFS 
 MapReduce for storage  processing respectively.

 · Informatica’s product offering

 They can tell you better. This list is specific to
 Hadoop ecosystem.

  · A comparison of which one of these is better

 Depends upon the particular use case. One size
 doesn't fit all.

  · A view of can and/or is Hadoop in competition with
 Informatica.

 I don't think so. Informatica is basically an ETL thing(if I
 am not wrong), while we leverage Hadoop's power to create
 ETL tools with the Help of different Hadoop sub projects.
 Though it is possible to use them together.



 Regards,

 Sameer



 *Sameer Jain*
   --

 Research Lead

 Evalueserve

 Office: + 91 124 4621615

 Mob: + 91 7827256066

 Fax: + 91 124 406 3430

 www.evalueserve.com





 .

 --

 The information in this e-mail is the property of Evalueserve and is
 confidential and privileged. It is intended solely for the addressee.
 Access to this email by anyone else is unauthorized. If you are not the
 intended recipient, any disclosure, copying, distribution or any action
 taken in reliance on it is prohibited and will be unlawful. If you receive
 this message in error, please notify the sender immediately and delete all
 copies of this message.





Re: Estimating disk space requirements

2013-01-18 Thread Jean-Marc Spaggiari
It all depend what you want to do with this data and the power of each
single node. There is no one size fit all rule.

The more nodes you have, the more CPU power you will have to process
the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
CPU ,maybe you should take the 80GB then.

If you want to get better advices from the list, you will need to
beter define you needs and the nodes you can have.

JM

2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
 If we look at it with performance in mind,
 is it better to have 20 Nodes with 40 GB HDD
 or is it better to have 10 Nodes with 80 GB HDD?

 they are connected on a gigabit LAN

 Thnx


 On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 20 nodes with 40 GB will do the work.

 After that you will have to consider performances based on your access
 pattern. But that's another story.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Thank you for the replies,
 
  So I take it that I should have atleast 800 GB on total free space on
  HDFS.. (combined free space of all the nodes connected to the cluster).
 So
  I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
 Will
  this be enough for the storage?
  Please confirm.
 
  Thanking You,
  Regards,
  Panshul.
 
 
  On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  Hi Panshul,
 
  If you have 20 GB with a replication factor set to 3, you have only
  6.6GB available, not 11GB. You need to divide the total space by the
  replication factor.
 
  Also, if you store your JSon into HBase, you need to add the key size
  to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
 
  So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
  store it. Without including the key size. Even with a replication
  factor set to 5 you don't have the space.
 
  Now, you can add some compression, but even with a lucky factor of 50%
  you still don't have the space. You will need something like 90%
  compression factor to be able to store this data in your cluster.
 
  A 1T drive is now less than $100... So you might think about replacing
  you 20 GB drives by something bigger.
  to reply to your last question, for your data here, you will need AT
  LEAST 350GB overall storage. But that's a bare minimum. Don't go under
  500GB.
 
  IMHO
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Hello,
  
   I was estimating how much disk space do I need for my cluster.
  
   I have 24 million JSON documents approx. 5kb each
   the Json is to be stored into HBASE with some identifying data in
  coloumns
   and I also want to store the Json for later retrieval based on the
   Id
  data
   as keys in Hbase.
   I have my HDFS replication set to 3
   each node has Hadoop and hbase and Ubuntu installed on it.. so
   approx
   11
  GB
   is available for use on my 20 GB node.
  
   I have no idea, if I have not enabled Hbase replication, is the HDFS
   replication enough to keep the data safe and redundant.
   How much total disk space I will need for the storage of the data.
  
   Please help me estimate this.
  
   Thank you so much.
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




 --
 Regards,
 Ouch Whisper
 010101010101



Re: Estimating disk space requirements

2013-01-18 Thread Panshul Whisper
Thank you for the reply.

It will be great if someone can suggest, if setting up my cluster on
Rackspace is good or on Amazon using EC2 servers?
keeping in mind Amazon services have been having a lot of downtimes...
My main point of concern is performance and availablitiy.
My cluster has to be very Highly Available.

Thanks.


On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 It all depend what you want to do with this data and the power of each
 single node. There is no one size fit all rule.

 The more nodes you have, the more CPU power you will have to process
 the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
 CPU ,maybe you should take the 80GB then.

 If you want to get better advices from the list, you will need to
 beter define you needs and the nodes you can have.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  If we look at it with performance in mind,
  is it better to have 20 Nodes with 40 GB HDD
  or is it better to have 10 Nodes with 80 GB HDD?
 
  they are connected on a gigabit LAN
 
  Thnx
 
 
  On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  20 nodes with 40 GB will do the work.
 
  After that you will have to consider performances based on your access
  pattern. But that's another story.
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Thank you for the replies,
  
   So I take it that I should have atleast 800 GB on total free space on
   HDFS.. (combined free space of all the nodes connected to the
 cluster).
  So
   I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
  Will
   this be enough for the storage?
   Please confirm.
  
   Thanking You,
   Regards,
   Panshul.
  
  
   On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
   Hi Panshul,
  
   If you have 20 GB with a replication factor set to 3, you have only
   6.6GB available, not 11GB. You need to divide the total space by the
   replication factor.
  
   Also, if you store your JSon into HBase, you need to add the key size
   to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
  
   So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
 to
   store it. Without including the key size. Even with a replication
   factor set to 5 you don't have the space.
  
   Now, you can add some compression, but even with a lucky factor of
 50%
   you still don't have the space. You will need something like 90%
   compression factor to be able to store this data in your cluster.
  
   A 1T drive is now less than $100... So you might think about
 replacing
   you 20 GB drives by something bigger.
   to reply to your last question, for your data here, you will need AT
   LEAST 350GB overall storage. But that's a bare minimum. Don't go
 under
   500GB.
  
   IMHO
  
   JM
  
   2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
Hello,
   
I was estimating how much disk space do I need for my cluster.
   
I have 24 million JSON documents approx. 5kb each
the Json is to be stored into HBASE with some identifying data in
   coloumns
and I also want to store the Json for later retrieval based on the
Id
   data
as keys in Hbase.
I have my HDFS replication set to 3
each node has Hadoop and hbase and Ubuntu installed on it.. so
approx
11
   GB
is available for use on my 20 GB node.
   
I have no idea, if I have not enabled Hbase replication, is the
 HDFS
replication enough to keep the data safe and redundant.
How much total disk space I will need for the storage of the data.
   
Please help me estimate this.
   
Thank you so much.
   
--
Regards,
Ouch Whisper
010101010101
   
  
  
  
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




-- 
Regards,
Ouch Whisper
010101010101


Re: Estimating disk space requirements

2013-01-18 Thread Mohammad Tariq
I have been using AWS since quite sometime and I have
never faced any issue. Personally speaking, I found AWS
really flexible. You get a great deal of flexibility in choosing
services depending upon your requirements.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Thank you for the reply.

 It will be great if someone can suggest, if setting up my cluster on
 Rackspace is good or on Amazon using EC2 servers?
 keeping in mind Amazon services have been having a lot of downtimes...
 My main point of concern is performance and availablitiy.
 My cluster has to be very Highly Available.

 Thanks.


 On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 It all depend what you want to do with this data and the power of each
 single node. There is no one size fit all rule.

 The more nodes you have, the more CPU power you will have to process
 the data... But if you 80GB boxes CPUs are faster than your 40GB boxes
 CPU ,maybe you should take the 80GB then.

 If you want to get better advices from the list, you will need to
 beter define you needs and the nodes you can have.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  If we look at it with performance in mind,
  is it better to have 20 Nodes with 40 GB HDD
  or is it better to have 10 Nodes with 80 GB HDD?
 
  they are connected on a gigabit LAN
 
  Thnx
 
 
  On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  20 nodes with 40 GB will do the work.
 
  After that you will have to consider performances based on your access
  pattern. But that's another story.
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Thank you for the replies,
  
   So I take it that I should have atleast 800 GB on total free space on
   HDFS.. (combined free space of all the nodes connected to the
 cluster).
  So
   I can connect 20 nodes having 40 GB of hdd on each node to my
 cluster.
  Will
   this be enough for the storage?
   Please confirm.
  
   Thanking You,
   Regards,
   Panshul.
  
  
   On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
   jean-m...@spaggiari.org wrote:
  
   Hi Panshul,
  
   If you have 20 GB with a replication factor set to 3, you have only
   6.6GB available, not 11GB. You need to divide the total space by the
   replication factor.
  
   Also, if you store your JSon into HBase, you need to add the key
 size
   to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
  
   So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
 to
   store it. Without including the key size. Even with a replication
   factor set to 5 you don't have the space.
  
   Now, you can add some compression, but even with a lucky factor of
 50%
   you still don't have the space. You will need something like 90%
   compression factor to be able to store this data in your cluster.
  
   A 1T drive is now less than $100... So you might think about
 replacing
   you 20 GB drives by something bigger.
   to reply to your last question, for your data here, you will need AT
   LEAST 350GB overall storage. But that's a bare minimum. Don't go
 under
   500GB.
  
   IMHO
  
   JM
  
   2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
Hello,
   
I was estimating how much disk space do I need for my cluster.
   
I have 24 million JSON documents approx. 5kb each
the Json is to be stored into HBASE with some identifying data in
   coloumns
and I also want to store the Json for later retrieval based on the
Id
   data
as keys in Hbase.
I have my HDFS replication set to 3
each node has Hadoop and hbase and Ubuntu installed on it.. so
approx
11
   GB
is available for use on my 20 GB node.
   
I have no idea, if I have not enabled Hbase replication, is the
 HDFS
replication enough to keep the data safe and redundant.
How much total disk space I will need for the storage of the data.
   
Please help me estimate this.
   
Thank you so much.
   
--
Regards,
Ouch Whisper
010101010101
   
  
  
  
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




 --
 Regards,
 Ouch Whisper
 010101010101



Cohesion of Hadoop team?

2013-01-18 Thread Glen Mazza
Hi, looking at the derivation of the 0.23.x  2.0.x branches on one 
hand, and the 1.x branches on the other, as described here:

http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E

One gets the impression the Hadoop committers are split into two teams, 
with one team working on 0.23.x/2.0.2 and another team working on 1.x, 
running the risk of increasingly diverging products eventually competing 
with each other.  Is that the case?  Is there expected to be a Hadoop 
3.0 where the results of the two lines of development will merge or is 
it increasingly likely the subteams will continue their separate routes?


Thanks,
Glen

--
Glen Mazza
Talend Community Coders - coders.talend.com
blog: www.jroller.com/gmazza



Re: Execution of udf

2013-01-18 Thread nagarjuna kanamarlapudi
No but the query execution shows a reducer running .. And infant I feel
that reduce phase can be there

On Friday, January 18, 2013, Dean Wampler wrote:

 There is no reduce phase needed in this query.

 On Fri, Jan 18, 2013 at 6:59 AM, nagarjuna kanamarlapudi 
 nagarjuna.kanamarlap...@gmail.com javascript:_e({}, 'cvml',
 'nagarjuna.kanamarlap...@gmail.com'); wrote:

 Hi,

 Select col1,myudf(col2,col3) from table1;


 In what phase if map reduce an udf is executed.

 In the very beginning, I assumed that hive will be joining two tables.,
 getting the required columns and then applies udf on columns specified
 I.e., essentially on reducer phase . But later on I realised that I was
 wrong.

 Is there any specific parameter which suggests hive to call udf at
 reducer phase rather than at Mapper phase.


 Regards,
 Nagarjuna


 --
 Sent from iPhone




 --
 *Dean Wampler, Ph.D.*
 thinkbiganalytics.com
 +1-312-339-1330



-- 
Sent from iPhone


Re: Problems

2013-01-18 Thread Sean Hudson

Leo,
   I downloaded the suggested 1.6.0_32 Java version to my home 
directory, but I am still experiencing the same problem (See error below).
The only thing that I have set in my hadoop-env.sh file is the JAVA_HOME 
environment variable. I have also tried it with the Java directory added to 
PATH.


export JAVA_HOME=/home/shu/jre1.6.0_32
export PATH=$PATH:/home/shu/jre1.6.0_32

Every other environment variable is defaulted.

Just to clarify, I have tried this in Local Standalone mode and also in 
Pseudo-Distributed Mode with the same result.


Frustrating to say the least,

Sean Hudson


shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep 
input output 'dfs[a-z.]+'

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0xb7fc51fb, pid=23112, tid=3075554208
#
# JRE version: 6.0_32-b05
# Java VM: Java HotSpot(TM) Client VM (20.7-b02 mixed mode, sharing 
linux-x86 )

# Problematic frame:
# C  [ld-linux.so.2+0x91fb]  double+0xab
#
# An error report file with more information is saved as:
# /home/shu/hadoop-1.0.4/hs_err_pid23112.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted

-Original Message- 
From: Leo Leung

Sent: Thursday, January 17, 2013 6:46 PM
To: user@hadoop.apache.org
Subject: RE: Problems

Use Sun/Oracle  1.6.0_32+   Build should be 20.7-b02+

1.7 causes failure and AFAIK,  not supported,  but you are free to try the 
latest version and report back.




-Original Message-
From: Sean Hudson [mailto:sean.hud...@ostiasolutions.com]
Sent: Thursday, January 17, 2013 6:57 AM
To: user@hadoop.apache.org
Subject: Re: Problems

Hi,
 My Java version is

java version 1.6.0_25
Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) Client 
VM (build 20.0-b11, mixed mode, sharing)


Would you advise obtaining a later Java version?

Sean

-Original Message-
From: Jean-Marc Spaggiari
Sent: Thursday, January 17, 2013 2:52 PM
To: user@hadoop.apache.org
Subject: Re: Problems

Hi Sean,

This is an issue with your JVM. Not related to hadoop.

Which JVM are you using, and can you try with the last from Sun?

JM

2013/1/17, Sean Hudson sean.hud...@ostiasolutions.com:

Hi,
  I have recently installed hadoop-1.0.4 on a linux machine.
Whilst working through the post-install instructions contained in the
“Quick Start”
guide, I incurred the following catastrophic Java runtime error (See
below).
I have attached the error report file “hs_err_pid24928.log”. I have
submitted a Java bug report, but perhaps it is a known hadoop-1.0.4
version problem.

I am a first time user of Hadoop and would welcome guidance on this
problem,

Regards,

Sean Hudson.

shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar
grep input output 'dfs[a-z.]+'
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGFPE (0x8) at pc=0xb7f2b1fb, pid=24928, tid=3074923424 # # JRE
version: 6.0_25-b06 # Java VM: Java HotSpot(TM) Client VM (20.0-b11
mixed mode, sharing
linux-x86 )
# Problematic frame:
# C  [ld-linux.so.2+0x91fb]  double+0xab # # An error report file with
more information is saved as:
# /home/shu/hadoop-1.0.4/hs_err_pid24928.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted
--
Ostia Software Solutions Limited, 6 The Mill Building, The Maltings,
Bray, Co. Wicklow, Ireland

Registered in Ireland CRO No.507541 This email and any attachments to
it is, unless otherwise stated, confidential, may contain copyright
material and is for the use of the intended recipient only. If you
have received this email in error, please notify the sender by return
and deleting all copies. Any views expressed in this email are those
of the sender and do not form part of any contract between Ostia
Software Solutions Limited and any other party.




--
Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, 
Co. Wicklow, Ireland


Registered in Ireland CRO No.507541 This email and any attachments to it is, 
unless otherwise stated, confidential, may contain copyright material and is 
for the use of the intended recipient only. If you have received this email 
in error, please notify the sender by return and deleting all copies. Any 
views expressed in this email are those of the sender and do not form part 
of any contract between Ostia Software Solutions Limited and any other 
party. 



--
Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, 
Co. Wicklow, Ireland


Registered in Ireland CRO No.507541 This email and any attachments to it 
is, unless otherwise stated, confidential, may 

Re: On a lighter note

2013-01-18 Thread varun kumar
:) :)

On Fri, Jan 18, 2013 at 7:08 PM, shashwat shriparv 
dwivedishash...@gmail.com wrote:

 :)



 ∞
 Shashwat Shriparv



 On Fri, Jan 18, 2013 at 6:43 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote:

 Someone should made one about unsubscribing from this mailing list ! :D


 *Fabio Pitzolu*
 Consultant - BI  Infrastructure

 Mob. +39 3356033776
 Telefono 02 87157239
 Fax. 02 93664786

 *Gruppo Consulenza Innovazione - http://www.gr-ci.com*


 2013/1/18 Mohammad Tariq donta...@gmail.com

 Folks quite often get confused by the name. But this one is just
 unbeatable :)

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria 
 viral.baja...@gmail.comwrote:

 LOL just amazing... I remember having a similar conversation with
 someone who didn't understand meaning of secondary namenode :-)

 Viral
 --
 From: iwannaplay games
 Sent: 1/18/2013 1:24 AM

 To: user@hadoop.apache.org
 Subject: Re: On a lighter note

 Awesome
 :)



 Regards
 Prabhjot







-- 
Regards,
Varun Kumar.P


Re: Problems

2013-01-18 Thread Jean-Marc Spaggiari
Hi Sean,

It's strange. You should not faced that.  I faced same kind of issues
on a desktop with memory errors. Can you install memtest86 and fullty
test your memory (one pass is enought) to make sure you don't have
issues on that side?

2013/1/18, Sean Hudson sean.hud...@ostiasolutions.com:
 Leo,
 I downloaded the suggested 1.6.0_32 Java version to my home
 directory, but I am still experiencing the same problem (See error below).
 The only thing that I have set in my hadoop-env.sh file is the JAVA_HOME
 environment variable. I have also tried it with the Java directory added to

 PATH.

 export JAVA_HOME=/home/shu/jre1.6.0_32
 export PATH=$PATH:/home/shu/jre1.6.0_32

 Every other environment variable is defaulted.

 Just to clarify, I have tried this in Local Standalone mode and also in
 Pseudo-Distributed Mode with the same result.

 Frustrating to say the least,

 Sean Hudson


 shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep

 input output 'dfs[a-z.]+'
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGFPE (0x8) at pc=0xb7fc51fb, pid=23112, tid=3075554208
 #
 # JRE version: 6.0_32-b05
 # Java VM: Java HotSpot(TM) Client VM (20.7-b02 mixed mode, sharing
 linux-x86 )
 # Problematic frame:
 # C  [ld-linux.so.2+0x91fb]  double+0xab
 #
 # An error report file with more information is saved as:
 # /home/shu/hadoop-1.0.4/hs_err_pid23112.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 # The crash happened outside the Java Virtual Machine in native code.
 # See problematic frame for where to report the bug.
 #
 Aborted

 -Original Message-
 From: Leo Leung
 Sent: Thursday, January 17, 2013 6:46 PM
 To: user@hadoop.apache.org
 Subject: RE: Problems

 Use Sun/Oracle  1.6.0_32+   Build should be 20.7-b02+

 1.7 causes failure and AFAIK,  not supported,  but you are free to try the
 latest version and report back.



 -Original Message-
 From: Sean Hudson [mailto:sean.hud...@ostiasolutions.com]
 Sent: Thursday, January 17, 2013 6:57 AM
 To: user@hadoop.apache.org
 Subject: Re: Problems

 Hi,
   My Java version is

 java version 1.6.0_25
 Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) Client

 VM (build 20.0-b11, mixed mode, sharing)

 Would you advise obtaining a later Java version?

 Sean

 -Original Message-
 From: Jean-Marc Spaggiari
 Sent: Thursday, January 17, 2013 2:52 PM
 To: user@hadoop.apache.org
 Subject: Re: Problems

 Hi Sean,

 This is an issue with your JVM. Not related to hadoop.

 Which JVM are you using, and can you try with the last from Sun?

 JM

 2013/1/17, Sean Hudson sean.hud...@ostiasolutions.com:
 Hi,
   I have recently installed hadoop-1.0.4 on a linux machine.
 Whilst working through the post-install instructions contained in the
 “Quick Start”
 guide, I incurred the following catastrophic Java runtime error (See
 below).
 I have attached the error report file “hs_err_pid24928.log”. I have
 submitted a Java bug report, but perhaps it is a known hadoop-1.0.4
 version problem.

 I am a first time user of Hadoop and would welcome guidance on this
 problem,

 Regards,

 Sean Hudson.

 shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar
 grep input output 'dfs[a-z.]+'
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGFPE (0x8) at pc=0xb7f2b1fb, pid=24928, tid=3074923424 # # JRE
 version: 6.0_25-b06 # Java VM: Java HotSpot(TM) Client VM (20.0-b11
 mixed mode, sharing
 linux-x86 )
 # Problematic frame:
 # C  [ld-linux.so.2+0x91fb]  double+0xab # # An error report file with
 more information is saved as:
 # /home/shu/hadoop-1.0.4/hs_err_pid24928.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 # The crash happened outside the Java Virtual Machine in native code.
 # See problematic frame for where to report the bug.
 #
 Aborted
 --
 Ostia Software Solutions Limited, 6 The Mill Building, The Maltings,
 Bray, Co. Wicklow, Ireland

 Registered in Ireland CRO No.507541 This email and any attachments to
 it is, unless otherwise stated, confidential, may contain copyright
 material and is for the use of the intended recipient only. If you
 have received this email in error, please notify the sender by return
 and deleting all copies. Any views expressed in this email are those
 of the sender and do not form part of any contract between Ostia
 Software Solutions Limited and any other party.



 --
 Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray,
 Co. Wicklow, Ireland

 Registered in Ireland CRO No.507541 This email and any attachments to it is,

 unless otherwise stated, confidential, may contain copyright material and is

 for the use of the intended recipient only. If you have received this email

 in error, please notify the sender by return and deleting all copies. 

unsubscribe

2013-01-18 Thread Cristian Cira
Please unsubscribe be from this news feed 

Thank you

Cristian Cira
Graduate Research Assistant
Parallel Architecture and System Laboratory(PASL)
Shelby Center 2105
Auburn University, AL 36849


From: yiyu jia [jia.y...@gmail.com]
Sent: Friday, January 18, 2013 12:12 AM
To: user@hadoop.apache.org
Subject: run hadoop in standalone mode

Hi,

I tried to run hadoop in standalone mode according to hadoop online document. 
But, I get error message as below. I run command ./bin/hadoop jar 
hadoop-examples-1.1.1.jar pi 10 100.


13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server: 
localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9 time(s); retry 
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 
SECONDS)
java.lang.RuntimeException: java.net.ConnectException: Call to 
localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection exception: 
java.net.ConnectException: Connection refused



I disabled ipV6, firewall on my linux machine. But, i still get this error 
message. localhost is bound with 127.0.01 . core-site.xml and 
mapreduce-site.xml are empty as they are not modified.

Anybody can give me a hint if I need to do some specific configuration to run 
hadoop in standalone mode?

thanks and regards,

Yiyu




How to unsubcribe from the list (Re: unsubscribe)

2013-01-18 Thread Jean-Marc Spaggiari
Search on google and clic on the first link ;)

https://www.google.ca/search?q=unsubscribe+hadoop+mailing+list

2013/1/18, Cristian Cira cmc0...@tigermail.auburn.edu:
 Please unsubscribe be from this news feed

 Thank you

 Cristian Cira
 Graduate Research Assistant
 Parallel Architecture and System Laboratory(PASL)
 Shelby Center 2105
 Auburn University, AL 36849

 
 From: yiyu jia [jia.y...@gmail.com]
 Sent: Friday, January 18, 2013 12:12 AM
 To: user@hadoop.apache.org
 Subject: run hadoop in standalone mode

 Hi,

 I tried to run hadoop in standalone mode according to hadoop online
 document. But, I get error message as below. I run command ./bin/hadoop jar
 hadoop-examples-1.1.1.jar pi 10 100.


 13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server:
 localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9 time(s);
 retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
 sleepTime=1 SECONDS)
 java.lang.RuntimeException: java.net.ConnectException: Call to
 localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection
 exception: java.net.ConnectException: Connection refused



 I disabled ipV6, firewall on my linux machine. But, i still get this error
 message. localhost is bound with 127.0.01 . core-site.xml and
 mapreduce-site.xml are empty as they are not modified.

 Anybody can give me a hint if I need to do some specific configuration to
 run hadoop in standalone mode?

 thanks and regards,

 Yiyu





Re: How to unsubcribe from the list (Re: unsubscribe)

2013-01-18 Thread Fabio Pitzolu
This was EPIC!! :-D


*Fabio Pitzolu*
*
*
2013/1/18 Jean-Marc Spaggiari jean-m...@spaggiari.org

 Search on google and clic on the first link ;)

 https://www.google.ca/search?q=unsubscribe+hadoop+mailing+list

 2013/1/18, Cristian Cira cmc0...@tigermail.auburn.edu:
  Please unsubscribe be from this news feed
 
  Thank you
 
  Cristian Cira
  Graduate Research Assistant
  Parallel Architecture and System Laboratory(PASL)
  Shelby Center 2105
  Auburn University, AL 36849
 
  
  From: yiyu jia [jia.y...@gmail.com]
  Sent: Friday, January 18, 2013 12:12 AM
  To: user@hadoop.apache.org
  Subject: run hadoop in standalone mode
 
  Hi,
 
  I tried to run hadoop in standalone mode according to hadoop online
  document. But, I get error message as below. I run command ./bin/hadoop
 jar
  hadoop-examples-1.1.1.jar pi 10 100.
 
 
  13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server:
  localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9
 time(s);
  retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10,
  sleepTime=1 SECONDS)
  java.lang.RuntimeException: java.net.ConnectException: Call to
  localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection
  exception: java.net.ConnectException: Connection refused
 
 
 
  I disabled ipV6, firewall on my linux machine. But, i still get this
 error
  message. localhost is bound with 127.0.01 . core-site.xml and
  mapreduce-site.xml are empty as they are not modified.
 
  Anybody can give me a hint if I need to do some specific configuration to
  run hadoop in standalone mode?
 
  thanks and regards,
 
  Yiyu
 
 
 



Re: Hadoop Scalability

2013-01-18 Thread Ted Dunning
Also, you may have to adjust your algorithms.

For instance, the conventional standard algorithm for SVD is a Lanczos
iterative algorithm.  Iteration in Hadoop is death because of job
invocation time ... what you wind up with is an algorithm that will handle
big data but with a slow-down factor that makes a single node perform at
the same level as 100 Hadoop nodes or more.  Scaling with iterative
algorithms like this is irrelevant because of the enormous fixed cost.

On the other hand, you can switch to some of the recently developed
stochastic projection algorithms which give a non-iterative algorithm that
requires 4-7 map-reduce steps (depending on which outputs you need).  With
these projection algorithms, Hadoop can out-run other techniques even with
quite modest cluster sizes and will scale linearly.

On Thu, Jan 17, 2013 at 9:47 PM, Stephen Boesch java...@gmail.com wrote:

 Hi Thiago,
   Subjectively:  there are a number of items to consider to achieve nearly
 linear scaling:


- if the work is well balanced among the tasks - no skew
- No skew in the association of tasks to nodes. Note: this skew
actually happens by default if the number of tasks is less than the cluster
capacity of slots.  You will notice that on a cluster with 20 nodes, with
each node set to 20 mapper tasks, if you launch a job with 20 maps it may
well have all of them running on one node.
- with higher number of tasks the risk of having stragglers affecting
overall throughput/performance increases unless speculative execution were
set properly
- hadoop configuration settings come under more pressure with more
- properly tuning the number of mappers and reducers to (a) your node
and cluster characteristics and (b) the particular tasks has a large impact
on performance. In my experience the settings are often set too
conservatively / too low to take advantage of the node and cluster
resources

 So in summary hadoop itself is capable of nearly linear scaling to low
 thousands of nodes, but configuring the cluster to really achieve that
 requires effort.


 2013/1/17 Thiago Vieira tpbvie...@gmail.com

 Hello!

 Is common to see this sentence: Hadoop Scales Linearly. But, is there
 any performance evaluation to confirm this?

 In my evaluations, Hadoop processing capacity scales linearly, but not
 proportional to number of nodes, the processing capacity achieved with 20
 nodes is not the double of the processing capacity achieved with 10 nodes.
 Is there any evaluation about this?

 Thank you!

 --
 Thiago Vieira





Re: how to restrict the concurrent running map tasks?

2013-01-18 Thread Robert Evans
General is for product announcements and the like.  You really should
direct your question to mapreduce-user@.  I have bcced general.

I am not an expert on this, but I looked and it appears that you have to
use a special scheduler in the JobTracker to make this happen.

org.apache.hadoop.mapred.LimitTasksPerJobTaskScheduler


It looks a lot like the fifo scheduler but with a limit on the number of
tasks.  I am not sure it this is something that will work for you or not.

--Bobby

On 1/18/13 4:22 AM, hwang joe.haiw...@gmail.com wrote:

Hi all:

My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the
same time. I have found 2 parameter related to this question.

a) mapred.job.map.capacity

but in my hadoop version, this parameter seems abandoned.

b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (
http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector
/1.0.2/mapred-default.xml
)

I set this variable like below:

Configuration conf = new Configuration();
conf.set(date, date);
conf.set(mapred.job.queue.name, hadoop);
conf.set(mapred.jobtracker.taskScheduler.maxRunningTasksPerJob, 10);

DistributedCache.createSymlink(conf);
Job job = new Job(conf, ConstructApkDownload_ + date);
...

The problem is that it doesn't work. There is still more than 50 maps
running as the job starts.

I'm not sure whether I set this parameter in wrong way ? or misunderstand
it.

After looking through the hadoop document, I can't find another parameter
to limit the concurrent running map tasks.

Hope someone can help me ,Thanks.



Re: Cohesion of Hadoop team?

2013-01-18 Thread Suresh Srinivas
On Fri, Jan 18, 2013 at 6:48 AM, Glen Mazza gma...@talend.com wrote:

  Hi, looking at the derivation of the 0.23.x  2.0.x branches on one hand,
 and the 1.x branches on the other, as described here:

 http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E

 One gets the impression the Hadoop committers are split into two teams,
 with one team working on 0.23.x/2.0.2 and another team working on 1.x,
 running the risk of increasingly diverging products eventually competing
 with each other.  Is that the case?


I am not sure how you came to this conclusion. The way I see it is, all the
folks are working on trunk. Subset of this work from trunk is pushed to
older releases such as 1.x or 0.23.x. In Apache Hadoop, features always go
to trunk first before going to any older releases 1.x or 0.23.x. That means
trunk is a superset of all the features.

Is there expected to be a Hadoop 3.0 where the results of the two lines of
 development will merge or is it increasingly likely the subteams will
 continue their separate routes?


2.0.3-alpha, which is the latest release based off of trunk, that is in
final stage of completion should have all the features that all the other
releases have. Let me know if there are any exceptions to this that you
know of.



 Thanks,
 Glen

 --
 Glen Mazza
 Talend Community Coders - coders.talend.com
 blog: www.jroller.com/gmazza




-- 
http://hortonworks.com/download/


Re: On a lighter note

2013-01-18 Thread Mattmann, Chris A (388J)
This…is….hilarious lol

Cheers,
Chris Mattmann

From: Anand Sharma anand2sha...@gmail.commailto:anand2sha...@gmail.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Thursday, January 17, 2013 7:09 PM
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: On a lighter note

Awesome one Tariq!!


On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq 
donta...@gmail.commailto:donta...@gmail.com wrote:
You are right Michael, as always :)

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.comhttp://cloudfront.blogspot.com


On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel 
michael_se...@hotmail.commailto:michael_se...@hotmail.com wrote:
I'm thinking 'Downfall'

But I could be wrong.

On Jan 17, 2013, at 6:56 PM, Yongzhi Wang 
wang.yongzhi2...@gmail.commailto:wang.yongzhi2...@gmail.com wrote:

Who can tell me what is the name of the original film? Thanks!

Yongzhi


On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq 
donta...@gmail.commailto:donta...@gmail.com wrote:
I am sure you will suffer from severe stomach ache after watching this :)
http://www.youtube.com/watch?v=hEqQMLSXQlY

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.comhttp://cloudfront.blogspot.com/






Re: Cohesion of Hadoop team?

2013-01-18 Thread Glen Mazza

On 01/18/2013 11:58 AM, Suresh Srinivas wrote:




On Fri, Jan 18, 2013 at 6:48 AM, Glen Mazza gma...@talend.com 
mailto:gma...@talend.com wrote:


Hi, looking at the derivation of the 0.23.x  2.0.x branches on
one hand, and the 1.x branches on the other, as described here:

http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E

One gets the impression the Hadoop committers are split into two
teams, with one team working on 0.23.x/2.0.2 and another team
working on 1.x, running the risk of increasingly diverging
products eventually competing with each other.  Is that the case?


I am not sure how you came to this conclusion. The way I see it is, 
all the folks are working on trunk. Subset of this work from trunk is 
pushed to older releases such as 1.x or 0.23.x. In Apache Hadoop, 
features always go to trunk first before going to any older releases 
1.x or 0.23.x. That means trunk is a superset of all the features.


Is there expected to be a Hadoop 3.0 where the results of the two
lines of development will merge or is it increasingly likely the
subteams will continue their separate routes?


2.0.3-alpha, which is the latest release based off of trunk, that is 
in final stage of completion should have all the features that all the 
other releases have. Let me know if there are any exceptions to this 
that you know of.


I had entered a JIRA here: 
https://issues.apache.org/jira/browse/HADOOP-9206 .  The instructions 
for single-node setup on 1.1.x are radically different from the 
instructions for 0.23 and 2.0.2; furthermore, the JARs and folder 
structure of what you get from the 1.1.x download and what you get with 
either 0.23.x or 2.0.x-alpha is also considerably different.  The deltas 
here, along with Bobby Evans' explanation of the version histories I 
linked to above, gave me the impression that 1.x has one team working on 
it while the other branches have another.  If that was the case (as 
you're not clarifying, it's not) I was then wondering when all 
committers would be more or less on the same page again.


Thanks for the clarification.

Glen




Thanks,
Glen

-- 
Glen Mazza

Talend Community Coders -coders.talend.com  http://coders.talend.com
blog:www.jroller.com/gmazza  http://www.jroller.com/gmazza




--
http://hortonworks.com/download/



--
Glen Mazza
Talend Community Coders - coders.talend.com
blog: www.jroller.com/gmazza



RE: On a lighter note

2013-01-18 Thread Chris Folsom



Now if only we really could change the name of secondary namenode...


Against the assault of laughter nothing can stand - Mark Twain


 Original Message 
Subject: Re: On a lighter note
From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
Date: Fri, January 18, 2013 10:46 am
To: user@hadoop.apache.org user@hadoop.apache.org

 This…is….hilarious lol

 
Cheers,
Chris Mattmann

 
  From: Anand Sharma anand2sha...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Thursday, January 17, 2013 7:09 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: On a lighter note
 

 
Awesome one Tariq!!
 

 
 On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq  donta...@gmail.com
wrote:
  You are right Michael, as always :)

  Warm Regards, Tariq
https://mtariq.jux.com/
 
cloudfront.blogspot.com
 



 
 
On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel 
michael_se...@hotmail.com wrote:
  I'm thinking 'Downfall' 
 
But I could be wrong.

  On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com
wrote:

   Who can tell me what is the name of the original film? Thanks!
 
 
Yongzhi
 

 
 On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq  donta...@gmail.com
wrote:
  I am sure you will suffer from severe stomach ache after watching this
:) http://www.youtube.com/watch?v=hEqQMLSXQlY

  Warm Regards, Tariq
https://mtariq.jux.com/
 
cloudfront.blogspot.com


Re: On a lighter note

2013-01-18 Thread Mohammad Tariq
Inspired by this, I would call it the 'Downfall node' ;)

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, Jan 19, 2013 at 12:14 AM, Chris Folsom jcfol...@pureperfect.comwrote:




 Now if only we really could change the name of secondary namenode...


 Against the assault of laughter nothing can stand - Mark Twain


  Original Message 
 Subject: Re: On a lighter note
 From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
 Date: Fri, January 18, 2013 10:46 am
 To: user@hadoop.apache.org user@hadoop.apache.org

  This…is….hilarious lol


 Cheers,
 Chris Mattmann


   From: Anand Sharma anand2sha...@gmail.com
  Reply-To: user@hadoop.apache.org user@hadoop.apache.org
  Date: Thursday, January 17, 2013 7:09 PM
  To: user@hadoop.apache.org user@hadoop.apache.org
  Subject: Re: On a lighter note



 Awesome one Tariq!!



  On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq  donta...@gmail.com
 wrote:
   You are right Michael, as always :)

   Warm Regards, Tariq
 https://mtariq.jux.com/

 cloudfront.blogspot.com






 On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel
 michael_se...@hotmail.com wrote:
   I'm thinking 'Downfall'

 But I could be wrong.

   On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com
 wrote:

Who can tell me what is the name of the original film? Thanks!


 Yongzhi



  On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq  donta...@gmail.com
 wrote:
   I am sure you will suffer from severe stomach ache after watching this
 :) http://www.youtube.com/watch?v=hEqQMLSXQlY

   Warm Regards, Tariq
 https://mtariq.jux.com/

 cloudfront.blogspot.com



config for high memory jobs does not work, please help.

2013-01-18 Thread Shaojun Zhao
Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  -- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun


Re: config for high memory jobs does not work, please help.

2013-01-18 Thread Jeffrey Buell
Try:

-Dmapred.tasktracker.map.tasks.maximum=1

Although I usually put this parameter in mapred-site.xml.

Jeff


Dear all,

I know it is best to use small amount of mem in mapper and reduce.
However, sometimes it is hard to do so. For example, in machine
learning algorithms, it is common to load the model into mem in the
mapper step. When the model is big, I have to allocate a lot of mem
for the mapper.

Here is my question: how can I config hadoop so that it does not fork
too many mappers and run out of physical memory?

My machines have 24G, and I have 100 of them. Each time, hadoop will
fork 6 mappers on each machine, no matter what config I used. I really
want to reduce it to what ever number I want, for example, just 1
mapper per machine.

Here are the config I tried. (I use streaming, and I pass the config
in the command line)

-Dmapred.child.java.opts=-Xmx8000m  -- did not bring down the number of mappers

-Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number
of mappers

Am I missing something here?
I use Hadoop 0.20.205

Thanks a lot in advance!
-Shaojun


RE: On a lighter note

2013-01-18 Thread Chris Folsom

LOL



 Original Message 
Subject: Re: On a lighter note
From: Mohammad Tariq donta...@gmail.com
Date: Fri, January 18, 2013 2:08 pm
To: user@hadoop.apache.org user@hadoop.apache.org

Inspired by this, I would call it the 'Downfall node' ;)

Warm Regards,Tariq
https://mtariq.jux.com/
 
cloudfront.blogspot.com





On Sat, Jan 19, 2013 at 12:14 AM, Chris Folsom
jcfol...@pureperfect.com wrote:
 
 
 
 Now if only we really could change the name of secondary namenode...
 
 
 Against the assault of laughter nothing can stand - Mark Twain
 
 
  Original Message 
 Subject: Re: On a lighter note
 
From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov
 Date: Fri, January 18, 2013 10:46 am
 To: user@hadoop.apache.org user@hadoop.apache.org
 
  This…is….hilarious lol
 
 
 Cheers,
 Chris Mattmann
 
 
   From: Anand Sharma anand2sha...@gmail.com
  Reply-To: user@hadoop.apache.org user@hadoop.apache.org
  Date: Thursday, January 17, 2013 7:09 PM
  To: user@hadoop.apache.org user@hadoop.apache.org
  Subject: Re: On a lighter note
 
 
 
 Awesome one Tariq!!
 
 
 
  On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq  donta...@gmail.com
 wrote:
   You are right Michael, as always :)
 
   Warm Regards, Tariq
 https://mtariq.jux.com/
 
 cloudfront.blogspot.com
 
 
 
 
 
 
 On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel
 michael_se...@hotmail.com wrote:
   I'm thinking 'Downfall'
 
 But I could be wrong.
 
   On Jan 17, 2013, at 6:56 PM, Yongzhi Wang
wang.yongzhi2...@gmail.com
 wrote:
 
Who can tell me what is the name of the original film? Thanks!
 
 
 Yongzhi
 
 
 
  On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq  donta...@gmail.com
 wrote:
   I am sure you will suffer from severe stomach ache after watching
this
 :) http://www.youtube.com/watch?v=hEqQMLSXQlY
 
   Warm Regards, Tariq
 https://mtariq.jux.com/
 
 cloudfront.blogspot.com


Re: config for high memory jobs does not work, please help.

2013-01-18 Thread Arun C Murthy
Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M 
map slots per node and request, per-job, that you want N (where N = max(1, N, 
M)).

Some more info:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/

hth,
Arun

On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:

 Dear all,
 
 I know it is best to use small amount of mem in mapper and reduce.
 However, sometimes it is hard to do so. For example, in machine
 learning algorithms, it is common to load the model into mem in the
 mapper step. When the model is big, I have to allocate a lot of mem
 for the mapper.
 
 Here is my question: how can I config hadoop so that it does not fork
 too many mappers and run out of physical memory?
 
 My machines have 24G, and I have 100 of them. Each time, hadoop will
 fork 6 mappers on each machine, no matter what config I used. I really
 want to reduce it to what ever number I want, for example, just 1
 mapper per machine.
 
 Here are the config I tried. (I use streaming, and I pass the config
 in the command line)
 
 -Dmapred.child.java.opts=-Xmx8000m  -- did not bring down the number of 
 mappers
 
 -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number
 of mappers
 
 Am I missing something here?
 I use Hadoop 0.20.205
 
 Thanks a lot in advance!
 -Shaojun

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: config for high memory jobs does not work, please help.

2013-01-18 Thread Shaojun Zhao
I do have this in my command line, and it did not work.
-Dmapred.tasktracker.map.tasks.maximum=2

I also tried changing mapred-site.xml, and restart the tasktracker, it
did not work either. I am sure it will work if I restart everything,
but I really do not want to lose my data on hdfs. So I have not tried
restarting everyting.

Best regards,
-Shaojun


On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell jbu...@vmware.com wrote:
 Try:

 -Dmapred.tasktracker.map.tasks.maximum=1

 Although I usually put this parameter in mapred-site.xml.

 Jeff


 Dear all,

 I know it is best to use small amount of mem in mapper and reduce.
 However, sometimes it is hard to do so. For example, in machine
 learning algorithms, it is common to load the model into mem in the
 mapper step. When the model is big, I have to allocate a lot of mem
 for the mapper.

 Here is my question: how can I config hadoop so that it does not fork
 too many mappers and run out of physical memory?

 My machines have 24G, and I have 100 of them. Each time, hadoop will
 fork 6 mappers on each machine, no matter what config I used. I really
 want to reduce it to what ever number I want, for example, just 1
 mapper per machine.

 Here are the config I tried. (I use streaming, and I pass the config
 in the command line)

 -Dmapred.child.java.opts=-Xmx8000m  -- did not bring down the number of 
 mappers

 -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number
 of mappers

 Am I missing something here?
 I use Hadoop 0.20.205

 Thanks a lot in advance!
 -Shaojun


Re: Hadoop Scalability

2013-01-18 Thread Arun C Murthy
Obviously the algorithm matters, but here are some very old numbers (things 
today are much better), but you do see the 'linear' scaling with both nodes and 
datasets:

http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/
100TB Sort - 97 mins
1000 TB Sort - 975 mins

Arun

On Jan 17, 2013, at 7:09 PM, Thiago Vieira wrote:

 Hello!
 
 Is common to see this sentence: Hadoop Scales Linearly. But, is there any 
 performance evaluation to confirm this? 
 
 In my evaluations, Hadoop processing capacity scales linearly, but not 
 proportional to number of nodes, the processing capacity achieved with 20 
 nodes is not the double of the processing capacity achieved with 10 nodes. Is 
 there any evaluation about this?
 
 Thank you!
 
 --
 Thiago Vieira

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: config for high memory jobs does not work, please help.

2013-01-18 Thread Arun C Murthy
Not sure about EMR, but if you install your own cluster on EC2 you can use the 
configs mentioned here:

 http://hadoop.apache.org/docs/stable/capacity_scheduler.html

Arun

On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote:

 I am using Amazon EC2/EMR.
 jps give this
 16600 JobTracker
 2732 RunJar
 2504 StatePusher
 31902 instance-controller.jar
 23553 Jps
 22444 RunJar
 2077 NameNode
 
 I am not sure how I can impose capacityscheduler on ec2/emr machines.
 -Shaojun
 
 On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy a...@hortonworks.com wrote:
 Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can 
 run M map slots per node and request, per-job, that you want N (where N = 
 max(1, N, M)).
 
 Some more info:
 http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling
 http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/
 
 hth,
 Arun
 
 On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote:
 
 Dear all,
 
 I know it is best to use small amount of mem in mapper and reduce.
 However, sometimes it is hard to do so. For example, in machine
 learning algorithms, it is common to load the model into mem in the
 mapper step. When the model is big, I have to allocate a lot of mem
 for the mapper.
 
 Here is my question: how can I config hadoop so that it does not fork
 too many mappers and run out of physical memory?
 
 My machines have 24G, and I have 100 of them. Each time, hadoop will
 fork 6 mappers on each machine, no matter what config I used. I really
 want to reduce it to what ever number I want, for example, just 1
 mapper per machine.
 
 Here are the config I tried. (I use streaming, and I pass the config
 in the command line)
 
 -Dmapred.child.java.opts=-Xmx8000m  -- did not bring down the number of 
 mappers
 
 -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number
 of mappers
 
 Am I missing something here?
 I use Hadoop 0.20.205
 
 Thanks a lot in advance!
 -Shaojun
 
 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/
 
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Estimating disk space requirements

2013-01-18 Thread Ted Dunning
It is usually better to not subdivide nodes into virtual nodes.  You will
generally get better performance form the original node because you only
pay for the OS once and because your disk I/O will be scheduled better.

If you look at EC2 pricing, however, the spot market often has arbitrage
opportunities where one size node is absurdly cheap relative to others.  In
that case, it pays to scale the individual nodes up or down.

The only reasonable reason to split nodes to very small levels is for
testing and training.

On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Thnx for the reply Ted,

 You can find 40 GB disks when u make virtual nodes on a cloud like
 Rackspace ;-)

 About the os partitions I did not exactly understand what you meant.
 I have made a server on the cloud.. And I just installed and configured
 hadoop and hbase in the /use/local folder.
 And I am pretty sure it does not have a separate partition for root.

 Please help me explain what u meant and what else precautions should I
 take.

 Thanks,

 Regards,
 Ouch Whisper
 01010101010
 On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote:

 Where do you find 40gb disks now a days?

 Normally your performance is going to be better with more space but your
 network may be your limiting factor for some computations.  That could give
 you some paradoxical scaling.  Hbase will rarely show this behavior.

 Keep in mind you also want to allow for an os partition. Current standard
 practice is to reserve as much as 100 GB for that partition but in your
 case 10gb better:-)

 Note that if you account for this, the node counts don't scale as simply.
  The overhead of these os partitions goes up with number of nodes.

 On Jan 18, 2013, at 8:55 AM, Panshul Whisper ouchwhis...@gmail.com
 wrote:

 If we look at it with performance in mind,
 is it better to have 20 Nodes with 40 GB HDD
 or is it better to have 10 Nodes with 80 GB HDD?

 they are connected on a gigabit LAN

 Thnx


 On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 20 nodes with 40 GB will do the work.

 After that you will have to consider performances based on your access
 pattern. But that's another story.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Thank you for the replies,
 
  So I take it that I should have atleast 800 GB on total free space on
  HDFS.. (combined free space of all the nodes connected to the
 cluster). So
  I can connect 20 nodes having 40 GB of hdd on each node to my cluster.
 Will
  this be enough for the storage?
  Please confirm.
 
  Thanking You,
  Regards,
  Panshul.
 
 
  On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  Hi Panshul,
 
  If you have 20 GB with a replication factor set to 3, you have only
  6.6GB available, not 11GB. You need to divide the total space by the
  replication factor.
 
  Also, if you store your JSon into HBase, you need to add the key size
  to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
 
  So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to
  store it. Without including the key size. Even with a replication
  factor set to 5 you don't have the space.
 
  Now, you can add some compression, but even with a lucky factor of 50%
  you still don't have the space. You will need something like 90%
  compression factor to be able to store this data in your cluster.
 
  A 1T drive is now less than $100... So you might think about replacing
  you 20 GB drives by something bigger.
  to reply to your last question, for your data here, you will need AT
  LEAST 350GB overall storage. But that's a bare minimum. Don't go under
  500GB.
 
  IMHO
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Hello,
  
   I was estimating how much disk space do I need for my cluster.
  
   I have 24 million JSON documents approx. 5kb each
   the Json is to be stored into HBASE with some identifying data in
  coloumns
   and I also want to store the Json for later retrieval based on the
 Id
  data
   as keys in Hbase.
   I have my HDFS replication set to 3
   each node has Hadoop and hbase and Ubuntu installed on it.. so
 approx
   11
  GB
   is available for use on my 20 GB node.
  
   I have no idea, if I have not enabled Hbase replication, is the HDFS
   replication enough to keep the data safe and redundant.
   How much total disk space I will need for the storage of the data.
  
   Please help me estimate this.
  
   Thank you so much.
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  --
  Regards,
  Ouch Whisper
  010101010101
 




 --
 Regards,
 Ouch Whisper
 010101010101




Re: Estimating disk space requirements

2013-01-18 Thread Panshul Whisper
ah now i understand what you mean.
I will be creating 20 individual servers on the cloud, and not create one
big server and make several virtual nodes inside it.
I will be paying for 20 different nodes.. all configured with hadoop and
connected to the cluster.

Thanx for the intel :)


On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning tdunn...@maprtech.com wrote:

 It is usually better to not subdivide nodes into virtual nodes.  You will
 generally get better performance form the original node because you only
 pay for the OS once and because your disk I/O will be scheduled better.

 If you look at EC2 pricing, however, the spot market often has arbitrage
 opportunities where one size node is absurdly cheap relative to others.  In
 that case, it pays to scale the individual nodes up or down.

 The only reasonable reason to split nodes to very small levels is for
 testing and training.


 On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Thnx for the reply Ted,

 You can find 40 GB disks when u make virtual nodes on a cloud like
 Rackspace ;-)

 About the os partitions I did not exactly understand what you meant.
 I have made a server on the cloud.. And I just installed and configured
 hadoop and hbase in the /use/local folder.
 And I am pretty sure it does not have a separate partition for root.

 Please help me explain what u meant and what else precautions should I
 take.

 Thanks,

 Regards,
 Ouch Whisper
 01010101010
 On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote:

 Where do you find 40gb disks now a days?

 Normally your performance is going to be better with more space but your
 network may be your limiting factor for some computations.  That could give
 you some paradoxical scaling.  Hbase will rarely show this behavior.

 Keep in mind you also want to allow for an os partition. Current
 standard practice is to reserve as much as 100 GB for that partition but in
 your case 10gb better:-)

 Note that if you account for this, the node counts don't scale as
 simply.  The overhead of these os partitions goes up with number of nodes.

 On Jan 18, 2013, at 8:55 AM, Panshul Whisper ouchwhis...@gmail.com
 wrote:

 If we look at it with performance in mind,
 is it better to have 20 Nodes with 40 GB HDD
 or is it better to have 10 Nodes with 80 GB HDD?

 they are connected on a gigabit LAN

 Thnx


 On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 20 nodes with 40 GB will do the work.

 After that you will have to consider performances based on your access
 pattern. But that's another story.

 JM

 2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
  Thank you for the replies,
 
  So I take it that I should have atleast 800 GB on total free space on
  HDFS.. (combined free space of all the nodes connected to the
 cluster). So
  I can connect 20 nodes having 40 GB of hdd on each node to my
 cluster. Will
  this be enough for the storage?
  Please confirm.
 
  Thanking You,
  Regards,
  Panshul.
 
 
  On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari 
  jean-m...@spaggiari.org wrote:
 
  Hi Panshul,
 
  If you have 20 GB with a replication factor set to 3, you have only
  6.6GB available, not 11GB. You need to divide the total space by the
  replication factor.
 
  Also, if you store your JSon into HBase, you need to add the key size
  to it. If you key is 4 bytes, or 1024 bytes, it makes a difference.
 
  So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space
 to
  store it. Without including the key size. Even with a replication
  factor set to 5 you don't have the space.
 
  Now, you can add some compression, but even with a lucky factor of
 50%
  you still don't have the space. You will need something like 90%
  compression factor to be able to store this data in your cluster.
 
  A 1T drive is now less than $100... So you might think about
 replacing
  you 20 GB drives by something bigger.
  to reply to your last question, for your data here, you will need AT
  LEAST 350GB overall storage. But that's a bare minimum. Don't go
 under
  500GB.
 
  IMHO
 
  JM
 
  2013/1/18, Panshul Whisper ouchwhis...@gmail.com:
   Hello,
  
   I was estimating how much disk space do I need for my cluster.
  
   I have 24 million JSON documents approx. 5kb each
   the Json is to be stored into HBASE with some identifying data in
  coloumns
   and I also want to store the Json for later retrieval based on the
 Id
  data
   as keys in Hbase.
   I have my HDFS replication set to 3
   each node has Hadoop and hbase and Ubuntu installed on it.. so
 approx
   11
  GB
   is available for use on my 20 GB node.
  
   I have no idea, if I have not enabled Hbase replication, is the
 HDFS
   replication enough to keep the data safe and redundant.
   How much total disk space I will need for the storage of the data.
  
   Please help me estimate this.
  
   Thank you so much.
  
   --
   Regards,
   Ouch Whisper
   010101010101
  
 
 
 
 
  

Re: Spring for hadoop

2013-01-18 Thread Jilani Shaik
Yes, We have used spring hadoop data for our hbase data reading and writing
to HBase.

We have used the below link for implementation in our project.

http://static.springsource.org/spring-hadoop/docs/current/reference/html/hbase.html

Thank you,
Jilani



On Sat, Jan 19, 2013 at 4:06 AM, Mohammad Tariq donta...@gmail.com wrote:

 You might find this link http://www.springsource.org/spring-data/hadoop
  useful.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Sat, Jan 19, 2013 at 4:04 AM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Hello,

 I was wondering if anyone is using spring for hadoop to execute map
 reduce jobs or to perform hbase operations on a hadoop cluster using spring
 data for hadoop.
 Please suggest me a working example as I am unable to find any working
 sample and spring data documentation is of no use for beginners.

 Thanks

 Regards,
 Ouch Whisper
 01010101010





Re: Spring for hadoop

2013-01-18 Thread Jilani Shaik
Hi,

Please find below URL where you will find the sample code for spring hadoop.

https://github.com/SpringSource/spring-hadoop-samples

Thank you,
Jilani


On Sat, Jan 19, 2013 at 11:43 AM, Jilani Shaik jilani2...@gmail.com wrote:

 Yes, We have used spring hadoop data for our hbase data reading and
 writing to HBase.

 We have used the below link for implementation in our project.


 http://static.springsource.org/spring-hadoop/docs/current/reference/html/hbase.html

 Thank you,
 Jilani



 On Sat, Jan 19, 2013 at 4:06 AM, Mohammad Tariq donta...@gmail.comwrote:

 You might find this link http://www.springsource.org/spring-data/hadoop
  useful.

 Warm Regards,
 Tariq
 https://mtariq.jux.com/
 cloudfront.blogspot.com


 On Sat, Jan 19, 2013 at 4:04 AM, Panshul Whisper 
 ouchwhis...@gmail.comwrote:

 Hello,

 I was wondering if anyone is using spring for hadoop to execute map
 reduce jobs or to perform hbase operations on a hadoop cluster using spring
 data for hadoop.
 Please suggest me a working example as I am unable to find any working
 sample and spring data documentation is of no use for beginners.

 Thanks

 Regards,
 Ouch Whisper
 01010101010