Re: Doubt in DoubleWritable

2015-11-23 Thread unmesha sreeveni
Please try this
for (DoubleArrayWritable avalue : values) {
Writable[] value = avalue.get();
// DoubleWritable[] value = new DoubleWritable[6];
// for(int k=0;k<6;k++){
// value[k] = DoubleWritable(wvalue[k]);
// }
//parse accordingly
if (Double.parseDouble(value[1].toString()) != 0) {
total_records_Temp = total_records_Temp + 1;
sumvalueTemp = sumvalueTemp + Double.parseDouble(value[0].toString());
}
if (Double.parseDouble(value[3].toString()) != 0) {
total_records_Dewpoint = total_records_Dewpoint + 1;
sumvalueDewpoint = sumvalueDewpoint +
Double.parseDouble(value[2].toString());
}
if (Double.parseDouble(value[5].toString()) != 0) {
total_records_Windspeed = total_records_Windspeed + 1;
sumvalueWindspeed = sumvalueWindspeed +
Double.parseDouble(value[4].toString());
}
}
​Attaching the code​


-- 
*Thanks & Regards *


*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Centre for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/
//cc MaxTemperature Application to find the maximum temperature in the weather dataset
//vv MaxTemperature
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.ArrayWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.conf.Configuration;

public class MapReduce {

	public static void main(String[] args) throws Exception {
		if (args.length != 2) {
			System.err
	.println("Usage: MaxTemperature  ");
			System.exit(-1);
		}

		/*
		 * Job job = new Job(); job.setJarByClass(MaxTemperature.class);
		 * job.setJobName("Max temperature");
		 */

		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(conf);
		Job job = Job.getInstance(conf, "AverageTempValues");

		/*
		 * Deleting output directory for reuseing same dir
		 */
		Path dest = new Path(args[1]);
		if(fs.exists(dest)){
			fs.delete(dest, true);
		}
		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.setNumReduceTasks(2);

		job.setMapperClass(NewMapper.class);
		job.setReducerClass(NewReducer.class);

		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(DoubleArrayWritable.class);

		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}
}
// ^^ MaxTemperature
// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import java.io.IOException;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class NewMapper extends
		Mapper {

	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {

		String Str = value.toString();
		String[] Mylist = new String[1000];
		int i = 0;

		for (String retval : Str.split("\\s+")) {
			System.out.println(retval);
			Mylist[i++] = retval;

		}
		String Val = Mylist[2];
		String Year = Val.substring(0, 4);
		String Month = Val.substring(5, 6);
		String[] Section = Val.split("_");

		String section_string = "0";
		if (Section[1].matches("^(0|1|2|3|4|5)$")) {
			section_string = "4";
		} else if (Section[1].matches("^(6|7|8|9|10|11)$")) {
			section_string = "1";
		} else if (Section[1].matches("^(12|13|14|15|16|17)$")) {
			section_string = "2";
		} else if (Section[1].matches("^(18|19|20|21|22|23)$")) {
			section_string = "3";
		}

		DoubleWritable[] array = new DoubleWritable[6];
		DoubleArrayWritable output = new DoubleArrayWritable();
		array[0].set(Double.parseDouble(Mylist[3]));
		array[2].set(Double.parseDouble(Mylist[4]));
		array[4].set(Double.parseDouble(Mylist[12]));
		for (int j = 0; j < 6; j = j + 2) {
			if (999.9 == array[j].get()) {
array[j + 1].set(0);
			} else {
array[j + 1].set(1);
			}
		}
		output.set(array);
		context.write(new Text(Year + section_string + Month), output);
	}
}
//cc MaxTemperatureReducer Reducer for maximum temperature example
//vv MaxTemperatureReducer
import java.io.IOException;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class NewReducer extends
		Reducer {

	@Override
	public void reduce(Text key, Iterable values,
			Context context) throws IOException, InterruptedException {
		double sumvalueTemp = 0;
		double sumvalueDewpoint = 0;
		double sumvalueWindspeed = 0;
		double total_records_Temp = 0;
		double total_records_Dewpoint = 0;
		double total_records_Windspeed = 0;
		double average_Temp = Integer.MIN_VALUE;
		double average_Dewpoint = Integer.MIN_VALUE;
		double average_Windspeed = Integer.MIN_VALUE;
		DoubleWritable[] temp = new DoubleWritable[3];

Re: Doubt Regarding QJM protocol - example 2.10.6 of Quorum-Journal Design document

2014-09-28 Thread Ulul

Hi

A developer should answer that but a quick look to an edit file with od 
suggests that record are not fixed length. So maybe the likeliness of 
the situation you suggest is so low that there is no need to check more 
than file size


Ulul

Le 28/09/2014 11:17, Giridhar Addepalli a écrit :

Hi All,

I am going through Quorum Journal Design document.

It is mentioned in Section 2.8 - In Accept Recovery RPC section

If the current on-disk log is missing, or a /different length /than 
the proposed recovery, the JN downloads the log from the provided URI, 
replacing any current copy of the log segment.



I can see it that the code follows above design

Source :: Journal.java
 

  public synchronized void acceptRecovery(RequestInfo reqInfo,
  SegmentStateProto segment, URL fromUrl)
  throws IOException {

  
  if (currentSegment == null ||
currentSegment.getEndTxId() != segment.getEndTxId()) {
  
  } else {
  LOG.info(Skipping download of log  +
  TextFormat.shortDebugString(segment) +
  : already have up-to-date logs);
  }
  
  }


My question is what if on-disk log is present and is of /same length 
/as the proposed recovery


If JournalNode is skipping download because the logs are of same 
length, then we could end up in a situation where finalized log 
segments contain different data !


This could happen if we follow example 2.10.6

As per that example we write transactions (151-153 ) on JN1
then when recovery proceeded with only JN2  JN3 let us assume that we 
write again /different transactions/ as (151-153) . Then after the 
crash when we run recovery , JN1 will skip downloading correct segment 
from JN2/JN3 as it thinks it has correct segment( as per the code 
pasted above). This will result in a situation where finalized segment 
( edits_151-153 ) on JN1 is different from finalized segment 
edits_151-153 on JN2/JN3.


Please let me know if i have gone wrong some where, and this situation 
is taken care of.


Thanks,
Giridhar.




RE: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop

2014-04-15 Thread Radhe Radhe
Thanks John for your comments,
I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* 
APIs.
I see this way:[1.]  One may have binaries i.e. jar file of the M\R program 
that used old *mapred* APIsThis will work directly on MRv2(YARN).
[2.]  One may have the source code i.e. Java Programs of the M\R program that 
used old *mapred* APIsFor this I need to recompile and generate the binaries 
i.e. jar file. Do I have to change the old *org.apache.hadoop.mapred* APIs to 
new *org.apache.hadoop.mapreduce* APIs? or No code changes are needed?
-RR
 Date: Mon, 14 Apr 2014 10:37:56 -0400
 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with 
 old *mapred* APIs and new *mapreduce* APIs in Hadoop
 From: john.meag...@gmail.com
 To: user@hadoop.apache.org
 
 Also, Source Compatibility also means ONLY a recompile is needed.
 No code changes should be needed.
 
 On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote:
  Source Compatibility = you need to recompile and use the new version
  as part of the compilation
 
  Binary Compatibility = you can take something compiled against the old
  version and run it on the new version
 
  On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe
  radhe.krishna.ra...@live.com wrote:
  Hello People,
 
  As per the Apache site
  http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
 
  Binary Compatibility
  
  First, we ensure binary compatibility to the applications that use old
  mapred APIs. This means that applications which were built against MRv1
  mapred APIs can run directly on YARN without recompilation, merely by
  pointing them to an Apache Hadoop 2.x cluster via configuration.
 
  Source Compatibility
  
  We cannot ensure complete binary compatibility with the applications that
  use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, 
  we
  ensure source compatibility for mapreduce APIs that break binary
  compatibility. In other words, users should recompile their applications
  that use mapreduce APIs against MRv2 jars. One notable binary
  incompatibility break is Counter and CounterGroup.
 
  For Binary Compatibility I understand that if I had build a MR job with
  old *mapred* APIs then they can be run directly on YARN without and 
  changes.
 
  Can anybody explain what do we mean by Source Compatibility here and also
  a use case where one will need it?
 
  Does that mean code changes if I already have a MR job source code written
  with with old *mapred* APIs and I need to make some changes to it to run in
  then I need to use the new mapreduce* API and generate the new  binaries?
 
  Thanks,
  -RR
 
 
  

Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop

2014-04-15 Thread Zhijie Shen
1. If you have the binaries that were compiled against MRv1 *mapred* libs,
it should just work with MRv2.
2. If you have the source code that refers to MRv1 *mapred* libs, it should
be compilable without code changes. Of course, you're free to change your
code.
3. If you have the binaries that were compiled against MRv1 *mapreduce* libs,
it may not be executable directly with MRv2, but you should able to compile
it against MRv2 *mapreduce* libs without code changes, and execute it.

- Zhijie


On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe
radhe.krishna.ra...@live.comwrote:

 Thanks John for your comments,

 I believe MRv2 has support for both the old *mapred* APIs and new
 *mapreduce* APIs.

 I see this way:
 [1.]  One may have binaries i.e. jar file of the M\R program that used old
 *mapred* APIs
 This will work directly on MRv2(YARN).

 [2.]  One may have the source code i.e. Java Programs of the M\R program
 that used old *mapred* APIs
 For this I need to recompile and generate the binaries i.e. jar file.
 Do I have to change the old *org.apache.hadoop.mapred* APIs to new *
 org.apache.hadoop.mapreduce* APIs? or No code changes are needed?

 -RR

  Date: Mon, 14 Apr 2014 10:37:56 -0400
  Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility
 with old *mapred* APIs and new *mapreduce* APIs in Hadoop
  From: john.meag...@gmail.com
  To: user@hadoop.apache.org

 
  Also, Source Compatibility also means ONLY a recompile is needed.
  No code changes should be needed.
 
  On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com
 wrote:
   Source Compatibility = you need to recompile and use the new version
   as part of the compilation
  
   Binary Compatibility = you can take something compiled against the old
   version and run it on the new version
  
   On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe
   radhe.krishna.ra...@live.com wrote:
   Hello People,
  
   As per the Apache site
  
 http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
  
   Binary Compatibility
   
   First, we ensure binary compatibility to the applications that use old
   mapred APIs. This means that applications which were built against
 MRv1
   mapred APIs can run directly on YARN without recompilation, merely by
   pointing them to an Apache Hadoop 2.x cluster via configuration.
  
   Source Compatibility
   
   We cannot ensure complete binary compatibility with the applications
 that
   use mapreduce APIs, as these APIs have evolved a lot since MRv1.
 However, we
   ensure source compatibility for mapreduce APIs that break binary
   compatibility. In other words, users should recompile their
 applications
   that use mapreduce APIs against MRv2 jars. One notable binary
   incompatibility break is Counter and CounterGroup.
  
   For Binary Compatibility I understand that if I had build a MR job
 with
   old *mapred* APIs then they can be run directly on YARN without and
 changes.
  
   Can anybody explain what do we mean by Source Compatibility here
 and also
   a use case where one will need it?
  
   Does that mean code changes if I already have a MR job source code
 written
   with with old *mapred* APIs and I need to make some changes to it to
 run in
   then I need to use the new mapreduce* API and generate the new
 binaries?
  
   Thanks,
   -RR
  
  




-- 
Zhijie Shen
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


RE: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop

2014-04-15 Thread Radhe Radhe
Thanks Zhijie for the explanation.
Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build against 
old MRv1 mapred APIS) then how can I compile it since I don't have the source 
code i.e. Java files. All I can do with binaries i.e. jar file is execute it. 
-RR
Date: Tue, 15 Apr 2014 13:03:53 -0700
Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with old 
*mapred* APIs and new *mapreduce* APIs in Hadoop
From: zs...@hortonworks.com
To: user@hadoop.apache.org

1. If you have the binaries that were compiled against MRv1 mapred libs, it 
should just work with MRv2.2. If you have the source code that refers to MRv1 
mapred libs, it should be compilable without code changes. Of course, you're 
free to change your code.
3. If you have the binaries that were compiled against MRv1 mapreduce libs, it 
may not be executable directly with MRv2, but you should able to compile it 
against MRv2 mapreduce libs without code changes, and execute it.

- Zhijie

On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe radhe.krishna.ra...@live.com 
wrote:




Thanks John for your comments,
I believe MRv2 has support for both the old *mapred* APIs and new *mapreduce* 
APIs.
I see this way:[1.]  One may have binaries i.e. jar file of the M\R program 
that used old *mapred* APIs
This will work directly on MRv2(YARN).
[2.]  One may have the source code i.e. Java Programs of the M\R program that 
used old *mapred* APIs
For this I need to recompile and generate the binaries i.e. jar file. Do I have 
to change the old *org.apache.hadoop.mapred* APIs to new 
*org.apache.hadoop.mapreduce* APIs? or No code changes are needed?

-RR
 Date: Mon, 14 Apr 2014 10:37:56 -0400
 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility with 
 old *mapred* APIs and new *mapreduce* APIs in Hadoop

 From: john.meag...@gmail.com
 To: user@hadoop.apache.org
 

 Also, Source Compatibility also means ONLY a recompile is needed.
 No code changes should be needed.
 
 On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote:

  Source Compatibility = you need to recompile and use the new version
  as part of the compilation
 
  Binary Compatibility = you can take something compiled against the old
  version and run it on the new version

 
  On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe
  radhe.krishna.ra...@live.com wrote:
  Hello People,

 
  As per the Apache site
  http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html

 
  Binary Compatibility
  
  First, we ensure binary compatibility to the applications that use old
  mapred APIs. This means that applications which were built against MRv1

  mapred APIs can run directly on YARN without recompilation, merely by
  pointing them to an Apache Hadoop 2.x cluster via configuration.
 
  Source Compatibility

  
  We cannot ensure complete binary compatibility with the applications that
  use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, 
  we

  ensure source compatibility for mapreduce APIs that break binary
  compatibility. In other words, users should recompile their applications
  that use mapreduce APIs against MRv2 jars. One notable binary

  incompatibility break is Counter and CounterGroup.
 
  For Binary Compatibility I understand that if I had build a MR job with
  old *mapred* APIs then they can be run directly on YARN without and 
  changes.

 
  Can anybody explain what do we mean by Source Compatibility here and also
  a use case where one will need it?
 
  Does that mean code changes if I already have a MR job source code written

  with with old *mapred* APIs and I need to make some changes to it to run in
  then I need to use the new mapreduce* API and generate the new  binaries?
 
  Thanks,

  -RR
 
 
  


-- 
Zhijie ShenHortonworks Inc.http://hortonworks.com/





CONFIDENTIALITY NOTICENOTICE: This message is intended for the use of the 
individual or entity to which it is addressed and may contain information that 
is confidential, privileged and exempt from disclosure under applicable law. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have received 
this communication in error, please contact the sender immediately and delete 
it from your system. Thank You. 

Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop

2014-04-15 Thread Zhijie Shen
bq. Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build
against old MRv1 mapred APIS)

Which APIs are you talking about, *mapred* or *mapreduce*? In #3, I was
saying about *mapreduce*. If this is the case, you may be in the trouble
unfortunately, because MRv2 has evolved so much in *mapreduce *APIs that
it's difficult to ensure binary compatibility. Anyway, you should still try
your luck, as your binaries may not use the incompatible APIs. On the other
hand, if you meant *mapred* APIs instead, you binaries should just work.

- Zhijie


On Tue, Apr 15, 2014 at 1:35 PM, Radhe Radhe
radhe.krishna.ra...@live.comwrote:

 Thanks Zhijie for the explanation.

 Regarding #3 if I have ONLY the binaries i.e. jar file (compiled\build
 against old MRv1 *mapred* APIS) then how can I compile it since I don't
 have the source code i.e. Java files. All I can do with binaries i.e. jar
 file is execute it.

 -RR
 --
 Date: Tue, 15 Apr 2014 13:03:53 -0700

 Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility
 with old *mapred* APIs and new *mapreduce* APIs in Hadoop
 From: zs...@hortonworks.com
 To: user@hadoop.apache.org


 1. If you have the binaries that were compiled against MRv1 *mapred*libs, it 
 should just work with MRv2.
 2. If you have the source code that refers to MRv1 *mapred* libs, it
 should be compilable without code changes. Of course, you're free to change
 your code.
 3. If you have the binaries that were compiled against MRv1 *mapreduce* libs,
 it may not be executable directly with MRv2, but you should able to compile
 it against MRv2 *mapreduce* libs without code changes, and execute it.

 - Zhijie


 On Tue, Apr 15, 2014 at 12:44 PM, Radhe Radhe 
 radhe.krishna.ra...@live.com wrote:

 Thanks John for your comments,

 I believe MRv2 has support for both the old *mapred* APIs and new
 *mapreduce* APIs.

 I see this way:
 [1.]  One may have binaries i.e. jar file of the M\R program that used old
 *mapred* APIs
 This will work directly on MRv2(YARN).

 [2.]  One may have the source code i.e. Java Programs of the M\R program
 that used old *mapred* APIs
 For this I need to recompile and generate the binaries i.e. jar file.
 Do I have to change the old *org.apache.hadoop.mapred* APIs to new *
 org.apache.hadoop.mapreduce* APIs? or No code changes are needed?

 -RR

  Date: Mon, 14 Apr 2014 10:37:56 -0400
  Subject: Re: Doubt regarding Binary Compatibility\Source Compatibility
 with old *mapred* APIs and new *mapreduce* APIs in Hadoop
  From: john.meag...@gmail.com
  To: user@hadoop.apache.org

 
  Also, Source Compatibility also means ONLY a recompile is needed.
  No code changes should be needed.
 
  On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com
 wrote:
   Source Compatibility = you need to recompile and use the new version
   as part of the compilation
  
   Binary Compatibility = you can take something compiled against the old
   version and run it on the new version
  
   On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe
   radhe.krishna.ra...@live.com wrote:
   Hello People,
  
   As per the Apache site
  
 http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html
  
   Binary Compatibility
   
   First, we ensure binary compatibility to the applications that use old
   mapred APIs. This means that applications which were built against
 MRv1
   mapred APIs can run directly on YARN without recompilation, merely by
   pointing them to an Apache Hadoop 2.x cluster via configuration.
  
   Source Compatibility
   
   We cannot ensure complete binary compatibility with the applications
 that
   use mapreduce APIs, as these APIs have evolved a lot since MRv1.
 However, we
   ensure source compatibility for mapreduce APIs that break binary
   compatibility. In other words, users should recompile their
 applications
   that use mapreduce APIs against MRv2 jars. One notable binary
   incompatibility break is Counter and CounterGroup.
  
   For Binary Compatibility I understand that if I had build a MR job
 with
   old *mapred* APIs then they can be run directly on YARN without and
 changes.
  
   Can anybody explain what do we mean by Source Compatibility here
 and also
   a use case where one will need it?
  
   Does that mean code changes if I already have a MR job source code
 written
   with with old *mapred* APIs and I need to make some changes to it to
 run in
   then I need to use the new mapreduce* API and generate the new
 binaries?
  
   Thanks,
   -RR
  
  




 --
 Zhijie Shen
 Hortonworks Inc.
 http://hortonworks.com/

 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended

Re: Doubt regarding Binary Compatibility\Source Compatibility with old *mapred* APIs and new *mapreduce* APIs in Hadoop

2014-04-14 Thread John Meagher
Also, Source Compatibility also means ONLY a recompile is needed.
No code changes should be needed.

On Mon, Apr 14, 2014 at 10:37 AM, John Meagher john.meag...@gmail.com wrote:
 Source Compatibility = you need to recompile and use the new version
 as part of the compilation

 Binary Compatibility = you can take something compiled against the old
 version and run it on the new version

 On Mon, Apr 14, 2014 at 9:19 AM, Radhe Radhe
 radhe.krishna.ra...@live.com wrote:
 Hello People,

 As per the Apache site
 http://hadoop.apache.org/docs/r2.3.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html

 Binary Compatibility
 
 First, we ensure binary compatibility to the applications that use old
 mapred APIs. This means that applications which were built against MRv1
 mapred APIs can run directly on YARN without recompilation, merely by
 pointing them to an Apache Hadoop 2.x cluster via configuration.

 Source Compatibility
 
 We cannot ensure complete binary compatibility with the applications that
 use mapreduce APIs, as these APIs have evolved a lot since MRv1. However, we
 ensure source compatibility for mapreduce APIs that break binary
 compatibility. In other words, users should recompile their applications
 that use mapreduce APIs against MRv2 jars. One notable binary
 incompatibility break is Counter and CounterGroup.

 For Binary Compatibility I understand that if I had build a MR job with
 old *mapred* APIs then they can be run directly on YARN without and changes.

 Can anybody explain what do we mean by Source Compatibility here and also
 a use case where one will need it?

 Does that mean code changes if I already have a MR job source code written
 with with old *mapred* APIs and I need to make some changes to it to run in
 then I need to use the new mapreduce* API and generate the new  binaries?

 Thanks,
 -RR




Re: Doubt

2014-03-19 Thread Jay Vyas
Certainly it is , and quite common especially if you have some high
performance machines : they  can run as mapreduce slaves and also double as
mongo hosts.  The problem would of course be that when running mapreduce
jobs you might have very slow network bandwidth at times, and if your front
end needs fast response times all the time from mongo instances you could
be in trouble.



On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar praveen...@gmail.comwrote:

 Why not ? Its just a matter of installing 2 different packages.
 Depends on what do you want to use it for, you need to take care of few
 things, but as far as installation is concerned, it should be easily doable.

 Regards
 Prav


 On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi all,
 is it possible to install Mongodb on the same VM which consists hadoop?

 --
 amiable harsha





-- 
Jay Vyas
http://jayunit100.blogspot.com


Re: Doubt

2014-03-19 Thread sri harsha
thank s jay and praveen,
i want to use both separately don't want to use mongodb in the place of
hbase


On Wed, Mar 19, 2014 at 9:25 PM, Jay Vyas jayunit...@gmail.com wrote:

 Certainly it is , and quite common especially if you have some high
 performance machines : they  can run as mapreduce slaves and also double as
 mongo hosts.  The problem would of course be that when running mapreduce
 jobs you might have very slow network bandwidth at times, and if your front
 end needs fast response times all the time from mongo instances you could
 be in trouble.



 On Wed, Mar 19, 2014 at 11:50 AM, praveenesh kumar 
 praveen...@gmail.comwrote:

 Why not ? Its just a matter of installing 2 different packages.
 Depends on what do you want to use it for, you need to take care of few
 things, but as far as installation is concerned, it should be easily doable.

 Regards
 Prav


 On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi all,
 is it possible to install Mongodb on the same VM which consists hadoop?

 --
 amiable harsha





 --
 Jay Vyas
 http://jayunit100.blogspot.com




-- 
amiable harsha


Re: Doubt

2014-03-19 Thread praveenesh kumar
Why not ? Its just a matter of installing 2 different packages.
Depends on what do you want to use it for, you need to take care of few
things, but as far as installation is concerned, it should be easily doable.

Regards
Prav


On Wed, Mar 19, 2014 at 3:41 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi all,
 is it possible to install Mongodb on the same VM which consists hadoop?

 --
 amiable harsha



Re: doubt

2014-01-19 Thread Justin Black
I've installed a hadoop single node cluster on a VirtualBox machine running
ubuntu 12.04LTS (64-bit) with 512MB RAM and 8GB HD. I haven't seen any
errors in my testing yet. Is 1GB RAM required? Will I run into issues when
I expand the cluster?


On Sat, Jan 18, 2014 at 11:24 PM, Alexander Pivovarov
apivova...@gmail.comwrote:

 it' enough. hadoop uses only 1GB RAM by default.


 On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi ,
 i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough
 for this or shall i need to expand ?
 please suggest about my query.

 than x

 --
 amiable harsha





-- 
-jblack


Re: doubt

2014-01-18 Thread Alexander Pivovarov
it' enough. hadoop uses only 1GB RAM by default.


On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi ,
 i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough
 for this or shall i need to expand ?
 please suggest about my query.

 than x

 --
 amiable harsha



Re: Doubt on Input and Output Mapper - Key value pairs

2012-11-07 Thread Harsh J
The answer (a) is correct, in general.

On Wed, Nov 7, 2012 at 6:09 PM, Ramasubramanian Narayanan
ramasubramanian.naraya...@gmail.com wrote:
 Hi,

 Which of the following is correct w.r.t mapper.

 (a) It accepts a single key-value pair as input and can emit any number of
 key-value pairs as output, including zero.
 (b) It accepts a single key-value pair as input and emits a single key and
 list of corresponding values as output


 regards,
 Rams



-- 
Harsh J


Re: Doubt on Input and Output Mapper - Key value pairs

2012-11-07 Thread Mahesh Balija
Hi Rams,

   A mapper will accept single key-value pair as input and can emit
0 or more key-value pairs based on what you want to do in mapper function
(I mean based on your business logic in mapper function).
   But the framework will actually aggregate the list of values
associated with a given key and sends the key and List of values to the
reducer function.

Best,
Mahesh Balija.

On Wed, Nov 7, 2012 at 6:09 PM, Ramasubramanian Narayanan 
ramasubramanian.naraya...@gmail.com wrote:

 Hi,

 Which of the following is correct w.r.t mapper.

 (a) It accepts a single key-value pair as input and can emit any number of
 key-value pairs as output, including zero.
 (b) It accepts a single key-value pair as input and emits a single key and
 list of corresponding values as output


 regards,
 Rams



Re: doubt about reduce tasks and block writes

2012-08-26 Thread Raj Vishwanathan
Harsh

I did leave an escape route open witth a bit about corner cases :-) 

Anyway I agree that HDFS has no notion of block 0. I just meant that had the 
dfs.replication is 1, there will be,under normal circumstances :-), no blocks 
of output file will be written to node A.

Raj


- Original Message -
 From: Harsh J ha...@cloudera.com
 To: common-user@hadoop.apache.org; Raj Vishwanathan rajv...@yahoo.com
 Cc: 
 Sent: Saturday, August 25, 2012 4:02 AM
 Subject: Re: doubt about reduce tasks and block writes
 
 Raj's almost right. In times of high load or space fillup on a local
 DN, the NameNode may decide to instead pick a non-local DN for
 replica-writing. In this way, the Node A may get a copy 0 of a
 replica from a task. This is per the default block placement policy.
 
 P.s. Note that HDFS hardly makes any differences between replicas,
 hence there is no hard-concept of a copy 0 or copy 1 
 block, at the
 NN level, it treats all DNs in pipeline equally and same for replicas.
 
 On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com 
 wrote:
  But since node A has no TT running, it will not run map or reduce tasks. 
 When the reducer node writes the output file, the fist block will be written 
 on 
 the local node and never on node A.
 
  So, to answer the question, Node A will contain copies of blocks of all 
 output files. It wont contain the copy 0 of any output file.
 
 
  I am reasonably sure about this , but there could be corner cases in case 
 of node failure and such like! I need to look into the code.
 
 
  Raj
 
  From: Marc Sturlese marc.sturl...@gmail.com
 To: hadoop-u...@lucene.apache.org
 Sent: Friday, August 24, 2012 1:09 PM
 Subject: doubt about reduce tasks and block writes
 
 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task 
 always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)
 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and 
 one
 (node A) just run DN, when running MR jobs, map tasks would never read 
 from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain 
 blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?
 Thanks in advance
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
 
 
 
 
 
 
 -- 
 Harsh J



Re: doubt about reduce tasks and block writes

2012-08-25 Thread Marc Sturlese
Thanks, Raj you got exactly my point. I wanted to confirm this assumption as
I was guessing if a shared HDFS cluster with MR and Hbase like this would
make sense:
http://old.nabble.com/HBase-User-f34655.html



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185p4003211.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: doubt about reduce tasks and block writes

2012-08-25 Thread Harsh J
Raj's almost right. In times of high load or space fillup on a local
DN, the NameNode may decide to instead pick a non-local DN for
replica-writing. In this way, the Node A may get a copy 0 of a
replica from a task. This is per the default block placement policy.

P.s. Note that HDFS hardly makes any differences between replicas,
hence there is no hard-concept of a copy 0 or copy 1 block, at the
NN level, it treats all DNs in pipeline equally and same for replicas.

On Sat, Aug 25, 2012 at 4:14 AM, Raj Vishwanathan rajv...@yahoo.com wrote:
 But since node A has no TT running, it will not run map or reduce tasks. When 
 the reducer node writes the output file, the fist block will be written on 
 the local node and never on node A.

 So, to answer the question, Node A will contain copies of blocks of all 
 output files. It wont contain the copy 0 of any output file.


 I am reasonably sure about this , but there could be corner cases in case of 
 node failure and such like! I need to look into the code.


 Raj

 From: Marc Sturlese marc.sturl...@gmail.com
To: hadoop-u...@lucene.apache.org
Sent: Friday, August 24, 2012 1:09 PM
Subject: doubt about reduce tasks and block writes

Hey there,
I have a doubt about reduce tasks and block writes. Do a reduce task always
first write to hdfs in the node where they it is placed? (and then these
blocks would be replicated to other nodes)
In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
(node A) just run DN, when running MR jobs, map tasks would never read from
node A? This would be because maps have data locality and if the reduce
tasks write first to the node where they live, one replica of the block
would always be in a node that has a TT. Node A would just contain blocks
created from replication by the framework as no reduce task would write
there directly. Is this correct?
Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.






-- 
Harsh J


Re: doubt about reduce tasks and block writes

2012-08-24 Thread Minh Duc Nguyen
Marc, see my inline comments.

On Fri, Aug 24, 2012 at 4:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote:

 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)


Yes, if there is a DN running on that server (it's possible to be running
TT without a DN).


 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and
 one
 (node A) just run DN, when running MR jobs, map tasks would never read from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?


I believe that it's possible that a map task would read from node A's DN.
 Yes, the JobTracker tries to schedule map tasks on nodes where the data
would be local, but it can't always do so.  If there's a node with a free
map slot, but that node doesn't have the data blocks locally, the
JobTracker will assign the map task to that free map slot.  Some work done
(albeit slower than the ideal case because of the increased network IO) is
better than no work done.


 Thanks in advance



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: doubt about reduce tasks and block writes

2012-08-24 Thread Bertrand Dechoux
Assuming that node A only contains replica, there is no garante that its
data would never be read.
First, you might lose a replica. The copy inside the node A could be used
to create the missing replica again.
Second, data locality is on best effort. If all the map slots are occupied
except one on one node without a replica of the data then your node A is as
likely as any other to be chosen as a source.

Regards

Bertrand

On Fri, Aug 24, 2012 at 10:09 PM, Marc Sturlese marc.sturl...@gmail.comwrote:

 Hey there,
 I have a doubt about reduce tasks and block writes. Do a reduce task always
 first write to hdfs in the node where they it is placed? (and then these
 blocks would be replicated to other nodes)
 In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and
 one
 (node A) just run DN, when running MR jobs, map tasks would never read from
 node A? This would be because maps have data locality and if the reduce
 tasks write first to the node where they live, one replica of the block
 would always be in a node that has a TT. Node A would just contain blocks
 created from replication by the framework as no reduce task would write
 there directly. Is this correct?
 Thanks in advance



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




-- 
Bertrand Dechoux


Re: doubt about reduce tasks and block writes

2012-08-24 Thread Raj Vishwanathan
But since node A has no TT running, it will not run map or reduce tasks. When 
the reducer node writes the output file, the fist block will be written on the 
local node and never on node A.

So, to answer the question, Node A will contain copies of blocks of all output 
files. It wont contain the copy 0 of any output file.


I am reasonably sure about this , but there could be corner cases in case of 
node failure and such like! I need to look into the code. 


Raj

 From: Marc Sturlese marc.sturl...@gmail.com
To: hadoop-u...@lucene.apache.org 
Sent: Friday, August 24, 2012 1:09 PM
Subject: doubt about reduce tasks and block writes
 
Hey there,
I have a doubt about reduce tasks and block writes. Do a reduce task always
first write to hdfs in the node where they it is placed? (and then these
blocks would be replicated to other nodes)
In case yes, if I have a cluster of 5 nodes, 4 of them run DN and TT and one
(node A) just run DN, when running MR jobs, map tasks would never read from
node A? This would be because maps have data locality and if the reduce
tasks write first to the node where they live, one replica of the block
would always be in a node that has a TT. Node A would just contain blocks
created from replication by the framework as no reduce task would write
there directly. Is this correct?
Thanks in advance



--
View this message in context: 
http://lucene.472066.n3.nabble.com/doubt-about-reduce-tasks-and-block-writes-tp4003185.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.





Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Hi Manoj,

Reply inline.

On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
 Hi All,

 Normal Hadoop job submission process involves:

 Checking the input and output specifications of the job.
 Computing the InputSplits for the job.
 Setup the requisite accounting information for the DistributedCache of the
 job, if necessary.
 Copying the job's jar and configuration to the map-reduce system directory
 on the distributed file-system.
 Submitting the job to the JobTracker and optionally monitoring it's status.

 I have a doubt in 4th point of  job execution flow could any of you explain
 it?

 What is job's jar?

The job.jar is the jar you supply via hadoop jar jar. Technically
though, it is the jar pointed by JobConf.getJar() (Set via setJar or
setJarByClass calls).

 Is it job's jar is the one we submitted to hadoop or hadoop will build based
 on the job configuration object?

It is the former, as explained above.

-- 
Harsh J


Re: doubt on Hadoop job submission process

2012-08-13 Thread Manoj Babu
Hi Harsh,

Thanks for your reply.

Consider from my main program i am doing so
many activities(Reading/writing/updating non hadoop activities) before
invoking JobClient.runJob(conf);
Is it anyway to separate the process flow by programmatic instead of going
for any workflow engine?

Cheers!
Manoj.



On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote:

 Hi Manoj,

 Reply inline.

 On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
  Hi All,
 
  Normal Hadoop job submission process involves:
 
  Checking the input and output specifications of the job.
  Computing the InputSplits for the job.
  Setup the requisite accounting information for the DistributedCache of
 the
  job, if necessary.
  Copying the job's jar and configuration to the map-reduce system
 directory
  on the distributed file-system.
  Submitting the job to the JobTracker and optionally monitoring it's
 status.
 
  I have a doubt in 4th point of  job execution flow could any of you
 explain
  it?
 
  What is job's jar?

 The job.jar is the jar you supply via hadoop jar jar. Technically
 though, it is the jar pointed by JobConf.getJar() (Set via setJar or
 setJarByClass calls).

  Is it job's jar is the one we submitted to hadoop or hadoop will build
 based
  on the job configuration object?

 It is the former, as explained above.

 --
 Harsh J



Re: doubt on Hadoop job submission process

2012-08-13 Thread Harsh J
Sure, you may separate the logic as you want it to be, but just ensure
the configuration object has a proper setJar or setJarByClass done on
it before you submit the job.

On Mon, Aug 13, 2012 at 4:43 PM, Manoj Babu manoj...@gmail.com wrote:
 Hi Harsh,

 Thanks for your reply.

 Consider from my main program i am doing so many
 activities(Reading/writing/updating non hadoop activities) before invoking
 JobClient.runJob(conf);
 Is it anyway to separate the process flow by programmatic instead of going
 for any workflow engine?

 Cheers!
 Manoj.



 On Mon, Aug 13, 2012 at 4:10 PM, Harsh J ha...@cloudera.com wrote:

 Hi Manoj,

 Reply inline.

 On Mon, Aug 13, 2012 at 3:42 PM, Manoj Babu manoj...@gmail.com wrote:
  Hi All,
 
  Normal Hadoop job submission process involves:
 
  Checking the input and output specifications of the job.
  Computing the InputSplits for the job.
  Setup the requisite accounting information for the DistributedCache of
  the
  job, if necessary.
  Copying the job's jar and configuration to the map-reduce system
  directory
  on the distributed file-system.
  Submitting the job to the JobTracker and optionally monitoring it's
  status.
 
  I have a doubt in 4th point of  job execution flow could any of you
  explain
  it?
 
  What is job's jar?

 The job.jar is the jar you supply via hadoop jar jar. Technically
 though, it is the jar pointed by JobConf.getJar() (Set via setJar or
 setJarByClass calls).

  Is it job's jar is the one we submitted to hadoop or hadoop will build
  based
  on the job configuration object?

 It is the former, as explained above.

 --
 Harsh J





-- 
Harsh J


Re: Doubt from the book Definitive Guide

2012-04-05 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 10:02 PM, Prashant Kommireddi prash1...@gmail.comwrote:

 Hi Mohit,

 What would be the advantage? Reducers in most cases read data from all
 the mappers. In the case where mappers were to write to HDFS, a
 reducer would still require to read data from other datanodes across
 the cluster.


Only advantage I was thinking of was that in some cases reducers might be
able to take advantage of data locality and avoid multiple HTTP calls, no?
Data is anyways written, so last merged file could go on HDFS instead of
local disk.
I am new to hadoop so just asking question to understand the rational
behind using local disk for final output.

 Prashant

 On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

  On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:
 
  Hi Mohit,
 
  On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
  using
  HTTP call. But the description under The Reduce Side doesn't
  specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the
 call
  http:// or hdfs://
 
  The flow is simple as this:
  1. For M+R job, map completes its task after writing all partitions
  down into the tasktracker's local filesystem (under mapred.local.dir
  directories).
  2. Reducers fetch completion locations from events at JobTracker, and
  query the TaskTracker there to provide it the specific partition it
  needs, which is done over the TaskTracker's HTTP service (50060).
 
  So to clear things up - map doesn't send it to reduce, nor does reduce
  ask the actual map task. It is the task tracker itself that makes the
  bridge here.
 
  Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
  be over Netty connections. This would be much more faster and
  reliable.
 
  2) My understanding was that mapper output gets written to hdfs, since
  I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS
 then
  shouldn't reducers simply read it from hdfs instead of making http
 calls
  to
  tasktrackers location?
 
  A map-only job usually writes out to HDFS directly (no sorting done,
  cause no reducer is involved). If the job is a map+reduce one, the
  default output is collected to local filesystem for partitioning and
  sorting at map end, and eventually grouping at reduce end. Basically:
  Data you want to send to reducer from mapper goes to local FS for
  multiple actions to be performed on them, other data may directly go
  to HDFS.
 
  Reducers currently are scheduled pretty randomly but yes their
  scheduling can be improved for certain scenarios. However, if you are
  pointing that map partitions ought to be written to HDFS itself (with
  replication or without), I don't see performance improving. Note that
  the partitions aren't merely written but need to be sorted as well (at
  either end). To do that would need ability to spill frequently (cause
  we don't have infinite memory to do it all in RAM) and doing such a
  thing on HDFS would only mean slowdown.
 
  Thanks for clearing my doubts. In this case I was merely suggesting that
  if the mapper output (merged output in the end or the shuffle output) is
  stored in HDFS then reducers can just retrieve it from HDFS instead of
  asking tasktracker for it. Once reducer threads read it they can continue
  to work locally.
 
 
 
  I hope this helps clear some things up for you.
 
  --
  Harsh J
 



Re: Doubt from the book Definitive Guide

2012-04-05 Thread Jean-Daniel Cryans
On Thu, Apr 5, 2012 at 7:03 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 Only advantage I was thinking of was that in some cases reducers might be
 able to take advantage of data locality and avoid multiple HTTP calls, no?
 Data is anyways written, so last merged file could go on HDFS instead of
 local disk.
 I am new to hadoop so just asking question to understand the rational
 behind using local disk for final output.

So basically it's a tradeoff here, you get more replicas to copy from
but you have 2 more copies to write. Considering that that data's very
short lived and that it doesn't need to be replicated (since if the
machine fails the maps are replayed anyway) it seems that writing 2
replicas that are potentially unused would be hurtful.

Regarding locality, it might make sense on a small cluster but the
more you add nodes the smaller the chance to have local replicas for
each blocks of data you're looking for.

J-D


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Answers inline.

On Wed, Apr 4, 2012 at 4:56 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://


Map output is written to local FS, not HDFS.


 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

 Map output is sent to HDFS when reducer is not used.



 - from the book ---
 Mapper
 The output file’s partitions are made available to the reducers over HTTP.
 The number of worker threads used to serve the file partitions is
 controlled by the tasktracker.http.threads property
 this setting is per tasktracker, not per map task slot. The default of 40
 may need increasing for large clusters running large jobs.6.4.2.

 The Reduce Side
 Let’s turn now to the reduce part of the process. The map output file is
 sitting on the local disk of the tasktracker that ran the map task
 (note that although map outputs always get written to the local disk of the
 map tasktracker, reduce outputs may not be), but now it is needed by the
 tasktracker
 that is about to run the reduce task for the partition. Furthermore, the
 reduce task needs the map output for its particular partition from several
 map tasks across the cluster.
 The map tasks may finish at different times, so the reduce task starts
 copying their outputs as soon as each completes. This is known as the copy
 phase of the reduce task.
 The reduce task has a small number of copier threads so that it can fetch
 map outputs in parallel.
 The default is five threads, but this number can be changed by setting the
 mapred.reduce.parallel.copies property.



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Harsh J
Hi Mohit,

On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file using
 HTTP call. But the description under The Reduce Side doesn't specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

The flow is simple as this:
1. For M+R job, map completes its task after writing all partitions
down into the tasktracker's local filesystem (under mapred.local.dir
directories).
2. Reducers fetch completion locations from events at JobTracker, and
query the TaskTracker there to provide it the specific partition it
needs, which is done over the TaskTracker's HTTP service (50060).

So to clear things up - map doesn't send it to reduce, nor does reduce
ask the actual map task. It is the task tracker itself that makes the
bridge here.

Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
be over Netty connections. This would be much more faster and
reliable.

 2) My understanding was that mapper output gets written to hdfs, since I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls to
 tasktrackers location?

A map-only job usually writes out to HDFS directly (no sorting done,
cause no reducer is involved). If the job is a map+reduce one, the
default output is collected to local filesystem for partitioning and
sorting at map end, and eventually grouping at reduce end. Basically:
Data you want to send to reducer from mapper goes to local FS for
multiple actions to be performed on them, other data may directly go
to HDFS.

Reducers currently are scheduled pretty randomly but yes their
scheduling can be improved for certain scenarios. However, if you are
pointing that map partitions ought to be written to HDFS itself (with
replication or without), I don't see performance improving. Note that
the partitions aren't merely written but need to be sorted as well (at
either end). To do that would need ability to spill frequently (cause
we don't have infinite memory to do it all in RAM) and doing such a
thing on HDFS would only mean slowdown.

I hope this helps clear some things up for you.

-- 
Harsh J


Re: Doubt from the book Definitive Guide

2012-04-04 Thread Mohit Anchlia
On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
  I am going through the chapter How mapreduce works and have some
  confusion:
 
  1) Below description of Mapper says that reducers get the output file
 using
  HTTP call. But the description under The Reduce Side doesn't
 specifically
  say if it's copied using HTTP. So first confusion, Is the output copied
  from mapper - reducer or from reducer - mapper? And second, Is the call
  http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

  2) My understanding was that mapper output gets written to hdfs, since
 I've
  seen part-m-0 files in hdfs. If mapper output is written to HDFS then
  shouldn't reducers simply read it from hdfs instead of making http calls
 to
  tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
if the mapper output (merged output in the end or the shuffle output) is
stored in HDFS then reducers can just retrieve it from HDFS instead of
asking tasktracker for it. Once reducer threads read it they can continue
to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: Doubt from the book Definitive Guide

2012-04-04 Thread Prashant Kommireddi
Hi Mohit,

What would be the advantage? Reducers in most cases read data from all
the mappers. In the case where mappers were to write to HDFS, a
reducer would still require to read data from other datanodes across
the cluster.

Prashant

On Apr 4, 2012, at 9:55 PM, Mohit Anchlia mohitanch...@gmail.com wrote:

 On Wed, Apr 4, 2012 at 8:42 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohit,

 On Thu, Apr 5, 2012 at 5:26 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:
 I am going through the chapter How mapreduce works and have some
 confusion:

 1) Below description of Mapper says that reducers get the output file
 using
 HTTP call. But the description under The Reduce Side doesn't
 specifically
 say if it's copied using HTTP. So first confusion, Is the output copied
 from mapper - reducer or from reducer - mapper? And second, Is the call
 http:// or hdfs://

 The flow is simple as this:
 1. For M+R job, map completes its task after writing all partitions
 down into the tasktracker's local filesystem (under mapred.local.dir
 directories).
 2. Reducers fetch completion locations from events at JobTracker, and
 query the TaskTracker there to provide it the specific partition it
 needs, which is done over the TaskTracker's HTTP service (50060).

 So to clear things up - map doesn't send it to reduce, nor does reduce
 ask the actual map task. It is the task tracker itself that makes the
 bridge here.

 Note however, that in Hadoop 2.0 the transfer via ShuffleHandler would
 be over Netty connections. This would be much more faster and
 reliable.

 2) My understanding was that mapper output gets written to hdfs, since
 I've
 seen part-m-0 files in hdfs. If mapper output is written to HDFS then
 shouldn't reducers simply read it from hdfs instead of making http calls
 to
 tasktrackers location?

 A map-only job usually writes out to HDFS directly (no sorting done,
 cause no reducer is involved). If the job is a map+reduce one, the
 default output is collected to local filesystem for partitioning and
 sorting at map end, and eventually grouping at reduce end. Basically:
 Data you want to send to reducer from mapper goes to local FS for
 multiple actions to be performed on them, other data may directly go
 to HDFS.

 Reducers currently are scheduled pretty randomly but yes their
 scheduling can be improved for certain scenarios. However, if you are
 pointing that map partitions ought to be written to HDFS itself (with
 replication or without), I don't see performance improving. Note that
 the partitions aren't merely written but need to be sorted as well (at
 either end). To do that would need ability to spill frequently (cause
 we don't have infinite memory to do it all in RAM) and doing such a
 thing on HDFS would only mean slowdown.

 Thanks for clearing my doubts. In this case I was merely suggesting that
 if the mapper output (merged output in the end or the shuffle output) is
 stored in HDFS then reducers can just retrieve it from HDFS instead of
 asking tasktracker for it. Once reducer threads read it they can continue
 to work locally.



 I hope this helps clear some things up for you.

 --
 Harsh J



Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Harsh J
Narayanan,


On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com wrote:
 Hi all,

 We are basically working on a research project and I require some help
 regarding this.

Always glad to see research work being done! What're you working on? :)

 How do I submit a mapreduce job from outside the cluster i.e from a
 different machine outside the Hadoop cluster?

If you use Java APIs, use the Job#submit(…) method and/or
JobClient.runJob(…) method.
Basically Hadoop will try to create a jar with all requisite classes
within and will push it out to the JobTracker's filesystem (HDFS, if
you run HDFS). From there on, its like a regular operation.

This even happens on the Hadoop nodes itself, so doing so from an
external place as long as that place has access to Hadoop's JT and
HDFS, should be no different at all.

If you are packing custom libraries along, don't forget to use
DistributedCache. If you are packing custom MR Java code, don't forget
to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
API methods.

 If the above can be done, How can I schedule map reduce jobs to run in
 hadoop like crontab from a different machine?
 Are there any webservice APIs that I can leverage to access a hadoop cluster
 from outside and submit jobs or read/write data from HDFS.

For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
It is well supported and is very useful in writing MR workflows (which
is a common requirement). You also get coordinator features and can
schedule similar to crontab functionalities.

For HDFS r/w over web, not sure of an existing web app specifically
for this purpose without limitations, but there is a contrib/thriftfs
you can leverage upon (if not writing your own webserver in Java, in
which case its as simple as using HDFS APIs).

Also have a look at the pretty mature Hue project which aims to
provide a great frontend that lets you design jobs, submit jobs,
monitor jobs and upload files or browse the filesystem (among several
other things): http://cloudera.github.com/hue/

-- 
Harsh J


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Harsh J
Narayanan,

On Fri, Jul 1, 2011 at 12:57 PM, Narayanan K knarayana...@gmail.com wrote:
 So the report will be run from a different machine outside the cluster. So
 we need a way to pass on the parameters to the hadoop cluster (master) and
 initiate a mapreduce job dynamically. Similarly the output of mapreduce job
 needs to tunneled into the machine from where the report was run.

 Some more clarification I need is : Does the machine (outside of cluster)
 which ran the report, require something like a Client installation which
 will talk with the Hadoop Master Server via TCP???  Or can it can run a job
 in hadoop server by using a passworldless scp to the master machine or
 something of the like.

Regular way is to let the client talk to your nodes over tcp ports.
This is what Hadoop's plain ol' submitter process does for you.

Have you tried running any simple hadoop jar your jar from a
remote client machine?

If that works, so should invoking the same from your code (with
appropriate configurations set) cause its basically the plain ol'
runjar submission process in both ways.

If not, maybe you need to think of opening ports to let things happen
(if there's a firewall here).

Hadoop does not use SSH/SCP to move code around. Please give this a
read if you believe you're confused about how SSH+Hadoop is integrated
(or not): http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

-- 
Harsh J


Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Yaozhen Pan
Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).

在 2011 7 1 15:28,Narayanan K knarayana...@gmail.com写道:
 Hi Harsh

 Thanks for the quick response...

 Have a few clarifications regarding the 1st point :

 Let me tell the background first..

 We have actually set up a Hadoop cluster with HBase installed. We are
 planning to load Hbase with data and perform some
 computations with the data and show up the data in a report format.
 The report should be accessible from outside the cluster and the report
 accepts certain parameters to show data, that will in turn pass on these
 parameters to the hadoop master server where a mapreduce job will be run
 that queries HBase to retrieve the data.

 So the report will be run from a different machine outside the cluster. So
 we need a way to pass on the parameters to the hadoop cluster (master) and
 initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
 needs to tunneled into the machine from where the report was run.

 Some more clarification I need is : Does the machine (outside of cluster)
 which ran the report, require something like a Client installation which
 will talk with the Hadoop Master Server via TCP??? Or can it can run a job
 in hadoop server by using a passworldless scp to the master machine or
 something of the like.


 Regards,
 Narayanan




 On Fri, Jul 1, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote:

 Narayanan,


 On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com
 wrote:
  Hi all,
 
  We are basically working on a research project and I require some help
  regarding this.

 Always glad to see research work being done! What're you working on? :)

  How do I submit a mapreduce job from outside the cluster i.e from a
  different machine outside the Hadoop cluster?

 If you use Java APIs, use the Job#submit(…) method and/or
 JobClient.runJob(…) method.
 Basically Hadoop will try to create a jar with all requisite classes
 within and will push it out to the JobTracker's filesystem (HDFS, if
 you run HDFS). From there on, its like a regular operation.

 This even happens on the Hadoop nodes itself, so doing so from an
 external place as long as that place has access to Hadoop's JT and
 HDFS, should be no different at all.

 If you are packing custom libraries along, don't forget to use
 DistributedCache. If you are packing custom MR Java code, don't forget
 to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
 API methods.

  If the above can be done, How can I schedule map reduce jobs to run in
  hadoop like crontab from a different machine?
  Are there any webservice APIs that I can leverage to access a hadoop
 cluster
  from outside and submit jobs or read/write data from HDFS.

 For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
 It is well supported and is very useful in writing MR workflows (which
 is a common requirement). You also get coordinator features and can
 schedule similar to crontab functionalities.

 For HDFS r/w over web, not sure of an existing web app specifically
 for this purpose without limitations, but there is a contrib/thriftfs
 you can leverage upon (if not writing your own webserver in Java, in
 which case its as simple as using HDFS APIs).

 Also have a look at the pretty mature Hue project which aims to
 provide a great frontend that lets you design jobs, submit jobs,
 monitor jobs and upload files or browse the filesystem (among several
 other things): http://cloudera.github.com/hue/

 --
 Harsh J



RE: Doubt: Regarding running Hadoop on a cluster with shared disk.

2010-05-05 Thread Michael Segel

Udaya,

You can use non-local disk on your hadoop cloud, however it will have 
sub-optimal performance, and you will have to tune accordingly.

If its a shared drive on all of your nodes, you need to create different 
directories for each machine.

Suppose your shared drive is /foo  then you would need to set up a /foo/name 
of node/data for each machine in your cluster.

The drawback is not only I/O traffic and constraints but you'll have to tune ZK 
and watch out for timing issues as your disk i/o is your constraint.

Definitely not recommended.


 Date: Wed, 5 May 2010 15:52:11 +0530
 Subject: Doubt: Regarding running Hadoop on a cluster with shared disk.
 From: udaya...@gmail.com
 To: common-user@hadoop.apache.org
 
 Hi,
I have an account on a cluster which is having a file system similar to
 NFS. If I create a file on one machine it is being shown on all the machines
 in the cluster. But hadoop will work on a cluster of machines, where in ,
 each machine has a disk of its own. Can someone please help me use hadoop on
 my cluster.
 Thanks,
 Udaya.
  
_
Hotmail is redefining busy with tools for the New Busy. Get more from your 
inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_2

Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Craig Macdonald
HOD supports a PBS environment, namely Torque. Torque is the vastly 
improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, 
or better still persuade your cluster admins to upgrade to a more recent 
version of Torque (e.g. at least 2.1.x)


Craig


On 22/07/28164 20:59, Udaya Lakshmi wrote:

Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
management software. The only way I can run a job is by submitting it to
OpenPBS. How to run mapreduce programs on it? Is there any possible work
around?

Thanks,
Udaya.
   




Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Udaya Lakshmi
Thank you Craig. My cluster has got Torque. Can you please point me
something which will have detailed explanation about using HOD on Torque.

On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:

 HOD supports a PBS environment, namely Torque. Torque is the vastly
 improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or
 better still persuade your cluster admins to upgrade to a more recent
 version of Torque (e.g. at least 2.1.x)

 Craig



 On 22/07/28164 20:59, Udaya Lakshmi wrote:

 Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
 management software. The only way I can run a job is by submitting it to
 OpenPBS. How to run mapreduce programs on it? Is there any possible work
 around?

 Thanks,
 Udaya.






Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Peeyush Bishnoi
Udaya,

Following link will help you for HOD on torque.
http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html


Thanks,

---
Peeyush

On Tue, 2010-05-04 at 22:49 +0530, Udaya Lakshmi wrote:

 Thank you Craig. My cluster has got Torque. Can you please point me
 something which will have detailed explanation about using HOD on Torque.
 
 On Tue, May 4, 2010 at 10:17 PM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:
 
  HOD supports a PBS environment, namely Torque. Torque is the vastly
  improved fork of OpenPBS. You may be able to get HOD working on OpenPBS, or
  better still persuade your cluster admins to upgrade to a more recent
  version of Torque (e.g. at least 2.1.x)
 
  Craig
 
 
 
  On 22/07/28164 20:59, Udaya Lakshmi wrote:
 
  Hi,
 I am given an account on a cluster which uses OpenPBS as the cluster
  management software. The only way I can run a job is by submitting it to
  OpenPBS. How to run mapreduce programs on it? Is there any possible work
  around?
 
  Thanks,
  Udaya.
 
 
 
 


Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Allen Wittenauer

On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote:

 Hi,
   I am given an account on a cluster which uses OpenPBS as the cluster
 management software. The only way I can run a job is by submitting it to
 OpenPBS. How to run mapreduce programs on it? Is there any possible work
 around?


Take a look at Hadoop on Demand.  It was built with Torque in mind, but any PBS 
system should work with few changes.




Re: Doubt: Using PBS to run mapreduce jobs.

2010-05-04 Thread Udaya Lakshmi
Thank you.
Udaya.

On Wed, May 5, 2010 at 12:23 AM, Allen Wittenauer
awittena...@linkedin.comwrote:


 On May 4, 2010, at 7:46 AM, Udaya Lakshmi wrote:

  Hi,
I am given an account on a cluster which uses OpenPBS as the cluster
  management software. The only way I can run a job is by submitting it to
  OpenPBS. How to run mapreduce programs on it? Is there any possible work
  around?


 Take a look at Hadoop on Demand.  It was built with Torque in mind, but any
 PBS system should work with few changes.





Re: Doubt about SequenceFile.Writer

2010-02-07 Thread Jeff Zhang
The SequenceFile is not text file, so you can not see the content by
invoking unix command cat.
But you can get the text content by using hadoop command : hadoop fs -text
src


On Sun, Feb 7, 2010 at 8:51 AM, Andiana Squazo Ringa 
andriana.ri...@gmail.com wrote:

 Hi,
   I have written to a sideeffect file using SequenceFile.Writer  . But when
 I cat the file, it is printing some unreadable characters . I did not use
 any compression code.
 Why is this so?
 Thanks,
 Ringa.




-- 
Best Regards

Jeff Zhang


Re: Doubt about SequenceFile.Writer

2010-02-07 Thread Ravi
Thanks a lot Jeff.

Ringa.


On Sun, Feb 7, 2010 at 10:30 PM, Jeff Zhang zjf...@gmail.com wrote:

 The SequenceFile is not text file, so you can not see the content by
 invoking unix command cat.
 But you can get the text content by using hadoop command : hadoop fs -text
 src


 On Sun, Feb 7, 2010 at 8:51 AM, Andiana Squazo Ringa 
 andriana.ri...@gmail.com wrote:

  Hi,
I have written to a sideeffect file using SequenceFile.Writer  . But
 when
  I cat the file, it is printing some unreadable characters . I did not use
  any compression code.
  Why is this so?
  Thanks,
  Ringa.
 



 --
 Best Regards

 Jeff Zhang



Re: Re: Re: Re: Doubt in Hadoop

2009-11-29 Thread aa225
Hi,
   Actually, I just made the change suggested by Aaron and my code worked. But I
still would like to know why does the setJarbyClass() method have to be called
when the Main class and the Map and Reduce classes are in the same package ?

Thank You

Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)

On Sun 11/29/09 10:39 AM , aa...@buffalo.edu sent:
 Hi,
 I dont set job.setJarByClass(Map.class). But my main class, the map class
 andthe reduce class are all in the same package. Does this make any difference
 at ordo I still have to call 
 
 Thank You
 
 Abhishek Agrawal
 
 SUNY- Buffalo
 (716-435-7122)
 
 On Fri 11/27/09  1:42 PM , Aaron Kimball aa...@clou
 dera.com sent: When you set up the Job object, do you
 call job.setJarByClass(Map.class)? That will tell
 Hadoop which jar file to ship with the job and to use for classloading in
 your code. 
  - Aaron
  On Thu, Nov 26, 2009 at 11:56 PM,  
 wrote: Hi,
    I am running the job from command
 line. The job runs fine in the local mode
  but something happens when I try to run the job
 in the distributed mode.
  Abhishek Agrawal
  SUNY- Buffalo
  (716-435-7122)
  On Fri 11/27/09  2:31 AM , Jeff
 Zhang  sent:  Do you run the map reduce job in command
 line or IDE?  in map reduce
   mode, you should put the jar containing the
 map and reduce class in
   your classpath
   Jeff Zhang
   On Fri, Nov 27, 2009 at 2:19 PM,
   wrote:  Hello Everybody,
        
          I have
 a doubt in Haddop and was wondering if
   anybody has faced a
   similar problem. I have a package called
 test. Inside that I have  class called
   A.java, Map.java, Reduce.java. In A.java I
 have the main method  where I am trying
   to initialize the jobConf object. I have
 written  jobConf.setMapperClass(Map.class) and
 similarly for the reduce class
   as well. The
   code works correctly when I run the code
 locally via 
 jobConf.set(mapred.job.tracker,local) but I get an
 exception  when I try to
   run this code on my cluster. The stack
 trace of the exception is as
   under. I
   cannot understand the problem. Any help
 would be appreciated.  java.lang.RuntimeException:
 java.lang.RuntimeException:  java.lang.ClassNotFoundException:
 test.Map       
  at 
 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)    
    
  at 
 org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)       
  at 
 org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)       
  at 
 
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)    
    
  at 
  
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        
  at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)       
  at
 org.apache.hadoop.mapred.Child.main(Child.java:158)  Caused by:
 java.lang.RuntimeException: 
 java.lang.ClassNotFoundException:  Markowitz.covarMatrixMap
        
  at 
 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)    
    
  at 
 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)    
    
  ... 6 more  Caused by:
 java.lang.ClassNotFoundException: test.Map       
  at
 java.net.URLClassLoader$1.run(URLClassLoader.java:200)       
  at
 java.security.AccessController.doPrivileged(Native  Method)
        
  at 
 java.net.URLClassLoader.findClass(URLClassLoader.java:188)       
  at
 java.lang.ClassLoader.loadClass(ClassLoader.java:306)       
  at 
 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)       
  at
 java.lang.ClassLoader.loadClass(ClassLoader.java:251)       
  at 
 java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)       
  at java.lang.Class.forName0(Native Method)       
  at java.lang.Class.forName(Class.java:247)       
  at 
  
 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
        
  at 
 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)    
    
  ... 7 more  Thank You
   Abhishek Agrawal
   SUNY- Buffalo
   (716-435-7122)
  
  
  
  
 
 
 
 
 



Re: Re: Doubt in Hadoop

2009-11-27 Thread Aaron Kimball
When you set up the Job object, do you call job.setJarByClass(Map.class)?
That will tell Hadoop which jar file to ship with the job and to use for
classloading in your code.

- Aaron


On Thu, Nov 26, 2009 at 11:56 PM, aa...@buffalo.edu wrote:

 Hi,
   I am running the job from command line. The job runs fine in the local
 mode
 but something happens when I try to run the job in the distributed mode.


 Abhishek Agrawal

 SUNY- Buffalo
 (716-435-7122)

 On Fri 11/27/09  2:31 AM , Jeff Zhang zjf...@gmail.com sent:
  Do you run the map reduce job in command line or IDE?  in map reduce
  mode, you should put the jar containing the map and reduce class in
  your classpath
  Jeff Zhang
  On Fri, Nov 27, 2009 at 2:19 PM,   wrote:
  Hello Everybody,
 I have a doubt in Haddop and was wondering if
  anybody has faced a
  similar problem. I have a package called test. Inside that I have
  class called
  A.java, Map.java, Reduce.java. In A.java I have the main method
  where I am trying
  to initialize the jobConf object. I have written
  jobConf.setMapperClass(Map.class) and similarly for the reduce class
  as well. The
  code works correctly when I run the code locally via
  jobConf.set(mapred.job.tracker,local) but I get an exception
  when I try to
  run this code on my cluster. The stack trace of the exception is as
  under. I
  cannot understand the problem. Any help would be appreciated.
  java.lang.RuntimeException: java.lang.RuntimeException:
  java.lang.ClassNotFoundException: test.Map
 at
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
 at
  org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)
 at
  org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
 at
  org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
 at
 
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
  Caused by: java.lang.RuntimeException:
  java.lang.ClassNotFoundException:
  Markowitz.covarMatrixMap
 at
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)
 at
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)
 ... 6 more
  Caused by: java.lang.ClassNotFoundException: test.Map
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native
  Method)
 at
  java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at
  sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at
  java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at
 
 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
 at
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)
 ... 7 more
  Thank You
  Abhishek Agrawal
  SUNY- Buffalo
  (716-435-7122)
 
 




Re: Doubt in Hadoop

2009-11-26 Thread Jeff Zhang
Do you run the map reduce job in command line or IDE?  in map reduce mode,
you should put the jar containing the map and reduce class in your classpath


Jeff Zhang



On Fri, Nov 27, 2009 at 2:19 PM, aa...@buffalo.edu wrote:

 Hello Everybody,
I have a doubt in Haddop and was wondering if anybody has
 faced a
 similar problem. I have a package called test. Inside that I have class
 called
 A.java, Map.java, Reduce.java. In A.java I have the main method where I am
 trying
 to initialize the jobConf object. I have written
 jobConf.setMapperClass(Map.class) and similarly for the reduce class as
 well. The
 code works correctly when I run the code locally via
 jobConf.set(mapred.job.tracker,local) but I get an exception when I try
 to
 run this code on my cluster. The stack trace of the exception is as under.
 I
 cannot understand the problem. Any help would be appreciated.

 java.lang.RuntimeException: java.lang.RuntimeException:
 java.lang.ClassNotFoundException: test.Map
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
at org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
 Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException:
 Markowitz.covarMatrixMap
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)
... 6 more
 Caused by: java.lang.ClassNotFoundException: test.Map
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)
... 7 more

 Thank You


 Abhishek Agrawal

 SUNY- Buffalo
 (716-435-7122)






Re: Re: Doubt in Hadoop

2009-11-26 Thread aa225
Hi,
   I am running the job from command line. The job runs fine in the local mode
but something happens when I try to run the job in the distributed mode.


Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)

On Fri 11/27/09  2:31 AM , Jeff Zhang zjf...@gmail.com sent:
 Do you run the map reduce job in command line or IDE?  in map reduce
 mode, you should put the jar containing the map and reduce class in
 your classpath
 Jeff Zhang
 On Fri, Nov 27, 2009 at 2:19 PM,   wrote:
 Hello Everybody,
                I have a doubt in Haddop and was wondering if
 anybody has faced a
 similar problem. I have a package called test. Inside that I have
 class called
 A.java, Map.java, Reduce.java. In A.java I have the main method
 where I am trying
 to initialize the jobConf object. I have written
 jobConf.setMapperClass(Map.class) and similarly for the reduce class
 as well. The
 code works correctly when I run the code locally via
 jobConf.set(mapred.job.tracker,local) but I get an exception
 when I try to
 run this code on my cluster. The stack trace of the exception is as
 under. I
 cannot understand the problem. Any help would be appreciated.
 java.lang.RuntimeException: java.lang.RuntimeException:
 java.lang.ClassNotFoundException: test.Map
        at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
        at
 org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)
        at
 org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
        at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
 Caused by: java.lang.RuntimeException:
 java.lang.ClassNotFoundException:
 Markowitz.covarMatrixMap
        at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)
        at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)
        ... 6 more
 Caused by: java.lang.ClassNotFoundException: test.Map
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native
 Method)
        at
 java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at
 sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at
 java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at
 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
        at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)
        ... 7 more
 Thank You
 Abhishek Agrawal
 SUNY- Buffalo
 (716-435-7122)
 
 



Re: Doubt in reducer

2009-08-27 Thread Vladimir Klimontovich

But reducer can do some preparations during map process. It can
distribute map output across nodes that will work as reducers.

Copying and sorting map output is also time costuming process (maybe,
more consuming than reduce itself). For example, piece job run log on  
40node cluster

could be like that:

09/08/27 11:08:24 INFO job.JobRunningListener:  map 36% reduce 10%
09/08/27 11:08:28 INFO job.JobRunningListener:  map 37% reduce 10%
09/08/27 11:08:29 INFO job.JobRunningListener:  map 37% reduce 11%

But if you run job on single node cluster reduce will start only after  
map finished.


On Aug 27, 2009, at 4:31 PM, Harish Mallipeddi wrote:

On Thu, Aug 27, 2009 at 5:22 PM, Rakhi Khatwani  
rkhatw...@gmail.com wrote:




but i want my reduce to run , tht is if 25% map is done, thn i want  
the
reduce 2 save that much data. even if the 2nd map fails, i dont  
loose data.

any pointers?
Regards,
Raakhi



What you're asking for will break the semantics of reduce(). Reduce  
can only

proceed after receiving all the map-outputs.

--
Harish Mallipeddi
http://blog.poundbang.in


---
Vladimir Klimontovich,
skype: klimontovich
GoogleTalk/Jabber: klimontov...@gmail.com
Cell phone: +7926 890 2349



Re: Doubt in HBase

2009-08-21 Thread Ryan Rawson
Well the inputs to those reducers would be the empty set, they
wouldn't have anything to do and their output would also be nil as
well.

If you are doing something like this, and your operation is
communitive, consider using a combiner so that you don't shuffle as
much data. A large amount of shuffled data can make map-reduces
slower. While map-reduce is a sorter, shuffling 1500gb just takes a
little while you know?

you can also set the # of reducers as well. but the mapping of reduce
keys to reducer instances is random/hashed iirc.  The normative case
however is to a large number of reduce keys, rather than only a small
amount.

Generally speaking, use the combiner functionality. It keeps the data
sizes low.  High reduce counts is better for when you have to shuffle
a lot of data with many distinct reduce keys.

This is getting pretty OT, I suggest revisiting the map-reduce paper
and the hadoop docs.

-ryan

On Thu, Aug 20, 2009 at 9:24 PM, john smithjs1987.sm...@gmail.com wrote:
 Thanks for all your replies guys ,.As bharath said , what is the case when
 number of reducers becomes more than number of distinct Map key outputs?

 On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada 
 bharathvissapragada1...@gmail.com wrote:

 Aamandeep , Gray and Purtell thanks for your replies .. I have found them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of the
 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on
 one
 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the
 region
 with 5 values instead of moving those 5 values to other system to minimize
 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org
 wrote:

  The behavior of TableInputFormat is to schedule one mapper for every
 table
  region.
 
  In addition to what others have said already, if your reducer is doing
  little more than storing data back into HBase (via TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens within
 the
  Hadoop job framework as map outputs are input into reducers. For that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially a grid
  scheduler -- something like job.setNumReducers(0) will do the trick.
 
  Best regards,
 
    - Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table which
  spans across 5 region servers . I am using TableInputFormat to read the
  data
  from the tables in the map . When i run the program , by default how many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more time .
  Is
  it due to moving the map output across the regionservers? i.e, moving the
  values of same key to a particular reduce phase to start the reducer? Is
  there any way i can optimize the code (e.g. by storing data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 




Re: Doubt in HBase

2009-08-21 Thread Ryan Rawson
hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:
 Aamandeep , Gray and Purtell thanks for your replies .. I have found them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of the
 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on one
 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the region
 with 5 values instead of moving those 5 values to other system to minimize
 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote:

 The behavior of TableInputFormat is to schedule one mapper for every table
 region.

 In addition to what others have said already, if your reducer is doing
 little more than storing data back into HBase (via TableOutputFormat), then
 you can consider writing results back to HBase directly from the mapper to
 avoid incurring the overhead of sort/shuffle/merge which happens within the
 Hadoop job framework as map outputs are input into reducers. For that type
 of use case -- using the Hadoop mapreduce subsystem as essentially a grid
 scheduler -- something like job.setNumReducers(0) will do the trick.

 Best regards,

   - Andy




 
 From: john smith js1987.sm...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 12:42:36 AM
 Subject: Doubt in HBase

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.

 Iam using Map Reduce in HBase in distributed mode .  I have a table which
 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how many
 map regions are created ? Is it one per region server or more ?

 Also after the map task is over.. reduce task is taking a bit more time .
 Is
 it due to moving the map output across the regionservers? i.e, moving the
 values of same key to a particular reduce phase to start the reducer? Is
 there any way i can optimize the code (e.g. by storing data of same reducer
 nearby )

 Thanks :)







Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

 hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:
  Aamandeep , Gray and Purtell thanks for your replies .. I have found them
  very useful.
 
  You said to increase the number of reduce tasks . Suppose the number of
  reduce tasks is more than number of distinct map output keys , some of
 the
  reduce processes may go waste ? is that the case?
 
  Also  I have one more doubt ..I have 5 values for a corresponding key on
 one
  region  and other 2 values on 2 different region servers.
  Does hadoop Map reduce take care of moving these 2 diff values to the
 region
  with 5 values instead of moving those 5 values to other system to
 minimize
  the dataflow? Is this what is happening inside ?
 
  On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org
 wrote:
 
  The behavior of TableInputFormat is to schedule one mapper for every
 table
  region.
 
  In addition to what others have said already, if your reducer is doing
  little more than storing data back into HBase (via TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens within
 the
  Hadoop job framework as map outputs are input into reducers. For that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially a
 grid
  scheduler -- something like job.setNumReducers(0) will do the trick.
 
  Best regards,
 
- Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table
 which
  spans across 5 region servers . I am using TableInputFormat to read the
  data
  from the tables in the map . When i run the program , by default how
 many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more time
 .
  Is
  it due to moving the map output across the regionservers? i.e, moving
 the
  values of same key to a particular reduce phase to start the reducer? Is
  there any way i can optimize the code (e.g. by storing data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 
 



Re: Doubt in HBase

2009-08-21 Thread Jonathan Gray

Ryan,

In older versions of HBase, when we did not attempt any data locality, 
we had a few users running jobs that became network i/o bound.  It 
wasn't a latency issue it was a bandwidth issue.


That's actually when/why an attempt at better data locality for HBase MR 
was made in the first place.


I hadn't personally experienced it but I recall two users who had. 
After they made a first-stab patch, I ran some comparisons and noticed a 
significant reduction in network i/o for data-intensive MR jobs.  They 
also were no longer network i/o bound on their jobs, if I recall, and 
became disk i/o bound (as one would expect/hope).


For a majority of use cases, it doesn't matter in a significant way at 
all.  But I have seen it make a measurable difference for some.


JG

bharath vissapragada wrote:

Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:


hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:

Aamandeep , Gray and Purtell thanks for your replies .. I have found them
very useful.

You said to increase the number of reduce tasks . Suppose the number of
reduce tasks is more than number of distinct map output keys , some of

the

reduce processes may go waste ? is that the case?

Also  I have one more doubt ..I have 5 values for a corresponding key on

one

region  and other 2 values on 2 different region servers.
Does hadoop Map reduce take care of moving these 2 diff values to the

region

with 5 values instead of moving those 5 values to other system to

minimize

the dataflow? Is this what is happening inside ?

On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

wrote:

The behavior of TableInputFormat is to schedule one mapper for every

table

region.

In addition to what others have said already, if your reducer is doing
little more than storing data back into HBase (via TableOutputFormat),

then

you can consider writing results back to HBase directly from the mapper

to

avoid incurring the overhead of sort/shuffle/merge which happens within

the

Hadoop job framework as map outputs are input into reducers. For that

type

of use case -- using the Hadoop mapreduce subsystem as essentially a

grid

scheduler -- something like job.setNumReducers(0) will do the trick.

Best regards,

  - Andy





From: john smith js1987.sm...@gmail.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 12:42:36 AM
Subject: Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.

Iam using Map Reduce in HBase in distributed mode .  I have a table

which

spans across 5 region servers . I am using TableInputFormat to read the
data
from the tables in the map . When i run the program , by default how

many

map regions are created ? Is it one per region server or more ?

Also after the map task is over.. reduce task is taking a bit more time

.

Is
it due to moving the map output across the regionservers? i.e, moving

the

values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same

reducer

nearby )

Thanks :)








Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
JG

Can you please elaborate on the last statement for some.. by giving an
example or some kind of scenario in which it can take place where MR jobs
involve huge amount of data.

Thanks.

On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote:

 Ryan,

 In older versions of HBase, when we did not attempt any data locality, we
 had a few users running jobs that became network i/o bound.  It wasn't a
 latency issue it was a bandwidth issue.

 That's actually when/why an attempt at better data locality for HBase MR
 was made in the first place.

 I hadn't personally experienced it but I recall two users who had. After
 they made a first-stab patch, I ran some comparisons and noticed a
 significant reduction in network i/o for data-intensive MR jobs.  They also
 were no longer network i/o bound on their jobs, if I recall, and became disk
 i/o bound (as one would expect/hope).

 For a majority of use cases, it doesn't matter in a significant way at all.
  But I have seen it make a measurable difference for some.

 JG


 bharath vissapragada wrote:

 Thanks Ryan

 I was just explaining with an example .. I have TBs of data to work
 with.Just i wanted to know that scheduler TRIES to assign the reduce phase
 to keep the data local (i.e.,TRYING  to assign it to the machine with
 machine with greater num of key values).
 I was just explaining it with an example .

 Thanks for ur reply (following u on twitter :))

 On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

  hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:

 Aamandeep , Gray and Purtell thanks for your replies .. I have found
 them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of

 the

 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on

 one

 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the

 region

 with 5 values instead of moving those 5 values to other system to

 minimize

 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

 wrote:

 The behavior of TableInputFormat is to schedule one mapper for every

 table

 region.

 In addition to what others have said already, if your reducer is doing
 little more than storing data back into HBase (via TableOutputFormat),

 then

 you can consider writing results back to HBase directly from the mapper

 to

 avoid incurring the overhead of sort/shuffle/merge which happens within

 the

 Hadoop job framework as map outputs are input into reducers. For that

 type

 of use case -- using the Hadoop mapreduce subsystem as essentially a

 grid

 scheduler -- something like job.setNumReducers(0) will do the trick.

 Best regards,

  - Andy




 
 From: john smith js1987.sm...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 12:42:36 AM
 Subject: Doubt in HBase

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.

 Iam using Map Reduce in HBase in distributed mode .  I have a table

 which

 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how

 many

 map regions are created ? Is it one per region server or more ?

 Also after the map task is 

Re: Doubt in HBase

2009-08-21 Thread Jonathan Gray

I really couldn't be specific.

The more data that has to be moved across the wire, the more network i/o.

For example, if you have very large values, and a very large table, and 
you have that as the input to your MR.  You could potentially be network 
i/o bound.


It should be very easy to test how your own jobs run on your own cluster 
using Ganglia and hadoop/mr logging/output.


bharath vissapragada wrote:

JG

Can you please elaborate on the last statement for some.. by giving an
example or some kind of scenario in which it can take place where MR jobs
involve huge amount of data.

Thanks.

On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com wrote:


Ryan,

In older versions of HBase, when we did not attempt any data locality, we
had a few users running jobs that became network i/o bound.  It wasn't a
latency issue it was a bandwidth issue.

That's actually when/why an attempt at better data locality for HBase MR
was made in the first place.

I hadn't personally experienced it but I recall two users who had. After
they made a first-stab patch, I ran some comparisons and noticed a
significant reduction in network i/o for data-intensive MR jobs.  They also
were no longer network i/o bound on their jobs, if I recall, and became disk
i/o bound (as one would expect/hope).

For a majority of use cases, it doesn't matter in a significant way at all.
 But I have seen it make a measurable difference for some.

JG


bharath vissapragada wrote:


Thanks Ryan

I was just explaining with an example .. I have TBs of data to work
with.Just i wanted to know that scheduler TRIES to assign the reduce phase
to keep the data local (i.e.,TRYING  to assign it to the machine with
machine with greater num of key values).
I was just explaining it with an example .

Thanks for ur reply (following u on twitter :))

On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com wrote:

 hey,

Yes the hadoop system attempts to assign map tasks to data local, but
why would you be worried about this for 5 values?  The max value size
in hbase is Integer.MAX_VALUE, so it's not like you have much data to
shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
use HDFS directly and keep only the metadata in hbase (including
things like location of the data blob).

I think people are confused about how optimal map reduces have to be.
Keeping all the data super-local on each machine is not always helping
you, since you have to read via a socket anyways. Going remote doesn't
actually make things that much slower, since on a modern lan ping
times are  0.1ms.  If your entire cluster is hanging off a single
switch, there is nearly unlimited bandwidth between all nodes
(certainly much higher than any single system could push).  Only once
you go multi-switch then switch-locality (aka rack locality) becomes
important.

Remember, hadoop isn't about the instantaneous speed of any job, but
about running jobs in a highly scalable manner that works on tens or
tens of thousands of nodes. You end up blocking on single machine
limits anyways, and the r=3 of HDFS helps you transcend a single
machine read speed for large files. Keeping the data transfer local in
this case results in lower performance.

If you want max local speed, I suggest looking at CUDA.


On Thu, Aug 20, 2009 at 9:09 PM, bharath
vissapragadabharathvissapragada1...@gmail.com wrote:


Aamandeep , Gray and Purtell thanks for your replies .. I have found
them
very useful.

You said to increase the number of reduce tasks . Suppose the number of
reduce tasks is more than number of distinct map output keys , some of


the


reduce processes may go waste ? is that the case?

Also  I have one more doubt ..I have 5 values for a corresponding key on


one


region  and other 2 values on 2 different region servers.
Does hadoop Map reduce take care of moving these 2 diff values to the


region


with 5 values instead of moving those 5 values to other system to


minimize


the dataflow? Is this what is happening inside ?

On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org


wrote:


The behavior of TableInputFormat is to schedule one mapper for every
table
region.

In addition to what others have said already, if your reducer is doing
little more than storing data back into HBase (via TableOutputFormat),


then
you can consider writing results back to HBase directly from the mapper
to
avoid incurring the overhead of sort/shuffle/merge which happens within
the
Hadoop job framework as map outputs are input into reducers. For that
type
of use case -- using the Hadoop mapreduce subsystem as essentially a
grid
scheduler -- something like job.setNumReducers(0) will do the trick.

Best regards,

 - Andy





From: john smith js1987.sm...@gmail.com
To: hbase-user@hadoop.apache.org
Sent: Friday, August 21, 2009 12:42:36 AM
Subject: Doubt in HBase

Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.


Re: Doubt in HBase

2009-08-21 Thread bharath vissapragada
JG,

In one of your above replies , you have said that datalocality was not
considered in older versions of HBase , Is there  any development on the
same in 0.20 RC1/2 or 0.19.x ? If no can you tell me where that patch can be
available so that i can test my programs .

Thanks in advance

On Sat, Aug 22, 2009 at 12:12 AM, Jonathan Gray jl...@streamy.com wrote:

 I really couldn't be specific.

 The more data that has to be moved across the wire, the more network i/o.

 For example, if you have very large values, and a very large table, and you
 have that as the input to your MR.  You could potentially be network i/o
 bound.

 It should be very easy to test how your own jobs run on your own cluster
 using Ganglia and hadoop/mr logging/output.


 bharath vissapragada wrote:

 JG

 Can you please elaborate on the last statement for some.. by giving an
 example or some kind of scenario in which it can take place where MR jobs
 involve huge amount of data.

 Thanks.

 On Fri, Aug 21, 2009 at 11:24 PM, Jonathan Gray jl...@streamy.com
 wrote:

  Ryan,

 In older versions of HBase, when we did not attempt any data locality, we
 had a few users running jobs that became network i/o bound.  It wasn't a
 latency issue it was a bandwidth issue.

 That's actually when/why an attempt at better data locality for HBase MR
 was made in the first place.

 I hadn't personally experienced it but I recall two users who had. After
 they made a first-stab patch, I ran some comparisons and noticed a
 significant reduction in network i/o for data-intensive MR jobs.  They
 also
 were no longer network i/o bound on their jobs, if I recall, and became
 disk
 i/o bound (as one would expect/hope).

 For a majority of use cases, it doesn't matter in a significant way at
 all.
  But I have seen it make a measurable difference for some.

 JG


 bharath vissapragada wrote:

  Thanks Ryan

 I was just explaining with an example .. I have TBs of data to work
 with.Just i wanted to know that scheduler TRIES to assign the reduce
 phase
 to keep the data local (i.e.,TRYING  to assign it to the machine with
 machine with greater num of key values).
 I was just explaining it with an example .

 Thanks for ur reply (following u on twitter :))

 On Fri, Aug 21, 2009 at 12:13 PM, Ryan Rawson ryano...@gmail.com
 wrote:

  hey,

 Yes the hadoop system attempts to assign map tasks to data local, but
 why would you be worried about this for 5 values?  The max value size
 in hbase is Integer.MAX_VALUE, so it's not like you have much data to
 shuffle. Once your blobs  ~ 64mb or so, it might make more sense to
 use HDFS directly and keep only the metadata in hbase (including
 things like location of the data blob).

 I think people are confused about how optimal map reduces have to be.
 Keeping all the data super-local on each machine is not always helping
 you, since you have to read via a socket anyways. Going remote doesn't
 actually make things that much slower, since on a modern lan ping
 times are  0.1ms.  If your entire cluster is hanging off a single
 switch, there is nearly unlimited bandwidth between all nodes
 (certainly much higher than any single system could push).  Only once
 you go multi-switch then switch-locality (aka rack locality) becomes
 important.

 Remember, hadoop isn't about the instantaneous speed of any job, but
 about running jobs in a highly scalable manner that works on tens or
 tens of thousands of nodes. You end up blocking on single machine
 limits anyways, and the r=3 of HDFS helps you transcend a single
 machine read speed for large files. Keeping the data transfer local in
 this case results in lower performance.

 If you want max local speed, I suggest looking at CUDA.


 On Thu, Aug 20, 2009 at 9:09 PM, bharath
 vissapragadabharathvissapragada1...@gmail.com wrote:

  Aamandeep , Gray and Purtell thanks for your replies .. I have found
 them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number
 of
 reduce tasks is more than number of distinct map output keys , some of

  the

  reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key
 on

  one

  region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the

  region

  with 5 values instead of moving those 5 values to other system to

  minimize

  the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org

  wrote:

  The behavior of TableInputFormat is to schedule one mapper for every
 table
 region.

 In addition to what others have said already, if your reducer is
 doing
 little more than storing data back into HBase (via
 TableOutputFormat),

  then
 you can consider writing results back to HBase directly from the
 mapper
 to
 avoid incurring the overhead of sort/shuffle/merge which happens
 within
 the
 Hadoop job framework as map outputs are 

Re: Doubt in HBase

2009-08-20 Thread Amandeep Khurana
On Thu, Aug 20, 2009 at 9:42 AM, john smith js1987.sm...@gmail.com wrote:

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.


No questions are silly.. Dont worry



 Iam using Map Reduce in HBase in distributed mode .  I have a table which
 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how many
 map regions are created ? Is it one per region server or more ?


If you set the number of map tasks to a high number, it automatically spawns
one map task for each region (not region server). Otherwise, it'll spawn the
number you have explicitly specified in the job.



 Also after the map task is over.. reduce task is taking a bit more time .
 Is
 it due to moving the map output across the regionservers? i.e, moving the
 values of same key to a particular reduce phase to start the reducer? Is
 there any way i can optimize the code (e.g. by storing data of same reducer
 nearby )


Increase the number of reducers. Each reducer will have lesser data to move.



 Thanks :)



Re: Doubt in HBase

2009-08-20 Thread Jonathan Gray

What Amandeep said.

Also, one clarification for you.  You mentioned the reduce task moving 
map output across regionservers.  Remember, HBase is just a MapReduce 
input source or output sink.  The sort/shuffle/reduce is a part of 
Hadoop MapReduce and has nothing to do with HBase directly.  It is 
utilizing the JobTracker/TaskTrackers, not the RegionServers.


Like AK said, you can increase the number of reducers, or reduce the 
amount of data you output from the maps.


JG

Amandeep Khurana wrote:

On Thu, Aug 20, 2009 at 9:42 AM, john smith js1987.sm...@gmail.com wrote:


Hi all ,

I have one small doubt . Kindly answer it even if it sounds silly.



No questions are silly.. Dont worry



Iam using Map Reduce in HBase in distributed mode .  I have a table which
spans across 5 region servers . I am using TableInputFormat to read the
data
from the tables in the map . When i run the program , by default how many
map regions are created ? Is it one per region server or more ?



If you set the number of map tasks to a high number, it automatically spawns
one map task for each region (not region server). Otherwise, it'll spawn the
number you have explicitly specified in the job.



Also after the map task is over.. reduce task is taking a bit more time .
Is
it due to moving the map output across the regionservers? i.e, moving the
values of same key to a particular reduce phase to start the reducer? Is
there any way i can optimize the code (e.g. by storing data of same reducer
nearby )



Increase the number of reducers. Each reducer will have lesser data to move.



Thanks :)





Re: Doubt in HBase

2009-08-20 Thread bharath vissapragada
Aamandeep , Gray and Purtell thanks for your replies .. I have found them
very useful.

You said to increase the number of reduce tasks . Suppose the number of
reduce tasks is more than number of distinct map output keys , some of the
reduce processes may go waste ? is that the case?

Also  I have one more doubt ..I have 5 values for a corresponding key on one
region  and other 2 values on 2 different region servers.
Does hadoop Map reduce take care of moving these 2 diff values to the region
with 5 values instead of moving those 5 values to other system to minimize
the dataflow? Is this what is happening inside ?

On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org wrote:

 The behavior of TableInputFormat is to schedule one mapper for every table
 region.

 In addition to what others have said already, if your reducer is doing
 little more than storing data back into HBase (via TableOutputFormat), then
 you can consider writing results back to HBase directly from the mapper to
 avoid incurring the overhead of sort/shuffle/merge which happens within the
 Hadoop job framework as map outputs are input into reducers. For that type
 of use case -- using the Hadoop mapreduce subsystem as essentially a grid
 scheduler -- something like job.setNumReducers(0) will do the trick.

 Best regards,

   - Andy




 
 From: john smith js1987.sm...@gmail.com
 To: hbase-user@hadoop.apache.org
 Sent: Friday, August 21, 2009 12:42:36 AM
 Subject: Doubt in HBase

 Hi all ,

 I have one small doubt . Kindly answer it even if it sounds silly.

 Iam using Map Reduce in HBase in distributed mode .  I have a table which
 spans across 5 region servers . I am using TableInputFormat to read the
 data
 from the tables in the map . When i run the program , by default how many
 map regions are created ? Is it one per region server or more ?

 Also after the map task is over.. reduce task is taking a bit more time .
 Is
 it due to moving the map output across the regionservers? i.e, moving the
 values of same key to a particular reduce phase to start the reducer? Is
 there any way i can optimize the code (e.g. by storing data of same reducer
 nearby )

 Thanks :)






Re: Doubt in HBase

2009-08-20 Thread john smith
Thanks for all your replies guys ,.As bharath said , what is the case when
number of reducers becomes more than number of distinct Map key outputs?

On Fri, Aug 21, 2009 at 9:39 AM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 Aamandeep , Gray and Purtell thanks for your replies .. I have found them
 very useful.

 You said to increase the number of reduce tasks . Suppose the number of
 reduce tasks is more than number of distinct map output keys , some of the
 reduce processes may go waste ? is that the case?

 Also  I have one more doubt ..I have 5 values for a corresponding key on
 one
 region  and other 2 values on 2 different region servers.
 Does hadoop Map reduce take care of moving these 2 diff values to the
 region
 with 5 values instead of moving those 5 values to other system to minimize
 the dataflow? Is this what is happening inside ?

 On Fri, Aug 21, 2009 at 9:03 AM, Andrew Purtell apurt...@apache.org
 wrote:

  The behavior of TableInputFormat is to schedule one mapper for every
 table
  region.
 
  In addition to what others have said already, if your reducer is doing
  little more than storing data back into HBase (via TableOutputFormat),
 then
  you can consider writing results back to HBase directly from the mapper
 to
  avoid incurring the overhead of sort/shuffle/merge which happens within
 the
  Hadoop job framework as map outputs are input into reducers. For that
 type
  of use case -- using the Hadoop mapreduce subsystem as essentially a grid
  scheduler -- something like job.setNumReducers(0) will do the trick.
 
  Best regards,
 
- Andy
 
 
 
 
  
  From: john smith js1987.sm...@gmail.com
  To: hbase-user@hadoop.apache.org
  Sent: Friday, August 21, 2009 12:42:36 AM
  Subject: Doubt in HBase
 
  Hi all ,
 
  I have one small doubt . Kindly answer it even if it sounds silly.
 
  Iam using Map Reduce in HBase in distributed mode .  I have a table which
  spans across 5 region servers . I am using TableInputFormat to read the
  data
  from the tables in the map . When i run the program , by default how many
  map regions are created ? Is it one per region server or more ?
 
  Also after the map task is over.. reduce task is taking a bit more time .
  Is
  it due to moving the map output across the regionservers? i.e, moving the
  values of same key to a particular reduce phase to start the reducer? Is
  there any way i can optimize the code (e.g. by storing data of same
 reducer
  nearby )
 
  Thanks :)
 
 
 
 



Re: Doubt regarding mem cache.

2009-08-12 Thread Erik Holstad
Hi Rakhi!

On Wed, Aug 12, 2009 at 11:49 AM, Rakhi Khatwani rkhatw...@gmail.comwrote:

 Hi,
 I am not very clear as to how does the mem cache thing works.


MemCache was a name that was used and caused some confusion of what the
purpose of it is.
It has now been renamed to MemStore and is basically a write buffer that
gets flushed to disk/HDFS
when it is too big.


 1. When you set memcache to say 1MB, does hbase write all the table
 information into some cache memory and when the size reaches IMB, it writes
 into hadoop and after that the replication takes place???

So yeah, kinda


 2. Is there any minimum limit on the mem cache property in hbase site??

Not sure if there is a minimum limit but have a look at hbase-default .xml
and
you will see a bunch of settings for memStore


 Regards,
 Raakhi


Regards Erik


Re: Doubt regarding Replication Factor

2009-08-12 Thread Konstantin Shvachko

You can try it: start a 3 node cluster and create a file with replication 5.

The answer is that each data-node can store only one replica of a block.
So in your case you will get an exception on close() saying the file cannot
be fully replicated.

Thanks,
--Konstantin

Rakhi Khatwani wrote:

Hi,
   I just wanted to know what if we have set the replication factor greater
than the number of nodes in the cluster.
for example, i have only 3 nodes in my cluster but i set the replication
factor to 5.

will it create 3 copies and save it in each node, or can it create more than
one copy per node?

Regards,
Raakhi Khatwani



Re: Doubt regarding Replication Factor

2009-08-12 Thread Tarandeep Singh
A similar question-

If in an N node cluster, a file's replication is set to N (replicate on each
node) and later if a node goes down, will HDFS throw an exception since the
file's replication has gone down below the specified number ?

Thanks,
Tarandeep

On Wed, Aug 12, 2009 at 12:11 PM, Konstantin Shvachko s...@yahoo-inc.comwrote:

 You can try it: start a 3 node cluster and create a file with replication
 5.

 The answer is that each data-node can store only one replica of a block.
 So in your case you will get an exception on close() saying the file cannot
 be fully replicated.

 Thanks,
 --Konstantin


 Rakhi Khatwani wrote:

 Hi,
   I just wanted to know what if we have set the replication factor greater
 than the number of nodes in the cluster.
 for example, i have only 3 nodes in my cluster but i set the replication
 factor to 5.

 will it create 3 copies and save it in each node, or can it create more
 than
 one copy per node?

 Regards,
 Raakhi Khatwani




Re: Doubt in implementing TableReduce Interface

2009-07-28 Thread Ninad Raut
the method looks fine. Put some logging inside the reduce method to trace
the inputs to the reduce.  Here's an example... change IntWritable to Text
in your case...
 static class ReadTableReduce2 extends MapReduceBase implements
TableReduceText, IntWritable{

  SortedMapText, Text buzz = new TreeMapText, Text();
  @Override
  public void reduce(Text key,
  IteratorIntWritable values,
  OutputCollectorImmutableBytesWritable, BatchUpdate output,
  Reporter report) throws IOException {

  Integer sum =0;
  while(values.hasNext()) {
  sum += values.next().get();
  }
  if (sum =3) {
  BatchUpdate outval = new BatchUpdate(rowCounter.toString());
  String keyStr = key.toString();
  String[] keyArr=keyStr.split(:);
  outval.put(buzzcount:+keyArr[1], sum.toString().getBytes());
  report.incrCounter(Counters.REDUCE_LINES, 1);
  report.setStatus(sum:+sum);
  rowCounter++;
  output.collect(new ImmutableBytesWritable(key.getBytes()), outval);
  }

  }


On Fri, Jul 24, 2009 at 11:29 PM, bharath vissapragada 
bhara...@students.iiit.ac.in wrote:

 Hi all,

 i have implemented TableMap interface succesfully to emit Text,Text pairs
 .. so now i must implement TableReduce interface to receive those
 Text,Text pairs correspondingly ...  Is the following code correct

 public class MyTableReduce
 extends MapReduceBase
 implements TableReduceText, Text {

  public void reduce(Text key, IteratorText values,
  OutputCollectorImmutableBytesWritable, BatchUpdate output,
  @SuppressWarnings(unused) Reporter reporter)
  throws IOException {
 }
 }



Re: Doubt regarding permissions

2009-04-13 Thread Tsz Wo (Nicholas), Sze

Hi Amar,

I just have tried.  Everything worked as expected.  I guess user A in your 
experiment was a superuser so that he could read anything.

Nicholas Sze

/// permission testing //
drwx-wx-wx   - nicholas supergroup  0 2009-04-13 10:55 /temp
drwx-w--w-   - tsz supergroup  0 2009-04-13 10:58 /temp/test
-rw-r--r--   3 tsz supergroup   1366 2009-04-13 10:58 /temp/test/r.txt

//login as nicholas (non-superuser)

$ whoami
nicholas

$ ./bin/hadoop fs -lsr /temp
drwx-w--w-   - tsz supergroup  0 2009-04-13 10:58 /temp/test
lsr: could not get get listing for 'hdfs://:9000/temp/test' : 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=nicholas, access=READ_EXECUTE, inode=test:tsz:supergroup:rwx-w--w-

$ ./bin/hadoop fs -cat /temp/test/r.txt
cat: org.apache.hadoop.security.AccessControlException: Permission denied: 
user=nicholas, access=EXECUTE, inode=test:tsz:supergroup:rwx-w--w-



- Original Message 
 From: Amar Kamat ama...@yahoo-inc.com
 To: core-user@hadoop.apache.org
 Sent: Monday, April 13, 2009 2:02:24 AM
 Subject: Doubt regarding permissions
 
 Hey, I tried the following :
 
 -  created a dir temp for user A and permission 733
 
 -  created a dir temp/test for user B and permission 722
 
 -  - created a file temp/test/test.txt for user B and permission722
 
 
 
 Now in HDFS, user A can list as well as read the contents of file
 temp/test/test.txt while on my RHEL box I cant. Is it a feature or a
 bug. Can someone please try this out and confirm?
 
 
 
 Thanks
 
 Amar



Re: Doubt in MultiFileWordCount.java

2008-09-29 Thread Arun C Murthy


On Sep 29, 2008, at 3:11 AM, Geethajini C wrote:


Hi everyone,

In the example MultiFileWordCount.java  
(hadoop-0.17.0), what happens when the statement 
JobClient.runJob(job);is executed. What methods will be  
called in sequence?




This might help:
http://hadoop.apache.org/core/docs/r0.17.2/api/org/apache/hadoop/mapred/JobClient.html

Arun


Re: Doubt in RegExpRowFilter and RowFilters in general

2008-02-11 Thread stack
Have you tried enabling DEBUG-level logging?  Filters have lots of 
logging around state changes.  Might help figure this issue.  You might 
need to add extra logging around line #2401 in HStore.


(I just spent some time trying to bend my head around whats going on.  
Filters are run at the Store level.  It looks like that in 
RegExpRowFilter, a map is made on construction of column to value.  If 
value matches, filter returns false, so cell should be added in each 
family.  I don't see anything obviously wrong in here).


St.Ack


David Alves wrote:

St.Ack

Thanks for your reply.

When I use RegExpRowFilter with only one (either one) of the conditions
it works (the rows are passed onto the Map/Reduce task) but there is
still a problem because only one of them column is present in the
resulting MapWritable (I'm using my own tableinputformat) from the
scanner.
So I still use the filter to check for more rows (build a scanner with
one of the conditions, the rarest one, iterate through to try and find
the other) but not in the tableinputformat itself (I just discard the
unwanted values in the Mapper) which is a performance hit (if it would
be the scanner the row wouldn't simply be sent to the master right,
therefore less traffic is distributed mode?), but no big deal.
I seems to me that when the filter is applied only the column that
matches (or the one that doesn't match I'm not sure at the moment) is
passed to the scanner result.

As to the second point I'm running HBase in local mode for development
and the DEBUG log for the HMaster shows nothing, my process simply hangs
indefinitely.

When I'll have some free time I'll try to look into the sources, and
pinpoint the problem more accurately.

David

On Mon, 2008-02-11 at 10:36 -0800, stack wrote:
  

David:

disclaimerIMO, filters are a bit of sweet functionality but they are 
not easy to use.  They also have seen little exercise so you are 
probably tripping over bugs.  That said, I know they basically 
work./disclaimer


I'd suggest you progress from basic filtering toward the filter you'd 
like to implement.   Does the RegExpRowFilter do the right thing when 
filtering one column only?


On the ClassNotFoundException, yeah, it should be coming out on the 
client.  Can you see it in the server logs?  Do you get any exceptions 
client-side?


St.Ack



David Alves wrote:


Hi Again

In my previous example I seem to have misplaced a new keyword (new
myvalue1.getBytes() where it should have been myvalue1.getBytes()).

On another note my program hangs when I supply my own filter to the
scanner (I suppose it's clear that the nodes don't know my class so
there should be a ClassNotFoundException right?).

Regards
David Alves 



On Mon, 2008-02-11 at 16:51 +, David Alves wrote: 
  
  

Hi Guys
In my previous email I might have misunderstood the roles of the
RowFilterInterfaces so I'll pose my question more clearly (since the
last one wasn't in question form :)).
I save a setup when a table has to columns belonging to different
column families (Table A cf1:a cf2:b));

I'm trying to build a filter so that a scanner only returns the rows
where cf1:a = myvalue1 and cf2:b = myvalue2.

I've build a RegExpRowFilter like this;
MapText, byte[] conditionalsMap = new HashMapText, byte[]();
conditionalsMap.put(new Text(cf1:a), new myvalue1.getBytes());
conditionalsMap.put(new Text(cf2:b), myvalue2.getBytes());
return new RegExpRowFilter(.*, conditionalsMap);

My problem is this filter always fails when I know for sure that there
are rows whose columns match my values.

I'm building the the scanner like this (the purpose in this case is to
find if there are more values that match my filter):

final Text startKey = this.htable.getStartKeys()[0];
HScannerInterface scanner = htable.obtainScanner(new 
Text[] {new
Text(cf1:a), new Text(cf2:b)}, startKey, rowFilterInterface);
return scanner.iterator().hasNext();

Can anyone give me a hand please.

Thanks in advance
David Alves





  
  


  




Re: Doubt in RegExpRowFilter and RowFilters in general

2008-02-11 Thread David Alves
Hi Again

In my previous example I seem to have misplaced a new keyword (new
myvalue1.getBytes() where it should have been myvalue1.getBytes()).

On another note my program hangs when I supply my own filter to the
scanner (I suppose it's clear that the nodes don't know my class so
there should be a ClassNotFoundException right?).

Regards
David Alves 


On Mon, 2008-02-11 at 16:51 +, David Alves wrote: 
 Hi Guys
   In my previous email I might have misunderstood the roles of the
 RowFilterInterfaces so I'll pose my question more clearly (since the
 last one wasn't in question form :)).
   I save a setup when a table has to columns belonging to different
 column families (Table A cf1:a cf2:b));
 
 I'm trying to build a filter so that a scanner only returns the rows
 where cf1:a = myvalue1 and cf2:b = myvalue2.
 
 I've build a RegExpRowFilter like this;
 MapText, byte[] conditionalsMap = new HashMapText, byte[]();
   conditionalsMap.put(new Text(cf1:a), new myvalue1.getBytes());
   conditionalsMap.put(new Text(cf2:b), myvalue2.getBytes());
   return new RegExpRowFilter(.*, conditionalsMap);
 
 My problem is this filter always fails when I know for sure that there
 are rows whose columns match my values.
 
 I'm building the the scanner like this (the purpose in this case is to
 find if there are more values that match my filter):
 
 final Text startKey = this.htable.getStartKeys()[0];
   HScannerInterface scanner = htable.obtainScanner(new 
 Text[] {new
 Text(cf1:a), new Text(cf2:b)}, startKey, rowFilterInterface);
   return scanner.iterator().hasNext();
 
 Can anyone give me a hand please.
 
 Thanks in advance
 David Alves