Understanding Sys.output from mapper & partitioner

2013-03-27 Thread Sai Sai
Below r my simple mapper, partitioner classes and the input file and the output 
displayed on Console at the end of the message:

My question is about the keys it prints in the console window highlighted in 
bold in the console output which looks like this:

Here is the first few lines of the output in console:

...

13/03/27 02:20:57 INFO mapred.MapTask: data buffer = 79691776/99614720
13/03/27 02:20:57 INFO mapred.MapTask: record buffer = 262144/327680
key = 0 value = 10    10
token[0] = 10 token[1] = 10
Printing Result in Partitioner = 0
IntPair in Mapper = 10-10
key = 6 value = 20    200
token[0] = 20 token[1] = 200
Printing Result in Partitioner = 0
IntPair in Mapper = 20-200 

Q1: I am confused how/where it is calculating/getting these values Key=0 & 
Key=6 and so on?

Q2: After output of the first 2 lines it prints the output from the partitioner 
class:
   Printing Result in Partitioner = 0
Is this because its happening parallel y the mapper & the partitioner?

Will really appreciate if someone can take a quick look and pour some light in 
understanding it.

 Mapper Class *** 


public class SecondarySortMapper extends  Mapper {
    
    private String [] tokens = null;
    private IntWritable ONE = new IntWritable(1);


    @Override
    public void map(LongWritable key, Text value,
            Context context)
            throws IOException , InterruptedException{
        
        System.out.println("key = " + key.toString() + " value = " + 
value.toString());
        
        if(value!=null){
            tokens = value.toString().split("\\s+") ;
            System.out.println("token[0] = " + tokens[0] + " token[1] = " + 
tokens[1] );
            ONE.set(Integer.parseInt(tokens[1]));
            IntPair ip = new IntPair(Integer.parseInt(tokens[0]), 
Integer.parseInt(tokens[1]));
            context.write(ip, ONE);
            System.out.println("IntPair in Mapper = " + ip.toString());         
           
        }
    }

 Partitioner class *** 

public class SecondarySortPartitioner extends Partitioner 
{

 
    @Override
    public int getPartition(IntPair key, IntWritable value, int 
numOfPartitions) {
        // TODO Auto-generated method stub
        
        int result = (key.getFirst().hashCode())%numOfPartitions;
        System.out.println("Printing Result in Partitioner = " + result);
        return result;
    }
    
}


*** input file ***

10    10
20    200
30    2500
40    400
50    500
60    1
10    10
30    2500
50    500
10    100
20    2000
30    25000
40    4000
50    5000
60    10
10    100
30    25000
50    5000



** Here is the output in the console 
...

13/03/27 02:20:57 INFO mapred.MapTask: data buffer = 79691776/99614720
13/03/27 02:20:57 INFO mapred.MapTask: record buffer = 262144/327680
key = 0 value = 10    10
token[0] = 10 token[1] = 10
Printing Result in Partitioner = 0
IntPair in Mapper = 10-10
key = 6 value = 20    200
token[0] = 20 token[1] = 200
Printing Result in Partitioner = 0
IntPair in Mapper = 20-200
key = 13 value = 30    2500
token[0] = 30 token[1] = 2500
Printing Result in Partitioner = 0
IntPair in Mapper = 30-2500
key = 21 value = 40    400
token[0] = 40 token[1] = 400
Printing Result in Partitioner = 0
IntPair in Mapper = 40-400
key = 28 value = 50    500
token[0] = 50 token[1] = 500
Printing Result in Partitioner = 0
IntPair in Mapper = 50-500
key = 35 value = 60    1
token[0] = 60 token[1] = 1
Printing Result in Partitioner = 0
IntPair in Mapper = 60-1
key = 40 value = 10    10
token[0] = 10 token[1] = 10
Printing Result in Partitioner = 0
IntPair in Mapper = 10-10
key = 46 value = 30    2500
token[0] = 30 token[1] = 2500
Printing Result in Partitioner = 0
IntPair in Mapper = 30-2500
key = 54 value = 50    500
token[0] = 50 token[1] = 500
Printing Result in Partitioner = 0
IntPair in Mapper = 50-500
key = 61 value = 10    100
token[0] = 10 token[1] = 100
Printing Result in Partitioner = 0
IntPair in Mapper = 10-100
key = 68 value = 20    2000
token[0] = 20 token[1] = 2000
Printing Result in Partitioner = 0
IntPair in Mapper = 20-2000
key = 76 value = 30    25000
token[0] = 30 token[1] = 25000
Printing Result in Partitioner = 0
IntPair in Mapper = 30-25000
key = 85 value = 40    4000
token[0] = 40 token[1] = 4000
Printing Result in Partitioner = 0
IntPair in Mapper = 40-4000
key = 93 value = 50    5000
token[0] = 50 token[1] = 5000
Printing Result in Partitioner = 0
IntPair in Mapper = 50-5000
key = 101 value = 60    10
token[0] = 60 token[1] = 10
Printing Result in Partitioner = 0
IntPair in Mapper = 60-10
key = 107 value = 10    100
token[0] = 10 token[1] = 100
Printing Result in Partitioner = 0
IntPair in Mapper = 10-100
key = 114 value = 30    25000
token[0] = 30 token[1] = 25000
Printing Result in Partitioner = 0
IntPair in Mapper = 30-25000
key = 123 value = 50    5000
token[0] = 50 token[1] = 5000
Printing Result in Partitioner = 0
IntPair in Mapper = 50-5000



Thanks
Sai


System.out.printlin vs Counters

2013-03-27 Thread Sai Sai
Q1. Is it right to assume the  System.out.println statements are used only in 
eclipse environment and 

In a multi node cluster environment we need to use counters.

Q2. I am slightly confused as it appears like using  System.out.println 
statements
we r able to get detailed info at every line of code in eclipse and counters 
just give few lines and not as detailed as  System.out.println statements do so 
what should we do in a multi node cluster enivronment.

Q3. Also when they say the limit of counters is 120 does that mean that in the 
output if we use:
context.getCounters("TestGroup1","TestName1").increment(1);
more than 120 times it will not print it. or does it refer to 120 options of 
counters in an enum that we can define.

Any help is really appreciated.
Thanks
Sai

Storage Block vs File Block

2013-03-27 Thread Sai Sai
Hadoop splits large files into file blocks of size 64MB. Are these same as 
storage blocks or r they different. 

Thanks
Sai


Re: CompareTo vs equals

2013-03-27 Thread Sai Sai
IntPair  class has these 2 methods, i understand that compareTo is used for 
comparing but when is equals method used and is it necessary to write it when 
we alread have implemented compareTo method.

@Override
public int compareTo(IntPair that) {
int cmp = first.compareTo(that.first);
if(cmp==0){
cmp = second.compareTo(that.second);
}
return cmp;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof IntPair){
IntPair that = (IntPair)obj;
return (first.equals(that.first) && second.equals(that.second));
}
return false;
}

Thanks

Sai

Static class vs Normal Class when to use

2013-03-27 Thread Sai Sai
In some examples/articles sometimes they use:
public static class MyMapper 

and sometimes they use

public class MyMapper 


When/why should we use static vs normal class.

Thanks
Sai

Re: Static class vs Normal Class when to use

2013-03-27 Thread Sai Sai
Hi Ted
Thanks for your help it is really appreciated.
I have gone thru the links that you have provided but i am looking for a simple 
exaplanation of using static classes in Hadoop/MR while
syntactically i know how the static classes and variables r referred, i am 
wondering if there was something more than that we use static classes.
In other words assuming static classes r loaded at the time the classes r 
loaded and not at the time the object is created and the timespan of objects of 
static classes r less than normal classes. Please let me know if this make 
sense and provide any additional info if you could.
Thanks again.
Sai




 From: Ted Yu 
To: user@hadoop.apache.org; Sai Sai  
Sent: Thursday, 28 March 2013 9:41 AM
Subject: Re: Static class vs Normal Class when to use
 

Take a look at Effective Java 2nd edition:  
Item 22: Favor static member classes over nonstatic  

On Wed, Mar 27, 2013 at 9:05 PM, Ted Yu  wrote:

See 
http://stackoverflow.com/questions/1353309/java-static-vs-non-static-inner-class
>
>
>I believe Josh Bloch covers this in his famous book.
>
>
>
>On Wed, Mar 27, 2013 at 9:01 PM, Sai Sai  wrote:
>
>In some examples/articles sometimes they use:
>>public static class MyMapper 
>>
>>
>>and sometimes they use
>>
>>
>>public class MyMapper 
>>
>>
>>
>>When/why should we use static vs normal class.
>>
>>
>>ThanksSai
>

Re: Inspect a context object and see whats in it

2013-03-27 Thread Sai Sai
I have put a break pt in map/reduce method and tried looking thru the context 
object by using the option inspect
i see a lot of variables inside it but wondering if it is possible to look at 
the contents in it meaningfully 
by contents i mean the keys and values only that we add at each step. 
Please let me know if you have suggestions.
Thanks
Sai

Re: Serialized comparator vs normal comparator

2013-03-27 Thread Sai Sai


Just wondering what is the difference between serialized comparator vs normal 
comparator given below,
the reason i am trying to understand this is how will you verify if you r using 
serialized comparator during debugging if the comparator is 
working or not as when you debug in eclipse it shows byte info which cannot be 
understood by developers.

Here r the methods i am referring to:

/** A Comparator that compares serialized IntPair. */ 
    public static class Comparator extends WritableComparator {
      public Comparator() {
        super(IntPair.class);
      }

     public int compare(byte[] b1, int s1, int l1,
                         byte[] b2, int s2, int l2) {
        return compareBytes(b1, s1, l1, b2, s2, l2);
      }
    }

    static {                                        // register this comparator
      WritableComparator.define(IntPair.class, new Comparator());
    }

    @Override
   public int compareTo(IntPair o) {
      if (first != o.first) {
        return first < o.first ? -1 : 1;
      } else if (second != o.second) {
        return second < o.second ? -1 : 1;
      } else {
        return 0;
      }
    }

Re: Bloom Filter analogy in SQL

2013-03-29 Thread Sai Sai
Can some one give a simple analogy of Bloom Filter in SQL.
I am trying to understand and always get confused.
Thanks

Re: list of linux commands for hadoop

2013-03-29 Thread Sai Sai
Just wondering if there are a list of linux commands or any article which r 
needed for learning hadoop.
Thanks

Re: Understanding Sys.output from mapper & partitioner

2013-03-29 Thread Sai Sai
 Mapper = 10-10
key = 46 value = 30    2500
token[0] = 30 token[1] = 2500
Printing Result in Partitioner = 0
IntPair in Mapper = 30-2500
key = 54 value = 50    500
token[0] = 50 token[1] = 500
Printing Result in Partitioner = 0
IntPair in Mapper = 50-500
key = 61 value = 10    100
token[0] = 10 token[1] = 100
Printing Result in Partitioner = 0
IntPair in Mapper = 10-100
key = 68 value = 20    2000
token[0] = 20 token[1] = 2000
Printing Result in Partitioner = 0
IntPair in Mapper = 20-2000
key = 76 value = 30    25000
token[0] = 30 token[1] = 25000
Printing Result in Partitioner = 0
IntPair in Mapper = 30-25000
key = 85 value = 40    4000
token[0] = 40 token[1] = 4000
Printing Result in Partitioner = 0
IntPair in Mapper = 40-4000
key = 93 value = 50    5000
token[0] = 50 token[1] = 5000
Printing Result in Partitioner = 0
IntPair in Mapper = 50-5000
key = 101 value = 60    10
token[0] = 60 token[1] = 10
Printing Result in Partitioner = 0
IntPair in Mapper = 60-10
key = 107 value = 10    100
token[0] = 10 token[1] = 100
Printing Result in Partitioner = 0
IntPair in Mapper = 10-100
key = 114 value = 30    25000
token[0] = 30 token[1] = 25000
Printing Result in Partitioner = 0
IntPair in Mapper = 30-25000
key = 123 value = 50    5000
token[0] = 50 token[1] = 5000
Printing Result in Partitioner = 0
IntPair in Mapper = 50-5000



Thanks
Sai


____
 From: Jens Scheidtmann 
To: user@hadoop.apache.org; Sai Sai  
Sent: Friday, 29 March 2013 9:26 PM
Subject: Re: Understanding Sys.output from mapper & partitioner
 

Hallo Sai,

the interesting bits are, how your job is configured. Depending on how you 
defined the input to the MR-job, e.g. as a text file you might get this result. 
Unfortunately, you didn't give this source code...

Best regards,

Jens

Re: Who splits the file into blocks

2013-03-31 Thread Sai Sai
Here is my understanding about putting a file into hdfs:
A client contacts name node and gets the location of blocks where it needs to 
put the blocks in data nodes.
But before this how does the name node know how many blocks it needs to split a 
file into.
Who splits the file is it the client itself or is it the name node. 
Since the file is never written to the name node so there is no chance it is 
the name node.
Please clarify.
Thanks
Sai


Re: fsImage & editsLog questions

2013-04-03 Thread Sai Sai
1. Will fsImage maintain the data/metadata of name node.
2. Will any input files be stored in fsImage.

3. When a namenode goes down will all the data in the name node go down or just 
the meta data only and what will happen to fsimage & editslog.
5. Is the fsimage file which is also maintained as a backup  on multiple data 
nodes and secondary namenode or it just sits on a namenode.
6. Does fsimage sit on hdfs or namenode server.
6. What exactly does the editsLog do?
Please pour some light.
Thanks
Sai


Re: Reduce starts before map completes (at 23%)

2013-04-11 Thread Sai Sai
I am running the wordcount from hadoop-examples, i am giving as input a bunch 
of test files, i have noticed in the output given below reduce starts when the 
map is at 23%, i was wondering if it is not right that reducers will start only 
after the complete mapping is done which mean when map is 100% then i thought 
the reducers will start. Why r the reducers starting when map is still at 23%.

13/04/11 21:10:32 INFO mapred.JobClient:  map 0% reduce 0%
13/04/11 21:10:56 INFO mapred.JobClient:  map 1% reduce 0%
13/04/11 21:10:59 INFO mapred.JobClient:  map 2% reduce 0%
13/04/11 21:11:02 INFO mapred.JobClient:  map 3% reduce 0%
13/04/11 21:11:05 INFO mapred.JobClient:  map 4% reduce 0%
13/04/11 21:11:08 INFO mapred.JobClient:  map 6% reduce 0%
13/04/11 21:11:11 INFO mapred.JobClient:  map 7% reduce 0%
13/04/11 21:11:17 INFO mapred.JobClient:  map 8% reduce 0%
13/04/11 21:11:23 INFO mapred.JobClient:  map 10% reduce 0%
13/04/11 21:11:26 INFO mapred.JobClient:  map 12% reduce 0%
13/04/11 21:11:32 INFO mapred.JobClient:  map 14% reduce 0%
13/04/11 21:11:44 INFO mapred.JobClient:  map 23% reduce 0%
13/04/11 21:11:50 INFO mapred.JobClient:  map 23% reduce 1%
13/04/11 21:11:53 INFO mapred.JobClient:  map 33% reduce 7%
13/04/11 21:12:02 INFO mapred.JobClient:  map 42% reduce 7%

Please pour some light.
Thanks
Sai

Re: Does a Map task run 3 times on 3 TTs or just once

2013-04-12 Thread Sai Sai
Just wondering if it is right to assume that a Map task is run 3 times on 3 
different TTs in parallel and whoever completes processing the task first that 
output is picked up and written to intermediate location.
Or is it true that a map task even though its data is replicated 3 times will 
run only once and other 2 will be on the stand by just incase this fails the 
second one will run followed by 3rd one if the 2nd Mapper fails.
Plesae pour some light.
Thanks
Sai

Re: 10 TB of a data file.

2013-04-12 Thread Sai Sai
In real world can a file be of this big size as 10 TB? 
Will the data be put into a txt file or what kind of a file?
If someone would like to open such a big file to look at the content will OS 
support opening such big files? 
If not how to handle this kind of scenario?
Any input will be appreciated.
Thanks
Sai

Re: How to find the num of Mappers

2013-04-12 Thread Sai Sai
If we have a 640 MB data file and have 3 Data Nodes in a cluster.
The file can be split into 10 Blocks and starts the Mappers M1, M2,  M3 first.
As each one completes the task M4 and so on will be run. 
It appears like it is not necessary to run all the 10 Map tasks in parallel at 
once.
Just wondering if this is right assumption.
What if we have 10 TB of data file with 3 Data Nodes, how to find the number of 
mappers that will be created.
Thanks
Sai

Re: Will HDFS refer to the memory of NameNode & DataNode or is it a separate machine

2013-04-12 Thread Sai Sai
A few basic questions:

Will HDFS refer to the memory of NameNode & DataNode or is it a separate 
machine.


For NameNode, DataNode and others there is a process associated with each of em.
But no process is for HDFS, wondering why? I understand that fsImage has the 
meta data of the HDFS, so when NameNode or DataNode or JobTracker/TT needs to 
get file info will they just look into the fsImage.

When we put a file in HDFS is it possible to look/find in which node (NN/DN) it 
physically sits.

Any help is appreciated.
Thanks
Sai

Re: 100K Maps scenario

2013-04-12 Thread Sai Sai


Just a follow up to see if anyone can shed some light on this:
My understanding is that each block after getting replicated 3 times, a map 
task is run on each of the replica in parallel.
The thing i am trying to double verify is in a scenario where a file is split 
into 10K or 100K or more blocks it will result in atleast 300K Map tasks being 
performed and this looks like an overkill from a performance or just a logical 
perspective. 
Will appreciate any thoughts on this.
Thanks
Sai


 From: Sai Sai 
To: "user@hadoop.apache.org" ; Sai Sai 
 
Sent: Friday, 12 April 2013 1:37 PM
Subject: Re: Does a Map task run 3 times on 3 TTs or just once
 


Just wondering if it is right to assume that a Map task is run 3 times on 3 
different TTs in parallel and whoever completes processing the task first that 
output is picked up and written to intermediate location.
Or is it true that a map task even though its data is replicated 3 times will 
run only once and other 2 will be on the stand by just incase this fails the 
second one will run followed by 3rd one if the 2nd Mapper fails.
Plesae pour some light.
Thanks
Sai

Re: 100K Maps scenario

2013-04-12 Thread Sai Sai
Thanks Kai for confirming it.



 From: Kai Voigt 
To: user@hadoop.apache.org; Sai Sai  
Sent: Saturday, 13 April 2013 7:18 AM
Subject: Re: 100K Maps scenario
 


No, only one copy of each block will be processed.

If a task fails, it will be retried on another copy. Also, if speculative 
execution is enabled, slow tasks might be executed twice in parallel. But this 
will only happen rarely.

Kai


Am 12.04.2013 um 18:45 schrieb Sai Sai :


>
>Just a follow up to see if anyone can shed some light on this:
>My understanding is that each block after getting replicated 3 times, a map 
>task is run on each of the replica in parallel.
>The thing i am trying to double verify is in a scenario where a file is split 
>into 10K or 100K or more blocks it will result in atleast 300K Map tasks being 
>performed and this looks like an overkill from a performance or just a logical 
>perspective. 
>Will appreciate any thoughts on this.
>Thanks
>Sai
>
>____________
> From: Sai Sai 
>To: "user@hadoop.apache.org" ; Sai Sai 
> 
>Sent: Friday, 12 April 2013 1:37 PM
>Subject: Re: Does a Map task run 3 times on 3 TTs or just once
> 
>
>
>Just wondering if it is right to assume that a Map task is run 3 times on 3 
>different TTs in parallel and whoever completes processing the task first that 
>output is picked up and written to intermediate location.
>Or is it true that a map task even though its data is replicated 3 times will 
>run only once and other 2 will be on the stand by just incase this fails the 
>second one will run followed by 3rd one if the 2nd Mapper fails.
>Plesae pour some light.
>Thanks
>Sai
>
>

-- 
Kai Voigt
k...@123.org

Re: Flume port issue

2013-05-20 Thread Sai Sai
Not sure if this is the right group to ask questions about flume:

I am getting an exception about unable to open a port in flume when trying to 
create a remote agent, more details below:
---
13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to Flume. 
Exception follows.
org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 41414 
}: RPC connection error
---


Here r the steps i have followed:

Step 1: Here is my agent3.conf created in the flume/conf dir:

**
agent3.sources = avrosource
agent3.sinks = filesink
agent3.channels = jdbcchannel

agent3.sources.avrosource.type = avro
agent3.sources.avrosource.bind = localhost
agent3.sources.avrosource.port = 4000
agent3.sources.avrosource.threads = 5

agent3.sinks.filesink.type = FILE_ROLL
agent3.sinks.filesink.sink.directory = 
/home/satish/work/apache-flume-1.3.1-bin/files
agent3.sinks.filesink.sink.rollInterval = 0

agent3.channels.jdbcchannel.type = jdbc

agent3.sources.avrosource.channels = jdbcchannel
agent3.sinks.filesink.channel = jdbcchannel

**


Step 2: Then i have saved it successfully and created a new test file like this:

Step 3: echo "Hello World" > /home/satish/message3

Step 4: Tried executing this command:

./flume-ng avro-client -H localhost -p 4000 -F /home/satish/message3

I get this exception below, please help:

--

Djava.library.path=:/home/satish/work/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
 org.apache.flume.client.avro.AvroCLIClient -H ubuntu -p 41414 -F 
/usr/logs/log.10
13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to Flume. 
Exception follows.
org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 41414 
}: RPC connection error
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
at 
org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
at 
org.apache.flume.api.RpcClientFactory.getDefaultInstance(RpcClientFactory.java:169)
at org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:180)
at org.apache.flume.client.avro.AvroCLIClient.main(AvroCLIClient.java:71)
Caused by: java.io.IOException: Error connecting to ubuntu/127.0.0.1:41414
at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:203)
at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:152)
at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:106)
... 5 more
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:396)
at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:358)
at 
org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)


Please help.
Thanks
Sai

Re: Flume port issue

2013-05-20 Thread Sai Sai
Lenin 
Thanks for your reply. 
Here is the 1st sample which works, i am not sure if you r referring to this:
---
agent1.sources = netsource
agent1.sinks = logsink
agent1.channels = memorychannel

agent1.sources.netsource.type = netcat
agent1.sources.netsource.bind = localhost
agent1.sources.netsource.port = 3000

agent1.sinks.logsink.type = logger

agent1.channels.memorychannel.type = memory
agent1.channels.memorychannel.capacity = 1000
agent1.channels.memorychannel.transactionCapacity = 100

agent1.sources.netsource.channels = memorychannel
agent1.sinks.logsink.channel = memorychannel
---

Please let me know if u have any suggestions.
Thanks
Sai



 From: Lenin Raj 
To: user@hadoop.apache.org 
Sent: Monday, 20 May 2013 5:54 PM
Subject: Re: Flume port issue
 


Sai, Are you able to run the netcat flume sample?
--
Lenin.
On May 20, 2013 5:40 PM, "Sai Sai"  wrote:

Not sure if this is the right group to ask questions about flume:
>
>
>I am getting an exception about unable to open a port in flume when trying to 
>create a remote agent, more details below:
>---
>13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to 
>Flume. Exception follows.
>org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 
>41414 }: RPC connection error
>---
>
>
>
>Here r the steps i have followed:
>
>
>Step 1: Here is my agent3.conf created in the flume/conf dir:
>
>
>**
>agent3.sources = avrosource
>agent3.sinks = filesink
>agent3.channels = jdbcchannel
>
>
>agent3.sources.avrosource.type = avro
>agent3.sources.avrosource.bind = localhost
>agent3.sources.avrosource.port = 4000
>agent3.sources.avrosource.threads = 5
>
>
>agent3.sinks.filesink.type = FILE_ROLL
>agent3.sinks.filesink.sink.directory = 
>/home/satish/work/apache-flume-1.3.1-bin/files
>agent3.sinks.filesink.sink.rollInterval = 0
>
>
>agent3.channels.jdbcchannel.type = jdbc
>
>
>agent3.sources.avrosource.channels = jdbcchannel
>agent3.sinks.filesink.channel = jdbcchannel
>
>
>**
>
>
>
>Step 2: Then i have saved it successfully and created a new test file like 
>this:
>
>
>Step 3: echo "Hello World" > /home/satish/message3
>
>
>Step 4: Tried executing this command:
>
>
>./flume-ng avro-client -H localhost -p 4000 -F /home/satish/message3
>
>
>I get this exception below, please help:
>
>
>--
>
>
>Djava.library.path=:/home/satish/work/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
> org.apache.flume.client.avro.AvroCLIClient -H ubuntu -p 41414 -F 
>/usr/logs/log.10
>13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to 
>Flume. Exception follows.
>org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 
>41414 }: RPC connection error
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
>at 
>org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
>at 
>org.apache.flume.api.RpcClientFactory.getDefaultInstance(RpcClientFactory.java:169)
>at org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:180)
>at org.apache.flume.client.avro.AvroCLIClient.main(AvroCLIClient.java:71)
>Caused by: java.io.IOException: Error connecting to ubuntu/127.0.0.1:41414
>at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
>at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:203)
>at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:152)
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:106)
>... 5 more
>Caused by: java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:396)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:358)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:274)
>at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:679)
>
>
>
>
>Please help.
>Thanks
>Sai
>
>
>
>
>
>
>
>

Re: Flume port issue

2013-05-21 Thread Sai Sai
Just a friendly follow up to see if anyone has any suggestions for the issue 
with port given below.
Any help is appreciated.
Thanks
Sai

On May 20, 2013 5:40 PM, "Sai Sai"  wrote:

Not sure if this is the right group to ask questions about flume:
>
>
>I am getting an exception about unable to open a port in flume when trying to 
>create a remote agent, more details below:
>---
>13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to 
>Flume. Exception follows.
>org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 
>41414 }: RPC connection error
>---
>
>
>
>Here r the steps i have followed:
>
>
>Step 1: Here is my agent3.conf created in the flume/conf dir:
>
>
>**
>agent3.sources = avrosource
>agent3.sinks = filesink
>agent3.channels = jdbcchannel
>
>
>agent3.sources.avrosource.type = avro
>agent3.sources.avrosource.bind = localhost
>agent3.sources.avrosource.port = 4000
>agent3.sources.avrosource.threads = 5
>
>
>agent3.sinks.filesink.type = FILE_ROLL
>agent3.sinks.filesink.sink.directory = 
>/home/satish/work/apache-flume-1.3.1-bin/files
>agent3.sinks.filesink.sink.rollInterval = 0
>
>
>agent3.channels.jdbcchannel.type = jdbc
>
>
>agent3.sources.avrosource.channels = jdbcchannel
>agent3.sinks.filesink.channel = jdbcchannel
>
>
>**
>
>
>
>Step 2: Then i have saved it successfully and created a new test file like 
>this:
>
>
>Step 3: echo "Hello World" > /home/satish/message3
>
>
>Step 4: Tried executing this command:
>
>
>./flume-ng avro-client -H localhost -p 4000 -F /home/satish/message3
>
>
>I get this exception below, please help:
>
>
>--
>
>
>Djava.library.path=:/home/satish/work/hadoop-1.0.4/libexec/../lib/native/Linux-i386-32
> org.apache.flume.client.avro.AvroCLIClient -H ubuntu -p 41414 -F 
>/usr/logs/log.10
>13/05/20 04:55:30 ERROR avro.AvroCLIClient: Unable to open connection to 
>Flume. Exception follows.
>org.apache.flume.FlumeException: NettyAvroRpcClient { host: ubuntu, port: 
>41414 }: RPC connection error
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:117)
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:93)
>at 
>org.apache.flume.api.NettyAvroRpcClient.configure(NettyAvroRpcClient.java:507)
>at 
>org.apache.flume.api.RpcClientFactory.getDefaultInstance(RpcClientFactory.java:169)
>at org.apache.flume.client.avro.AvroCLIClient.run(AvroCLIClient.java:180)
>at org.apache.flume.client.avro.AvroCLIClient.main(AvroCLIClient.java:71)
>Caused by: java.io.IOException: Error connecting to ubuntu/127.0.0.1:41414
>at org.apache.avro.ipc.NettyTransceiver.getChannel(NettyTransceiver.java:261)
>at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:203)
>at org.apache.avro.ipc.NettyTransceiver.(NettyTransceiver.java:152)
>at org.apache.flume.api.NettyAvroRpcClient.connect(NettyAvroRpcClient.java:106)
>... 5 more
>Caused by: java.net.ConnectException: Connection refused
>at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:396)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:358)
>at 
>org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:274)
>at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>at java.lang.Thread.run(Thread.java:679)
>
>
>
>
>Please help.
>Thanks
>Sai
>
>
>
>
>
>
>
>

Re: Project ideas

2013-05-21 Thread Sai Sai
Excellent Sanjay, really excellent input. Many Thanks for this input.
I have been always thinking about some ideas but never knowing what to proceed 
with.
Thanks again.
Sai



 From: Sanjay Subramanian 
To: "user@hadoop.apache.org"  
Sent: Tuesday, 21 May 2013 11:51 PM
Subject: Re: Project ideas
 


+1 

My $0.02 is look look around and see problems u can solve…Its better to get a 
list of problems and see if u can model a solution using map-reduce framework 

An example is as follows

PROBLEM 
Build a Cars Pricing Model based on advertisements on Craigs list

OBJECTIVE
Recommend a price to the Craigslist car seller when the user gives info about 
make,model,color,miles

DATA required
Collect RSS feeds daily from Craigs List (don't pound their website , else they 
will lock u down) 

DESIGN COMPONENTS
- Daily RSS Collector - pulls data and puts into HDFS
- Data Loader - Structures the columns u need to analyze and puts into HDFS
- Hive Aggregator and analyzer - studies and queries data and brings out 
recommendation models for car pricing
- REST Web service to return query results in XML/JSON
- iPhone App that talks to web service and gets info

There u go…this should keep a couple of students busy for 3 months

I find this kind of problem statement and solutions simpler to understand 
because its all there in the real world !

An example of my way of thinking led to me founding this non profit called 
www.medicalsidefx.org that gives users valuable metrics regarding medical side 
fx.
It uses Hadoop to aggregate , Lucene to search….This year I am redesigning the 
core to use Hive :-) 

Good luck 

Sanjay

 


From: Michael Segel 
Reply-To: "user@hadoop.apache.org" 
Date: Tuesday, May 21, 2013 6:46 AM
To: "user@hadoop.apache.org" 
Subject: Re: Project ideas


Drink heavily?  

Sorry.

Let me rephrase.

Part of the exercise is for you, the student to come up with the idea. Not 
solicit someone else for a suggestion.  This is how you learn. 

The exercise is to get you to think about the following:

1) What is Hadoop
2) How does it work
3) Why would you want to use it

You need to understand #1 and #2 to be able to #3.

But at the same time... you need to also incorporate your own view of the 
world. 
What are your hobbies? What do you like to do? 
What scares you the most?  What excites you the most? 
Why are you here? 
And most importantly, what do you think you can do within the time period. 
(What data can you easily capture and work with...) 

Have you ever seen 'Eden of the East' ? ;-) 

HTH



On May 21, 2013, at 8:35 AM, Anshuman Mathur  wrote:

Hello fellow users,
>We are a group of students studying in National University of Singapore. As 
>part of our course curriculum we need to develop an application using Hadoop 
>and  map-reduce. Can you please suggest some innovative ideas for our project?
>Thanks in advance.
>Anshuman


CONFIDENTIALITY NOTICE
==
This email message and any attachments are for the exclusive use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient,
 please contact the sender by reply email and destroy all copies of the 
original message along with any attachments, from your computer system. If you 
are the intended recipient, please be advised that the content of this message 
is subject to access, review
 and disclosure by the sender's Email System Administrator.

Re: Hadoop Development on cloud in a secure and economical way.

2013-05-21 Thread Sai Sai


Is it possible to do Hadoop development on cloud in a secure and economical way 
without worrying about our source being taken away. We would like to have 
Hadoop and eclipse installed on a vm in cloud and our developers will log into 
the cloud on a daily basis and work on the cloud. Like this we r hoping if we 
develop any product we will minimize our source being taken away by the devs or 
others and is secure. Please let me know, any  suggestions  u may have.
Thanks,
Sai

Re: diff between these 2 dirs

2013-05-24 Thread Sai Sai
Just wondering if someone can explain what is the diff between these 2 dirs:

Contents of directory /home/satish/work/mapred/staging/satish/.staging

and this dir:
/hadoop/mapred/system


Thanks
Sai

Hadoop based product recomendations.

2013-05-28 Thread Sai Sai
Just wondering if anyone would have any suggestions.
We r a bunch of developers on bench for a few months trained on Hadoop but do 
not have any projects to work.
We would like to develop a Hadoop/Hive/Pig based product for our company so we 
can be of value to the company and not be scared of lay offs. We r wondering if 
anyone could share any ideas of any product that we can develop and be of value 
to our company rather than just hoping we would get any projects to work. Any 
help/suggestions/direction will be really appreciated.
Thanks,
Sai

Re: Install hadoop on multiple VMs in 1 laptop like a cluster

2013-05-31 Thread Sai Sai
Just wondering if anyone has any documentation or references to any articles 
how to simulate a multi node cluster setup in 1 laptop with hadoop running on 
multiple ubuntu VMs. any help is appreciated.
Thanks
Sai

Re: Why/When partitioner is used.

2013-06-07 Thread Sai Sai
I always get confused why we should partition and what is the use of it.
Why would one want to send all the keys starting with A to Reducer1 and B to R2 
and so on...
Is it just to parallelize the reduce process.
Please help.
Thanks
Sai

Re: How hadoop processes image or video files

2013-06-07 Thread Sai Sai
How r the image files or video files processed using hadoop.
I understand that the byte[] is read by Hadoop using SeqFileFormat in map 
format but what is done after that 
with this byte[] as it is something which does not make sense in its raw form.
Any input please.

Thanks
Sai

Re: Is counter a static var

2013-06-07 Thread Sai Sai
Is counter like a static var. If so is it persisted on the name node or data 
node.
Any input please.

Thanks
Sai

Re: Is it possible to define num of mappers to run for a job

2013-06-07 Thread Sai Sai
Is it possible to define num of mappers to run for a job.

What r the conditions we need to be aware of when defining such a thing.
Please help.
Thanks
Sai

Re: Pool & slot questions

2013-06-07 Thread Sai Sai
1. Can we think of a job pool similar to a queue.

2. Is it possible to configure a slot if so how.

Please help.
Thanks

Sai

Re: 2 Map tasks running for a small input file

2013-09-26 Thread Sai Sai
Hi
Here is the input file for the wordcount job:
**

Hi This is a simple test.
Hi Hadoop how r u.
Hello Hello.
Hi Hi.
Hadoop Hadoop Welcome.
**


After running the wordcount successfully 
here r the counters info:

***
Job Counters SLOTS_MILLIS_MAPS 0 0 8,386
Launched reduce tasks 0 0 1
Total time spent by all reduces waiting after reserving slots (ms) 0 0 0
Total time spent by all maps waiting after reserving slots (ms) 0 0 0
Launched map tasks 0 0 2
Data-local map tasks 0 0 2
SLOTS_MILLIS_REDUCES 0 0 9,199
***

My question why r there 2 launched map tasks when i have only a small file.
Per my understanding it is only 1 block.
and should be only 1 split.
Then for each line a map computation should occur
but it shows 2 map tasks.
Please let me know.
Thanks
Sai


Re: 2 Map tasks running for a small input file

2013-09-26 Thread Sai Sai
Thanks Viji.
I am confused a little when the data is small y would there b 2 tasks.
U will use the min as 2 if u need it but in this case it is not needed due to 
size of the data being small 
so y would 2 map tasks exec.
Since it results in 1 block with 5 lines of data in it
i am assuming this results in 5 map computations 1 per each line 
and all of em in 1 process/node since i m using a pseudo vm.
Where is the second task coming from.
The 5 computations of map on each line is 1 task.
Is this right.
Please help.
Thanks




 From: Viji R 
To: user@hadoop.apache.org; Sai Sai  
Sent: Thursday, 26 September 2013 5:09 PM
Subject: Re: 2 Map tasks running for a small input file
 

Hi,

Default number of map tasks is 2. You can set mapred.map.tasks to 1 to
avoid this.

Regards,
Viji

On Thu, Sep 26, 2013 at 4:28 PM, Sai Sai  wrote:
> Hi
> Here is the input file for the wordcount job:
> **
> Hi This is a simple test.
> Hi Hadoop how r u.
> Hello Hello.
> Hi Hi.
> Hadoop Hadoop Welcome.
> **
>
> After running the wordcount successfully
> here r the counters info:
>
> ***
> Job Counters SLOTS_MILLIS_MAPS 0 0 8,386
> Launched reduce tasks 0 0 1
> Total time spent by all reduces waiting after reserving slots (ms) 0 0 0
> Total time spent by all maps waiting after reserving slots (ms) 0 0 0
> Launched map tasks 0 0 2
> Data-local map tasks 0 0 2
> SLOTS_MILLIS_REDUCES 0 0 9,199
> ***
> My question why r there 2 launched map tasks when i have only a small file.
> Per my understanding it is only 1 block.
> and should be only 1 split.
> Then for each line a map computation should occur
> but it shows 2 map tasks.
> Please let me know.
> Thanks
> Sai
>

Re: Input Split vs Task vs attempt vs computation

2013-09-26 Thread Sai Sai
Hi
I have a few questions i am trying to understand:

1. Is each input split same as a record, (a rec can be a single line or 
multiple lines).

2. Is each Task a collection of few computations or attempts.

For ex: if i have a small file with 5 lines.

By default there will be 1 line on which each map computation is performed.
So totally 5 computations r done on 1 node.

This means JT will spawn 1 JVM for 1 Tasktracker on a node
and another JVM for map task which will instantiate 5 map objects 1 for each 
line.

The MT JVM is called the task which will have 5 attempts for  each line.
This means attempt is same as computation.

Please let me know if anything is incorrect.
Thanks
Sai


Re: Retrieve and compute input splits

2013-09-30 Thread Sai Sai
Thanks for your suggestions and replies.
I am still confused about this:

To create the list of tasks to run, the job scheduler first retrieves the input 
splits computed by the JobClient from the shared filesystem (step 6).


My question:

Does the input split in the above statement refer to the physical block or the 
logical input split.
I undersstand that the client will split the file and save the blocks at the 
time of writing the file to the cluster and the meta data
about the blocks is in Namenode. 
The only place where the meta data about the blocks is in NN so can v assume in 
step 6 is the scheduler goes to 
NN for retrieving this meta data from NN and thats what is indicated in the 
diagram as Shared File System HDFS.
And if this is right the input split is the physical blocks info and not the 
logical input split info which could be just a single line
if v r using TextInuptFormat  the default one.
Any suggestions.
Thanks
Sai


 From: Jay Vyas 
To: "common-u...@hadoop.apache.org"  
Cc: Sai Sai  
Sent: Saturday, 28 September 2013 5:35 AM
Subject: Re: Retrieve and compute input splits
 


Technically, the block locations are provided by the InputSplit which in the 
FileInputFormat case, is provided by the FileSystem Interface.

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/InputSplit.html


The thing to realize here is that the FileSystem implementation is provided at 
runtime - so the InputSplit class is responsible to create a FileSystem 
implementation using reflection, and then call the getBlockLocations of on a 
given file or set of files which the input split is corresponding to.


I think your confusion here lies in the fact that the input splits create a 
filesystem, however, they dont know what the filesystem implementation actually 
is - they only rely on the abstract contract, which provides a set of block 
locations.  

See the FileSystem abstract class for details on that.




On Fri, Sep 27, 2013 at 7:02 PM, Peyman Mohajerian  wrote:

For the JobClient to compute the input splits doesn't it need to contact Name 
Node. Only Name Node knows where the splits are, how can it compute it without 
that additional call?
>
>
>
>
>On Fri, Sep 27, 2013 at 1:41 AM, Sonal Goyal  wrote:
>
>The input splits are not copied, only the information on the location of the 
>splits is copied to the jobtracker so that it can assign tasktrackers which 
>are local to the split.
>>
>>
>>Check the Job Initialization section at 
>>http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
>>
>>
>>
>>To create the list of tasks to run, the job scheduler first retrieves the 
>>input splits computed by the JobClient from the shared filesystem (step 6). 
>>It then creates one map task for each split. The number of reduce tasks to 
>>create is determined by the mapred.reduce.tasks property in the JobConf, 
>>which is set by the setNumReduceTasks() method, and the scheduler simply 
>>creates this number of reduce tasks to be run. Tasks are given IDs at this 
>>point.
>>
>>
>>
>>Best Regards,
>>Sonal
>>Nube Technologies 
>>
>>
>>
>>
>>
>>
>>
>>
>>On Fri, Sep 27, 2013 at 10:55 AM, Sai Sai  wrote:
>>
>>Hi
>>>I have attached the anatomy of MR from definitive guide.
>>>
>>>
>>>In step 6 it says JT/Scheduler  retrieve  input splits computed by the 
>>>client from hdfs.
>>>
>>>
>>>In the above line it refers to as the client computes input splits.
>>>
>>>
>>>
>>>1. Why does the JT/Scheduler retrieve the input splits and what does it do.
>>>If it is retrieving the input split does this mean it goes to the block and 
>>>reads each record 
>>>and gets the record back to JT. If so this is a lot of data movement for 
>>>large files.
>>>which is not data locality. so i m getting confused.
>>>
>>>
>>>2. How does the client know how to calculate the input splits.
>>>
>>>
>>>Any help please.
>>>ThanksSai
>>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com 

Re: Newbie Debuggin Question

2013-02-21 Thread Sai Sai
This may be a basic beginner debug question will appreciate if anyone can pour 
some light:

Here is the method i have in Eclipse:


***

@Override
    protected void setup(Context context) throws java.io.IOException,
            InterruptedException {
        Path[] cacheFiles = DistributedCache.getLocalCacheFiles(context
                .getConfiguration());
        lookUp = cacheFiles[0];
    };
***

I have put a breakpoint at the second line and inspected cacheFiles[0] and here 
is what i see:

[/tmp/hadoop-sai/mapred/local/archive/3401759285981873176_334405473_2022582449/fileinput/lookup.txt]

I went back to my local folders looking for these folders to see if there r in 
here but do not see them.

Just wondering where it is getting this file from.

Any help will be really appreciated.
Thanks
Sai

Re: mapr videos question

2013-02-23 Thread Sai Sai

Hi
Could some one please verify if the mapr videos are meant for learning hadoop 
or is it for learning mapr. If we r interested in learning hadoop only then 
will they help. As a starter would like to just understand hadoop only and not 
mapr yet. 

Just wondering if others can share their thoughts and any relevant links.
Thanks,
Sai

Re: WordPairCount Mapreduce question.

2013-02-23 Thread Sai Sai


Hello

I have a question about how Mapreduce sorting works internally with multiple 
columns.

Below r my classes using 2 columns in an input file given below.


1st question: About the method hashCode, we r adding a "31 + ", i am wondering 
why is this required. what does 31 refer to.


2nd question: what if my input file has 3 columns instead of 2 how would you 
write a compare method and was wondering if anyone can map this to a real world 
scenario it will be really helpful.



    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

**

Here is my input file wordpair.txt

**

a    b
a    c
a    b
a    d
b    d
e    f
b    d
e    f
b    d

**


Here is my WordPairObject:

*

public class WordPairCountKey implements WritableComparable {

    private String word1;
    private String word2;

    @Override
    public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
        }
        return diff;
    }
    
    @Override
    public int hashCode() {
        return word1.hashCode() + 31 * word2.hashCode();
    }

    
    public String getWord1() {
        return word1;
    }

    public void setWord1(String word1) {
        this.word1 = word1;
    }

    public String getWord2() {
        return word2;
    }

    public void setWord2(String word2) {
        this.word2 = word2;
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        word1 = in.readUTF();
        word2 = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(word1);
        out.writeUTF(word2);
    }

    
    @Override
    public String toString() {
        return "[word1=" + word1 + ", word2=" + word2 + "]";
    }

}

**

Any help will be really appreciated.
Thanks
Sai


Re: WordPairCount Mapreduce question.

2013-02-24 Thread Sai Sai
Thanks Mahesh for your help.

Wondering if u can provide some insight with the below compare method using 
byte[] in the SecondarySort example:

public static class Comparator extends WritableComparator {
        public Comparator() {
            super(URICountKey.class);
        }

        public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int 
l2) {
            return compareBytes(b1, s1, l1, b2, s2, l2);
        }
    }


My question is in the below compare method that i have given we are comparing 
word1/word2
which makes sense but what about this byte[] comparison, is it right in 
assuming  it converts each objects word1/word2/word3 to byte[] and compares 
them.
If so is it for performance reason it is done.
Could you please verify.
Thanks
Sai



 From: Mahesh Balija 
To: user@hadoop.apache.org; Sai Sai  
Sent: Saturday, 23 February 2013 5:23 AM
Subject: Re: WordPairCount Mapreduce question.
 

Please check the in-line answers...


On Sat, Feb 23, 2013 at 6:22 PM, Sai Sai  wrote:


>
>Hello
>
>
>I have a question about how Mapreduce sorting works internally with multiple 
>columns.
>
>
>Below r my classes using 2 columns in an input file given below.
>
>
>
>1st question: About the method hashCode, we r adding a "31 + ", i am wondering 
>why is this required. what does 31 refer to.
>
This is how usually hashcode is calculated for any String instance 
(s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]) where n stands for length of the 
String. Since in your case you only have 2 chars then it will be a * 31^0 + b * 
31^1.
 


>
>2nd question: what if my input file has 3 columns instead of 2 how would you 
>write a compare method and was wondering if anyone can map this to a real 
>world scenario it will be really helpful.
>
you will extend the same approach for the third column,
 public int compareTo(WordPairCountKey o) {
        int diff = word1.compareTo(o.word1);
        if (diff == 0) {
            diff = word2.compareTo(o.word2);
    if(diff==0){
 diff = word3.compareTo(o.word3);
    }
        }
        return diff;
    }
    

>
>
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>
>**
>
>Here is my input file wordpair.txt
>
>**
>
>a    b
>a    c
>a    b
>a    d
>b    d
>e    f
>b    d
>e    f
>b    d
>
>**
>
>
>Here is my WordPairObject:
>
>*
>
>public class WordPairCountKey implements WritableComparable {
>
>    private String word1;
>    private String word2;
>
>    @Override
>    public int compareTo(WordPairCountKey o) {
>        int diff = word1.compareTo(o.word1);
>        if (diff == 0) {
>            diff = word2.compareTo(o.word2);
>        }
>        return diff;
>    }
>    
>    @Override
>    public int hashCode() {
>        return word1.hashCode() + 31 * word2.hashCode();
>    }
>
>    
>    public String getWord1() {
>        return word1;
>    }
>
>    public void setWord1(String word1) {
>        this.word1 = word1;
>    }
>
>    public String getWord2() {
>        return word2;
>    }
>
>    public void setWord2(String word2) {
>        this.word2 = word2;
>    }
>
>    @Override
>    public void readFields(DataInput in) throws IOException {
>        word1 = in.readUTF();
>        word2 = in.readUTF();
>    }
>
>    @Override
>    public void
 write(DataOutput out) throws IOException {
>        out.writeUTF(word1);
>        out.writeUTF(word2);
>    }
>
>    
>    @Override
>    public String toString() {
>        return "[word1=" + word1 + ", word2=" + word2 + "]";
>    }
>
>}
>
>**
>
>Any help will be really appreciated.
>Thanks
>Sai
>

Re: Trying to copy file to Hadoop file system from a program

2013-02-24 Thread Sai Sai


Greetings,

Below is the program i am trying to run and getting this exception:
***

Test Start.
java.net.UnknownHostException: unknown host: master
    at org.apache.hadoop.ipc.Client$Connection.(Client.java:214)
    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
    at $Proxy1.getProtocolVersion(Unknown Source)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
    at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)






public class HdpTest {
    
    public static String fsURI = "hdfs://master:9000";

    
    public static void copyFileToDFS(FileSystem fs, String srcFile,
            String dstFile) throws IOException {
        try {
            System.out.println("Initialize copy...");
            URI suri = new URI(srcFile);
            URI duri = new URI(fsURI + "/" + dstFile);
            Path dst = new Path(duri.toString());
            Path src = new Path(suri.toString());
            System.out.println("Start copy...");
            fs.copyFromLocalFile(src, dst);
            System.out.println("End copy...");
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {
        try {
            System.out.println("Test Start.");
            Configuration conf = new Configuration();
            DistributedFileSystem fs = new DistributedFileSystem();
            URI duri = new URI(fsURI);
            fs.initialize(duri, conf); // Here is the xception occuring
            long start = 0, end = 0;
            start = System.nanoTime();
            //writing data from local to HDFS
            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
                    "/input/raptor/trade1.txt");
            //Writing data from HDFS to Local
//         copyFileFromDFS(fs, "/input/raptor/trade1.txt", 
"/home/kosmos/Work/input/wordpair1.txt");
            end = System.nanoTime();
            System.out.println("Total Execution times: " + (end - start));
            fs.close();
        } catch (Throwable t) {
            t.printStackTrace();
        }
    }

**
I am trying to access in FireFox this url: 

hdfs://master:9000

Get an error msg FF does not know how to display this message.

I can successfully access my admin page:

http://localhost:50070/dfshealth.jsp

Just wondering if anyone can give me any suggestions, your help will be really 
appreciated.
Thanks
Sai

Re: Trying to copy file to Hadoop file system from a program

2013-02-24 Thread Sai Sai
Many Thanks Nitin for your quick reply.

Heres what i have in my hosts file and i am running in VM i m assuming it is 
the pseudo mode:

*
127.0.0.1    localhost.localdomain    localhost
#::1    ubuntu    localhost6.localdomain6    localhost6
#127.0.1.1    ubuntu
127.0.0.1   ubuntu

# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
*
In my masters i have:
ubuntu
In my slaves i have:
localhost
***
My question is in my variable below:
public static String fsURI = "hdfs://master:9000";

what would be the right value so i can connect to Hadoop successfully.
Please let me know if you need more info.
Thanks
Sai








 From: Nitin Pawar 
To: user@hadoop.apache.org; Sai Sai  
Sent: Sunday, 24 February 2013 3:42 AM
Subject: Re: Trying to copy file to Hadoop file system from a program
 

if you want to use master as your hostname then make such entry in your 
/etc/hosts file 

or change the hdfs://master to hdfs://localhost 



On Sun, Feb 24, 2013 at 5:10 PM, Sai Sai  wrote:


>
>Greetings,
>
>
>Below is the program i am trying to run and getting this exception:
>***
>
>Test Start.
>java.net.UnknownHostException: unknown host: master
>    at org.apache.hadoop.ipc.Client$Connection.(Client.java:214)
>    at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
>    at org.apache.hadoop.ipc.Client.call(Client.java:1050)
>    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
>    at $Proxy1.getProtocolVersion(Unknown Source)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
>    at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
>    at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
>    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:238)
>    at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:203)
>    at
 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
>    at kelly.hadoop.hive.test.HadoopTest.main(HadoopTest.java:54)
>
>
>
>
>
>
>
>
>public class HdpTest {
>    
>    public static String fsURI = "hdfs://master:9000";
>
>    
>    public static void copyFileToDFS(FileSystem fs, String srcFile,
>   
         String dstFile) throws IOException {
>        try {
>            System.out.println("Initialize copy...");
>            URI suri = new URI(srcFile);
>            URI duri = new URI(fsURI + "/" + dstFile);
>            Path dst = new Path(duri.toString());
>            Path src = new Path(suri.toString());
>            System.out.println("Start copy...");
>            fs.copyFromLocalFile(src, dst);
>            System.out.println("End copy...");
>        } catch (Exception e)
 {
>            e.printStackTrace();
>        }
>    }
>
>    public static void main(String[] args) {
>        try {
>            System.out.println("Test Start.");
>            Configuration conf = new Configuration();
>            DistributedFileSystem fs = new DistributedFileSystem();
>            URI duri = new URI(fsURI);
>            fs.initialize(duri, conf); // Here is the xception occuring
>            long start = 0, end = 0;
>       
     start = System.nanoTime();
>            //writing data from local to HDFS
>            copyFileToDFS(fs, "/home/kosmos/Work/input/wordpair.txt",
>                    "/input/raptor/trade1.txt");
>            //Writing data from HDFS to Local
>//         copyFileFromDFS(fs, "/input/raptor/trade1.txt", 
>"/home/kosmos/Work/input/wordpair1.txt");
>            end = System.nanoTime();
>            System.out.println("Total Execution times: " + (end - start));
>            fs.close();
>        } catch
 (Throwable t) {
>            t.printStackTrace();
>        }
>    }
>
>**
>I am trying to access in FireFox this url: 
>
>hdfs://master:9000
>
>
>Get an error msg FF does not know how to display this message.
>
>
>I can successfully access my admin page:
>
>
>http://localhost:50070/dfshealth.jsp
>
>
>Just wondering if anyone can give me any suggestions, your help will be really 
>appreciated.
>ThanksSai
>
>
>


-- 
Nitin Pawar

Files in hadoop.

2013-03-04 Thread Sai Sai
Just wondering after we put a file in hadoop for running MR jobs after we r 
done with it.
Is it a standard to delete it or just leave it there like that.
Just wondering what others do.
Any input will be appreciated.
Thanks
Sai

Re: Unknown processes unable to terminate

2013-03-04 Thread Sai Sai
I have a list of following processes given below, i am trying to kill the 
process 13082 using:

kill 13082

Its not terminating RunJar.

I have done a stop-all.sh hoping it would stop all the processes but only 
stopped the hadoop related processes.
I am just wondering if it is necessary to stop all other processes before 
starting the hadoop process and how to stop these other processes.

Here is the list of processes which r appearing:



30969 FileSystemCat
30877 FileSystemCat
5647 StreamCompressor
32200 DataNode
25015 Jps
2227 URLCat
5563 StreamCompressor
5398 StreamCompressor
13082 RunJar
32578 JobTracker
7215 
385 TaskTracker
31884 NameNode
32489 SecondaryNameNode


Thanks
Sai


Re: Find current version & cluster info of hadoop

2013-03-07 Thread Sai Sai
Just wondering if there r any commands in Hadoop which would give us the 
current version that we
r using and any command which will give us the info of cluster setup of H we r 
working on.
Thanks
Sai


Re: Block vs FileSplit vs record vs line

2013-03-14 Thread Sai Sai
Just wondering if this is right way to understand this:
A large file is split into multiple blocks and each block is split into 
multiple file splits and each file split has multiple records and each record 
has multiple lines. Each line is processed by 1 instance of mapper.
Any help is appreciated.
Thanks
Sai

Re: Increase the number of mappers in PM mode

2013-03-14 Thread Sai Sai



In Pseudo Mode where is the setting to increase the number of mappers or is 
this not possible.
Thanks
Sai


Re: Setup/Cleanup question

2013-03-22 Thread Sai Sai
When running an MR job/program assuming there r 'n' (=100) Mapperstriggered 
then my question is will the setup & cleanup run n number of times which means 
once for each mapper or for all the mappers they will run only once. 

Any help is appreciated.
Thanks
Sai


Re: Setup/Cleanup question

2013-03-22 Thread Sai Sai
Thanks Harsh.
So the setup/cleanup r for the Job and not the Mappers i take it.
Thanks.






 From: Harsh J 
To: "" ; Sai Sai 
 
Sent: Friday, 22 March 2013 10:05 PM
Subject: Re: Setup/Cleanup question
 
Assuming you speak of MRv1 (1.x/0.20.x versions), there is just 1 Job
Setup and 1 Job Cleanup tasks additionally run for each Job.

On Sat, Mar 23, 2013 at 9:10 AM, Sai Sai  wrote:
> When running an MR job/program assuming there r 'n' (=100) Mappers triggered
> then my question is will the setup & cleanup run n number of times which
> means once for each mapper or for all the mappers they will run only once.
> Any help is appreciated.
> Thanks
> Sai



-- 
Harsh J

Re: Dissecting MR output article

2013-03-22 Thread Sai Sai


Just wondering if there is any step by step explaination/article of MR output 
we get when we run a job either in eclipse or ubuntu.Any help is appreciated.
Thanks
Sai


Difference between FILE_Bytes_READ vs HDFS_Bytes_Read.

2014-03-13 Thread Sai Sai
Can some please help:
1. Difference between FILE_Bytes_READ vs HDFS_Bytes_Read.

Thanks
Sai

fault tolerance question.

2014-03-13 Thread Sai Sai
Lets say the client is writing the first block to the first Data node and the 
node fails
what will happen now, will the client or the NN do something about it.
Thanks
Sai

Is hdinsights a C# version of hadoop or is it in java.

2014-03-13 Thread Sai Sai
Is hdinsights a C# version of hadoop or is it in java.

Please let me know.
Thanks
Sai

File_bytes_read vs hdfs_bytes_read

2014-03-14 Thread Sai Sai
Just wondering what is the diff between File_bytes_read vs hdfs_bytes_read
which gets displayed in the output of job.
Thanks
Sai

how to unzip a .tar.bz2 file in hadoop/hdfs

2014-03-14 Thread Sai Sai
Can some one please help:
How to unzip a .tar.bz2 file which is in hadoop/hdfs

Thanks
Sai