Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-14 Thread Rasit OZDAS
Why don't you use it with localhost? Does it have a disadvantage?
As far as I know, there were several host <=> IP problems in hadoop, but
that was a while ago, I think these should have been solved..

It's can also be about the order of IP conversions in IP table file.

2009/5/14 andy2005cst 

>
> when set the IP to localhost, it works well, but if change localhost into
> IP
> address, it does not work at all.
> so, it is to say my hadoop is ok, just the connection failed.
>
>
> Rasit OZDAS wrote:
> >
> > Your hadoop isn't working at all or isn't working at the specified port.
> > - try stop-all.sh command on namenode. if it says "no namenode to stop",
> > then take a look at namenode logs and paste here if anything seems
> > strange.
> > - If namenode logs are ok (filled with INFO messages), then take a look
> at
> > all logs.
> > - In eclipse plugin, left side is for map reduce port, right side is for
> > namenode port, make sure both are same as your configuration in xml files
> >
> > 2009/5/12 andy2005cst 
> >
> >>
> >> when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
> >> connect
> >> to a remote hadoop dfs, i got ioexception. if run a map/reduce program
> it
> >> outputs:
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 0 time(s).
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 1 time(s).
> >> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> >> /**.**.**.**:9100. Already tried 2 time(s).
> >> 
> >> Exception in thread "main" java.io.IOException: Call to
> /**.**.**.**:9100
> >> failed on local exception: java.net.SocketException: Connection refused:
> >> connect
> >>
> >> looking forward your help. thanks a lot.
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23533748.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-12 Thread Rasit OZDAS
I have the similar situation, I have very small files,
I never tried HBase (want to), but you can also group them
and write (let's say) 20-30 into a file as every file becomes a key in that
big file.

There are methods in API which you can write an object as a file into HDFS,
and read again
to get original object. Having list of items in object can solve this
problem..


Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-12 Thread Rasit OZDAS
Your hadoop isn't working at all or isn't working at the specified port.
- try stop-all.sh command on namenode. if it says "no namenode to stop",
then take a look at namenode logs and paste here if anything seems strange.
- If namenode logs are ok (filled with INFO messages), then take a look at
all logs.
- In eclipse plugin, left side is for map reduce port, right side is for
namenode port, make sure both are same as your configuration in xml files

2009/5/12 andy2005cst 

>
> when i use eclipse plugin hadoop-0.18.3-eclipse-plugin.jar and try to
> connect
> to a remote hadoop dfs, i got ioexception. if run a map/reduce program it
> outputs:
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 0 time(s).
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 1 time(s).
> 09/05/12 16:53:52 INFO ipc.Client: Retrying connect to server:
> /**.**.**.**:9100. Already tried 2 time(s).
> 
> Exception in thread "main" java.io.IOException: Call to /**.**.**.**:9100
> failed on local exception: java.net.SocketException: Connection refused:
> connect
>
> looking forward your help. thanks a lot.
> --
> View this message in context:
> http://www.nabble.com/how-to-connect-to-remote-hadoop-dfs-by-eclipse-plugin--tp23498736p23498736.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: Distributed Agent

2009-04-15 Thread Rasit OZDAS
Take a look at this topic:

http://dsonline.computer.org/portal/site/dsonline/menuitem.244c5fa74f801883f1a516106bbe36ec/index.jsp?&pName=dso_level1_about&path=dsonline/topics/agents&file=about.xml&xsl=generic.xsl&;

2009/4/14 Burak ISIKLI :
> Hello everyone;
> I want to write a distributed agent program. But i can't understand one thing 
> that what's difference between client-server program and agent program? Pls 
> help me...
>
>
>
>
> 
> Burak ISIKLI
> Dumlupinar University
> Electric & Electronic - Computer Engineering
>
> http://burakisikli.wordpress.com
> http://burakisikli.blogspot.com
> 
>
>
>
>



-- 
M. Raşit ÖZDAŞ


Re: Ynt: Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS
It's normal that they are all empty. Look at files with ".log" extension.

12 Nisan 2009 Pazar 23:30 tarihinde halilibrahimcakir
 yazdı:
> I followed these steps:
>
> $ bin/stop-all.sh
> $ rm -ri /tmp/hadoop-root
> $ bin/hadoop namenode -format
> $ bin/start-all.sh
>
> and looked "localhost:50070" and "localhost:50030" in my browser that the
> result was not different. Again "Error 404". I looked these files:
>
> $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out1
> $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out2
> $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out3
> $ gedit hadoop-0.19.0/logs/hadoop-root-namenode-debian.out4
>
> 4th file is the last one related to namenode logs in the logs directory.
> All of them are empty. I don't understand what is wrong.
>
> - Özgün İleti -
> Kimden : core-user@hadoop.apache.org
> Kime : core-user@hadoop.apache.org
> Gönderme tarihi : 12/04/2009 22:56
> Konu : Re: Cannot access Jobtracker and namenode
> Try looking at namenode logs (under "logs" directory). There should be
> an exception. Paste it here if you don't understand what it means.
>
> 12 Nisan 2009 Pazar 22:22 tarihinde halilibrahimcakir
> 
> yazdı:
> > I typed:
> >
> > $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> > $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
> >
> > Deleted this directory:
> >
> > $ rm -ri /tmp/hadoop-root
> >
> > Formatted namenode again:
> >
> > $ /bin/hadoop namenode -format
> >
> > Stopped:
> >
> > $ /bin/stop-all.sh
> >
> >
> > then typed:
> >
> >
> >
> > $ ssh localhost
> >
> > so it didn't want me to enter a password. I started:
> >
> > $ /bin/start-all.sh
> >
> > But nothing changed :(
> >
> > - Özgün İleti -
> > Kimden : core-user@hadoop.apache.org
> > Kime : core-user@hadoop.apache.org
> > Gönderme tarihi : 12/04/2009 21:33
> > Konu : Re: Ynt: Re: Cannot access Jobtracker and namenode
> > There are two commands in hadoop quick start, used for passwordless
> ssh.
> > Try those.
> >
> > $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> > $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
> >
> > http://hadoop.apache.org/core/docs/current/quickstart.html
> >
> > --
> > M. Raşit ÖZDAŞ
> >
> > Halil İbrahim ÇAKIR
> >
> > Dumlupınar Üniversitesi Bilgisayar Mühendisliği
> >
> > http://cakirhal.blogspot.com
> >
> >
>
>
>
> --
> M. Raşit ÖZDAŞ
>
> Halil İbrahim ÇAKIR
>
> Dumlupınar Üniversitesi Bilgisayar Mühendisliği
>
> http://cakirhal.blogspot.com
>
>



-- 
M. Raşit ÖZDAŞ


Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS
Try looking at namenode logs (under "logs" directory). There should be
an exception. Paste it here if you don't understand what it means.

12 Nisan 2009 Pazar 22:22 tarihinde halilibrahimcakir
 yazdı:
> I typed:
>
> $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
>
> Deleted this directory:
>
> $ rm -ri /tmp/hadoop-root
>
> Formatted namenode again:
>
> $ /bin/hadoop namenode -format
>
> Stopped:
>
> $ /bin/stop-all.sh
>
>
> then typed:
>
>
>
> $ ssh localhost
>
> so it didn't want me to enter a password. I started:
>
> $ /bin/start-all.sh
>
> But nothing changed :(
>
> - Özgün İleti -
> Kimden : core-user@hadoop.apache.org
> Kime : core-user@hadoop.apache.org
> Gönderme tarihi : 12/04/2009 21:33
> Konu : Re: Ynt: Re: Cannot access Jobtracker and namenode
> There are two commands in hadoop quick start, used for passwordless ssh.
> Try those.
>
> $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
>
> http://hadoop.apache.org/core/docs/current/quickstart.html
>
> --
> M. Raşit ÖZDAŞ
>
> Halil İbrahim ÇAKIR
>
> Dumlupınar Üniversitesi Bilgisayar Mühendisliği
>
> http://cakirhal.blogspot.com
>
>



-- 
M. Raşit ÖZDAŞ


Re: Ynt: Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS
There are two commands in hadoop quick start, used for passwordless ssh.
Try those.

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

http://hadoop.apache.org/core/docs/current/quickstart.html

-- 
M. Raşit ÖZDAŞ


Re: Cannot access Jobtracker and namenode

2009-04-12 Thread Rasit OZDAS
Does your system request a password when you ssh to localhost outside hadoop?

12 Nisan 2009 Pazar 20:51 tarihinde halilibrahimcakir
 yazdı:
>
> Hi
>
> I am new at hadoop. I downloaded Hadoop-0.19.0 and followed the
> instructions in the quick start
> manual(http://hadoop.apache.org/core/docs/r0.19.1/quickstart.html). When I
> came to Pseudo-Distributed Operation section there was no problem but
> localhost:50070 and localhost:50030 couldn't be opened. It says "localhost
> reffused the connection". I tried this in another machine, but it says
> like "Http Error 404: /dfshealth.jsp ...". How can I see these pages and
> continue using hadoop? Thanks.
>
> Additional Information:
>
> OS: Debian 5.0 (latest version)
> JDK: Sun-Java 1.6 (latest version)
>  rsync and ssh installed
> edited hadoop-site-xml properly
>
> Halil İbrahim ÇAKIR
>
> Dumlupınar Üniversitesi Bilgisayar Mühendisliği
>
> http://cakirhal.blogspot.com
>
>



-- 
M. Raşit ÖZDAŞ


Re: Web ui

2009-04-08 Thread Rasit OZDAS
@Nick, I'm using ajax very often and previously done projects with ZK
and JQuery, I can easily say that GWT was the easiest of them.
Javascript is only needed where core features aren't enough. I can
easily assume that we won't need any inline javascript.

@Philip,
Thanks for the point. That is a better solution than I imagine,
actually, and I won't have to wait since it's a resolved issue.

-- 
M. Raşit ÖZDAŞ


Web ui

2009-04-07 Thread Rasit OZDAS
Hi,

I started to write my own web ui with GWT. With GWT I can manage
everything within one page, I can set refreshing durations for
each part of the page. And also a better look and feel with the help
of GWT styling.

But I can't get references of NameNode and JobTracker instances.
I found out that they're sent to web ui as application parameters when
hadoop initializes.

I'll try to contribute gui part of my project to hadoop source, if you
want, no problem.
But I need static references to namenode and jobtracker for this.

And I think it will be useful for everyone like me.

M. Rasit OZDAS


Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS
You can point to them by using
conf.setMapClass(..) and conf.setReduceClass(..)  - or something
similar, I don't have the source nearby.

But something weird has happened to my code. It runs locally when I
start it as java process (tries to find input path locally). I'm now
using trunk, maybe something has changed with new version. With
version 0.19 it was fine.
Can somebody point out a clue?

Rasit


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
That seems interesting, we have 3 replications as default.
Is there a way to define, lets say, 1 replication for only job-specific files?

2009/4/2 Owen O'Malley :
>
> On Apr 2, 2009, at 2:41 AM, andy2005cst wrote:
>
>>
>> I need to use the output of the reduce, but I don't know how to do.
>> use the wordcount program as an example if i want to collect the wordcount
>> into a hashtable for further use, how can i do?
>
> You can use an output format and then an input format that uses a database,
> but in practice, the cost of writing to hdfs and reading it back is not a
> problem, especially if you set the replication of the output files to 1.
> (You'll need to re-run the job if you lose a node, but it will be fast.)
>
> -- Owen
>



-- 
M. Raşit ÖZDAŞ


Re: HadoopConfig problem -Datanode not able to connect to the server

2009-04-02 Thread Rasit OZDAS
I have no idea, but there are many "use hostname instead of IP"
issues. Try once hostname instead of IP.

2009/3/26 mingyang :
> check you iptable is off
>
> 2009/3/26 snehal nagmote 
>
>> hello,
>> We configured hadoop successfully, but after some days  its configuration
>> file from datanode( hadoop-site.xml) went off , and datanode was not coming
>> up ,so we again did the same configuration, its showing one datanode and
>> its
>> name as localhost rather than expected as either name of respected datanode
>> m/c or ip address of   actual datanode in ui interfece of hadoop.
>>
>> But capacity as 80.0gb ,(we have  one namenode (40 gb) and datanode(40
>> gb))means capacity is updated ,we can browse the filesystem , it is showing
>> whatever directories we are creating in namenode .
>>
>> but when we try to access the same through the datanode  machine
>> means doing ssh and executing series of commands its not able to connect to
>> the server.
>> saying retrying connect to the server
>>
>> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
>> 172.16.6.102:21011. Already tried 0 time(s).
>>
>> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: /
>> 172.16.6.102:21011. Already tried 1 time(s)
>>
>>
>> moreover we added one datanode into it and formatted namenode ,but that
>> datanode is not getting added. we are not understanding whats the problem.
>>
>> Can configuration files in case of datanode automatcally lost  after some
>> days??
>>
>> I have again one doubt , according to my understanding namenode doesnt
>> store
>> any data , it stores metadata of all the data , so when i execute mkdir in
>> namenode machine  and copying some files into it, it means that data is
>> getting stored in datanode provided to it, please correct me if i am wrong
>> ,
>> i am very new to hadoop.
>> So if i am able to view the data through inteface means its properly
>> storing
>> data into respected datanode, So
>> why its showing localhost as datanode name rather than respected datanode
>> name.
>>
>> can you please help.
>>
>>
>> Regards,
>> Snehal Nagmote
>> IIIT hyderabad
>>
>
>
>
> --
> 致
> 礼!
>
>
> 王明阳
>



-- 
M. Raşit ÖZDAŞ


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
Andy, I didn't try this feature. But I know that Yahoo had a
performance record with this file format.
I came across a file system included in hadoop code (probably that
one) when searching the source code.
Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem
But if you have a lot of big files, this approach won't be suitable I think.

Maybe someone can give further info.

2009/4/2 andy2005cst :
>
> thanks for your reply. Let me explain more clearly, since Map Reduce is just
> one step of my program, I need to use the output of reduce for furture
> computation, so i do not need to want to wirte the output into disk, but
> wanna to get the collection or list of the output in RAM. if it directly
> wirtes into disk, I have to read it back into RAM again.
> you have mentioned a special file format, will you please show me what is
> it? and give some example if possible.
>
> thank you so much.
>
>
> Rasit OZDAS wrote:
>>
>> Hi, hadoop is normally designed to write to disk. There are a special file
>> format, which writes output to RAM instead of disk.
>> But I don't have an idea if it's what you're looking for.
>> If what you said exists, there should be a mechanism which sends output as
>> objects rather than file content across computers, as far as I know there
>> is
>> no such feature yet.
>>
>> Good luck.
>>
>> 2009/4/2 andy2005cst 
>>
>>>
>>> I need to use the output of the reduce, but I don't know how to do.
>>> use the wordcount program as an example if i want to collect the
>>> wordcount
>>> into a hashtable for further use, how can i do?
>>> the example just show how to let the result onto disk.
>>> myemail is : andy2005...@gmail.com
>>> looking forward your help. thanks a lot.
>>> --
>>> View this message in context:
>>> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
>>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> --
>> M. Raşit ÖZDAŞ
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
M. Raşit ÖZDAŞ


Re: Identify the input file for a failed mapper/reducer

2009-04-02 Thread Rasit OZDAS
Two quotes for this problem:

"Streaming map tasks should have a "map_input_file" environment
variable like the following:
map_input_file=hdfs://HOST/path/to/file"

"the value for map.input.file gives you the exact information you need."

(didn't try)
Rasit

2009/3/26 Jason Fennell :
> Is there a way to identify the input file a mapper was running on when
> it failed?  When a large job fails because of bad input lines I have
> to resort to rerunning the entire job to isolate a single bad line
> (since the log doesn't contain information on the file that that
> mapper was running on).
>
> Basically, I would like to be able to do one of the following:
> 1. Find the file that a mapper was running on when it failed
> 2. Find the block that a mapper was running on when it failed (and be
> able to find file names from block ids)
>
> I haven't been able to find any documentation on facilities to
> accomplish either (1) or (2), so I'm hoping someone on this list will
> have a suggestion.
>
> I am using the Hadoop streaming API on hadoop 0.18.2.
>
> -Jason
>



-- 
M. Raşit ÖZDAŞ


Re: hdfs-doubt

2009-04-02 Thread Rasit OZDAS
It seems that either NameNode or DataNode is not started.
You can take a look at log files, and paste related lines here.

2009/3/29 deepya :
>
> Thanks,
>
> I have another doubt.I just want to run the examples and see how it works.I
> am trying to copy the file from local file system to hdfs using the command
>
>  bin/hadoop fs -put conf input
>
> It is giving the following error.
> 09/03/29 05:50:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.NoRouteToHostException: No route to host
> 09/03/29 05:50:54 INFO hdfs.DFSClient: Abandoning block
> blk_-5733385806393158149_1053
>
> I have only one datanode in my cluster and my replication factor is also
> 1(as configured in the conf file in hadoop-site.xml).Can you please provide
> the solution for this.
>
>
> Thanks in advance
>
> SreeDeepya
>
>
> sree deepya wrote:
>>
>> Hi sir/madam,
>>
>> I am SreeDeepya,doing Mtech in IIIT.I am working on a project named cost
>> effective and scalable storage server.Our main goal of the project is to
>> be
>> able to store images in a server and the data can be upto petabytes.For
>> that
>> we are using HDFS.I am new to hadoop and am just learning about it.
>>     Can you please clarify some of the doubts I have.
>>
>>
>>
>> At present we configured one datanode and one namenode.Jobtracker is
>> running
>> on namenode and tasktracker on datanode.Now namenode also acts as
>> client.Like we are writing programs in the namenode to store or retrieve
>> images.My doubts are
>>
>> 1.Can we put the client and namenode in two separate systems?
>>
>> 2.Can we access the images from the datanode of hadoop cluster from a
>> machine in which hdfs is not there?
>>
>> 3.At present we may not have data upto petabytes but will be in
>> gigabytes.Is
>> hadoop still efficient in storing mega and giga bytes of data
>>
>>
>> Thanking you,
>>
>> Yours sincerely,
>> SreeDeepya
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/hdfs-doubt-tp22764502p22765332.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>



-- 
M. Raşit ÖZDAŞ


Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
I doubt If I understood you correctly, but if so, there is a previous
thread to better understand what hadoop is intended to be, and what
disadvantages it has:
http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html

2009/4/2 Rasit OZDAS 
>
> If performance is important to you, Look at the quote from a previous thread:
>
> "HDFS is a file system for distributed storage typically for distributed
> computing scenerio over hadoop. For office purpose you will require a SAN
> (Storage Area Network) - an architecture to attach remote computer storage
> devices to servers in such a way that, to the operating system, the devices
> appear as locally attached. Or you can even go for AmazonS3, if the data is
> really authentic. For opensource solution related to SAN, you can go with
> any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
> zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
> Server + XSan."
>
> --nitesh
>
> Besides, I wouldn't use HDFS for this purpose.
>
> Rasit



--
M. Raşit ÖZDAŞ


Re: a doubt regarding an appropriate file system

2009-04-02 Thread Rasit OZDAS
If performance is important to you, Look at the quote from a previous
thread:

"HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer storage
devices to servers in such a way that, to the operating system, the devices
appear as locally attached. Or you can even go for AmazonS3, if the data is
really authentic. For opensource solution related to SAN, you can go with
any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
Server + XSan."

--nitesh

Besides, I wouldn't use HDFS for this purpose.

Rasit


Re: A bizarre problem in reduce method

2009-04-02 Thread Rasit OZDAS
Hi, Husain,

1. You can use a boolean control in your code.
   boolean hasAlreadyOned = false;
   int iCount = 0;
   String sValue;
   while (values.hasNext()) {
   sValue = values.next().toString();
   iCount++;
   if (sValue.equals("1"))
 hasAlreadyOned = true;

   if (!hasAlreadyOned)
 sValues += "\t" + sValue;
   }
   ...

2. You're actually controlling for 3 elements, not 2. You should use  if
(iCount == 1)

2009/4/1 Farhan Husain 

> Hello All,
>
> I am facing some problems with a reduce method I have written which I
> cannot
> understand. Here is the method:
>
>@Override
>public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter)
>throws IOException {
>String sValues = "";
>int iCount = 0;
>String sValue;
>while (values.hasNext()) {
>sValue = values.next().toString();
>iCount++;
>sValues += "\t" + sValue;
>
>}
>sValues += "\t" + iCount;
>//if (iCount == 2)
>output.collect(key, new Text(sValues));
>}
>
> The output of the code is like the following:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent111
> D0U0:GraduateStudent1lehigh:GraduateStudent111
> D0U0:GraduateStudent10lehigh:GraduateStudent111
> D0U0:GraduateStudent100lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent101lehigh:GraduateStudent1
> D0U0:GraduateCourse0121
> D0U0:GraduateStudent102lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent103lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent104lehigh:GraduateStudent11
>  1
> D0U0:GraduateStudent105lehigh:GraduateStudent11
>  1
>
> The problem is there cannot be so many 1's in the output value. The output
> which I expect should be like this:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent1
> D0U0:GraduateStudent1lehigh:GraduateStudent1
> D0U0:GraduateStudent10lehigh:GraduateStudent1
> D0U0:GraduateStudent100lehigh:GraduateStudent1
> D0U0:GraduateStudent101lehigh:GraduateStudent
> D0U0:GraduateCourse02
> D0U0:GraduateStudent102lehigh:GraduateStudent1
> D0U0:GraduateStudent103lehigh:GraduateStudent1
> D0U0:GraduateStudent104lehigh:GraduateStudent1
> D0U0:GraduateStudent105lehigh:GraduateStudent1
>
> If I do not append the iCount variable to sValues string, I get the
> following output:
>
> D0U0:GraduateStudent0lehigh:GraduateStudent
> D0U0:GraduateStudent1lehigh:GraduateStudent
> D0U0:GraduateStudent10lehigh:GraduateStudent
> D0U0:GraduateStudent100lehigh:GraduateStudent
> D0U0:GraduateStudent101lehigh:GraduateStudent
> D0U0:GraduateCourse0
> D0U0:GraduateStudent102lehigh:GraduateStudent
> D0U0:GraduateStudent103lehigh:GraduateStudent
> D0U0:GraduateStudent104lehigh:GraduateStudent
> D0U0:GraduateStudent105lehigh:GraduateStudent
>
> This confirms that there is no 1's after each of those values (which I
> already know from the intput data). I do not know why the output is
> distorted like that when I append the iCount to sValues (like the given
> code). Can anyone help in this regard?
>
> Now comes the second problem which is equally perplexing. Actually, the
> reduce method which I want to run is like the following:
>
>@Override
>public void reduce(Text key, Iterator values,
> OutputCollector output, Reporter reporter)
>throws IOException {
>String sValues = "";
>int iCount = 0;
>String sValue;
>while (values.hasNext()) {
>sValue = values.next().toString();
>iCount++;
>sValues += "\t" + sValue;
>
>}
>sValues += "\t" + iCount;
>if (iCount == 2)
>output.collect(key, new Text(sValues));
>}
>
> I want to output only if "values" contained only two elements. By looking
> at
> the output above you can see that there is at least one such key values
> pair
> where values have exactly two elements. But when I run the code I get an
> empty output file. Can anyone solve this?
>
> I have tried many versions of the code (e.g. using StringBuffer instead of
> String, using flags instead of integer count) but nothing works. Are these
> problems due to bugs in Hadoop? Please let me know any kind of solution you
> can think of.
>
> Thanks,
>
> --
> Mohammad Farhan Husain
> Research Assistant
> Department of Computer Science
> Erik Jonsson School of Engineering and Computer Science
> University of T

Re: what change to be done in OutputCollector to print custom writable object

2009-04-02 Thread Rasit OZDAS
There is also a good alternative,
We use ObjectInputFormat and ObjectRecordReader.
With it you can easily do File <-> Object translations.
I can send a code sample to your mail if you want.


Re: Running MapReduce without setJar

2009-04-02 Thread Rasit OZDAS
Yes, as an additional info,
you can use this code just to start the job, not wait until it's finished:

JobClient client = new JobClient(conf);
client.runJob(conf);

2009/4/1 javateck javateck 

> you can run from java program:
>
>JobConf conf = new JobConf(MapReduceWork.class);
>
>// setting your params
>
>JobClient.runJob(conf);
>
>


Re: Reducer side output

2009-04-02 Thread Rasit OZDAS
I think it's about that you have no right to access to the path you define.
Did you try it with a path under your user directory?

You can change permissions from console.

2009/4/1 Nagaraj K 

> Hi,
>
> I am trying to do a side-effect output along with the usual output from the
> reducer.
> But for the side-effect output attempt, I get the following error.
>
> org.apache.hadoop.fs.permission.AccessControlException:
> org.apache.hadoop.fs.permission.AccessControlException: Permission denied:
> user=nagarajk, access=WRITE, inode="":hdfs:hdfs:rwxr-xr-x
>at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90)
>at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:2311)
>at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477)
>at
> org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391)
>at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383)
>at
> org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310)
>at
> org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319)
>at
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206)
>
> My reducer code;
> =
> conf.set("group_stat", "some_path"); // Set during the configuration of
> jobconf object
>
> public static class ReducerClass extends MapReduceBase implements
> Reducer {
>FSDataOutputStream part=null;
>JobConf conf;
>
>public void reduce(Text key, Iterator values,
>   OutputCollector output,
>   Reporter reporter) throws IOException {
>double i_sum = 0.0;
>while (values.hasNext()) {
>i_sum += ((Double) values.next()).valueOf();
>}
>String [] fields = key.toString().split(SEP);
>if(fields.length==1)
>{
>   if(part==null)
>   {
>   FileSystem fs = FileSystem.get(conf);
>String jobpart =
> conf.get("mapred.task.partition");
>part = fs.create(new
> Path(conf.get("group_stat"),"/part-000"+jobpart)) ; // Failing here
>   }
>   part.writeBytes(fields[0] +"\t" + i_sum +"\n");
>
>}
>else
>output.collect(key, new DoubleWritable(i_sum));
>}
> }
>
> Can you guys let me know what I am doing wrong here!.
>
> Thanks
> Nagaraj K
>



-- 
M. Raşit ÖZDAŞ


Re: Strange Reduce Bahavior

2009-04-02 Thread Rasit OZDAS
Yes, we've constructed a local version of a hadoop process,
We needed 500 input files in hadoop to reach the speed of local process,
total time was 82 seconds in a cluster of 6 machines.
And I think it's a good performance among other distributed processing
systems.

2009/4/2 jason hadoop 

> 3) The framework is designed for working on large clusters of machines
> where
> there needs to be a little delay between operations to avoid massive
> network
> loading spikes, and the initial setup of the map task execution environment
> on a machine, and the initial setup of the reduce task execution
> environment
> take a bit of time.
> In production jobs, these delays and setup times are lost in the overall
> task run time.
> In the small test job case the delays and setup times will be the bulk of
> the time spent executing the test.
>
>
>


Re: Cannot resolve Datonode address in slave file

2009-04-02 Thread Rasit OZDAS
Hi, Sim,

I've two suggessions, if you haven't done yet:

1. Check if your other hosts can ssh to master.
2. Take a look at logs of other hosts.

2009/4/2 Puri, Aseem 

>
> Hi
>
>I have a small Hadoop cluster with 3 machines. One is my
> NameNode/JobTracker + DataNode/TaskTracker and other 2 are
> DataNode/TaskTracker. So I have made all 3 as slave.
>
>
>
> In slave file I have put names of all there machines as:
>
>
>
> master
>
> slave
>
> slave1
>
>
>
> When I start Hadoop cluster it always start DataNode/TaskTracker on last
> slave in the list and do not start DataNode/TaskTracker on other two
> machines. Also I got the message as:
>
>
>
> slave1:
>
> : no address associated with name
>
> : no address associated with name
>
> slave1: starting datanode, logging to
> /home/HadoopAdmin/hadoop/bin/../logs/hadoo
>
> p-HadoopAdmin-datanode-ie11dtxpficbfise.out
>
>
>
> If I change the order in slave file like this:
>
>
>
> slave
>
> slave1
>
> master
>
>
>
> then DataNode/TaskTracker on master m/c starts and not on other two.
>
>
>
> Please tell how I should solve this problem.
>
>
>
> Sim
>
>


-- 
M. Raşit ÖZDAŞ


Re: reducer in M-R

2009-04-02 Thread Rasit OZDAS
Since every file name is different, you have a unique key for each map
output.
That means, every iterator has only one element. So you won't need to search
for a given name.
But it's possible that I misunderstood you.

2009/4/2 Vishal Ghawate 

> Hi ,
>
> I just wanted to know that values parameter passed to the reducer is always
> iterator ,
>
> Which is then used to iterate through for particular key
>
> Now I want to use file name as key and file content as its value
>
> So how can I set the parameters in the reducer
>
>
>
> Can anybody please help me on this.
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
M. Raşit ÖZDAŞ


Re: HELP: I wanna store the output value into a list not write to the disk

2009-04-02 Thread Rasit OZDAS
Hi, hadoop is normally designed to write to disk. There are a special file
format, which writes output to RAM instead of disk.
But I don't have an idea if it's what you're looking for.
If what you said exists, there should be a mechanism which sends output as
objects rather than file content across computers, as far as I know there is
no such feature yet.

Good luck.

2009/4/2 andy2005cst 

>
> I need to use the output of the reduce, but I don't know how to do.
> use the wordcount program as an example if i want to collect the wordcount
> into a hashtable for further use, how can i do?
> the example just show how to let the result onto disk.
> myemail is : andy2005...@gmail.com
> looking forward your help. thanks a lot.
> --
> View this message in context:
> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: mapreduce problem

2009-04-02 Thread Rasit OZDAS
MultipleOutputFormat would be what you want. It supplies multiple files as
output.
I can paste some code here if you want..

2009/4/2 Vishal Ghawate 

> Hi,
>
> I am new to map-reduce programming model ,
>
>  I am writing a MR that will process the log file and results are written
> to
> different files on hdfs  based on some values in the log file
>
>The program is working fine even if I haven't done any
> processing in reducer ,I am not getting how to use reducer for solving my
> problem efficiently
>
> Can anybody please help me on this.
>
>
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>



-- 
M. Raşit ÖZDAŞ


Re: Eclipse version for Hadoop-0.19.1

2009-04-01 Thread Rasit OZDAS
Try this page for eclipse europa:
http://rm.mirror.garr.it/mirrors/eclipse/technology/epp/downloads/release/europa/winter/

This is the fastest for me,
if it's too slow, you can download from here:
http://archive.eclipse.org/eclipse/downloads/

Rasit

2009/4/1 Puri, Aseem 

> Hi
>Please tell which eclipse version should I use which support
> hadoop-0.19.0-eclipse-plugin and from where I can download it?
>
>
> -Original Message-----
> From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
> Sent: Friday, March 20, 2009 9:19 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Eclipse version for Hadoop-0.19.1
>
> I also couldn't succeed in running it in ganymede.
> I use eclipse europa with v. 0.19.0. I would give it a try for 19.1,
> though.
>
> 2009/3/18 Puri, Aseem 
>
> > I am using Hadoop - HBase 0.18 and my eclipse supports
> > hadoop-0.18.0-eclipse-plugin.
> >
> >
> >
> >   When I switch to Hadoop 0.19.1 and use
> > hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce
> > perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1.
> >
> >
> >
> > Can anyone pls tell which version of eclipse supports Hadoop 0.19.1?
> >
> >
> >
> >
> >
> > Thanks & Regards
> >
> > Aseem Puri
> >
> >
> >
> >
> >
> >
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Reduce doesn't start until map finishes

2009-03-24 Thread Rasit OZDAS
Just to inform, we installed v.0.21.0-dev and there is no such issue now.

2009/3/6 Rasit OZDAS 

> So, is there currently no solution to my problem?
> Should I live with it? Or do we have to have a JIRA for this?
> What do you think?
>
>
> 2009/3/4 Nick Cen 
>
> Thanks, about the "Secondary Sort", can you provide some example. What does
>> the intermediate keys stands for?
>>
>> Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
>> and
>> the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
>> partition and k1 < k2, so i think the order inside reducer maybe:
>> (k1,v1)
>> (k1,v3)
>> (k2,v2)
>> (k2,v4)
>>
>> can the Secondary Sort change this order?
>>
>>
>>
>> 2009/3/4 Chris Douglas 
>>
>> > The output of each map is sorted by partition and by key within that
>> > partition. The reduce merges sorted map output assigned to its partition
>> > into the reduce. The following may be helpful:
>> >
>> > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
>> >
>> > If your job requires total order, consider
>> > o.a.h.mapred.lib.TotalOrderPartitioner. -C
>> >
>> >
>> > On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
>> >
>> >  can you provide more info about sortint? The sort is happend on the
>> whole
>> >> data set, or just on the specified partion?
>> >>
>> >> 2009/3/4 Mikhail Yakshin 
>> >>
>> >>  On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
>> >>>
>> >>>> This is normal behavior. The Reducer is guaranteed to receive all the
>> >>>> results for its partition in sorted order. No reduce can start until
>> all
>> >>>>
>> >>> the
>> >>>
>> >>>> maps are completed, since any running map could emit a result that
>> would
>> >>>> violate the order for the results it currently has. -C
>> >>>>
>> >>>
>> >>> _Reducers_ usually start almost immediately and start downloading data
>> >>> emitted by mappers as they go. This is their first phase. Their second
>> >>> phase can start only after completion of all mappers. In their second
>> >>> phase, they're sorting received data, and in their third phase they're
>> >>> doing real reduction.
>> >>>
>> >>> --
>> >>> WBR, Mikhail Yakshin
>> >>>
>> >>>
>> >>
>> >>
>> >> --
>> >> http://daily.appspot.com/food/
>> >>
>> >
>> >
>>
>>
>> --
>> http://daily.appspot.com/food/
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Running Balancer from API

2009-03-23 Thread Rasit OZDAS
Hi,

I try to start balancer from API
(org.apache.hadoop.hdfs.server.balancer.Balancer.main() ), but I get
NullPointerException.

09/03/23 15:17:37 ERROR dfs.Balancer: java.lang.NullPointerException
at org.apache.hadoop.dfs.Balancer.run(Balancer.java:1453)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.dfs.Balancer.main(Balancer.java:792)

It's this line (Balancer.java:1453):
fs.delete(BALANCER_ID_PATH, true);

The process doesn't start at all, I assume, It tries to delete a path that
balancer didn't put yet.
Is this an issue?

Rasit


Re: How to write custom web-ui

2009-03-20 Thread Rasit OZDAS
Thanks, Stefan.

That's a good starting point. Is there any way to get JobTracker instance?
I also need to pass it in, in my code.
Do I have to initialize JobTracker myself?

2009/3/19 Stefan Podkowinski 

> This is done in the JobTracker class during the bootstrap process of
> the jetty servlet engine.
> Searching for 'setAttribute("job.tracker"' should find the exact position.
>
>
>


Re: Eclipse version for Hadoop-0.19.1

2009-03-20 Thread Rasit OZDAS
I also couldn't succeed in running it in ganymede.
I use eclipse europa with v. 0.19.0. I would give it a try for 19.1, though.

2009/3/18 Puri, Aseem 

> I am using Hadoop - HBase 0.18 and my eclipse supports
> hadoop-0.18.0-eclipse-plugin.
>
>
>
>   When I switch to Hadoop 0.19.1 and use
> hadoop-0.19.0-eclipse-plugin then my eclipse doesn't show mapreduce
> perspective. I am using Eclipse Platform (GANYMEDE), Version: 3.4.1.
>
>
>
> Can anyone pls tell which version of eclipse supports Hadoop 0.19.1?
>
>
>
>
>
> Thanks & Regards
>
> Aseem Puri
>
>
>
>
>
>


-- 
M. Raşit ÖZDAŞ


How to write custom web-ui

2009-03-18 Thread Rasit OZDAS
Hi,

Web ui of hadoop isn't sufficient enough for our project, so I should change
it a little bit. But pages generally start with
application.getProperty('job.tracker');   to get JobTracker instance, or
similarly for NameNode instance. I couldn't find anything in the code, where
they're first initialized.
Where in the code are these instances set? Or is there another way to get
instances?

Any help is appreciated,
Rasit


Re: merging files

2009-03-18 Thread Rasit OZDAS
I would use DistributedCache.
Put file2 to distributed cache, but you should read it for every map.
If you find a better solution, please let me know, because I have a similar
issue.

Rasit

2009/3/18 Nir Zohar 

> Hi,
>
>
>
> I would like your help with the below question.
>
> I have 2 files: file1 (key, value), file2 (only key) and I need to exclude
> all records from file1 that these key records not in file2.
>
> 1. The output format is key-value, not only keys.
>
> 2. The key is not primary key; hence it's not possible to have joined in
> the
> end.
>
>
>
> Can you assist?
>
>
>
> Thanks,
>
> Nir.
>
>
>
>
>
> Example:
>
>
>
> file1:
>
> 2,1
>
> 2,3
>
> 2,5
>
> 3,1
>
> 3,2
>
> 4,7
>
> 4,9
>
> 6,3
>
>
>
> file2:
>
> 4
>
> 2
>
>
>
> Output:
>
> 3,1
>
> 3,2
>
> 6,3
>
>
>
>
>
>
>
>


-- 
M. Raşit ÖZDAŞ


Re: MultipleOutputFormat with sorting functionality

2009-03-09 Thread Rasit OZDAS
Thanks, Nick!

It seems that sorting takes place in map, not in reduce :)
I've added double values in front of every map key, the problem is solved
now.
I know it's more like a workaround rather than a real solution,
and I don't know if it has performance problems.. Have an idea? I'm not
familiar with what hadoop does exactly when I do this.

Rasit

2009/3/9 Nick Cen 

> I think the sort is not relatived to the output format.
>
> I previously have try this class ,but has a little different compared to
> your code. I extend the MultipleTextOutputFormat class and override
> its generateFileNameForKeyValue()
> method, and everything seems working fine.
>
> 2009/3/9 Rasit OZDAS 
>
> > Hi, all!
> >
> > I'm using multiple output format to write out 4 different files, each one
> > has the same type.
> > But it seems that outputs aren't being sorted.
> >
> > Should they be sorted? Or isn't it implemented for multiple output
> format?
> >
> > Here is some code:
> >
> > // in main function
> > MultipleOutputs.addMultiNamedOutput(conf, "text", TextOutputFormat.class,
> > DoubleWritable.class, Text.class);
> >
> > // in Reducer.configure()
> > mos = new MultipleOutputs(conf);
> >
> > // in Reducer.reduce()
> > if (keystr.equalsIgnoreCase("BreachFace"))
> >mos.getCollector("text", "BreachFace",
> reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("Ejector"))
> >mos.getCollector("text", "Ejector", reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("FiringPin"))
> >mos.getCollector("text", "FiringPin",
> reporter).collect(new
> > Text(key), dbl);
> >else if (keystr.equalsIgnoreCase("WeightedSum"))
> >mos.getCollector("text", "WeightedSum",
> > reporter).collect(new Text(key), dbl);
> >else
> >mos.getCollector("text", "Diger", reporter).collect(new
> > Text(key), dbl);
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ


MultipleOutputFormat with sorting functionality

2009-03-09 Thread Rasit OZDAS
Hi, all!

I'm using multiple output format to write out 4 different files, each one
has the same type.
But it seems that outputs aren't being sorted.

Should they be sorted? Or isn't it implemented for multiple output format?

Here is some code:

// in main function
MultipleOutputs.addMultiNamedOutput(conf, "text", TextOutputFormat.class,
DoubleWritable.class, Text.class);

// in Reducer.configure()
mos = new MultipleOutputs(conf);

// in Reducer.reduce()
if (keystr.equalsIgnoreCase("BreachFace"))
mos.getCollector("text", "BreachFace", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("Ejector"))
mos.getCollector("text", "Ejector", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("FiringPin"))
mos.getCollector("text", "FiringPin", reporter).collect(new
Text(key), dbl);
else if (keystr.equalsIgnoreCase("WeightedSum"))
mos.getCollector("text", "WeightedSum",
reporter).collect(new Text(key), dbl);
else
mos.getCollector("text", "Diger", reporter).collect(new
Text(key), dbl);


-- 
M. Raşit ÖZDAŞ


Re: question about released version id

2009-03-09 Thread Rasit OZDAS
Hi, here is the versioning methodology of Apache Portable Runtime,
But I think hadoop's is much or less the same..

http://apr.apache.org/versioning.html

Rasit


2009/3/3 鞠適存 

> hi,
>
> I wonder how to make the hadoop version number.
> The HowToRelease page on the hadoop web site just describes
> the process about new release but not mentions the rules on
> assigning the version number. Are there any criteria for version number?
> For example,under what condition the next version of 0.18.0 would be call
> as
> 0.19.0, and
> under what condtion  the next version of 0.18.0 would be call as 0.18.1?
> In addition, did the other Apache projects (such as hbase) use the same
> criteria to decide the
> version number?
>
> Thank you in advance for any pointers.
>
> Chu, ShihTsun
>



-- 
M. Raşit ÖZDAŞ


Re: Profiling Map/Reduce Tasks

2009-03-09 Thread Rasit OZDAS
I note System.currentTimeMillis() at the beginning of main function,
then at the end I use a while loop to wait for the job,

while (!runningJob.isComplete())
  Thread.sleep(1000);

Then again I note the system time. But this only gives the total amount of
time passed.

Rasit

2009/3/8 Richa Khandelwal 

> Hi,
> Does Map/Reduce profiles jobs down to milliseconds. From what I can see in
> the logs, there is no time specified for the job. Although CPU TIME is an
> information that should be present in the logs, it was not profiled and the
> response time can only be noted in down to seconds from the runtime
> progress
> of the jobs.
>
> Does someone know how to efficiently profile map reduce jobs?
>
> Thanks,
> Richa Khandelwal
>
>
> University Of California,
> Santa Cruz.
> Ph:425-241-7763
>



-- 
M. Raşit ÖZDAŞ


Re: Does "hadoop-default.xml" + "hadoop-site.xml" matter for whole cluster or each node?

2009-03-09 Thread Rasit OZDAS
Some parameters are global (I can't give an example now),
they are cluster-wide even if they're defined in hadoop-site.xml

Rasit

2009/3/9 Nick Cen 

> for Q1: i think so , but i think it is a good practice to keep the
> hadoop-default.xml untouched.
> for Q2: i use this property for debugging in eclipse.
>
>
>
> 2009/3/9 
>
> >
> >
> >  The hadoop-site.xml will take effect only on that specified node. So
> each
> >> node can have its own configuration with hadoop-site.xml.
> >>
> >>
> > As i understand, parameters in "hadoop-site" overwrites these ones in
> > "hadoop-default".
> > So "hadoop-default" also individual for each node?
> >
> > Q2: what means "local" as value of "mapred.job.tracker"?
> >
> > thanks
> >
>
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ


Re: Setting ctime in HDFS

2009-03-06 Thread Rasit OZDAS
Cosmin, unfortunately there isn't such a method yet (in FileSystem api).

Rasit

2009/3/6 Cosmin Lehene 

> Hi,
>
> Is there any way to create a file in HDFS and set the creation date(ctime)
> in the file attributes?
>
> Thanks,
> Cosmin
>


Re: MapReduce jobs with expensive initialization

2009-03-06 Thread Rasit OZDAS
Owen, I tried this, it doesn't work.
I doubt if static singleton method will work either,
since it's much or less the same.

Rasit

2009/3/2 Owen O'Malley 

>
> On Mar 2, 2009, at 3:03 AM, Tom White wrote:
>
>  I believe the static singleton approach outlined by Scott will work
>> since the map classes are in a single classloader (but I haven't
>> actually tried this).
>>
>
> Even easier, you should just be able to do it with static initialization in
> the Mapper class. (I haven't tried it either... )
>
> -- Owen
>



-- 
M. Raşit ÖZDAŞ


Re: Reduce doesn't start until map finishes

2009-03-05 Thread Rasit OZDAS
So, is there currently no solution to my problem?
Should I live with it? Or do we have to have a JIRA for this?
What do you think?


2009/3/4 Nick Cen 

> Thanks, about the "Secondary Sort", can you provide some example. What does
> the intermediate keys stands for?
>
> Assume I have two mapper, m1 and m2. The output of m1 is (k1,v1),(k2,v2)
> and
> the output of m2 is (k1,v3),(k2,v4). Assume k1 and k2 belongs to the same
> partition and k1 < k2, so i think the order inside reducer maybe:
> (k1,v1)
> (k1,v3)
> (k2,v2)
> (k2,v4)
>
> can the Secondary Sort change this order?
>
>
>
> 2009/3/4 Chris Douglas 
>
> > The output of each map is sorted by partition and by key within that
> > partition. The reduce merges sorted map output assigned to its partition
> > into the reduce. The following may be helpful:
> >
> > http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
> >
> > If your job requires total order, consider
> > o.a.h.mapred.lib.TotalOrderPartitioner. -C
> >
> >
> > On Mar 3, 2009, at 7:24 PM, Nick Cen wrote:
> >
> >  can you provide more info about sortint? The sort is happend on the
> whole
> >> data set, or just on the specified partion?
> >>
> >> 2009/3/4 Mikhail Yakshin 
> >>
> >>  On Wed, Mar 4, 2009 at 2:09 AM, Chris Douglas wrote:
> >>>
>  This is normal behavior. The Reducer is guaranteed to receive all the
>  results for its partition in sorted order. No reduce can start until
> all
> 
> >>> the
> >>>
>  maps are completed, since any running map could emit a result that
> would
>  violate the order for the results it currently has. -C
> 
> >>>
> >>> _Reducers_ usually start almost immediately and start downloading data
> >>> emitted by mappers as they go. This is their first phase. Their second
> >>> phase can start only after completion of all mappers. In their second
> >>> phase, they're sorting received data, and in their third phase they're
> >>> doing real reduction.
> >>>
> >>> --
> >>> WBR, Mikhail Yakshin
> >>>
> >>>
> >>
> >>
> >> --
> >> http://daily.appspot.com/food/
> >>
> >
> >
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ


Re: Reduce doesn't start until map finishes

2009-03-01 Thread Rasit OZDAS
Strange, that I've last night tried 1 input files (maps), waiting time
after maps increases (probably linearly)

2009/3/2 Rasit OZDAS 

> I have 6 reducers, Nick, still no luck..
>
> 2009/3/2 Nick Cen 
>
> how many reducer do you have? You should make this value larger then 1 to
>> make mapper and reducer run concurrently. You can set this value from
>> JobConf.*setNumReduceTasks*().
>>
>>
>> 2009/3/2 Rasit OZDAS 
>>
>> > Hi!
>> >
>> > Whatever code I run on hadoop, reduce starts a few seconds after map
>> > finishes.
>> > And worse, when I run 10 jobs parallely (using threads and sending one
>> > after
>> > another)
>> > all maps finish sequentially, then after 8-10 seconds reduces start.
>> > I use reducer also as combiner, my cluster has 6 machines, namenode and
>> > jobtracker run also as slaves.
>> > There were 44 maps and 6 reduces in the last example, I never tried a
>> > bigger
>> > job.
>> >
>> > What can the problem be? I've read somewhere that this is not the normal
>> > behaviour.
>> > Replication factor is 3.
>> > Thank you in advance for any pointers.
>> >
>> > Rasit
>> >
>>
>>
>>
>> --
>> http://daily.appspot.com/food/
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Reduce doesn't start until map finishes

2009-03-01 Thread Rasit OZDAS
I have 6 reducers, Nick, still no luck..

2009/3/2 Nick Cen 

> how many reducer do you have? You should make this value larger then 1 to
> make mapper and reducer run concurrently. You can set this value from
> JobConf.*setNumReduceTasks*().
>
>
> 2009/3/2 Rasit OZDAS 
>
> > Hi!
> >
> > Whatever code I run on hadoop, reduce starts a few seconds after map
> > finishes.
> > And worse, when I run 10 jobs parallely (using threads and sending one
> > after
> > another)
> > all maps finish sequentially, then after 8-10 seconds reduces start.
> > I use reducer also as combiner, my cluster has 6 machines, namenode and
> > jobtracker run also as slaves.
> > There were 44 maps and 6 reduces in the last example, I never tried a
> > bigger
> > job.
> >
> > What can the problem be? I've read somewhere that this is not the normal
> > behaviour.
> > Replication factor is 3.
> > Thank you in advance for any pointers.
> >
> > Rasit
> >
>
>
>
> --
> http://daily.appspot.com/food/
>



-- 
M. Raşit ÖZDAŞ


Re: When do we use the Key value for a map function?

2009-03-01 Thread Rasit OZDAS
Amit, it's not used here in this example, but it has other uses.
As I needed, you can pass in the name of input file as key, for example.

Rasit

2009/3/1 Kumar, Amit H. 

> A very Basic Question:
>
> Form the WordCount example below: I don't see why do we need the
> "LongWritable key" argument in the Map function. Can anybody tell me the
> importance of it?
> As I understand the worker process reads in the designated input split as a
> series of strings. Which the map functions operates on to produce the  value> pair, in this case the 'output' variable.  Then, Why would one need
> "LongWritable key" as the argument for map function?
>
> Thank you,
> Amit
>
> 
> public static class MapClass extends MapReduceBase
>implements Mapper {
>
>private final static IntWritable one = new IntWritable(1);
>private Text word = new Text();
>
>public void map(LongWritable key, Text value,
>OutputCollector output,
>Reporter reporter) throws IOException {
>  String line = value.toString();
>  StringTokenizer itr = new StringTokenizer(line);
>  while (itr.hasMoreTokens()) {
>word.set(itr.nextToken());
>output.collect(word, one);
>  }
>}
>  }
> 
>
>
>


-- 
M. Raşit ÖZDAŞ


Reduce doesn't start until map finishes

2009-03-01 Thread Rasit OZDAS
Hi!

Whatever code I run on hadoop, reduce starts a few seconds after map
finishes.
And worse, when I run 10 jobs parallely (using threads and sending one after
another)
all maps finish sequentially, then after 8-10 seconds reduces start.
I use reducer also as combiner, my cluster has 6 machines, namenode and
jobtracker run also as slaves.
There were 44 maps and 6 reduces in the last example, I never tried a bigger
job.

What can the problem be? I've read somewhere that this is not the normal
behaviour.
Replication factor is 3.
Thank you in advance for any pointers.

Rasit


Re: why print this error when using MultipleOutputFormat?

2009-02-25 Thread Rasit OZDAS
Qiang,
I couldn't find now which one, but there is a JIRA issue about
MultipleTextOutputFormat (especially when reducers = 0).
If you have no reducers, you can try having one or two, then you can see if
your problem is related with this one.

Cheers,
Rasit

2009/2/25 ma qiang 

> Thanks for your reply.
> If I increase the number of computers, can we solve this problem of
> running out of file descriptors?
>
>
>
>
> On Wed, Feb 25, 2009 at 11:07 AM, jason hadoop 
> wrote:
> > My 1st guess is that your application is running out of file
> > descriptors,possibly because your MultipleOutputFormat  instance is
> opening
> > more output files than you expect.
> > Opening lots of files in HDFS is generally a quick route to bad job
> > performance if not job failure.
> >
> > On Tue, Feb 24, 2009 at 6:58 PM, ma qiang  wrote:
> >
> >> Hi all,
> >>   I have one class extends MultipleOutputFormat as below,
> >>
> >>  public class MyMultipleTextOutputFormat extends
> >> MultipleOutputFormat {
> >>private TextOutputFormat theTextOutputFormat = null;
> >>
> >>@Override
> >>protected RecordWriter getBaseRecordWriter(FileSystem fs,
> >>JobConf job, String name, Progressable arg3)
> throws
> >> IOException {
> >>if (theTextOutputFormat == null) {
> >>theTextOutputFormat = new TextOutputFormat V>();
> >>}
> >>return theTextOutputFormat.getRecordWriter(fs, job, name,
> >> arg3);
> >>}
> >>@Override
> >>protected String generateFileNameForKeyValue(K key, V value,
> String
> >> name) {
> >>return name + "_" + key.toString();
> >>}
> >> }
> >>
> >>
> >> also conf.setOutputFormat(MultipleTextOutputFormat2.class) in my job
> >> configuration. but when the program run, error print as follow:
> >>
> >> 09/02/25 10:22:32 INFO mapred.JobClient: Task Id :
> >> attempt_200902250959_0002_r_01_0, Status : FAILED
> >> java.io.IOException: Could not read from stream
> >>at
> >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
> >>at java.io.DataInputStream.readByte(DataInputStream.java:248)
> >>at
> >> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> >>at
> >> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> >>at org.apache.hadoop.io.Text.readString(Text.java:400)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >>
> >> 09/02/25 10:22:42 INFO mapred.JobClient:  map 100% reduce 69%
> >> 09/02/25 10:22:55 INFO mapred.JobClient:  map 100% reduce 0%
> >> 09/02/25 10:22:55 INFO mapred.JobClient: Task Id :
> >> attempt_200902250959_0002_r_00_1, Status : FAILED
> >> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> >>
> >>
> /user/qiang/output/_temporary/_attempt_200902250959_0002_r_00_1/part-0_t0x5y3
> >> could only be replicated to 0 nodes, instead of 1
> >>at
> >>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
> >>at
> >>
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
> >>at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
> >>at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>at java.lang.reflect.Method.invoke(Method.java:597)
> >>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> >>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
> >>at org.apache.hadoop.ipc.Client.call(Client.java:696)
> >>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> >>at $Proxy1.addBlock(Unknown Source)
> >>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>at
> >>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>at
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>at java.lang.reflect.Method.invoke(Method.java:597)
> >>at
> >>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> >>at
> >>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> >>at $Proxy1.addBlock(Unknown Source)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
> >>at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java

Re: Hadoop Streaming -file option

2009-02-23 Thread Rasit OZDAS
Hadoop uses RMI for file copy operations.
Clients listen port 50010 for this operation.
I assume, it's sending the file as byte stream.

Cheers,
Rasit

2009/2/23 Bing TANG 

> Hi, everyone,
> Could somdone tell me the principle of "-file" when using Hadoop
> Streaming. I want to ship a big file to Slaves, so how it works?
>
> Hadoop uses "SCP" to copy? How does Hadoop deal with -file option?
>
>
>


-- 
M. Raşit ÖZDAŞ


Re: Super-long reduce task timeouts in hadoop-0.19.0

2009-02-21 Thread Rasit OZDAS
I agree with the timeout period, Bryan,
Reporter has a progress() method to tell the namenode that it's still
working, no need to kill the job.


2009/2/21 Bryan Duxbury 

> We didn't customize this value, to my knowledge, so I'd suspect it's the
> default.
> -Bryan
>
>
> On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote:
>
>  How often do your reduce tasks report status?
>>
>> On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury  wrote:
>>
>>  (Repost from the dev list)
>>>
>>>
>>> I noticed some really odd behavior today while reviewing the job history
>>> of
>>> some of our jobs. Our Ganglia graphs showed really long periods of
>>> inactivity across the entire cluster, which should definitely not be the
>>> case - we have a really long string of jobs in our workflow that should
>>> execute one after another. I figured out which jobs were running during
>>> those periods of inactivity, and discovered that almost all of them had
>>> 4-5
>>> failed reduce tasks, with the reason for failure being something like:
>>>
>>> Task attempt_200902061117_3382_r_38_0 failed to report status for
>>> 1282
>>> seconds. Killing!
>>>
>>> The actual timeout reported varies from 700-5000 seconds. Virtually all
>>> of
>>> our longer-running jobs were affected by this problem. The period of
>>> inactivity on the cluster seems to correspond to the amount of time the
>>> job
>>> waited for these reduce tasks to fail.
>>>
>>> I checked out the tasktracker log for the machines with timed-out reduce
>>> tasks looking for something that might explain the problem, but the only
>>> thing I came up with that actually referenced the failed task was this
>>> log
>>> message, which was repeated many times:
>>>
>>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_66_0/output/file.out
>>> in any of the configured local directories
>>>
>>> I'm not sure what this means; can anyone shed some light on this message?
>>>
>>> Further confusing the issue, on the affected machines, I looked in
>>> logs/userlogs/, and to my surprise, the directory and log files
>>> existed, and the syslog file seemed to contain logs of a perfectly good
>>> reduce task!
>>>
>>> Overall, this seems like a pretty critical bug. It's consuming up to 50%
>>> of
>>> the runtime of our jobs in some instances, killing our throughput. At the
>>> very least, it seems like the reduce task timeout period should be MUCH
>>> shorter than the current 10-20 minutes.
>>>
>>> -Bryan
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>> 111 West Evelyn Ave. Ste. 202
>> Sunnyvale, CA 94086
>> www.deepdyve.com
>> 408-773-0110 ext. 738
>> 858-414-0013 (m)
>> 408-773-0220 (fax)
>>
>
>


-- 
M. Raşit ÖZDAŞ


Re: Map/Recuce Job done locally?

2009-02-20 Thread Rasit OZDAS
Philipp, I have no problem running jobs locally with eclipse (via hadoop
plugin) and observing it from browser.
(Please note that jobtracker page doesn't refresh automatically, you need to
refresh it manually.)

Cheers,
Rasit

2009/2/19 Philipp Dobrigkeit 

> When I start my job from eclipse it gets processed and the output is
> generated, but it never shows up in my JobTracker, which is opened in my
> browser. Why is this happening?
> --
> Pt! Schon vom neuen GMX MultiMessenger gehört? Der kann`s mit allen:
> http://www.gmx.net/de/go/multimessenger01
>



-- 
M. Raşit ÖZDAŞ


Re: empty log file...

2009-02-20 Thread Rasit OZDAS
Zander,
I've looked at my datanode logs on the slaves, but they are all in quite
small sizes, although we've run many jobs on them.
And running 2 new jobs also didn't add anything to them.
(As I understand from the contents of the logs, hadoop logs especially
operations about DFS performance tests.)

Cheers,
Rasit

2009/2/20 zander1013 

>
> hi,
>
> i am setting up hadoop for the first time on multi-node cluster. right now
> i
> have two nodes. the two node cluster consists of two laptops connected via
> ad-hoc wifi network. they they do not have access to the internet. i
> formated the datanodes on both machines prior to startup...
>
> output form the commands /usr/local/hadoop/bin/start-all.sh, jps (on both
> machines), and /usr/local/hadoop/bin/stop-all.sh all appear normal. however
> the file /usr/local/hadoop/logs/hadoop-hadoop-datanode-node1.log (the slave
> node) is empty.
>
> the same file for the master node shows the startup and shutdown events as
> normal and without error.
>
> is it okay that the log file on the slave is empty?
>
> zander
> --
> View this message in context:
> http://www.nabble.com/empty-log-file...-tp22113398p22113398.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-20 Thread Rasit OZDAS
Erik, did you correctly placed ports in properties window?
Port 9000 under "Map/Reduce Master" on the left, 9001 under "DFS Master" on
the right.


2009/2/19 Erik Holstad 

> Thanks guys!
> Running Linux and the remote cluster is also Linux.
> I have the properties set up like that already on my remote cluster, but
> not sure where to input this info into Eclipse.
> And when changing the ports to 9000 and 9001 I get:
>
> Error: java.io.IOException: Unknown protocol to job tracker:
> org.apache.hadoop.dfs.ClientProtocol
>
> Regards Erik
>



-- 
M. Raşit ÖZDAŞ


Re: Probelms getting Eclipse Hadoop plugin to work.

2009-02-19 Thread Rasit OZDAS
Erik,
Try to add following properties into hadoop-site.xml:


fs.default.name
hdfs://:9000


mapred.job.tracker
hdfs://:9001


This way your ports become static. Then use port 9001 for MR, 9000 for HDFS
in your properties window.
If it still doesn't work, try to write ip address instead of host name as
target host.

Hope this helps,
Rasit

2009/2/18 Erik Holstad 

> I'm using Eclipse 3.3.2 and want to view my remote cluster using the Hadoop
> plugin.
> Everything shows up and I can see the map/reduce perspective but when
> trying
> to
> connect to a location I get:
> "Error: Call failed on local exception"
>
> I've set the host to for example xx0, where xx0 is a remote machine
> accessible from
> the terminal, and the ports to 50020/50040 for M/R master and
> DFS master respectively. Is there anything I'm missing to set for remote
> access to the
> Hadoop cluster?
>
> Regards Erik
>



-- 
M. Raşit ÖZDAŞ


Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS
I see, John.
I also use 0.19, just to note, -D option should come first, since it's one
of generic options. I use it without any errors.

Cheers,
Rasit

2009/2/18 S D 

> Thanks for your response Rasit. You may have missed a portion of my post.
>
> > On a different note, when I attempt to pass params via -D I get a usage
> message; when I use
> > -jobconf the command goes through (and works in the case of
> mapred.reduce.tasks=0 for
> > example) but I get  a deprecation warning).
>
> I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0
> as well?
>
> John
>
>
> On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS  wrote:
>
> > John, did you try -D option instead of -jobconf,
> >
> > I had -D option in my code, I changed it with -jobconf, this is what I
> get:
> >
> > ...
> > ...
> > Options:
> >  -input DFS input file(s) for the Map step
> >  -outputDFS output directory for the Reduce step
> >  -mapper The streaming command to run
> >  -combiner  Combiner has to be a Java class
> >  -reducerThe streaming command to run
> >  -file  File/dir to be shipped in the Job jar file
> >  -inputformat
> > TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
> > Optional.
> >  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
> >  -partitioner JavaClassName  Optional.
> >  -numReduceTasks   Optional.
> >  -inputreader   Optional.
> >  -cmdenv   =Optional. Pass env.var to streaming commands
> >  -mapdebug   Optional. To run this script when a map task fails
> >  -reducedebug   Optional. To run this script when a reduce task
> fails
> >
> >  -verbose
> >
> > Generic options supported are
> > -conf  specify an application configuration file
> > -D use value for given property
> > -fs   specify a namenode
> > -jt specify a job tracker
> > -files specify comma separated files
> to
> > be copied to the map reduce cluster
> > -libjars specify comma separated jar
> > files
> > to include in the classpath.
> > -archives specify comma separated
> > archives to be unarchived on the compute machines.
> >
> > The general command line syntax is
> > bin/hadoop command [genericOptions] [commandOptions]
> >
> > For more details about these options:
> > Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
> >
> >
> >
> > I think -jobconf is not used in v.0.19 .
> >
> > 2009/2/18 S D 
> >
> > > I'm having trouble overriding the maximum number of map tasks that run
> on
> > a
> > > given machine in my cluster. The default value of
> > > mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
> > > When
> > > running my job I passed
> > >
> > > -jobconf mapred.tasktracker.map.tasks.maximum=1
> > >
> > > to limit map tasks to one per machine but each machine was still
> > allocated
> > > 2
> > > map tasks (simultaneously).  The only way I was able to guarantee a
> > maximum
> > > of one map task per machine was to change the value of the property in
> > > hadoop-site.xml. This is unsatisfactory since I'll often be changing
> the
> > > maximum on a per job basis. Any hints?
> > >
> > > On a different note, when I attempt to pass params via -D I get a usage
> > > message; when I use -jobconf the command goes through (and works in the
> > > case
> > > of mapred.reduce.tasks=0 for example) but I get  a deprecation
> warning).
> > >
> > > Thanks,
> > > John
> > >
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>



-- 
M. Raşit ÖZDAŞ


Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS
John, did you try -D option instead of -jobconf,

I had -D option in my code, I changed it with -jobconf, this is what I get:

...
...
Options:
  -input DFS input file(s) for the Map step
  -outputDFS output directory for the Reduce step
  -mapper The streaming command to run
  -combiner  Combiner has to be a Java class
  -reducerThe streaming command to run
  -file  File/dir to be shipped in the Job jar file
  -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks   Optional.
  -inputreader   Optional.
  -cmdenv   =Optional. Pass env.var to streaming commands
  -mapdebug   Optional. To run this script when a map task fails
  -reducedebug   Optional. To run this script when a reduce task fails

  -verbose

Generic options supported are
-conf  specify an application configuration file
-D use value for given property
-fs   specify a namenode
-jt specify a job tracker
-files specify comma separated files to
be copied to the map reduce cluster
-libjars specify comma separated jar files
to include in the classpath.
-archives specify comma separated
archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



I think -jobconf is not used in v.0.19 .

2009/2/18 S D 

> I'm having trouble overriding the maximum number of map tasks that run on a
> given machine in my cluster. The default value of
> mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
> When
> running my job I passed
>
> -jobconf mapred.tasktracker.map.tasks.maximum=1
>
> to limit map tasks to one per machine but each machine was still allocated
> 2
> map tasks (simultaneously).  The only way I was able to guarantee a maximum
> of one map task per machine was to change the value of the property in
> hadoop-site.xml. This is unsatisfactory since I'll often be changing the
> maximum on a per job basis. Any hints?
>
> On a different note, when I attempt to pass params via -D I get a usage
> message; when I use -jobconf the command goes through (and works in the
> case
> of mapred.reduce.tasks=0 for example) but I get  a deprecation warning).
>
> Thanks,
> John
>



-- 
M. Raşit ÖZDAŞ


Re: GenericOptionsParser warning

2009-02-18 Thread Rasit OZDAS
Hi,
There is a JIRA issue about this problem, if I understand it correctly:
https://issues.apache.org/jira/browse/HADOOP-3743

Strange, that I searched all source code, but there exists only this control
in 2 places:

if (!(job.getBoolean("mapred.used.genericoptionsparser", false))) {
  LOG.warn("Use GenericOptionsParser for parsing the arguments. " +
   "Applications should implement Tool for the same.");
}

Just an if block for logging, no extra controls.
Am I missing something?

If your class implements Tool, than there shouldn't be a warning.

Cheers,
Rasit

2009/2/18 Steve Loughran 

> Sandhya E wrote:
>
>> Hi All
>>
>> I prepare my JobConf object in a java class, by calling various set
>> apis in JobConf object. When I submit the jobconf object using
>> JobClient.runJob(conf), I'm seeing the warning:
>> "Use GenericOptionsParser for parsing the arguments. Applications
>> should implement Tool for the same". From hadoop sources it looks like
>> setting "mapred.used.genericoptionsparser" will prevent this warning.
>> But if I set this flag to true, will it have some other side effects.
>>
>> Thanks
>> Sandhya
>>
>
> Seen this message too -and it annoys me; not tracked it down
>



-- 
M. Raşit ÖZDAŞ


Re: Allowing other system users to use Haddoop

2009-02-18 Thread Rasit OZDAS
Nicholas, like Matei said,
There is 2 possibility in terms of permissions:

(any permissions command is just-like in linux)

1. Create a directory for a user. Make the user owner of that directory:
hadoop dfs -chown ... (assuming hadoop doesn't need to have write access to
any file outside user's home directory)
2. Convert group ownership of all files in HDFS to a group name which any
user has. (hadoop dfs -chgrp -R  /). Then give group write access
(hadoop dfs -chmod -R g+w /), again to all files. (here, any user runs jobs,
hadoop creates automatically a separated home directory). This way is better
for development environment, I think.

Cheers,
Rasit

2009/2/18 Matei Zaharia 

> Other users should be able to submit jobs using the same commands
> (bin/hadoop ...). Are there errors you ran into?
> One thing is that you'll need to grant them permissions over any files in
> HDFS that you want them to read. You can do it using bin/hadoop fs -chmod,
> which works like chmod on Linux. You may need to run this as the root user
> (sudo bin/hadoop fs -chmod). Also, I don't remember exactly, but you may
> need to create home directories for them in HDFS as well (again create them
> as root, and then sudo bin/hadoop fs -chown them).
>
> On Tue, Feb 17, 2009 at 10:48 AM, Nicholas Loulloudes <
> loulloude...@cs.ucy.ac.cy> wrote:
>
> > Hi all,
> >
> > I just installed Hadoop (Single Node) on a Linux Ubuntu distribution as
> > per the instructions found in the following website:
> >
> >
> >
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
> >
> > I followed the instructions of the website to create a "hadoop" system
> > user and group and i was able to run a Map Reduce job successfully.
> >
> > What i want to do now is to create more system users which will be able
> > to use Hadoop for running Map Reduce jobs.
> >
> > Is there any guide on how to achieve these??
> >
> > Any suggestions will be highly appreciated.
> >
> > Thanks in advance,
> >
> > --
> > _
> >
> > Nicholas Loulloudes
> > High Performance Computing Systems Laboratory (HPCL)
> > University of Cyprus,
> > Nicosia, Cyprus
> >
> >
> >
> >
> >
>



-- 
M. Raşit ÖZDAŞ


Re: HDFS bytes read job counters?

2009-02-17 Thread Rasit OZDAS
Nartan, If you're using BytesWritable, I've heard that it doesn't return
only valid bytes, it actually returns more than that.
Here is this issue discussed:
http://www.nabble.com/can%27t-read-the-SequenceFile-correctly-td21866960.html

Cheers,
Rasit



2009/2/18 Nathan Marz 

> Hello,
>
> I'm seeing very odd numbers from the HDFS job tracker page. I have a job
> that operates over approximately 200 GB of data (209715200047 bytes to be
> exact), and HDFS bytes read is 2,103,170,802,501 (2 TB).
>
> The "Map input bytes" is set to "209,714,811,510", which is a correct
> number.
>
> The job only took 10 minutes to run, so there's no way that that much data
> was actually read. Anyone have any idea of what's going on here?
>
> Thanks,
> Nathan Marz
>



-- 
M. Raşit ÖZDAŞ


Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Rasit OZDAS
:D

Then I found out that there is 3 similar issue about this problem :D
Quite useful information, isn't it? ;)


2009/2/17 Thibaut_ 

>
> Hello Rasi,
>
> https://issues.apache.org/jira/browse/HADOOP-5268 is my bug report.
>
> Thibaut
>
> --
> View this message in context:
> http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22060926.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: Can never restart HDFS after a day or two

2009-02-17 Thread Rasit OZDAS
I agree with Amandeep, and results will remain forever, unless you manually
delete them.

If we are on the right road,
change hadoop.tmp.dir property to be outside of /tmp, or changing
dfs.name.dir and dfs.data.dir should be enough for basic use (I didn't have
to change anything else).

Cheers,
Rasit

2009/2/17 Amandeep Khurana 

> Where are your namenode and datanode storing the data? By default, it goes
> into the /tmp directory. You might want to move that out of there.
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Mon, Feb 16, 2009 at 8:11 PM, Mark Kerzner 
> wrote:
>
> > Hi all,
> >
> > I consistently have this problem that I can run HDFS and restart it after
> > short breaks of a few hours, but the next day I always have to reformat
> > HDFS
> > before the daemons begin to work.
> >
> > Is that normal? Maybe this is treated as temporary data, and the results
> > need to be copied out of HDFS and not stored for long periods of time? I
> > verified that the files in /tmp related to hadoop are seemingly intact.
> >
> > Thank you,
> > Mark
> >
>



-- 
M. Raşit ÖZDAŞ


Re: AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-02-17 Thread Rasit OZDAS
Stefan and Thibaut, are you using MultipleOutputFormat, and how many
reducers do you have?
if you're using MultipleOutputFormat and have no reducer, there is a JIRA
ticket about this issue.
https://issues.apache.org/jira/browse/HADOOP-5268

Or there is a different JIRA issue (it's not resolved yet, but gives some
underlying info)
https://issues.apache.org/jira/browse/HADOOP-4264

Or this issue (not resolved):
https://issues.apache.org/jira/browse/HADOOP-1583

Rasit

2009/2/16 Thibaut_ 

>
> I have the same problem.
>
> is there any solution to this?
>
> Thibaut
>
>
> --
> View this message in context:
> http://www.nabble.com/AlredyBeingCreatedExceptions-after-upgrade-to-0.19.0-tp21631077p22043484.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: datanode not being started

2009-02-16 Thread Rasit OZDAS
Sandy, I have no idea about your issue :(

Zander,
Your problem is probably about this JIRA issue:
http://issues.apache.org/jira/browse/HADOOP-1212

Here is 2 workarounds explained:
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)#java.io.IOException:_Incompatible_namespaceIDs

I haven't tried it, hope it helps.
Rasit

2009/2/17 zander1013 :
>
> hi,
>
> i am not seeing the DataNode run either. but i am seeing an extra process
> TaskTracker run.
>
> here is what hapens when i start the cluster run jps and stop the cluster...
>
> had...@node0:/usr/local/hadoop$ bin/start-all.sh
> starting namenode, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-namenode-node0.out
> node0.local: starting datanode, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-node0.out
> node1.local: starting datanode, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-datanode-node1.out
> node0.local: starting secondarynamenode, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-node0.out
> starting jobtracker, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-jobtracker-node0.out
> node0.local: starting tasktracker, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node0.out
> node1.local: starting tasktracker, logging to
> /usr/local/hadoop/bin/../logs/hadoop-hadoop-tasktracker-node1.out
> had...@node0:/usr/local/hadoop$ jps
> 13353 TaskTracker
> 13126 SecondaryNameNode
> 12846 NameNode
> 13455 Jps
> 13232 JobTracker
> had...@node0:/usr/local/hadoop$ bin/stop-all.sh
> stopping jobtracker
> node0.local: stopping tasktracker
> node1.local: stopping tasktracker
> stopping namenode
> node0.local: no datanode to stop
> node1.local: no datanode to stop
> node0.local: stopping secondarynamenode
> had...@node0:/usr/local/hadoop$
>
> here is the tail of the log file for the session above...
> /
> 2009-02-16 19:35:13,999 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = node1/127.0.1.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.19.0
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 713890;
> compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
> /
> 2009-02-16 19:35:18,999 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
> Incompatible namespaceIDs in
> /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data: namenode namespaceID =
> 1050914495; datanode namespaceID = 722953254
>at
> org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
>at
> org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:287)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:205)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
>at
> org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)
>
> 2009-02-16 19:35:19,000 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down DataNode at node1/127.0.1.1
> /
>
> i have not seen DataNode run yet. i have only started and stopped the
> cluster a couple of times.
>
> i tried to reformat datanode and namenode with bin/hadoop datanode -format
> and bin/hadoop namenode -format from /usr/local/hadoop dir.
>
> please advise
>
> zander
>
>
>
> Mithila Nagendra wrote:
>>
>> Hey Sandy
>> I had a similar problem with Hadoop. All I did was I stopped all the
>> daemons
>> using stop-all.sh. Then formatted the namenode again using hadoop namenode
>> -format. After this I went on to restarting everything by using
>> start-all.sh
>>
>> I hope you dont have much data on the datanode, reformatting it would
>> erase
>> everything out.
>>
>> Hope this helps!
>> Mithila
>>
>>
>>
>> On Sat, Feb 14, 2009 at 2:39 AM, james warren  wrote:
>>
>>> Sandy -
>>>
>>> I suggest you take a look into your NameNode and DataNode logs.  From the
>>> information posted, these likely would be at
>>>
>>>
>>> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-namenode-loteria.cs.tamu.edu.log
>>>
>>> /Users/hadoop/hadoop-0.18.2/bin/../logs/hadoop-hadoop-jobtracker-loteria.cs.tamu.edu.log
>>>
>>> If the cause isn't obvious from what you see there, could you please post
>>> the last few lines from each log?
>>>
>>> -

Re: Copying a file to specified nodes

2009-02-16 Thread Rasit OZDAS
Yes, I've tried the long solution;
when I execute   ./hadoop dfs -put ... from a datanode,
in any case 1 copy gets written to that datanode.

But I think I should use SSH for this,
Anybody knows a better way?

Thanks,
Rasit

2009/2/16 Rasit OZDAS :
> Thanks, Jeff.
> After considering JIRA link you've given and making some investigation:
>
> It seems that this JIRA ticket didn't draw much attention, so will
> take much time to be considered.
> After some more investigation I found out that when I copy the file to
> HDFS from a specific DataNode, first copy will be written to that
> DataNode itself. This solution will take long to implement, I think.
> But we definitely need this feature, so if we have no other choice,
> we'll go though it.
>
> Any further info (or comments on my solution) is appreciated.
>
> Cheers,
> Rasit
>
> 2009/2/10 Jeff Hammerbacher :
>> Hey Rasit,
>>
>> I'm not sure I fully understand your description of the problem, but
>> you might want to check out the JIRA ticket for making the replica
>> placement algorithms in HDFS pluggable
>> (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
>> case there.
>>
>> Regards,
>> Jeff
>>
>> On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS  wrote:
>>>
>>> Hi,
>>>
>>> We have thousands of files, each dedicated to a user.  (Each user has
>>> access to other users' files, but they do this not very often.)
>>> Each user runs map-reduce jobs on the cluster.
>>> So we should seperate his/her files equally across the cluster,
>>> so that every machine can take part in the process (assuming he/she is
>>> the only user running jobs).
>>> For this we should initially copy files to specified nodes:
>>> User A :   first file : Node 1, second file: Node 2, .. etc.
>>> User B :   first file : Node 1, second file: Node 2, .. etc.
>>>
>>> I know, hadoop create also replicas, but in our solution at least one
>>> file will be in the right place
>>> (or we're willing to control other replicas too).
>>>
>>> Rebalancing is also not a problem, assuming it uses the information
>>> about how much a computer is in use.
>>> It even helps for a better organization of files.
>>>
>>> How can we copy files to specified nodes?
>>> Or do you have a better solution for us?
>>>
>>> I couldn't find a solution to this, probably such an option doesn't exist.
>>> But I wanted to take an expert's opinion about this.
>>>
>>> Thanks in advance..
>>> Rasit
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: datanode not being started

2009-02-16 Thread Rasit OZDAS
Sandy, as far as I remember, there were some threads about the same
problem (I don't know if it's solved). Searching the mailing list for
this error: "could only be replicated to 0 nodes, instead of 1" may
help.

Cheers,
Rasit

2009/2/16 Sandy :
> just some more information:
> hadoop fsck produces:
> Status: HEALTHY
>  Total size: 0 B
>  Total dirs: 9
>  Total files: 0 (Files currently being written: 1)
>  Total blocks (validated): 0
>  Minimally replicated blocks: 0
>  Over-replicated blocks: 0
>  Under-replicated blocks: 0
>  Mis-replicated blocks: 0
>  Default replication factor: 1
>  Average block replication: 0.0
>  Corrupt blocks: 0
>  Missing replicas: 0
>  Number of data-nodes: 0
>  Number of racks: 0
>
>
> The filesystem under path '/' is HEALTHY
>
> on the newly formatted hdfs.
>
> jps says:
> 4723 Jps
> 4527 NameNode
> 4653 JobTracker
>
>
> I can't copy files onto the dfs since I get "NotReplicatedYetExceptions",
> which I suspect has to do with the fact that there are no datanodes. My
> "cluster" is a single MacPro with 8 cores. I haven't had to do anything
> extra before in order to get datanodes to be generated.
>
> 09/02/15 15:56:27 WARN dfs.DFSClient: Error Recovery for block null bad
> datanode[0]
> copyFromLocal: Could not get block locations. Aborting...
>
>
> The corresponding error in the logs is:
>
> 2009-02-15 15:56:27,123 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 9000, call addBlock(/user/hadoop/input/.DS_Store,
> DFSClient_755366230) from 127.0.0.1:49796: error: java.io.IOException: File
> /user/hadoop/input/.DS_Store could only be replicated to 0 nodes, instead of
> 1
> java.io.IOException: File /user/hadoop/input/.DS_Store could only be
> replicated to 0 nodes, instead of 1
> at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1120)
> at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>
> On Sun, Feb 15, 2009 at 3:26 PM, Sandy  wrote:
>
>> Thanks for your responses.
>>
>> I checked in the namenode and jobtracker logs and both say:
>>
>> INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9000, call
>> delete(/Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system, true) from
>> 127.0.0.1:61086: error: org.apache.hadoop.dfs.SafeModeException: Cannot
>> delete /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node
>> is in safe mode.
>> The ratio of reported blocks 0. has not reached the threshold 0.9990.
>> Safe mode will be turned off automatically.
>> org.apache.hadoop.dfs.SafeModeException: Cannot delete
>> /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/mapred/system. Name node is in
>> safe mode.
>> The ratio of reported blocks 0. has not reached the threshold 0.9990.
>> Safe mode will be turned off automatically.
>> at
>> org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1505)
>> at
>> org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1477)
>> at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:425)
>> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)
>>
>>
>> I think this is a continuation of my running problem. The nodes stay in
>> safe mode, but won't come out, even after several minutes. I believe this is
>> due to the fact that it keep trying to contact a datanode that does not
>> exist. Any suggestions on what I can do?
>>
>> I have recently tried to reformat the hdfs, using bin/hadoop namenode
>> -format. From the output directed to standard out, I thought this completed
>> correctly:
>>
>> Re-format filesystem in /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name
>> ? (Y or N) Y
>> 09/02/15 15:16:39 INFO fs.FSNamesystem:
>> fsOwner=hadoop,staff,_lpadmin,com.apple.sharepoint.group.8,com.apple.sharepoint.group.3,com.apple.sharepoint.group.4,com.apple.sharepoint.group.2,com.apple.sharepoint.group.6,com.apple.sharepoint.group.9,com.apple.sharepoint.group.1,com.apple.sharepoint.group.5
>> 09/02/15 15:16:39 INFO fs.FSNamesystem: supergroup=supergroup
>> 09/02/15 15:16:39 INFO fs.FSNamesystem: isPermissionEnabled=true
>> 09/02/15 15:16:39 INFO dfs.Storage: Image file of size 80 saved in 0
>> seconds.
>> 09/02/15 15:16:39 INFO dfs.Storage: Storage directory
>> /Users/hadoop/hadoop-0.18.2/hadoop-hadoop/dfs/name has been

Re: Copying a file to specified nodes

2009-02-16 Thread Rasit OZDAS
Thanks, Jeff.
After considering JIRA link you've given and making some investigation:

It seems that this JIRA ticket didn't draw much attention, so will
take much time to be considered.
After some more investigation I found out that when I copy the file to
HDFS from a specific DataNode, first copy will be written to that
DataNode itself. This solution will take long to implement, I think.
But we definitely need this feature, so if we have no other choice,
we'll go though it.

Any further info (or comments on my solution) is appreciated.

Cheers,
Rasit

2009/2/10 Jeff Hammerbacher :
> Hey Rasit,
>
> I'm not sure I fully understand your description of the problem, but
> you might want to check out the JIRA ticket for making the replica
> placement algorithms in HDFS pluggable
> (https://issues.apache.org/jira/browse/HADOOP-3799) and add your use
> case there.
>
> Regards,
> Jeff
>
> On Tue, Feb 10, 2009 at 5:05 AM, Rasit OZDAS  wrote:
>>
>> Hi,
>>
>> We have thousands of files, each dedicated to a user.  (Each user has
>> access to other users' files, but they do this not very often.)
>> Each user runs map-reduce jobs on the cluster.
>> So we should seperate his/her files equally across the cluster,
>> so that every machine can take part in the process (assuming he/she is
>> the only user running jobs).
>> For this we should initially copy files to specified nodes:
>> User A :   first file : Node 1, second file: Node 2, .. etc.
>> User B :   first file : Node 1, second file: Node 2, .. etc.
>>
>> I know, hadoop create also replicas, but in our solution at least one
>> file will be in the right place
>> (or we're willing to control other replicas too).
>>
>> Rebalancing is also not a problem, assuming it uses the information
>> about how much a computer is in use.
>> It even helps for a better organization of files.
>>
>> How can we copy files to specified nodes?
>> Or do you have a better solution for us?
>>
>> I couldn't find a solution to this, probably such an option doesn't exist.
>> But I wanted to take an expert's opinion about this.
>>
>> Thanks in advance..
>> Rasit
>



-- 
M. Raşit ÖZDAŞ


Re: HDFS architecture based on GFS?

2009-02-15 Thread Rasit OZDAS
"If there was a
malicious process though, then I imagine it could talk to a datanode
directly and request a specific block."

I didn't understand usage of "malicuous" here,
but any process using HDFS api should first ask NameNode where the
file replications are.
Then - I assume - namenode returns the IP of best DataNode (or all IPs),
then call to specific DataNode is made.
Please correct me if I'm wrong.

Cheers,
Rasit

2009/2/16 Matei Zaharia :
> In general, yeah, the scripts can access any resource they want (within the
> permissions of the user that the task runs as). It's also possible to access
> HDFS from scripts because HDFS provides a FUSE interface that can make it
> look like a regular file system on the machine. (The FUSE module in turn
> talks to the namenode as a regular HDFS client.)
>
> On Sun, Feb 15, 2009 at 8:43 PM, Amandeep Khurana  wrote:
>
>> I dont know much about Hadoop streaming and have a quick question here.
>>
>> The snippets of code/programs that you attach into the map reduce job might
>> want to access outside resources (like you mentioned). Now these might not
>> need to go to the namenode right? For example a python script. How would it
>> access the data? Would it ask the parent java process (in the tasktracker)
>> to get the data or would it go and do stuff on its own?
>>
>>
>> Amandeep Khurana
>> Computer Science Graduate Student
>> University of California, Santa Cruz
>>
>>
>> On Sun, Feb 15, 2009 at 8:23 PM, Matei Zaharia  wrote:
>>
>> > Nope, typically the JobTracker just starts the process, and the
>> tasktracker
>> > talks directly to the namenode to get a pointer to the datanode, and then
>> > directly to the datanode.
>> >
>> > On Sun, Feb 15, 2009 at 8:07 PM, Amandeep Khurana 
>> > wrote:
>> >
>> > > Alright.. Got it.
>> > >
>> > > Now, do the task trackers talk to the namenode and the data node
>> directly
>> > > or
>> > > do they go through the job tracker for it? So, if my code is such that
>> I
>> > > need to access more files from the hdfs, would the job tracker get
>> > involved
>> > > or not?
>> > >
>> > >
>> > >
>> > >
>> > > Amandeep Khurana
>> > > Computer Science Graduate Student
>> > > University of California, Santa Cruz
>> > >
>> > >
>> > > On Sun, Feb 15, 2009 at 7:20 PM, Matei Zaharia 
>> > wrote:
>> > >
>> > > > Normally, HDFS files are accessed through the namenode. If there was
>> a
>> > > > malicious process though, then I imagine it could talk to a datanode
>> > > > directly and request a specific block.
>> > > >
>> > > > On Sun, Feb 15, 2009 at 7:15 PM, Amandeep Khurana 
>> > > > wrote:
>> > > >
>> > > > > Ok. Got it.
>> > > > >
>> > > > > Now, when my job needs to access another file, does it go to the
>> > > Namenode
>> > > > > to
>> > > > > get the block ids? How does the java process know where the files
>> are
>> > > and
>> > > > > how to access them?
>> > > > >
>> > > > >
>> > > > > Amandeep Khurana
>> > > > > Computer Science Graduate Student
>> > > > > University of California, Santa Cruz
>> > > > >
>> > > > >
>> > > > > On Sun, Feb 15, 2009 at 7:05 PM, Matei Zaharia > >
>> > > > wrote:
>> > > > >
>> > > > > > I mentioned this case because even jobs written in Java can use
>> the
>> > > > HDFS
>> > > > > > API
>> > > > > > to talk to the NameNode and access the filesystem. People often
>> do
>> > > this
>> > > > > > because their job needs to read a config file, some small data
>> > table,
>> > > > etc
>> > > > > > and use this information in its map or reduce functions. In this
>> > > case,
>> > > > > you
>> > > > > > open the second file separately in your mapper's init function
>> and
>> > > read
>> > > > > > whatever you need from it. In general I wanted to point out that
>> > you
>> > > > > can't
>> > > > > > know which files a job will access unless you look at its source
>> > code
>> > > > or
>> > > > > > monitor the calls it makes; the input file(s) you provide in the
>> > job
>> > > > > > description are a hint to the MapReduce framework to place your
>> job
>> > > on
>> > > > > > certain nodes, but it's reasonable for the job to access other
>> > files
>> > > as
>> > > > > > well.
>> > > > > >
>> > > > > > On Sun, Feb 15, 2009 at 6:14 PM, Amandeep Khurana <
>> > ama...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Another question that I have here - When the jobs run arbitrary
>> > > code
>> > > > > and
>> > > > > > > access data from the HDFS, do they go to the namenode to get
>> the
>> > > > block
>> > > > > > > information?
>> > > > > > >
>> > > > > > >
>> > > > > > > Amandeep Khurana
>> > > > > > > Computer Science Graduate Student
>> > > > > > > University of California, Santa Cruz
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Feb 15, 2009 at 6:00 PM, Amandeep Khurana <
>> > > ama...@gmail.com>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Assuming that the job is purely in Java and not involving
>> > > streaming
>> > > > > or
>> > > > > > > > pipes, wouldnt the resources (files) required by 

Re: Hadoop setup questions

2009-02-13 Thread Rasit OZDAS
With this configuration, any user having that group name will be able
to write to any location..
(I've tried this in local network, though)


2009/2/14 Rasit OZDAS :
> I agree with Amar and James,
>
> if you require permissions for your project,
> then
> 1. create a group in linux for your user.
> 2. give group write access to all files in HDFS. (hadoop dfs -chmod -R
> g+w /  - or sth, I'm not totally sure.)
> 3. change group ownership of all files in HDFS. (hadoop dfs -chgrp -R
>  /       - I'm not totally sure again..)
>
> cheers,
> Rasit
>
>
> 2009/2/12 james warren :
>> Like Amar said.  Try adding
>>
>> 
>> dfs.permissions
>> false
>> 
>>
>>
>> to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml),
>> restart your daemons and give it a whirl.
>>
>> cheers,
>> -jw
>>
>> On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat  wrote:
>>
>>> bjday wrote:
>>>
>>>> Good morning everyone,
>>>>
>>>> I have a question about correct setup for hadoop.  I have 14 Dell
>>>> computers in a lab.   Each connected to the internet and each independent 
>>>> of
>>>> each other.  All run CentOS.  Logins are handled by NIS.  If userA logs 
>>>> into
>>>> the master and starts the daemons and UserB logs into the master and wants
>>>> to run a job while the daemons from UserA are still running the following
>>>> error occurs:
>>>>
>>>> copyFromLocal: org.apache.hadoop.security.AccessControlException:
>>>> Permission denied: user=UserB, access=WRITE,
>>>> inode="user":UserA:supergroup:rwxr-xr-x
>>>>
>>> Looks like one of your files (input or output) is of different user. Seems
>>> like your DFS has permissions enabled. If you dont require permissions then
>>> disable it else make sure that the input/output paths are under your
>>> permission (/user/userB is the hone directory for userB).
>>> Amar
>>>
>>>
>>>> what needs to be changed to allow UserB-UserZ to run their jobs?  Does
>>>> there need to be a local user the everyone logs into as and run from there?
>>>>  Should Hadoop be ran in an actual cluster instead of independent 
>>>> computers?
>>>>  Any ideas what is the correct configuration settings that allow it?
>>>>
>>>> I followed Ravi Phulari suggestions and followed:
>>>>
>>>> http://hadoop.apache.org/core/docs/current/quickstart.html
>>>>
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
>>>>
>>>>
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)<
>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29>
>>>>
>>>>
>>>> These allowed me to get Hadoop running on the 14 computers when I login
>>>> and everything works fine, thank you Ravi.  The problem occurs when
>>>> additional people attempt to run jobs simultaneously.
>>>>
>>>> Thank you,
>>>>
>>>> Brian
>>>>
>>>>
>>>
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Hadoop setup questions

2009-02-13 Thread Rasit OZDAS
I agree with Amar and James,

if you require permissions for your project,
then
1. create a group in linux for your user.
2. give group write access to all files in HDFS. (hadoop dfs -chmod -R
g+w /  - or sth, I'm not totally sure.)
3. change group ownership of all files in HDFS. (hadoop dfs -chgrp -R
 /   - I'm not totally sure again..)

cheers,
Rasit


2009/2/12 james warren :
> Like Amar said.  Try adding
>
> 
> dfs.permissions
> false
> 
>
>
> to your conf/hadoop-site.xml file (or flip the value in hadoop-default.xml),
> restart your daemons and give it a whirl.
>
> cheers,
> -jw
>
> On Wed, Feb 11, 2009 at 8:44 PM, Amar Kamat  wrote:
>
>> bjday wrote:
>>
>>> Good morning everyone,
>>>
>>> I have a question about correct setup for hadoop.  I have 14 Dell
>>> computers in a lab.   Each connected to the internet and each independent of
>>> each other.  All run CentOS.  Logins are handled by NIS.  If userA logs into
>>> the master and starts the daemons and UserB logs into the master and wants
>>> to run a job while the daemons from UserA are still running the following
>>> error occurs:
>>>
>>> copyFromLocal: org.apache.hadoop.security.AccessControlException:
>>> Permission denied: user=UserB, access=WRITE,
>>> inode="user":UserA:supergroup:rwxr-xr-x
>>>
>> Looks like one of your files (input or output) is of different user. Seems
>> like your DFS has permissions enabled. If you dont require permissions then
>> disable it else make sure that the input/output paths are under your
>> permission (/user/userB is the hone directory for userB).
>> Amar
>>
>>
>>> what needs to be changed to allow UserB-UserZ to run their jobs?  Does
>>> there need to be a local user the everyone logs into as and run from there?
>>>  Should Hadoop be ran in an actual cluster instead of independent computers?
>>>  Any ideas what is the correct configuration settings that allow it?
>>>
>>> I followed Ravi Phulari suggestions and followed:
>>>
>>> http://hadoop.apache.org/core/docs/current/quickstart.html
>>>
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)<
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29>
>>>
>>>
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)<
>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29>
>>>
>>>
>>> These allowed me to get Hadoop running on the 14 computers when I login
>>> and everything works fine, thank you Ravi.  The problem occurs when
>>> additional people attempt to run jobs simultaneously.
>>>
>>> Thank you,
>>>
>>> Brian
>>>
>>>
>>
>



-- 
M. Raşit ÖZDAŞ


Re: Running Map and Reduce Sequentially

2009-02-13 Thread Rasit OZDAS
Kris,
This is the case when you have only 1 reducer.
If it doesn't have any side effects for you..

Rasit


2009/2/14 Kris Jirapinyo :
> Is there a way to tell Hadoop to not run Map and Reduce concurrently?  I'm
> running into a problem where I set the jvm to Xmx768 and it seems like 2
> mappers and 2 reducers are running on each machine that only has 1.7GB of
> ram, so it complains of not being able to allocate memory...(which makes
> sense since 4x768mb > 1.7GB).  So, if it would just finish the Map and then
> start on Reduce, then there would be 2 jvm's running on one machine at any
> given time and thus possibly avoid this out of memory error.
>



-- 
M. Raşit ÖZDAŞ


Re: Best practices on spliltting an input line?

2009-02-12 Thread Rasit OZDAS
Hi, Andy

Your problem seems to be a general Java problem, rather than hadoop.
In a java forum you may get better help.
String.split uses regular expressions, which you definitely don't need.
I would write my own split function, without regular expressions.

This link may help to better understand underlying operations:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/stringBufferToken.html#split

Also there is a constructor of StringTokenizer to return also delimeters:
StringTokenizer(String string, String delimeters, boolean returnDelimeters);
(I would write my own, though.)

Rasit

2009/2/10 Andy Sautins :
>
>
>   I have question.  I've dabbled with different ways of tokenizing an
> input file line for processing.  I've noticed in my somewhat limited
> tests that there seem to be some pretty reasonable performance
> differences between different tokenizing methods.  For example, roughly
> it seems to split a line on tokens ( tab delimited in my case ) that
> Scanner is the slowest, followed by String.spit and StringTokenizer
> being the fastest.  StringTokenizer, for my application, has the
> unfortunate characteristic of not returning blank tokens ( i.e., parsing
> "a,b,c,,d" would return "a","b","c","d" instead of "a","b","c","","d").
> The WordCount example uses StringTokenizer which makes sense to me,
> except I'm currently getting hung up on not returning blank tokens.  I
> did run across the com.Ostermiller.util StringTokenizer replacement that
> handles null/blank tokens
> (http://ostermiller.org/utils/StringTokenizer.html ) which seems
> possible to use, but it sure seems like someone else has solved this
> problem already better than I have.
>
>
>
>   So, my question is, is there a "best practice" for splitting an input
> line especially when NULL tokens are expected ( i.e., two consecutive
> delimiter characters )?
>
>
>
>   Any thoughts would be appreciated
>
>
>
>   Thanks
>
>
>
>   Andy
>
>



-- 
M. Raşit ÖZDAŞ


Re: what's going on :( ?

2009-02-11 Thread Rasit OZDAS
Hi, Mark

Try to add an extra property to that file, and try to examine if
hadoop recognizes it.
This way you can find out if hadoop uses your configuration file.

2009/2/10 Jeff Hammerbacher :
> Hey Mark,
>
> In NameNode.java, the DEFAULT_PORT specified for NameNode RPC is 8020.
> From my understanding of the code, your fs.default.name setting should
> have overridden this port to be 9000. It appears your Hadoop
> installation has not picked up the configuration settings
> appropriately. You might want to see if you have any Hadoop processes
> running and terminate them (bin/stop-all.sh should help) and then
> restart your cluster with the new configuration to see if that helps.
>
> Later,
> Jeff
>
> On Mon, Feb 9, 2009 at 9:48 PM, Amar Kamat  wrote:
>> Mark Kerzner wrote:
>>>
>>> Hi,
>>> Hi,
>>>
>>> why is hadoop suddenly telling me
>>>
>>>  Retrying connect to server: localhost/127.0.0.1:8020
>>>
>>> with this configuration
>>>
>>> 
>>>  
>>>    fs.default.name
>>>    hdfs://localhost:9000
>>>  
>>>  
>>>    mapred.job.tracker
>>>    localhost:9001
>>>
>>
>> Shouldnt this be
>>
>> hdfs://localhost:9001
>>
>> Amar
>>>
>>>  
>>>  
>>>    dfs.replication
>>>    1
>>>  
>>> 
>>>
>>> and both this http://localhost:50070/dfshealth.jsp and this
>>> http://localhost:50030/jobtracker.jsp links work fine?
>>>
>>> Thank you,
>>> Mark
>>>
>>>
>>
>>
>



-- 
M. Raşit ÖZDAŞ


Re: Loading native libraries

2009-02-11 Thread Rasit OZDAS
I have also the same problem.
It would be wonderful if someone has some info about this..

Rasit

2009/2/10 Mimi Sun :
> I see UnsatisfiedLinkError.  Also I'm calling
>  System.getProperty("java.library.path") in the reducer and logging it. The
> only thing that prints out is
> ...hadoop-0.18.2/bin/../lib/native/Mac_OS_X-i386-32
> I'm using Cascading, not sure if that affects anything.
>
> - Mimi
>
> On Feb 10, 2009, at 11:40 AM, Arun C Murthy wrote:
>
>>
>> On Feb 10, 2009, at 11:06 AM, Mimi Sun wrote:
>>
>>> Hi,
>>>
>>> I'm new to Hadoop and I'm wondering what the recommended method is for
>>> using native libraries in mapred jobs.
>>> I've tried the following separately:
>>> 1. set LD_LIBRARY_PATH in .bashrc
>>> 2. set LD_LIBRARY_PATH and  JAVA_LIBRARY_PATH in hadoop-env.sh
>>> 3. set -Djava.library.path=... for mapred.child.java.opts
>>
>> For what you are trying (i.e. given that the JNI libs are present on all
>> machines at a constant path) setting -Djava.library.path for the child task
>> via mapred.child.java.opts should work. What are you seeing?
>>
>> Arun
>>
>>>
>>> 4. change bin/hadoop to include  $LD_LIBRARY_PATH in addition to the path
>>> it generates:  HADOOP_OPTS="$HADOOP_OPTS
>>> -Djava.library.path=$LD_LIBRARY_PATH:$JAVA_LIBRARY_PATH"
>>> 5. drop the .so files I need into hadoop/lib/native/...
>>>
>>> 1~3 didn't work, 4 and 5 did but seem to be hacks. I also read that I can
>>> do this using DistributedCache, but that seems to be extra work for loading
>>> libraries that are already present on each machine. (I'm using the JNI libs
>>> for berkeley db).
>>> It seems that there should be a way to configure java.library.path for
>>> the mapred jobs.  Perhaps bin/hadoop should make use of LD_LIBRARY_PATH?
>>>
>>> Thanks,
>>> - Mimi
>>
>
>



-- 
M. Raşit ÖZDAŞ


Re: stable version

2009-02-11 Thread Rasit OZDAS
Yes, version 18.3 is the most stable one. It has added patches,
without not-proven new functionality.

2009/2/11 Owen O'Malley :
>
> On Feb 10, 2009, at 7:21 PM, Vadim Zaliva wrote:
>
>> Maybe version 0.18
>> is better suited for production environment?
>
> Yahoo is mostly on 0.18.3 + some patches at this point.
>
> -- Owen
>



-- 
M. Raşit ÖZDAŞ


Copying a file to specified nodes

2009-02-10 Thread Rasit OZDAS
Hi,

We have thousands of files, each dedicated to a user.  (Each user has
access to other users' files, but they do this not very often.)
Each user runs map-reduce jobs on the cluster.
So we should seperate his/her files equally across the cluster,
so that every machine can take part in the process (assuming he/she is
the only user running jobs).
For this we should initially copy files to specified nodes:
User A :   first file : Node 1, second file: Node 2, .. etc.
User B :   first file : Node 1, second file: Node 2, .. etc.

I know, hadoop create also replicas, but in our solution at least one
file will be in the right place
(or we're willing to control other replicas too).

Rebalancing is also not a problem, assuming it uses the information
about how much a computer is in use.
It even helps for a better organization of files.

How can we copy files to specified nodes?
Or do you have a better solution for us?

I couldn't find a solution to this, probably such an option doesn't exist.
But I wanted to take an expert's opinion about this.

Thanks in advance..
Rasit


Re: Cannot copy from local file system to DFS

2009-02-07 Thread Rasit OZDAS
Hi, Mithila,

"File /user/mithila/test/20417.txt could only be replicated to 0
nodes, instead of 1"

I think your datanode isn't working properly.
please take a look at log file of your datanode (logs/*datanode*.log).

If there is no error in that log file, I've heard that hadoop can sometimes mark
a datanode as "BAD" and refuses to send the block to that node, this
can be the cause.
(List, please correct me if I'm wrong!)

Hope this helps,
Rasit

2009/2/6 Mithila Nagendra :
> Hey all
> I was trying to run the word count example on one of the hadoop systems I
> installed, but when i try to copy the text files from the local file system
> to the DFS, it throws up the following exception:
>
> [mith...@node02 hadoop]$ jps
> 8711 JobTracker
> 8805 TaskTracker
> 8901 Jps
> 8419 NameNode
> 8642 SecondaryNameNode
> [mith...@node02 hadoop]$ cd ..
> [mith...@node02 mithila]$ ls
> hadoop  hadoop-0.17.2.1.tar  hadoop-datastore  test
> [mith...@node02 mithila]$ hadoop/bin/hadoop dfs -copyFromLocal test test
> 09/02/06 11:26:26 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: File /user/mithila/test/20417.txt could only be
> replicated to 0 nodes, instead of 1
>at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
>at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
>at org.apache.hadoop.ipc.Client.call(Client.java:557)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842)
>
> 09/02/06 11:26:26 WARN dfs.DFSClient: NotReplicatedYetException sleeping
> /user/mithila/test/20417.txt retries left 4
> 09/02/06 11:26:27 INFO dfs.DFSClient: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: File /user/mithila/test/20417.txt could only be
> replicated to 0 nodes, instead of 1
>at
> org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1145)
>at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:300)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
>at org.apache.hadoop.ipc.Client.call(Client.java:557)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2335)
>at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2220)
>at
> org.apache.h

Re: Heap size error

2009-02-07 Thread Rasit OZDAS
Hi, Amandeep,
I've copied following lines from a site:
--
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

This can have two reasons:

* Your Java application has a memory leak. There are tools like
YourKit Java Profiler that help you to identify such leaks.
* Your Java application really needs a lot of memory (more than
128 MB by default!). In this case the Java heap size can be increased
using the following runtime parameters:

java -Xms -Xmx

Defaults are:

java -Xms32m -Xmx128m

You can set this either in the Java Control Panel or on the command
line, depending on the environment you run your application.
-

Hope this helps,
Rasit

2009/2/7 Amandeep Khurana :
> I'm getting the following error while running my hadoop job:
>
> 09/02/06 15:33:03 INFO mapred.JobClient: Task Id :
> attempt_200902061333_0004_r_00_1, Status : FAILED
> java.lang.OutOfMemoryError: Java heap space
>at java.util.Arrays.copyOf(Unknown Source)
>at java.lang.AbstractStringBuilder.expandCapacity(Unknown Source)
>at java.lang.AbstractStringBuilder.append(Unknown Source)
>at java.lang.StringBuffer.append(Unknown Source)
>at TableJoin$Reduce.reduce(TableJoin.java:61)
>at TableJoin$Reduce.reduce(TableJoin.java:1)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)
>at org.apache.hadoop.mapred.Child.main(Child.java:155)
>
> Any inputs?
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>



-- 
M. Raşit ÖZDAŞ


Re: Not able to copy a file to HDFS after installing

2009-02-05 Thread Rasit OZDAS
Rajshekar,
I have also threads for this ;)

http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200803.mbox/%3cpine.lnx.4.64.0803132200480.5...@localhost.localdomain%3e
http://www.mail-archive.com/hadoop-...@lucene.apache.org/msg03226.html

Please try the following:

- Give local filepath for jar
- Give absolute path, not relative to the hadoop/bin
- HADOOP_HOME env. variable should be correctly set.

Hope this helps,
Rasit

2009/2/6 Rajshekar :
>
> Hi
> Thanks Rasi,
>
> From Yest evening I am able to start Namenode. I did few changed in
> hadoop-site.xml. it working now, but the new problem is I am not able to do
> map/reduce jobs using .jar files. it is giving following error
>
> had...@excel-desktop:/usr/local/hadoop$ bin/hadoop jar
> hadoop-0.19.0-examples.jar wordcount gutenberg gutenberg-output
> java.io.IOException: Error opening job jar: hadoop-0.19.0-examples.jar
>at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
>at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
> Caused by: java.util.zip.ZipException: error in opening zip file
>at java.util.zip.ZipFile.open(Native Method)
>at java.util.zip.ZipFile.(ZipFile.java:131)
>at java.util.jar.JarFile.(JarFile.java:150)
>at java.util.jar.JarFile.(JarFile.java:87)
>at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
>... 4 more
>
> Pls help me out
>
>
>
> Rasit OZDAS wrote:
>>
>> Rajshekar,
>> It seems that your namenode isn't able to load FsImage file.
>>
>> Here is a thread about a similar issue:
>> http://www.nabble.com/Hadoop-0.17.1-%3D%3E-EOFException-reading-FSEdits-file,-what-causes-this---how-to-prevent--td21440922.html
>>
>> Rasit
>>
>> 2009/2/5 Rajshekar :
>>>
>>> Name naode is localhost with an ip address.Now I checked when i give
>>> /bin/hadoop namenode i am getting error
>>>
>>> r...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1# bin/hadoop namenode
>>> 09/02/05 13:27:43 INFO dfs.NameNode: STARTUP_MSG:
>>> /
>>> STARTUP_MSG: Starting NameNode
>>> STARTUP_MSG:   host = excel-desktop/127.0.1.1
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 0.17.2.1
>>> STARTUP_MSG:   build =
>>> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r
>>> 684969;
>>> compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
>>> /
>>> 09/02/05 13:27:43 INFO metrics.RpcMetrics: Initializing RPC Metrics with
>>> hostName=NameNode, port=9000
>>> 09/02/05 13:27:43 INFO dfs.NameNode: Namenode up at:
>>> localhost/127.0.0.1:9000
>>> 09/02/05 13:27:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>>> processName=NameNode, sessionId=null
>>> 09/02/05 13:27:43 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics
>>> using context object:org.apache.hadoop.metrics.spi.NullContext
>>> 09/02/05 13:27:43 INFO fs.FSNamesystem: fsOwner=root,root
>>> 09/02/05 13:27:43 INFO fs.FSNamesystem: supergroup=supergroup
>>> 09/02/05 13:27:43 INFO fs.FSNamesystem: isPermissionEnabled=true
>>> 09/02/05 13:27:44 INFO ipc.Server: Stopping server on 9000
>>> 09/02/05 13:27:44 ERROR dfs.NameNode: java.io.EOFException
>>>at java.io.RandomAccessFile.readInt(RandomAccessFile.java:776)
>>>at
>>> org.apache.hadoop.dfs.FSImage.isConversionNeeded(FSImage.java:488)
>>>at
>>> org.apache.hadoop.dfs.Storage$StorageDirectory.analyzeStorage(Storage.java:283)
>>>at
>>> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:149)
>>>at
>>> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>>>at
>>> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
>>>at
>>> org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:255)
>>>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
>>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>>>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>>>at
>>> org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
>>>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
>>>
>>>

Re: Regarding "Hadoop multi cluster" set-up

2009-02-05 Thread Rasit OZDAS
Ian,
here is a list under
Setting up Hadoop on a single node > Basic Configuration > Jobtracker
and Namenode settings
Maybe it's what you're looking for.

Cheers,
Rasit

2009/2/4 Ian Soboroff :
> I would love to see someplace a complete list of the ports that the various
> Hadoop daemons expect to have open.  Does anyone have that?
>
> Ian
>
> On Feb 4, 2009, at 1:16 PM, shefali pawar wrote:
>
>>
>> Hi,
>>
>> I will have to check. I can do that tomorrow in college. But if that is
>> the case what should i do?
>>
>> Should i change the port number and try again?
>>
>> Shefali
>>
>> On Wed, 04 Feb 2009 S D wrote :
>>>
>>> Shefali,
>>>
>>> Is your firewall blocking port 54310 on the master?
>>>
>>> John
>>>
>>> On Wed, Feb 4, 2009 at 12:34 PM, shefali pawar
>>> wrote:
>>>
 Hi,

 I am trying to set-up a two node cluster using Hadoop0.19.0, with 1
 master(which should also work as a slave) and 1 slave node.

 But while running bin/start-dfs.sh the datanode is not starting on the
 slave. I had read the previous mails on the list, but nothing seems to
 be
 working in this case. I am getting the following error in the
 hadoop-root-datanode-slave log file while running the command
 bin/start-dfs.sh =>

 2009-02-03 13:00:27,516 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = slave/172.16.0.32
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.19.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r
 713890; compiled by 'ndaley' on Fri Nov 14 03:12:29 UTC 2008
 /
 2009-02-03 13:00:28,725 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 0 time(s).
 2009-02-03 13:00:29,726 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 1 time(s).
 2009-02-03 13:00:30,727 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 2 time(s).
 2009-02-03 13:00:31,728 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 3 time(s).
 2009-02-03 13:00:32,729 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 4 time(s).
 2009-02-03 13:00:33,730 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 5 time(s).
 2009-02-03 13:00:34,731 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 6 time(s).
 2009-02-03 13:00:35,732 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 7 time(s).
 2009-02-03 13:00:36,733 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 8 time(s).
 2009-02-03 13:00:37,734 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: master/172.16.0.46:54310. Already tried 9 time(s).
 2009-02-03 13:00:37,738 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
 Call
 to master/172.16.0.46:54310 failed on local exception: No route to host
  at org.apache.hadoop.ipc.Client.call(Client.java:699)
  at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
  at $Proxy4.getProtocolVersion(Unknown Source)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:306)
  at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:343)
  at org.apache.hadoop.ipc.RPC.waitForProxy(RPC.java:288)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:258)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:205)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1199)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1154)
  at

 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1162)
  at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1284)
 Caused by: java.net.NoRouteToHostException: No route to host
  at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
  at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
  at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100)
  at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:299)
  at
 org.apache

Re: Not able to copy a file to HDFS after installing

2009-02-05 Thread Rasit OZDAS
Rajshekar,
It seems that your namenode isn't able to load FsImage file.

Here is a thread about a similar issue:
http://www.nabble.com/Hadoop-0.17.1-%3D%3E-EOFException-reading-FSEdits-file,-what-causes-this---how-to-prevent--td21440922.html

Rasit

2009/2/5 Rajshekar :
>
> Name naode is localhost with an ip address.Now I checked when i give
> /bin/hadoop namenode i am getting error
>
> r...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1# bin/hadoop namenode
> 09/02/05 13:27:43 INFO dfs.NameNode: STARTUP_MSG:
> /
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = excel-desktop/127.0.1.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.17.2.1
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 684969;
> compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
> /
> 09/02/05 13:27:43 INFO metrics.RpcMetrics: Initializing RPC Metrics with
> hostName=NameNode, port=9000
> 09/02/05 13:27:43 INFO dfs.NameNode: Namenode up at:
> localhost/127.0.0.1:9000
> 09/02/05 13:27:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=NameNode, sessionId=null
> 09/02/05 13:27:43 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics
> using context object:org.apache.hadoop.metrics.spi.NullContext
> 09/02/05 13:27:43 INFO fs.FSNamesystem: fsOwner=root,root
> 09/02/05 13:27:43 INFO fs.FSNamesystem: supergroup=supergroup
> 09/02/05 13:27:43 INFO fs.FSNamesystem: isPermissionEnabled=true
> 09/02/05 13:27:44 INFO ipc.Server: Stopping server on 9000
> 09/02/05 13:27:44 ERROR dfs.NameNode: java.io.EOFException
>at java.io.RandomAccessFile.readInt(RandomAccessFile.java:776)
>at
> org.apache.hadoop.dfs.FSImage.isConversionNeeded(FSImage.java:488)
>at
> org.apache.hadoop.dfs.Storage$StorageDirectory.analyzeStorage(Storage.java:283)
>at
> org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:149)
>at
> org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
>at
> org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)
>at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:255)
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:178)
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:164)
>at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)
>
> 09/02/05 13:27:44 INFO dfs.NameNode: SHUTDOWN_MSG:
> /
> SHUTDOWN_MSG: Shutting down NameNode at excel-desktop/127.0.1.1
> /
>  Rajshekar
>
>
>
>
>
> Sagar Naik-3 wrote:
>>
>>
>> where is the namenode running ? localhost or some other host
>>
>> -Sagar
>> Rajshekar wrote:
>>> Hello,
>>> I am new to Hadoop and I jus installed on Ubuntu 8.0.4 LTS as per
>>> guidance
>>> of a web site. I tested it and found working fine. I tried to copy a file
>>> but it is giving some error pls help me out
>>>
>>> had...@excel-desktop:/usr/local/hadoop/hadoop-0.17.2.1$  bin/hadoop jar
>>> hadoop-0.17.2.1-examples.jar wordcount /home/hadoop/Download\ URLs.txt
>>> download-output
>>> 09/02/02 11:18:59 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 1 time(s).
>>> 09/02/02 11:19:00 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 2 time(s).
>>> 09/02/02 11:19:01 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 3 time(s).
>>> 09/02/02 11:19:02 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 4 time(s).
>>> 09/02/02 11:19:04 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 5 time(s).
>>> 09/02/02 11:19:05 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 6 time(s).
>>> 09/02/02 11:19:06 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 7 time(s).
>>> 09/02/02 11:19:07 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 8 time(s).
>>> 09/02/02 11:19:08 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 9 time(s).
>>> 09/02/02 11:19:09 INFO ipc.Client: Retrying connect to server:
>>> localhost/127.0.0.1:9000. Already tried 10 time(s).
>>> java.lang.RuntimeException: java.net.ConnectException: Connection refused
>>> at org.apache.hadoop.mapred.JobConf.getWorkingDirecto
>>> ry(JobConf.java:356)
>>> at org.apache.hadoop.mapred.FileInputFormat.setInputP
>>> aths(FileInputFormat.java:331)
>>> at org.apache.hadoop.mapred.FileInputFormat.setInputP
>>> aths(FileInputFormat.java:304)
>>> at org.apache.hadoop.examples.

Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS
Sharath,

You're using  reporter.incrCounter(enumVal, intVal);  to increment counter,
I think method to get should also be similar.

Try to use findCounter(enumVal).getCounter() or  getCounter(enumVal).

Hope this helps,
Rasit

2009/2/5 some speed :
> In fact I put the enum in my Reduce method as the following link (from
> Yahoo) says so:
>
> http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html#metrics
> --->Look at the section under Reporting Custom Metrics.
>
> 2009/2/5 some speed 
>
>> Thanks Rasit.
>>
>> I did as you said.
>>
>> 1) Put the static enum MyCounter{ct_key1} just above main()
>>
>> 2) Changed  result =
>> ct.findCounter("org.apache.hadoop.mapred.Task$Counter", 1,
>> "Reduce.MyCounter").getCounter();
>>
>> Still is doesnt seem to help. It throws a null pointer exception.Its not
>> able to find the Counter.
>>
>>
>>
>> Thanks,
>>
>> Sharath
>>
>>
>>
>>
>> On Thu, Feb 5, 2009 at 8:04 AM, Rasit OZDAS  wrote:
>>
>>> Forgot to say, value "0" means that the requested counter does not exist.
>>>
>>> 2009/2/5 Rasit OZDAS :
>>> > Sharath,
>>> >  I think the static enum definition should be out of Reduce class.
>>> > Hadoop probably tries to find it elsewhere with "MyCounter", but it's
>>> > actually "Reduce.MyCounter" in your example.
>>> >
>>> > Hope this helps,
>>> > Rasit
>>> >
>>> > 2009/2/5 some speed :
>>> >> I Tried the following...It gets compiled but the value of result seems
>>> to be
>>> >> 0 always.
>>> >>
>>> >>RunningJob running = JobClient.runJob(conf);
>>> >>
>>> >> Counters ct = new Counters();
>>> >> ct = running.getCounters();
>>> >>
>>> >>long result =
>>> >> ct.findCounter("org.apache.hadoop.mapred.Task$Counter", 0,
>>> >> "*MyCounter*").getCounter();
>>> >> //even tried MyCounter.Key1
>>> >>
>>> >>
>>> >>
>>> >> Does anyone know whay that is happening?
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Sharath
>>> >>
>>> >>
>>> >>
>>> >> On Thu, Feb 5, 2009 at 5:59 AM, some speed 
>>> wrote:
>>> >>
>>> >>> Hi Tom,
>>> >>>
>>> >>> I get the error :
>>> >>>
>>> >>> Cannot find Symbol* "**MyCounter.ct_key1 " *
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Thu, Feb 5, 2009 at 5:51 AM, Tom White  wrote:
>>> >>>
>>> >>>> Hi Sharath,
>>> >>>>
>>> >>>> The code you posted looks right to me. Counters#getCounter() will
>>> >>>> return the counter's value. What error are you getting?
>>> >>>>
>>> >>>> Tom
>>> >>>>
>>> >>>> On Thu, Feb 5, 2009 at 10:09 AM, some speed 
>>> wrote:
>>> >>>> > Hi,
>>> >>>> >
>>> >>>> > Can someone help me with the usage of counters please? I am
>>> incrementing
>>> >>>> a
>>> >>>> > counter in Reduce method but I am unable to collect the counter
>>> value
>>> >>>> after
>>> >>>> > the job is completed.
>>> >>>> >
>>> >>>> > Its something like this:
>>> >>>> >
>>> >>>> > public static class Reduce extends MapReduceBase implements
>>> >>>> Reducer>> >>>> > FloatWritable, Text, FloatWritable>
>>> >>>> >{
>>> >>>> >static enum MyCounter{ct_key1};
>>> >>>> >
>>> >>>> > public void reduce(..) throws IOException
>>> >>>> >{
>>> >>>> >
>>> >>>> >reporter.incrCounter(MyCounter.ct_key1, 1);
>>> >>>> >
>>> >>>> >output.collect(..);
>>> >>>> >
>>> >>>> >}
>>> >>>> > }
>>> >>>> >
>>> >>>> > -main method
>>> >>>> > {
>>> >>>> >RunningJob running = null;
>>> >>>> >running=JobClient.runJob(conf);
>>> >>>> >
>>> >>>> >Counters ct = running.getCounters();
>>> >>>> > /*  How do I Collect the ct_key1 value ***/
>>> >>>> >long res = ct.getCounter(MyCounter.ct_key1);
>>> >>>> >
>>> >>>> > }
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> > Thanks,
>>> >>>> >
>>> >>>> > Sharath
>>> >>>> >
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > M. Raşit ÖZDAŞ
>>> >
>>>
>>>
>>>
>>> --
>>> M. Raşit ÖZDAŞ
>>>
>>
>>
>



-- 
M. Raşit ÖZDAŞ


Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS
Forgot to say, value "0" means that the requested counter does not exist.

2009/2/5 Rasit OZDAS :
> Sharath,
>  I think the static enum definition should be out of Reduce class.
> Hadoop probably tries to find it elsewhere with "MyCounter", but it's
> actually "Reduce.MyCounter" in your example.
>
> Hope this helps,
> Rasit
>
> 2009/2/5 some speed :
>> I Tried the following...It gets compiled but the value of result seems to be
>> 0 always.
>>
>>RunningJob running = JobClient.runJob(conf);
>>
>> Counters ct = new Counters();
>> ct = running.getCounters();
>>
>>long result =
>> ct.findCounter("org.apache.hadoop.mapred.Task$Counter", 0,
>> "*MyCounter*").getCounter();
>> //even tried MyCounter.Key1
>>
>>
>>
>> Does anyone know whay that is happening?
>>
>> Thanks,
>>
>> Sharath
>>
>>
>>
>> On Thu, Feb 5, 2009 at 5:59 AM, some speed  wrote:
>>
>>> Hi Tom,
>>>
>>> I get the error :
>>>
>>> Cannot find Symbol* "**MyCounter.ct_key1 " *
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 5, 2009 at 5:51 AM, Tom White  wrote:
>>>
>>>> Hi Sharath,
>>>>
>>>> The code you posted looks right to me. Counters#getCounter() will
>>>> return the counter's value. What error are you getting?
>>>>
>>>> Tom
>>>>
>>>> On Thu, Feb 5, 2009 at 10:09 AM, some speed  wrote:
>>>> > Hi,
>>>> >
>>>> > Can someone help me with the usage of counters please? I am incrementing
>>>> a
>>>> > counter in Reduce method but I am unable to collect the counter value
>>>> after
>>>> > the job is completed.
>>>> >
>>>> > Its something like this:
>>>> >
>>>> > public static class Reduce extends MapReduceBase implements
>>>> Reducer>>> > FloatWritable, Text, FloatWritable>
>>>> >{
>>>> >static enum MyCounter{ct_key1};
>>>> >
>>>> > public void reduce(..) throws IOException
>>>> >{
>>>> >
>>>> >reporter.incrCounter(MyCounter.ct_key1, 1);
>>>> >
>>>> >output.collect(..);
>>>> >
>>>> >}
>>>> > }
>>>> >
>>>> > -main method
>>>> > {
>>>> >RunningJob running = null;
>>>> >running=JobClient.runJob(conf);
>>>> >
>>>> >Counters ct = running.getCounters();
>>>> > /*  How do I Collect the ct_key1 value ***/
>>>> >long res = ct.getCounter(MyCounter.ct_key1);
>>>> >
>>>> > }
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> > Thanks,
>>>> >
>>>> > Sharath
>>>> >
>>>>
>>>
>>>
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Problem with Counters

2009-02-05 Thread Rasit OZDAS
Sharath,
 I think the static enum definition should be out of Reduce class.
Hadoop probably tries to find it elsewhere with "MyCounter", but it's
actually "Reduce.MyCounter" in your example.

Hope this helps,
Rasit

2009/2/5 some speed :
> I Tried the following...It gets compiled but the value of result seems to be
> 0 always.
>
>RunningJob running = JobClient.runJob(conf);
>
> Counters ct = new Counters();
> ct = running.getCounters();
>
>long result =
> ct.findCounter("org.apache.hadoop.mapred.Task$Counter", 0,
> "*MyCounter*").getCounter();
> //even tried MyCounter.Key1
>
>
>
> Does anyone know whay that is happening?
>
> Thanks,
>
> Sharath
>
>
>
> On Thu, Feb 5, 2009 at 5:59 AM, some speed  wrote:
>
>> Hi Tom,
>>
>> I get the error :
>>
>> Cannot find Symbol* "**MyCounter.ct_key1 " *
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2009 at 5:51 AM, Tom White  wrote:
>>
>>> Hi Sharath,
>>>
>>> The code you posted looks right to me. Counters#getCounter() will
>>> return the counter's value. What error are you getting?
>>>
>>> Tom
>>>
>>> On Thu, Feb 5, 2009 at 10:09 AM, some speed  wrote:
>>> > Hi,
>>> >
>>> > Can someone help me with the usage of counters please? I am incrementing
>>> a
>>> > counter in Reduce method but I am unable to collect the counter value
>>> after
>>> > the job is completed.
>>> >
>>> > Its something like this:
>>> >
>>> > public static class Reduce extends MapReduceBase implements
>>> Reducer>> > FloatWritable, Text, FloatWritable>
>>> >{
>>> >static enum MyCounter{ct_key1};
>>> >
>>> > public void reduce(..) throws IOException
>>> >{
>>> >
>>> >reporter.incrCounter(MyCounter.ct_key1, 1);
>>> >
>>> >output.collect(..);
>>> >
>>> >}
>>> > }
>>> >
>>> > -main method
>>> > {
>>> >RunningJob running = null;
>>> >running=JobClient.runJob(conf);
>>> >
>>> >Counters ct = running.getCounters();
>>> > /*  How do I Collect the ct_key1 value ***/
>>> >long res = ct.getCounter(MyCounter.ct_key1);
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Sharath
>>> >
>>>
>>
>>
>



-- 
M. Raşit ÖZDAŞ


Re: Bad connection to FS.

2009-02-05 Thread Rasit OZDAS
I can add a little method to follow namenode failures,
I find out such problems by running first   start-all.sh , then  stop-all.sh
if namenode starts without error, stop-all.sh gives the output
"stopping namenode.." , but in case of an error, it says "no namenode
to stop.."
In case of an error, Hadoop log directory is always the first place to look.
It doesn't save the day, but worths noting.

Hope this helps,
Rasit

2009/2/5 lohit :
> As noted by others NameNode is not running.
> Before formatting anything (which is like deleting your data), try to see why 
> NameNode isnt running.
> search for value of HADOOP_LOG_DIR in ./conf/hadoop-env.sh if you have not 
> set it explicitly it would default to  installation>/logs/*namenode*.log
> Lohit
>
>
>
> - Original Message 
> From: Amandeep Khurana 
> To: core-user@hadoop.apache.org
> Sent: Wednesday, February 4, 2009 5:26:43 PM
> Subject: Re: Bad connection to FS.
>
> Here's what I had done..
>
> 1. Stop the whole system
> 2. Delete all the data in the directories where the data and the metadata is
> being stored.
> 3. Format the namenode
> 4. Start the system
>
> This solved my problem. I'm not sure if this is a good idea to do for you or
> not. I was pretty much installing from scratch so didnt mind deleting the
> files in those directories..
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Wed, Feb 4, 2009 at 3:49 PM, TCK  wrote:
>
>>
>> I believe the debug logs location is still specified in hadoop-env.sh (I
>> just read the 0.19.0 doc). I think you have to shut down all nodes first
>> (stop-all), then format the namenode, and then restart (start-all) and make
>> sure that NameNode comes up too. We are using a very old version, 0.12.3,
>> and are upgrading.
>> -TCK
>>
>>
>>
>> --- On Wed, 2/4/09, Mithila Nagendra  wrote:
>> From: Mithila Nagendra 
>> Subject: Re: Bad connection to FS.
>> To: core-user@hadoop.apache.org, moonwatcher32...@yahoo.com
>> Date: Wednesday, February 4, 2009, 6:30 PM
>>
>> @TCK: Which version of hadoop have you installed?
>> @Amandeep: I did tried reformatting the namenode, but it hasn't helped me
>> out in anyway.
>> Mithila
>>
>>
>> On Wed, Feb 4, 2009 at 4:18 PM, TCK  wrote:
>>
>>
>>
>> Mithila, how come there is no NameNode java process listed by your jps
>> command? I would check the hadoop namenode logs to see if there was some
>> startup problem (the location of those logs would be specified in
>> hadoop-env.sh, at least in the version I'm using).
>>
>>
>> -TCK
>>
>>
>>
>>
>>
>>
>>
>> --- On Wed, 2/4/09, Mithila Nagendra  wrote:
>>
>> From: Mithila Nagendra 
>>
>> Subject: Bad connection to FS.
>>
>> To: "core-user@hadoop.apache.org" , "
>> core-user-subscr...@hadoop.apache.org" <
>> core-user-subscr...@hadoop.apache.org>
>>
>>
>> Date: Wednesday, February 4, 2009, 6:06 PM
>>
>>
>>
>> Hey all
>>
>>
>>
>> When I try to copy a folder from the local file system in to the HDFS using
>>
>> the command hadoop dfs -copyFromLocal, the copy fails and it gives an error
>>
>> which says "Bad connection to FS". How do I get past this? The
>>
>> following is
>>
>> the output at the time of execution:
>>
>>
>>
>> had...@renweiyu-desktop:/usr/local/hadoop$ jps
>>
>> 6873 Jps
>>
>> 6299 JobTracker
>>
>> 6029 DataNode
>>
>> 6430 TaskTracker
>>
>> 6189 SecondaryNameNode
>>
>> had...@renweiyu-desktop:/usr/local/hadoop$ ls
>>
>> bin  docslib  README.txt
>>
>> build.xmlhadoop-0.18.3-ant.jar   libhdfs  src
>>
>> c++  hadoop-0.18.3-core.jar  librecordio  webapps
>>
>> CHANGES.txt  hadoop-0.18.3-examples.jar  LICENSE.txt
>>
>> conf hadoop-0.18.3-test.jar  logs
>>
>> contrib  hadoop-0.18.3-tools.jar NOTICE.txt
>>
>> had...@renweiyu-desktop:/usr/local/hadoop$ cd ..
>>
>> had...@renweiyu-desktop:/usr/local$ ls
>>
>> bin  etc  games  gutenberg  hadoop  hadoop-0.18.3.tar.gz  hadoop-datastore
>>
>> include  lib  man  sbin  share  src
>>
>> had...@renweiyu-desktop:/usr/local$ hadoop/bin/hadoop dfs -copyFromLocal
>>
>> gutenberg gutenberg
>>
>> 09/02/04 15:58:21 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 0 time(s).
>>
>> 09/02/04 15:58:22 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 1 time(s).
>>
>> 09/02/04 15:58:23 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 2 time(s).
>>
>> 09/02/04 15:58:24 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 3 time(s).
>>
>> 09/02/04 15:58:25 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 4 time(s).
>>
>> 09/02/04 15:58:26 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Already tried 5 time(s).
>>
>> 09/02/04 15:58:27 INFO ipc.Client: Retrying connect to server: localhost/
>>
>> 127.0.0.1:54310. Alr

Re: copying binary files to a SequenceFile

2009-02-05 Thread Rasit OZDAS
Mark,
http://stuartsierra.com/2008/04/24/a-million-little-files/comment-page-1

In this link, there is a tool to create sequence files from tar.gz and
tar.bz2 files.
I don't think that this is a real solution, but at least it means more
free memory and delay of problems (worst solution).

Rasit

2009/2/5 Mark Kerzner :
> Hi all,
>
> I am copying regular binary files to a SequenceFile, and I am using
> BytesWritable, to which I am giving all the byte[] content of the file.
> However, once it hits a file larger than my computer memory, it may have
> problems. Is there a better way?
>
> Thank you,
> Mark
>



-- 
M. Raşit ÖZDAŞ


Re: Hadoop FS Shell - command overwrite capability

2009-02-04 Thread Rasit OZDAS
John, I also couldn't find a way from console,
Maybe you already know and don't prefer to use, but API solves this problem.
FileSystem.copyFromLocalFile(boolean delSrc, boolean overwrite, Path
src, Path dst)

If you have to use console, long solution, but you can create a jar
for this, and call it just like hadoop calls FileSystem class in
"hadoop" file in bin directory.

I think File System API also needs some improvement. I wonder if it's
considered by head developers.

Hope this helps,
Rasit

2009/2/4 S D :
> I'm using the Hadoop FS commands to move files from my local machine into
> the Hadoop dfs. I'd like a way to force a write to the dfs even if a file of
> the same name exists. Ideally I'd like to use a "-force" switch or some
> such; e.g.,
>hadoop dfs -copyFromLocal -force adirectory s3n://wholeinthebucket/
>
> Is there a way to do this or does anyone know if this is in the future
> Hadoop plans?
>
> Thanks
> John SD
>



-- 
M. Raşit ÖZDAŞ


Re: How to use DBInputFormat?

2009-02-04 Thread Rasit OZDAS
Amandeep,
"SQL command not properly ended"
I get this error whenever I forget the semicolon at the end.
I know, it doesn't make sense, but I recommend giving it a try

Rasit

2009/2/4 Amandeep Khurana :
> The same query is working if I write a simple JDBC client and query the
> database. So, I'm probably doing something wrong in the connection settings.
> But the error looks to be on the query side more than the connection side.
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana  wrote:
>
>> Thanks Kevin
>>
>> I couldnt get it work. Here's the error I get:
>>
>> bin/hadoop jar ~/dbload.jar LoadTable1
>> 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with
>> processName=JobTracker, sessionId=
>> 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001
>> 09/02/03 19:21:21 INFO mapred.JobClient:  map 0% reduce 0%
>> 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0
>> 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001
>> java.io.IOException: ORA-00933: SQL command not properly ended
>>
>> at
>> org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>> java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>> at LoadTable1.run(LoadTable1.java:130)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at LoadTable1.main(LoadTable1.java:107)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>> at java.lang.reflect.Method.invoke(Unknown Source)
>> at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>> at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>> at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>>
>> Exception closing file
>> /user/amkhuran/contract_table/_temporary/_attempt_local_0001_m_00_0/part-0
>> java.io.IOException: Filesystem closed
>> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
>> at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
>> at
>> org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
>> at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
>> at
>> org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
>> at
>> org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
>> at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
>> at
>> org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)
>>
>>
>> Here's my code:
>>
>> public class LoadTable1 extends Configured implements Tool  {
>>
>>   // data destination on hdfs
>>   private static final String CONTRACT_OUTPUT_PATH = "contract_table";
>>
>>   // The JDBC connection URL and driver implementation class
>>
>> private static final String CONNECT_URL = "jdbc:oracle:thin:@dbhost
>> :1521:PSEDEV";
>>   private static final String DB_USER = "user";
>>   private static final String DB_PWD = "pass";
>>   private static final String DATABASE_DRIVER_CLASS =
>> "oracle.jdbc.driver.OracleDriver";
>>
>>   private static final String CONTRACT_INPUT_TABLE =
>> "OSE_EPR_CONTRACT";
>>
>>   private static final String [] CONTRACT_INPUT_TABLE_FIELDS = {
>> "PORTFOLIO_NUMBER", "CONTRACT_NUMBER"};
>>
>>   private static final String ORDER_CONTRACT_BY_COL =
>> "CONTRACT_NUMBER";
>>
>>
>> static class ose_epr_contract implements Writable, DBWritable {
>>
>>
>> String CONTRACT_NUMBER;
>>
>>
>> public void readFields(DataInput in) throws IOException {
>>
>> this.CONTRACT_NUMBER = Text.readString(in);
>>
>> }
>>
>> public void write(DataOutput out) throws IOException {
>>
>> Text.writeString(out, this.CONTRACT_NUMBER);
>>
>>
>> }
>>
>> public void readFields(ResultSet in_set) throws SQLException {
>>
>> this.CONTRACT_NUMBER = in_set.getString(1);
>>
>> }
>>
>> @Override
>> public void write(PreparedStatement prep_st) throws SQLException {
>>

Re: Value-Only Reduce Output

2009-02-04 Thread Rasit OZDAS
I tried it myself, it doesn't work.
I've also tried   stream.map.output.field.separator   and
map.output.key.field.separator  parameters for this purpose, they
don't work either. When hadoop sees empty string, it takes default tab
character instead.

Rasit

2009/2/4 jason hadoop 
>
> Ooops, you are using streaming., and I am not familar.
> As a terrible hack, you could set mapred.textoutputformat.separator to the
> empty string, in your configuration.
>
> On Tue, Feb 3, 2009 at 9:26 PM, jason hadoop  wrote:
>
> > If you are using the standard TextOutputFormat, and the output collector is
> > passed a null for the value, there will not be a trailing tab character
> > added to the output line.
> >
> > output.collect( key, null );
> > Will give you the behavior you are looking for if your configuration is as
> > I expect.
> >
> >
> > On Tue, Feb 3, 2009 at 7:49 PM, Jack Stahl  wrote:
> >
> >> Hello,
> >>
> >> I'm interested in a map-reduce flow where I output only values (no keys)
> >> in
> >> my reduce step.  For example, imagine the canonical word-counting program
> >> where I'd like my output to be an unlabeled histogram of counts instead of
> >> (word, count) pairs.
> >>
> >> I'm using HadoopStreaming (specifically, I'm using the dumbo module to run
> >> my python scripts).  When I simulate the map reduce using pipes and sort
> >> in
> >> bash, it works fine.   However, in Hadoop, if I output a value with no
> >> tabs,
> >> Hadoop appends a trailing "\t", apparently interpreting my output as a
> >> (value, "") KV pair.  I'd like to avoid outputing this trailing tab if
> >> possible.
> >>
> >> Is there a command line option that could be use to effect this?  More
> >> generally, is there something wrong with outputing arbitrary strings,
> >> instead of key-value pairs, in your reduce step?
> >>
> >
> >



--
M. Raşit ÖZDAŞ


Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
Thanks, Tom
The problem that content was different was that
I converted one sample to Base64 byte-by-byte, and converted the other
from-byte-array to-byte-array (Strange, that they cause different outputs).
Thanks for good points.

Rasit

2009/2/2 Tom White 

> The SequenceFile format is described here:
>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html
> .
> The format of the keys and values depends on the serialization classes
> used. For example, BytesWritable writes out the length of its byte
> array followed by the actual bytes in the array (see the write()
> method in BytesWritable).
>
> Hope this helps.
> Tom
>
> On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS  wrote:
> > I tried to use SequenceFile.Writer to convert my binaries into Sequence
> > Files,
> > I read the binary data with FileInputStream, getting all bytes with
> > reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
> > parameters NullWritable as key, BytesWritable as value. But the content
> > changes,
> > (I can see that by converting to Base64)
> >
> > Binary File:
> > 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81
> 65
> > 103 54 81 65 65 97 81 65 65 65 81 ...
> >
> > Sequence File:
> > 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103
> 67
> > 69 77 65 52 80 86 67 65 73 68 114 ...
> >
> > Thanks for any points..
> > Rasit
> >
> > 2009/2/2 Rasit OZDAS 
> >
> >> Hi,
> >> I tried to use SequenceFileInputFormat, for this I appended "SEQ" as
> first
> >> bytes of my "binary" files (with hex editor).
> >> but I get this exception:
> >>
> >> A record version mismatch occured. Expecting v6, found v32
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
> >> at
> >> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
> >> at
> >>
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
> >> at
> >>
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
> >> at org.apache.hadoop.mapred.Child.main(Child.java:155)
> >>
> >> What could it be? Is it not enough just to add "SEQ" to binary files?
> >> I use Hadoop v.0.19.0 .
> >>
> >> Thanks in advance..
> >> Rasit
> >>
> >>
> >> different *version* of *Hadoop* between your server and your client.
> >>
> >> --
> >> M. Raşit ÖZDAŞ
> >>
> >
> >
> >
> > --
> > M. Raşit ÖZDAŞ
> >
>



-- 
M. Raşit ÖZDAŞ


Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
I tried to use SequenceFile.Writer to convert my binaries into Sequence
Files,
I read the binary data with FileInputStream, getting all bytes with
reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
parameters NullWritable as key, BytesWritable as value. But the content
changes,
(I can see that by converting to Base64)

Binary File:
73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65
103 54 81 65 65 97 81 65 65 65 81 ...

Sequence File:
73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67
69 77 65 52 80 86 67 65 73 68 114 ...

Thanks for any points..
Rasit

2009/2/2 Rasit OZDAS 

> Hi,
> I tried to use SequenceFileInputFormat, for this I appended "SEQ" as first
> bytes of my "binary" files (with hex editor).
> but I get this exception:
>
> A record version mismatch occured. Expecting v6, found v32
> at
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
> at
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
> at
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
> at
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
> at org.apache.hadoop.mapred.Child.main(Child.java:155)
>
> What could it be? Is it not enough just to add "SEQ" to binary files?
> I use Hadoop v.0.19.0 .
>
> Thanks in advance..
> Rasit
>
>
> different *version* of *Hadoop* between your server and your client.
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Rasit OZDAS
Hi,
I tried to use SequenceFileInputFormat, for this I appended "SEQ" as first
bytes of my "binary" files (with hex editor).
but I get this exception:

A record version mismatch occured. Expecting v6, found v32
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1428)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1412)
at
org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:43)
at
org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

What could it be? Is it not enough just to add "SEQ" to binary files?
I use Hadoop v.0.19.0 .

Thanks in advance..
Rasit


different *version* of *Hadoop* between your server and your client.

-- 
M. Raşit ÖZDAŞ


Re: Using HDFS for common purpose

2009-01-29 Thread Rasit OZDAS
Today Nitesh has given an answer to a similar thread, that was what I wanted
to learn.
I'm writing it here to help others having same question.

HDFS is a file system for distributed storage typically for distributed
computing scenerio over hadoop. For office purpose you will require a SAN
(Storage Area Network) - an architecture to attach remote computer storage
devices to servers in such a way that, to the operating system, the devices
appear as locally attached. Or you can even go for AmazonS3, if the data is
really authentic. For opensource solution related to SAN, you can go with
any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac
Server + XSan.

--nitesh

Thanks,
Rasit

2009/1/28 Rasit OZDAS 

> Thanks for responses,
>
> Sorry, I made a mistake, it's actually not a db what I wanted. We need a
> simple storage for files. Only get and put commands are enough (no queries
> needed). We don't even need append, chmod, etc.
>
> Probably from a thread on this list, I came across a link to a KFS-HDFS
> comparison:
> http://deliberateambiguity.typepad.com/blog/2007/10/advantages-of-k.html<https://webmail.uzay.tubitak.gov.tr/owa/redir.aspx?C=55b317b7ca7548209f9929c643fcbf93&URL=http%3a%2f%2fdeliberateambiguity.typepad.com%2fblog%2f2007%2f10%2fadvantages-of-k.html>
>
> It's good, that KFS is written in C++, but handling errors in C++ is
> usually more difficult.
> I need your opinion about which one could best fit.
>
> Thanks,
> Rasit
>
> 2009/1/27 Jim Twensky 
>
> You may also want to have a look at this to reach a decision based on your
>> needs:
>>
>> http://www.swaroopch.com/notes/Distributed_Storage_Systems
>>
>> Jim
>>
>> On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky 
>> wrote:
>>
>> > Rasit,
>> >
>> > What kind of data will you be storing on Hbase or directly on HDFS? Do
>> you
>> > aim to use it as a data source to do some key/value lookups for small
>> > strings/numbers or do you want to store larger files labeled with some
>> sort
>> > of a key and retrieve them during a map reduce run?
>> >
>> > Jim
>> >
>> >
>> > On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray 
>> wrote:
>> >
>> >> Perhaps what you are looking for is HBase?
>> >>
>> >> http://hbase.org
>> >>
>> >> HBase is a column-oriented, distributed store that sits on top of HDFS
>> and
>> >> provides random access.
>> >>
>> >> JG
>> >>
>> >> > -Original Message-
>> >> > From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
>> >> > Sent: Tuesday, January 27, 2009 1:20 AM
>> >> > To: core-user@hadoop.apache.org
>> >> > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr
>> ;
>> >> > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr
>> ;
>> >> > hakan.kocaku...@uzay.tubitak.gov.tr;
>> caglar.bi...@uzay.tubitak.gov.tr
>> >> > Subject: Using HDFS for common purpose
>> >> >
>> >> > Hi,
>> >> > I wanted to ask, if HDFS is a good solution just as a distributed db
>> >> > (no
>> >> > running jobs, only get and put commands)
>> >> > A review says that "HDFS is not designed for low latency" and
>> besides,
>> >> > it's
>> >> > implemented in Java.
>> >> > Do these disadvantages prevent us using it?
>> >> > Or could somebody suggest a better (faster) one?
>> >> >
>> >> > Thanks in advance..
>> >> > Rasit
>> >>
>> >>
>> >
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Is Hadoop Suitable for me?

2009-01-29 Thread Rasit OZDAS
Oh, I can't believe, my problem was the same, I thought last one was an
answer to my thread.
Who cares, the problem is solved, thanks!

2009/1/29 Rasit OZDAS 

> Thanks for responses, the problem is solved :)
> I'll be forwarding the thread to my colleagues.
>
> 2009/1/29 nitesh bhatia 
>
> HDFS is a file system for distributed storage typically for distributed
>> computing scenerio over hadoop. For office purpose you will require a SAN
>> (Storage Area Network) - an architecture to attach remote computer storage
>> devices to servers in such a way that, to the operating system, the
>> devices
>> appear as locally attached. Or you can even go for AmazonS3, if the data
>> is
>> really authentic. For opensource solution related to SAN, you can go with
>> any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
>> zones) or perhaps best plug-n-play solution (non-open-source) would be a
>> Mac
>> Server + XSan.
>>
>> --nitesh
>>
>> On Wed, Jan 28, 2009 at 10:20 PM, Simon  wrote:
>>
>> > But we are looking for an open source solution.
>> >
>> > If I do decide to implement this for the office storage, what problems
>> will
>> > I run into?
>> >
>> > -Original Message-
>> > From: Dmitry Pushkarev [mailto:u...@stanford.edu]
>> > Sent: Thursday, 29 January 2009 5:15 PM
>> > To: core-user@hadoop.apache.org
>> >  Cc: sim...@bigair.net.au
>> > Subject: RE: Is Hadoop Suitable for me?
>> >
>> > Definitely not,
>> >
>> > You should be looking at expandable Ethernet storage that can be
>> extended
>> > by
>> > connecting additional SAS arrays. (like dell powervault and similar
>> things
>> > from other companies)
>> >
>> > 600Mb is just 6 seconds over gigabit network...
>> >
>> > ---
>> > Dmitry Pushkarev
>> >
>> >
>> > -Original Message-
>> > From: Simon [mailto:sim...@bigair.net.au]
>> > Sent: Wednesday, January 28, 2009 10:02 PM
>> > To: core-user@hadoop.apache.org
>> > Subject: Is Hadoop Suitable for me?
>> >
>> > Hi Hadoop Users,
>> >
>> >
>> > I am trying to build a storage system for the office of about 20-30
>> users
>> > which will store everything.
>> >
>> > From normal everyday documents to computer configuration files to big
>> files
>> > (600mb) which are generated every hour.
>> >
>> >
>> >
>> > Is Hadoop suitable for this kind of environment?
>> >
>> >
>> >
>> > Regards,
>> >
>> > Simon
>> >
>> >
>> >
>> > No virus found in this incoming message.
>> > Checked by AVG - http://www.avg.com
>> > Version: 8.0.176 / Virus Database: 270.10.15/1921 - Release Date:
>> 1/28/2009
>> > 6:37 AM
>> >
>> >
>>
>>
>> --
>> Nitesh Bhatia
>> Dhirubhai Ambani Institute of Information & Communication Technology
>> Gandhinagar
>> Gujarat
>>
>> "Life is never perfect. It just depends where you draw the line."
>>
>> visit:
>> http://www.awaaaz.com - connecting through music
>> http://www.volstreet.com - lets volunteer for better tomorrow
>> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
>>
>
>
>
> --
> M. Raşit ÖZDAŞ
>



-- 
M. Raşit ÖZDAŞ


Re: Is Hadoop Suitable for me?

2009-01-29 Thread Rasit OZDAS
Thanks for responses, the problem is solved :)
I'll be forwarding the thread to my colleagues.

2009/1/29 nitesh bhatia 

> HDFS is a file system for distributed storage typically for distributed
> computing scenerio over hadoop. For office purpose you will require a SAN
> (Storage Area Network) - an architecture to attach remote computer storage
> devices to servers in such a way that, to the operating system, the devices
> appear as locally attached. Or you can even go for AmazonS3, if the data is
> really authentic. For opensource solution related to SAN, you can go with
> any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS +
> zones) or perhaps best plug-n-play solution (non-open-source) would be a
> Mac
> Server + XSan.
>
> --nitesh
>
> On Wed, Jan 28, 2009 at 10:20 PM, Simon  wrote:
>
> > But we are looking for an open source solution.
> >
> > If I do decide to implement this for the office storage, what problems
> will
> > I run into?
> >
> > -Original Message-
> > From: Dmitry Pushkarev [mailto:u...@stanford.edu]
> > Sent: Thursday, 29 January 2009 5:15 PM
> > To: core-user@hadoop.apache.org
> >  Cc: sim...@bigair.net.au
> > Subject: RE: Is Hadoop Suitable for me?
> >
> > Definitely not,
> >
> > You should be looking at expandable Ethernet storage that can be extended
> > by
> > connecting additional SAS arrays. (like dell powervault and similar
> things
> > from other companies)
> >
> > 600Mb is just 6 seconds over gigabit network...
> >
> > ---
> > Dmitry Pushkarev
> >
> >
> > -Original Message-
> > From: Simon [mailto:sim...@bigair.net.au]
> > Sent: Wednesday, January 28, 2009 10:02 PM
> > To: core-user@hadoop.apache.org
> > Subject: Is Hadoop Suitable for me?
> >
> > Hi Hadoop Users,
> >
> >
> > I am trying to build a storage system for the office of about 20-30 users
> > which will store everything.
> >
> > From normal everyday documents to computer configuration files to big
> files
> > (600mb) which are generated every hour.
> >
> >
> >
> > Is Hadoop suitable for this kind of environment?
> >
> >
> >
> > Regards,
> >
> > Simon
> >
> >
> >
> > No virus found in this incoming message.
> > Checked by AVG - http://www.avg.com
> > Version: 8.0.176 / Virus Database: 270.10.15/1921 - Release Date:
> 1/28/2009
> > 6:37 AM
> >
> >
>
>
> --
> Nitesh Bhatia
> Dhirubhai Ambani Institute of Information & Communication Technology
> Gandhinagar
> Gujarat
>
> "Life is never perfect. It just depends where you draw the line."
>
> visit:
> http://www.awaaaz.com - connecting through music
> http://www.volstreet.com - lets volunteer for better tomorrow
> http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
>



-- 
M. Raşit ÖZDAŞ


Re: Using HDFS for common purpose

2009-01-28 Thread Rasit OZDAS
Thanks for responses,

Sorry, I made a mistake, it's actually not a db what I wanted. We need a
simple storage for files. Only get and put commands are enough (no queries
needed). We don't even need append, chmod, etc.

Probably from a thread on this list, I came across a link to a KFS-HDFS
comparison:
http://deliberateambiguity.typepad.com/blog/2007/10/advantages-of-k.html<https://webmail.uzay.tubitak.gov.tr/owa/redir.aspx?C=55b317b7ca7548209f9929c643fcbf93&URL=http%3a%2f%2fdeliberateambiguity.typepad.com%2fblog%2f2007%2f10%2fadvantages-of-k.html>

It's good, that KFS is written in C++, but handling errors in C++ is usually
more difficult.
I need your opinion about which one could best fit.

Thanks,
Rasit

2009/1/27 Jim Twensky 

> You may also want to have a look at this to reach a decision based on your
> needs:
>
> http://www.swaroopch.com/notes/Distributed_Storage_Systems
>
> Jim
>
> On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky 
> wrote:
>
> > Rasit,
> >
> > What kind of data will you be storing on Hbase or directly on HDFS? Do
> you
> > aim to use it as a data source to do some key/value lookups for small
> > strings/numbers or do you want to store larger files labeled with some
> sort
> > of a key and retrieve them during a map reduce run?
> >
> > Jim
> >
> >
> > On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray 
> wrote:
> >
> >> Perhaps what you are looking for is HBase?
> >>
> >> http://hbase.org
> >>
> >> HBase is a column-oriented, distributed store that sits on top of HDFS
> and
> >> provides random access.
> >>
> >> JG
> >>
> >> > -Original Message-
> >> > From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
> >> > Sent: Tuesday, January 27, 2009 1:20 AM
> >> > To: core-user@hadoop.apache.org
> >> > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
> >> > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
> >> > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
> >> > Subject: Using HDFS for common purpose
> >> >
> >> > Hi,
> >> > I wanted to ask, if HDFS is a good solution just as a distributed db
> >> > (no
> >> > running jobs, only get and put commands)
> >> > A review says that "HDFS is not designed for low latency" and besides,
> >> > it's
> >> > implemented in Java.
> >> > Do these disadvantages prevent us using it?
> >> > Or could somebody suggest a better (faster) one?
> >> >
> >> > Thanks in advance..
> >> > Rasit
> >>
> >>
> >
>



-- 
M. Raşit ÖZDAŞ


Re: Netbeans/Eclipse plugin

2009-01-28 Thread Rasit OZDAS
Both DFS viewer and job submission work on eclipse v. 3.3.2.
I've given up using Ganymede, unfortunately..

2009/1/26 Aaron Kimball 

> The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/)
> currently is inoperable. The DFS viewer works, but the job submission code
> is broken.
>
> - Aaron
>
> On Sun, Jan 25, 2009 at 9:07 PM, Amit k. Saha 
> wrote:
>
> > On Sun, Jan 25, 2009 at 9:32 PM, Edward Capriolo 
> > wrote:
> > > On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar <
> vinaykat...@gmail.com>
> > wrote:
> > >> Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I
> > want
> > >> to make plugin for netbeans
> > >>
> > >> http://vinayakkatkar.wordpress.com
> > >> --
> > >> Vinayak Katkar
> > >> Sun Campus Ambassador
> > >> Sun Microsytems,India
> > >> COEP
> > >>
> > >
> > > There is an ecplipse plugin.
> > http://www.alphaworks.ibm.com/tech/mapreducetools
> > >
> > > Seems like some work is being done on netbeans
> > > https://nbhadoop.dev.java.net/
> >
> > I started this project. But well, its caught up in the requirements
> > gathering phase.
> >
> > @ Vinayak,
> >
> > Lets take this offline and discuss. What do you think?
> >
> >
> > Thanks,
> > Amit
> >
> > >
> > > The world needs more netbeans love.
> > >
> >
> > Definitely :-)
> >
> >
> > --
> > Amit Kumar Saha
> > http://amitksaha.blogspot.com
> > http://amitsaha.in.googlepages.com/
> > *Bangalore Open Java Users Group*:http:www.bojug.in
> >
>



-- 
M. Raşit ÖZDAŞ


Re: Number of records in a MapFile

2009-01-28 Thread Rasit OZDAS
Do you mean, without scanning all the files line by line?
I know little about implementation of hadoop, but as a programmer, I can
presume that it's not possible without a complete scan.

But I can suggest a work-around:
- compute number of records manually before putting a file to HDFS.
- Append the computed number to the filename.
- modify InputReader, so that reader appends that number to the key of every
map.

Hope this helps,
Rasit

2009/1/27 Andy Liu 

> Is there a way to programatically get the number of records in a MapFile
> without doing a complete scan?
>



-- 
M. Raşit ÖZDAŞ


Re: Where are the meta data on HDFS ?

2009-01-27 Thread Rasit OZDAS
Hi Tien,

Configuration config = new Configuration(true);
config.addResource(new Path("/etc/hadoop-0.19.0/conf/hadoop-site.xml"));

FileSystem fileSys = FileSystem.get(config);
BlockLocation[] locations = fileSys.getFileBlockLocations(.

I copied some lines of my code, it can also help if you prefer using the
API.
It has other useful features (methods) as well.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html


2009/1/24 tienduc_dinh 

>
> that's what I needed !
>
> Thank you so much.
> --
> View this message in context:
> http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Using HDFS for common purpose

2009-01-27 Thread Rasit OZDAS
Hi,
I wanted to ask, if HDFS is a good solution just as a distributed db (no
running jobs, only get and put commands)
A review says that "HDFS is not designed for low latency" and besides, it's
implemented in Java.
Do these disadvantages prevent us using it?
Or could somebody suggest a better (faster) one?

Thanks in advance..
Rasit


Re: Problem running hdfs_test

2009-01-23 Thread Rasit OZDAS
Hi, Arifa

I had to add "LD_LIBRARY_PATH" env. var. to correctly run my example.
But I have no idea if it helps, because my error wasn't a segmentation
fault. I would try it anyway.

LD_LIBRARY_PATH:/usr/JRE/jre1.6.0_11/jre1.6.0_11/lib:/usr/JRE/jre1.6.0_11/jre1.6.0_11/lib/amd64/server

(server directory of a JRE, which contains libjvm.so file, and lib directory
of the same JRE.)

Hope this helps,
Rasit

2009/1/21 Arifa Nisar 

> Hello,
>
> As I mentioned in my previous email, I am having segmentation fault at
> 0x0001 while running hdfs_test. I was suggested to build and
> run
> hdfs_test usning ant, as ant should set some environment variable which
> Makefile won't. I tried building libhdfs and running hdfs_test using ant
> but
> I am still having same problem. Now, instead of hdfs_test, I am testing a
> simple test with libhdfs. I linked a following hello world program with
> libhdfs.
>
> #include "hdfs.h"
> int main() {
>  printf("Hello World.\n");
>  return(0);
> }
>
> I added a line to compile this test program in
> ${HADOOP_HOME}/src/c++/libhdfs/Makefile and replaced hdfs_test with this
> test program in {HADOOP_HOME}/src/c++/libhdfs/tests/test-libhdfs.sh. I
> build
> and invoked this test using test-libhdfs target in build.xml but I am still
> having segmentation fault when this simple test program is invoked from
> test-libhdfs.sh. I followed the following steps
>
> cd ${HADOOP_HOME}
> ant clean
> cd ${HADOOP_HOME}/src/c++/libhdfs/
> rm -f hdfs_test hdfs_write hdfs_read libhdfs.so* *.o test
> Cd ${HADOOP_HOME}
> ant test-libhdfs -Dlibhdfs=1
>
> Error Line
> --
> [exec] ./tests/test-libhdfs.sh: line 85: 23019 Segmentation fault
> $LIBHDFS_BUILD_DIR/$HDFS_TEST
>
> I have attached the output of this command with this email. I have added
> "env" in test-libhdfs.sh to see what environmental variable are set. Please
> suggest if any variable is wrongly set. Any kind of suggestion will be
> helpful for me as I have already spent a lot of time on this problem.
>
> I have added following lines in Makefile and test-libhdfs.sh
>
> Makefile
> -
> export JAVA_HOME=/usr/lib/jvm/java-1.7.0-icedtea-1.7.0.0.x86_64
> export OS_ARCH=amd64
> export OS_NAME=Linux
> export LIBHDFS_BUILD_DIR=$(HADOOP_HOME)/src/c++/libhdfs
> export SHLIB_VERSION=1
>
> test-libhdfs.sh
> --
>
> HADOOP_CONF_DIR=${HADOOP_HOME}/conf
> HADOOP_LOG_DIR=${HADOOP_HOME}/logs
> LIBHDFS_BUILD_DIR=${HADOOP_HOME}/src/c++/libhdfs
> HDFS_TEST=test
>
> When I don't link libdhfs with test.c it doesn't give error and prints
> "Hello World" when "ant test-libhdfs -Dlibhdfs=1" is run. I made sure that
> "ant" and "hadoop" uses same java installation, I have tried this on 32 bit
> machine but I am still having segmentation fault. Now, I am clueless what I
> can do to correct this. Please help.
>
> Thanks,
> Arifa.
>
> PS: Also please suggest is there any java version of hdfs_test?
>
> -Original Message-
> From: Delip Rao [mailto:delip...@gmail.com]
> Sent: Saturday, January 17, 2009 3:49 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Problem running unning hdfs_test
>
> Try enabling the debug flags while compiling to get more information.
>
> On Sat, Jan 17, 2009 at 4:19 AM, Arifa Nisar 
> wrote:
> > Hello all,
> >
> >
> >
> > I am trying to test hdfs_test.c provided with hadoop installation.
> > libhdfs.so and hdfs_test are built fine after making a few  changes in
> > $(HADOOP_HOME)/src/c++/libhdfs/Makefile. But when I try to run
> ./hdfs_test,
> > I get segmentation fault at 0x0001
> >
> >
> >
> > Program received signal SIGSEGV, Segmentation fault.
> >
> > 0x0001 in ?? ()
> >
> > (gdb) bt
> >
> > #0  0x0001 in ?? ()
> >
> > #1  0x7fffd0c51af5 in ?? ()
> >
> > #2  0x in ?? ()
> >
> >
> >
> > A simple hello world program linked with libdhfs.so also gives the same
> > error. In CLASSPATH all the jar files in $(HADOOP_HOME),
> > $(HADOOP_HOME)/conf, $(HADOOP_HOME)/lib,$(JAVA_HOME)/lib are included.
> > Please help.
> >
> >
> >
> > Thanks,
> >
> > Arifa.
> >
> >
> >
> >
>



-- 
M. Raşit ÖZDAŞ


Re: Null Pointer with Pattern file

2009-01-21 Thread Rasit OZDAS
Hi,
Try to use:

conf.setJarByClass(EchoOche.class);  // conf is the JobConf instance of your
example.

Hope this helps,
Rasit

2009/1/20 Shyam Sarkar 

> Hi,
>
> I was trying to run Hadoop wordcount version 2 example under Cygwin. I
> tried
> without pattern.txt file -- It works fine.
> I tried with pattern.txt file to skip some patterns, I get NULL POINTER
> exception as follows::
>
> 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set.  User classes
> may not be found. See JobConf(Class) or JobConf#setJar(String).
> 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
> : 4
> 09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001
> 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process
> : 4
> 09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1
> 09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100
> 09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720
> 09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680
> 09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.NullPointerException
>  at org.myorg.WordCount$Map.configure(WordCount.java:39)
>  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>  at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>  at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>  at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>  at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
> java.io.IOException: Job failed!
>  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
>  at org.myorg.WordCount.run(WordCount.java:114)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.myorg.WordCount.main(WordCount.java:119)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>  at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:597)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
>  at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>  at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
>
>
> Please tell me what should I do.
>
> Thanks,
> shyam.s.sar...@gmail.com
>



-- 
M. Raşit ÖZDAŞ


  1   2   >