Re: Streaming data - Avaiable tools

2014-07-04 Thread Marcos Ortiz

Storm is another project sponsored by ASF. Look here:
http://storm.apache.org

On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote:
Storm. It’s not a part of the Apache project but it seems to be what 
people are using to process event data.

B.
*From:* santosh.viswanat...@accenture.com 
mailto:santosh.viswanat...@accenture.com

*Sent:* Friday, July 04, 2014 11:25 AM
*To:* user@hadoop.apache.org mailto:user@hadoop.apache.org
*Subject:* Streaming data - Avaiable tools

Hello Experts,

Wanted to explore the available tools in the market on streaming data. 
I know Apache Spark exists. Are there any other tools available?


Regards,
Santosh Karthikeyan




This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise confidential information. If you 
have received it in error, please notify the sender immediately and 
delete the original. Any other use of the e-mail by you is prohibited. 
Where allowed by local law, electronic communications with Accenture 
and its affiliates, including e-mail and instant messaging (including 
content), may be scanned by our systems for the purposes of 
information security and assessment of internal compliance with 
Accenture policy.

__

www.accenture.com


--
Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186 
http://twitter.com/marcosluis2186)

http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu



Re: Storing videos in Hdfs

2014-06-17 Thread Marcos Ortiz
What do you want to achieve with this?
I've seen that Hadoop is being used for video analytics, just storing video's 
metadata, 
quantity of unique views and that kind of stuff; but I've never seen
this use-case.
A good example of this is Ooyala, which have been used Hadoop+ Apache Cassandra 
for this, although they migrated to a Spark/Shark + Cassandra solution. 
They wrote a whitepaper called Designing a Scalable Database for Online Video 
Analytics and Evan Chan(@evanfchan) did a great talk in the last Cassandra 
Summit 
2013 about how to use Spark/Shark + Cassandra for Real-Time video analytics.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
On Tuesday, June 17, 2014 06:12:49 PM alajangikish...@gmail.com wrote:
 Hi hadoopers,
 
 What is the best way to store video files in Hdfs?
 
 Sent from my iPhone


[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Re: MapReduce scalability study

2014-05-22 Thread Marcos Ortiz

On Thursday, May 22, 2014 10:17:42 PM Sylvain Gault wrote:
 Hello,
 
 I'm new to this mailing list, so forgive me if I don't do everything
 right.
 
 I didn't know whether I should ask on this mailing list or on
 mapreduce-dev or on yarn-dev. So I'll just start there. ^^
 
 Short story: I'm looking for some paper(s) studying the scalability
 of Hadoop MapReduce. And I found this extremely difficult to find on
 google scholar. Do you have something worth citing in a PhD thesis?
 
 Long story: I'm writing my PhD thesis about MapReduce and when I talk
 about Hadoop I'd like to say how much it scales. I heared two years
 ago some people say that Yahoo! got it scale up to 4000 nodes and plan
 to try on 6000 nodes or something like that. I also heared that
 YARN/MRv2 should scale better, but I don't plan to talk much about
 YARN/MRv2. So I'd take anything I could cite as a reference in my
 manuscript. :)
Hello, Sylvain.
One of the reason why the Hadoop dev team began to work in YARN is precisely 
looking for a more scalable and resourceful Hadoop system, so if you actually 
want to 
talk about Hadoop scalability, you should talk about YARN and MR2.

The paper is here:
https://developer.yahoo.com/blogs/hadoop/next-generation-apache-hadoop-mapreduce-3061.html

and the related JIRA issues here:
https://issues.apache.org/jira/browse/MAPREDUCE-278
https://issues.apache.org/jira/browse/MAPREDUCE-279

You should talk with Arun C Murthy, Chief Architect at Hortonworks about all 
these 
topics. He could help you much more than I could.

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
 
 
 Best regards,
 Sylvain Gault


[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Re: Job Tracker Stops as Task Tracker starts

2014-05-20 Thread Marcos Ortiz
What version of JDK are you using in your servers?
What version of Hadoop are you using?

-- 
Marcos Ortiz[1] (@marcosluis2186[2])
http://about.me/marcosortiz[3] 
On Tuesday, May 20, 2014 09:01:07 PM Faisal Rabbani wrote:
 Hi,
 I just installed jobtracker and task trackers but as soon as I start any of
 my tasktrackers Job trackers homepage gives following error:
 
 
 
 java.lang.NoSuchMethodError: sun.misc.FloatingDecimal.digitsRoundedUp()Z
 at java.text.DigitList.set(DigitList.java:292)
 at java.text.DecimalFormat.format(DecimalFormat.java:599)
 at java.text.DecimalFormat.format(DecimalFormat.java:522)
 at java.text.NumberFormat.format(NumberFormat.java:271)
 at
 org.apache.hadoop.mapred.jobtracker_jsp.generateSummaryTable(jobtracker_jsp.
 java:26) at
 org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:146)
 at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:98) at
 javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511)
 at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
 .java:1221) at
 org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(Sta
 ticUserWebFilter.java:109) at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
 .java:1212) at
 org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.jav
 a:1069) at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
 .java:1212) at
 org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
 .java:1212) at
 org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) at
 org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler
 .java:1212) at
 org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at
 org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerColl
 ection.java:230) at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at
 org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnectio
 n.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
 at
 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582
 )
 
 
 whereas in* jobtracker::50030/machines.jsp?type=active* all tasktrackers
 are showed in running state
 hmaster01 http://162.243.238.97:50030/jobtracker.jsp Hadoop Machine
 ListActive
 Task Trackers*Task Trackers**Name**Host**# running tasks**Max Map Tasks**Max
 Reduce Tasks**Task Failures**Directory Failures**Node Health
 Status**Seconds Since Node Last Healthy**Total Tasks Since Start**Succeeded
 Tasks Since Start**Total Tasks Last Day**Succeeded Tasks Last Day**Total
 Tasks Last Hour**Succeeded Tasks Last Hour**Seconds since heartbeat*
 tracker_hslav01:localhost/127.0.0.1:39451 http://hslav01:50060/hslav010220
 0N/Atracker_hslave02:localhost/127.0.0.1:56916http://hslave02:5006
 0/
 hslave0202200N/Atracker_hslave04:localhost/127.0.0.1:43590http://h
 slave04:50060/
 hslave0402200N/Atracker_hslave03:localhost/127.0.0.1:56552http://h
 slave03:50060/ hslave0302200N/A
 --
 Hadoop http://hadoop.apache.org/core, 2014.
 
 
 Any suggestions please.
 --
 Thanks,
 Faisal Ali Rabbani


[1] http://www.linkedin.com/in/mlortiz
[2] http://twitter.com/marcosluis2186
[3] http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu

Re: Random Exception

2014-05-02 Thread Marcos Ortiz
It seems that your Hadoop data directory is broken or your disk has problems.
Which version of Hadoop are you using?

On Friday, May 02, 2014 08:43:44 AM S.L wrote:
 Hi All,
 
 I get this exception after af resubmit my failed MapReduce jon, can one
 please let me know what this exception means ?
 
 14/05/02 01:28:25 INFO mapreduce.Job: Task Id :
 attempt_1398989569957_0021_m_00_0, Status : FAILED
 Error: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
 find any valid local directory for
 attempt_1398989569957_0021_m_00_0/intermediate.26
 at
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWr
 ite(LocalDirAllocator.java:402) at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato
 r.java:150) at
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocato
 r.java:131) at
 org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:711) at
 org.apache.hadoop.mapred.Merger$MergeQueue.merge(Merger.java:579) at
 org.apache.hadoop.mapred.Merger.merge(Merger.java:150)
 at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:187
 0) at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1482)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:437)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Unknown Source)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
 va:1548)

I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de 
abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu


Re: Which database should be used

2014-05-02 Thread Marcos Ortiz

On Friday, May 02, 2014 04:21:58 PM Alex Lee wrote:
 There are many database, such as Hbase, hive and mango etc. I need to choose
 one to save data big volumn stream data from sensors.
 
 Will hbase be good, thanks.
HBase could be a good allied for this case. You should check OpenTSDB project 
like a similar case to your problem.
http://opentsdb.net/

You should check HBaseCon presentations and videos to see what could you use 
for your case.
 


I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de 
abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu


Re: upgrade to CDH5 from CDH4.6 hadoop 2.0

2014-04-28 Thread Marcos Ortiz
Regards, Motty
This kind of questions, I think that should be asked in the CDH Users mailing 
list. There, you will obtain a better and a faster answer.
Best wishes

On Monday, April 28, 2014 01:00:13 PM motty cruz wrote:
 Hello, I'm upgrading to CDH5. I download latest parcel from
 http://archive.cloudera.com/cdh5/parcels/latest/
 
 to /oprt/cloudera/parcel-repo next to cluster on cludera under parcels --
 I hit the distribution button, started to distribute got to 50% but it does
 not go any further. any ideas how to proceed?
 
 Thanks,


I Conferencia Científica Internacional UCIENCIA 2014 en la UCI del 24 al 26 de 
abril de 2014, La Habana, Cuba. Ver http://uciencia.uci.cu


Re: Intel Hadoop Distribution.

2013-03-01 Thread Marcos Ortiz

Regards, Chengi
Intel is working in a battery-tested Hadoop distribution, with a marked 
focused on Security Enhancements. You can see it here:

https://github.com/intel-hadoop/project-rhino/

Best wishes


On 03/01/2013 04:47 PM, Chengi Liu wrote:

Hi,
  I am curious. In this strata, intel made any announcement of their 
own hadoop distribution  optimized for their chips and with some of 
their own implementations.
Though I was bit surprised on seeing intel's involvement in hadoop 
world.. but now somehow it makes sense (its a big market after all)

http://www.javaworld.com/javaworld/jw-02-2013/130227-intel-releases-hadoop-distribution.html

I was wondering on how is their distribution different than other 
players? or why would anyone buy intel's distribution at all?

(Probably not suited for this mailing list, then please let me know? )
Thanks



--
Marcos Ortiz Valmaseda,
Product Manager  Data Scientist at UCI
Blog: http://marcosluis2186.posterous.com
Twitter: @marcosluis2186 http://twitter.com/marcosluis2186


Re: How to handle sensitive data

2013-02-15 Thread Marcos Ortiz Valmaseda
Regards, abhishek.
I´m agree with Michael. You can encrypt your incoming data from your 
application.
I recommend to use HBase too.

- Mensaje original -

De: Michael Segel michael_se...@hotmail.com
Para: common-user@hadoop.apache.org
CC: cdh-u...@cloudera.org
Enviados: Viernes, 15 de Febrero 2013 8:47:16
Asunto: Re: How to handle sensitive data

Simple, have your app encrypt the field prior to writing to HDFS.

Also consider HBase.

On Feb 14, 2013, at 10:35 AM, abhishek abhishek.dod...@gmail.com wrote:


 Hi all,

 we are having some sensitive data, in some particular fields(columns). Can I 
 know how to handle sensitive in Hadoop.

 How do different people handle sensitive data in Hadoop.

 Thanks
 Abhi


Michael Segel | (m) 312.755.9623

Segel and Associates





--

Marcos Ortiz Valmaseda,
Product Manager  Data Scientist at UCI
Blog : http://marcosluis2186.posterous.com
LinkedIn: http://www.linkedin.com/in/marcosluis2186
Twitter : @marcosluis2186


Re: .deflate trouble

2013-02-15 Thread Marcos Ortiz Valmaseda
Yes, I know, Keith. I know that you want more control over your Hadoop cluster, 
so I recommend you three things: 
- You can use Whirr to manage your Hadoop clusters installations en EC2 [1] 
- You can create your own Hadoop-focused AMI based in your requirements (my 
favorite choice here) 
- Or simply install Hadoop on EC2 with Puppet or Chef to have a better control 
over your configuration and management. 
- Or, if you have a good pay check, you can choose MapR M3 or M5 distribution 
in Amazon Marketplace.[2][3] [4] 

[1] http://whirr.apache.org 
[2] https://aws.amazon.com/marketplace/pp/B008B7VT2C 
[3] 
https://aws.amazon.com/marketplace/pp/B008B7WAAW/ref=sp_mpg_product_title?ie=UTF8sr=0-2
 
[4] http://aws.amazon.com/es/elasticmapreduce/mapr/ 

- Mensaje original -

De: Keith Wiley kwi...@keithwiley.com 
Para: user@hadoop.apache.org 
Enviados: Viernes, 15 de Febrero 2013 12:36:20 
Asunto: Re: .deflate trouble 

I might contact them but we are specifically avoiding EMR for this project. We 
have already successfully deployed EMR but we want more precise control over 
the cluster, namely the ability to persist and reawaken it on demand. We really 
want a direct Hadoop installation instead of an EMR-based installation. But I 
might contact them anyway to see what they recommend. Thanks for he refs. 

On Feb 14, 2013, at 19:09 , Marcos Ortiz Valmaseda wrote: 

 Regards, Keith. For EMR issues and stuff, you can contact directly to Jeff 
 Barr(Chief Evangelist for AWS) or to Saurabh Baji (Product Manager for AWS 
 EMR). 
 Best wishes. 
 
 De: Keith Wiley kwi...@keithwiley.com 
 Para: user@hadoop.apache.org 
 Enviados: Jueves, 14 de Febrero 2013 15:46:05 
 Asunto: Re: .deflate trouble 
 
 Good call. We can't use the conventional web-based JT due to corporate access 
 issues, but I looked at the job_XXX.xml file directly, and sure enough, it 
 set mapred.output.compress to true. Now I just need to remember how that 
 occurs. I simply ran the wordcount example straight off the command line, I 
 didn't specify any overridden conf settings for the job. 
 
 Ultimately, the solution (or part of it) is to get away from .19 to a more 
 up-to-date version of Hadoop. I would prefer 2.0 over 1.0 in fact, but due to 
 a remarkable lack of concise EC2/Hadoop documentation (and the fact that what 
 docs I did find were very old and therefore conformed to .19 style Hadoop), I 
 have fallen back on old versions of Hadoop for my initial tests. In the long 
 run, I will need to get a more modern version of Hadoop to successfully 
 deploy on EC2. 
 
 Thanks. 
 
 On Feb 14, 2013, at 15:02 , Harsh J wrote: 
 
  Did the job.xml of the job that produced this output also carry 
  mapred.output.compress=false in it? The file should be viewable on the 
  JT UI page for the job. Unless explicitly turned out, even 0.19 
  wouldn't have enabled compression on its own. 



 
Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com 

What I primarily learned in grad school is how much I *don't* know. 
Consequently, I left grad school with a higher ignorance to knowledge ratio 
than 
when I entered. 
-- Keith Wiley 

 




-- 

Marcos Ortiz Valmaseda, 
Product Manager  Data Scientist at UCI 
Blog : http://marcosluis2186.posterous.com 
LinkedIn: http://www.linkedin.com/in/marcosluis2186 
Twitter : @marcosluis2186 


Re: Hadoop 2.0.3 namenode issue

2013-02-15 Thread Marcos Ortiz Valmaseda
Regards, Dheeren. It seems that you are using an incompatible version of HDFS 
with this 
version of HBase. Can you provide the exact version of your HBase package? 


- Mensaje original -

De: Dheeren Bebortha dbebor...@salesforce.com 
Para: user@hadoop.apache.org 
Enviados: Viernes, 15 de Febrero 2013 13:28:19 
Asunto: Hadoop 2.0.3 namenode issue 



HI, 
In one of our test clusters that has Namenode HA using QJM+ YARN + HBase 0.94, 
namenode came down with following logs. I am trying to root cause the issue. 
Any help is appreciated. 
= 
2013-02-13 10:18:27,521 INFO hdfs.StateChange - BLOCK* 
NameSystem.fsync: file 
/hbase/.logs/datanode-X.sfdomain.com,60020,1360091866476/datanode-X.sfdomain.com%2C60020%2C1360091866476.1360750706694
 
for DFSClient_hb_rs_datanode-X.sfdomain.com,60020,1360091866476_470800334_38 
2013-02-13 10:20:01,861 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:49933 got version 4 expected version 7 
2013-02-13 10:20:01,884 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:49935 got version 4 expected version 7 
2013-02-13 10:20:02,550 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:49938 got version 4 expected version 7 
2013-02-13 10:20:08,210 INFO namenode.FSNamesystem - Roll Edit Log from 
10.232.29.14 
= 
= 
== 
2013-02-13 12:14:32,879 INFO namenode.FileJournalManager - Finalizing edits 
file /data/hdfs/current/edits_inprogress_0065699 - 
/data/hdfs/current/edits_0065699-0065700 
2013-02-13 12:14:32,879 INFO namenode.FSEditLog - Starting log segment at 65701 
2013-02-13 12:15:02,507 INFO namenode.NameNode - FSCK started by sfdc 
(auth:SIMPLE) from /10.232.29.4 for path / at Wed Feb 13 12:15:02 
GMT+00:00 2013 
2013-02-13 12:15:02,663 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:40025 got version 4 expected version 7 
2013-02-13 12:15:02,663 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:40027 got version 4 expected version 7 
2013-02-13 12:15:03,391 WARN ipc.Server - Incorrect header or version mismatch 
from 10.232.29.4:40031 got version 4 expected version 7 
2013-02-13 12:16:33,181 INFO namenode.FSNamesystem - Roll Edit Log from 
10.232.29.14 
== 
== 



-- 

Marcos Ortiz Valmaseda, 
Product Manager  Data Scientist at UCI 
Blog : http://marcosluis2186.posterous.com 
LinkedIn: http://www.linkedin.com/in/marcosluis2186 
Twitter : @marcosluis2186 


Re: .deflate trouble

2013-02-14 Thread Marcos Ortiz Valmaseda
Regards, Keith. For EMR issues and stuff, you can contact directly to Jeff 
Barr(Chief Evangelist for AWS) or to Saurabh Baji (Product Manager for AWS 
EMR). 
Best wishes. 

- Mensaje original -

De: Keith Wiley kwi...@keithwiley.com 
Para: user@hadoop.apache.org 
Enviados: Jueves, 14 de Febrero 2013 15:46:05 
Asunto: Re: .deflate trouble 

Good call. We can't use the conventional web-based JT due to corporate access 
issues, but I looked at the job_XXX.xml file directly, and sure enough, it set 
mapred.output.compress to true. Now I just need to remember how that occurs. I 
simply ran the wordcount example straight off the command line, I didn't 
specify any overridden conf settings for the job. 

Ultimately, the solution (or part of it) is to get away from .19 to a more 
up-to-date version of Hadoop. I would prefer 2.0 over 1.0 in fact, but due to a 
remarkable lack of concise EC2/Hadoop documentation (and the fact that what 
docs I did find were very old and therefore conformed to .19 style Hadoop), I 
have fallen back on old versions of Hadoop for my initial tests. In the long 
run, I will need to get a more modern version of Hadoop to successfully deploy 
on EC2. 

Thanks. 

On Feb 14, 2013, at 15:02 , Harsh J wrote: 

 Did the job.xml of the job that produced this output also carry 
 mapred.output.compress=false in it? The file should be viewable on the 
 JT UI page for the job. Unless explicitly turned out, even 0.19 
 wouldn't have enabled compression on its own. 



 
Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com 

The easy confidence with which I know another man's religion is folly teaches 
me to suspect that my own is also. 
-- Mark Twain 

 




-- 

Marcos Ortiz Valmaseda, 
Product Manager  Data Scientist at UCI 
Blog : http://marcosluis2186.posterous.com 
LinkedIn: http://www.linkedin.com/in/marcosluis2186 
Twitter : @marcosluis2186 


Re: Hadoop Tutorial help

2012-12-09 Thread Marcos Ortiz Valmaseda
Hi, Jennifer.
Precisely, Robert Evans, from Yahoo! Team was working in the update of this 
tutorial to use at least Hadoop 1.x series, but I don´t know right now the 
progress of the project.

OK, now, you don´t need to download Hadoop-0.18.0 because, it´s included in the 
VMware Hadoop VM.
You can download it and try it.

Best wishes.
- Mensaje original -
De: Jennifer Lopez lopez.miri...@gmail.com
Para: user@hadoop.apache.org
Enviados: Domingo, 9 de Diciembre 2012 10:53:55
Asunto: Hadoop Tutorial help

I am going through the tutorial presented @ 
http://developer.yahoo.com/hadoop/tutorial/module3.html#vm-jobs

I have installed vmware and hadoop virtual machine. This tutorial talks about 
hadoop 0.18.0 version and states that the compilation would be done in the 
windows host machine. I want to try out simple examples.

And now I see that this hadoop 0.18.0 version is not availble @ apache hadoop 
website.

How do I go ahead now? Are any other Hadoop virtual machines available for such 
tutorials?
Any info would be highly appreciated.

-- Lopez



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Strange machine behavior

2012-12-08 Thread Marcos Ortiz

Are you sure that 24 map slots is a good number for this machine?
Remember that you have three services (DN, TT and HRegionServer) with
with a 12 GB for Heap.
Try to use a lower number of map slots (12 for example) and launch your
MR job again.
Can you share your logs in pastebin?


On Sat 08 Dec 2012 07:09:02 PM CST, Robert Dyer wrote:

Has anyone experienced a TaskTracker/DataNode behaving like the
attached image?

This was during a MR job (which runs often).  Note the extremely high
System CPU time.  Upon investigating I saw that out of 64GB ram the
system had allocated almost 45GB to cache!

I did a sudo sh -c sync ; echo 3  /proc/sys/vm/drop_cache ; sync
which is roughly where the graph goes back to normal (much lower
System, much higher User).

This has happened a few times.

I have tried playing with the sysctl vm.swappiness value (default of
60) by setting it to 30 (which it was at when the graph was collected)
and now to 10.  I am not sure that helps.

Any ideas?  Anyone else run into this before?

24 cores
64GB ram
4x2TB sata3 hdd

Running Hadoop 1.0.4, with a DataNode (2gb heap), TaskTracker (2gb
heap) on this machine.

24 map slots (1gb heap each), no reducers.

Also running HBase 0.94.2 with a RS (8gb ram) on this machine.


--
Marcos Luis Ortíz Valmaseda
about.me/marcosortiz http://about.me/marcosortiz
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: HADOOP UPGRADE ERROR

2012-11-22 Thread Marcos Ortiz


On 11/22/2012 08:55 PM, yogesh dhari wrote:

Hi All,

I am trying to upgrade hadoop-0.20.2 to hadoop-1.0.4.
I used command

*hadoop namenode -upgrade*

after that if I start cluster by command

*Start-all.sh

the TT and DN doesn't starts.*

Which steps did you follow to perform the upgrade process?
In Tom White´s Hadoop: The Definitive Guide book, in the Chapter 10, 
there is a great
section dedicated to Upgrades, where he described the basic procedure to 
do this:


1. Make sure that any previous upgrade is finalized before proceeding 
with another

upgrade.

2. Shut down MapReduce, and kill any orphaned task processes on the 
tasktrackers.


3. Shut down HDFS, and back up the namenode directories.

4. Install new versions of Hadoop HDFS and MapReduce on the cluster and on
clients.

5. Start HDFS with the -upgrade option:

   $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade

6. Wait until the upgrade is complete:
   $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status

7. Perform some sanity checks on HDFS.

8. Start MapReduce:

9. Roll back or finalize the upgrade (optional):
$NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade
$NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status




1). Log file of  TT...

2012-11-23 07:15:54,399 ERROR 
org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:yogesh cause:java.io.IOException: Call 
to localhost/127.0.0.1:9001 failed on local exception: 
java.io.IOException: Connection reset by peer
2012-11-23 07:15:54,400 ERROR org.apache.hadoop.mapred.TaskTracker: 
Can not start task tracker because java.io.IOException: Call to 
localhost/127.0.0.1:9001 failed on local exception: 
java.io.IOException: Connection reset by peer

at org.apache.hadoop.ipc.Client.wrapException(Client.java:1107)
at org.apache.hadoop.ipc.Client.call(Client.java:1075)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)


Which mode are you have enabled in your Hadoop cluster?



2).  Log file of DN...


2012-11-23 07:07:57,095 INFO 
org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage 
directory /opt/hadoop_newdata_dirr
2012-11-23 07:07:57,096 INFO 
org.apache.hadoop.hdfs.server.common.Storage: Storage directory 
/opt/hadoop_newdata_dirr does not exist.
2012-11-23 07:07:57,199 ERROR 
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: 
All specified directories are not accessible or do not exist.
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:139)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:385)




Although /opt/hadoop_new_dirr exists with file permission 755.

The user yogesh has all privileges in that directory?
/opt/hadoop_new_dirr is not the same that /opt/hadoop_newdata_dirr





Please suggest

ThanksRegards
Yogesh Kumar





http://www.uci.cu/


--

Marcos Luis Ortíz Valmaseda
about.me/marcosortiz http://about.me/marcosortiz
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: hadoop - running examples

2012-11-08 Thread Marcos Ortiz Valmaseda
Mohammad is right.
When you write a file to HDFS, it can´t be modified.
The pattern in HDFS is write-one/read-many-times.

If you want to use a distribution where you can read and write files, you 
should take a look
to MapR distribution.

- Mensaje original -
De: Mohammad Tariq donta...@gmail.com
Para: user@hadoop.apache.org
Enviado: Thu, 08 Nov 2012 17:33:56 -0500 (CST)
Asunto: Re: hadoop - running examples

Apologies for the wrong word. Yes, I meant non modifiable.

Regards,
Mohammad Tariq



On Fri, Nov 9, 2012 at 4:01 AM, Jay Vyas jayunit...@gmail.com wrote:

 What do you mean immutable? Do u mean non modifiable maybe .?  Immutable
 implies that they can't be deleted .

 Jay Vyas
 MMSB
 UCHC

 On Nov 8, 2012, at 5:28 PM, Mohammad Tariq donta...@gmail.com wrote:

 Files are immutable, once written into the Hdfs. And touchz creates a file
 of 0 length.

 Regards,
 Mohammad Tariq



 On Fri, Nov 9, 2012 at 3:18 AM, Kartashov, Andy andy.kartas...@mpac.cawrote:

 Guys,

 When running examples, you bring them into HDFS. Say, you need to make
 some correction to a file, you need to make them on local FS and run
 $hadoop fs -put ... again. You cannot just make changes to files inside
 HDFS except for touchz a file, correct?

 Just making sure.

 Thnx,
 AK
 NOTICE: This e-mail message and any attachments are confidential, subject
 to copyright and may be privileged. Any unauthorized use, copying or
 disclosure is prohibited. If you are not the intended recipient, please
 delete and contact the sender immediately. Please consider the environment
 before printing this e-mail. AVIS : le présent courriel et toute pièce
 jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur
 et peuvent être couverts par le secret professionnel. Toute utilisation,
 copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le
 destinataire prévu de ce courriel, supprimez-le et contactez immédiatement
 l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent
 courriel







10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: monitoring CPU cores (resource consumption) in hadoop

2012-11-03 Thread Marcos Ortiz

Regards, Jim.
In the open source world I don't know.
In the Enterprise world, Boundary is a great choice.
Look here:
http://boundary.com/why-boundary/product/

On 11/03/2012 02:59 PM, ugiwgh wrote:

The Paramon can resove this problem. It can monitoring CPU cores.

--GHui

-- Original --
From:  Jim Neofotistosjim.neofotis...@oracle.com;
Date:  Sun, Nov 4, 2012 03:00 AM
To:  useruser@hadoop.apache.org;

Subject:  monitoring CPU cores (resource consumption) in hadoop

Standard hadoop monitoring metrics system  doesn’t allow the monitoring  CPU 
cores.  Ganglia open source monitoring does not have the capability with the 
RRD tool as well.
  
Top is an option but I was looking for something cluster wide
  
JIm


--

Marcos Luis Ortíz Valmaseda
about.me/marcosortiz http://about.me/marcosortiz
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Set the number of maps

2012-11-01 Thread Marcos Ortiz
The option since 0.21 was renamed to 
mapreduce.tasktracker.map.tasks.maximum, and like

Harsh said to you, is is a TaskTracker service level option.

Another thing is that this option is very united to the 
mapreduce.child.java.opts, so , make sure

to monitor constantly the effect of these changes in your cluster.



On 11/01/2012 11:55 AM, Harsh J wrote:

It can't be set from the code this way - the slot property is applied
at the TaskTracker service level (as the name goes).

Since you're just testing at the moment, try to set these values,
restart TTs, and run your jobs again. You do not need to restart JT at
any point for tweaking these values.

On Thu, Nov 1, 2012 at 7:13 PM, Cogan, Peter (Peter)
peter.co...@alcatel-lucent.com wrote:

Hi

I understand that the maximum number of concurrent map tasks is set by
mapred.tasktracker.map.tasks.maximum  - however I wish to run with a smaller
number of maps (am testing disk IO). I thought that I could set that within
the main program using

conf.set(mapred.tasktracker.map.tasks.maximuma, 4);


to run with 4 maps – but that seems to have no impact. I know I could just
change the mapred-site.xml and restart map reduce but that's kind of a pain.
Can it be set from within the code?


Thanks

Peter





--

Marcos Luis Ortíz Valmaseda
about.me/marcosortiz http://about.me/marcosortiz
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: File Permissions on s3 FileSystem

2012-10-23 Thread Marcos Ortiz

El 23/10/12 13:32, Parth Savani escribió:

Hello Everyone,
I am trying to run a hadoop job with s3n as my filesystem.
I changed the following properties in my hdfs-site.xml

fs.default.name http://fs.default.name=s3n://KEY:VALUE@bucket/
A good practice to this is to use these two properties in the 
core-site.xml, if you will use S3 often:

property
namefs.s3.awsAccessKeyId/name
valueAWS_ACCESS_KEY_ID/value
/property

property
namefs.s3.awsSecretAccessKey/name
valueAWS_SECRET_ACCESS_KEY/value
/property

After that, you can access to your URI with a more friendly way:
S3:
 s3://s3-bucket/s3-filepath

S3n:
 s3n://s3-bucket/s3-filepath


mapreduce.jobtracker.staging.root.dir=s3n://KEY:VALUE@bucket/tmp

When i run the job from ec2, I get the following error

The ownership on the staging directory 
s3n://KEY:VALUE@bucket/tmp/ec2-user/.staging is not as expected. It is 
owned by   The directory must be owned by the submitter ec2-user or by 
ec2-user
at 
org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:113)

at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:844)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)
at 
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:844)

at org.apache.hadoop.mapreduce.Job.submit(Job.java:481)

I am using cloudera CDH4 hadoop distribution. The error is thrown from 
JobSubmissionFiles.java class

 public static Path getStagingDir(JobClient client, Configuration conf)
  throws IOException, InterruptedException {
Path stagingArea = client.getStagingAreaDir();
FileSystem fs = stagingArea.getFileSystem(conf);
String realUser;
String currentUser;
UserGroupInformation ugi = UserGroupInformation.getLoginUser();
realUser = ugi.getShortUserName();
currentUser = 
UserGroupInformation.getCurrentUser().getShortUserName();

if (fs.exists(stagingArea)) {
  FileStatus fsStatus = fs.getFileStatus(stagingArea);
  String owner = fsStatus.*getOwner();*
  if (!(owner.equals(currentUser) || owner.equals(realUser))) {
 throw new IOException(*The ownership on the staging 
directory  +*

*  stagingArea +  is not as expected.  + *
*  It is owned by  + owner + . The directory 
must  +*
*  be owned by the submitter  + currentUser +  
or  +*

*  by  + realUser*);
  }
  if (!fsStatus.getPermission().equals(JOB_DIR_PERMISSION)) {
LOG.info(Permissions on staging directory  + stagingArea +  
are  +
  incorrect:  + fsStatus.getPermission() + . Fixing 
permissions  +

  to correct value  + JOB_DIR_PERMISSION);
fs.setPermission(stagingArea, JOB_DIR_PERMISSION);
  }
} else {
  fs.mkdirs(stagingArea,
  new FsPermission(JOB_DIR_PERMISSION));
}
return stagingArea;
  }


I think my job calls getOwner() which returns NULL since s3 does not 
have file permissions which results in the IO exception that i am 
getting.

Which what user are you launching the job in EC2?




Any workaround for this? Any idea how i could you s3 as the filesystem 
with hadoop on distributed mode?


Look here:
http://wiki.apache.org/hadoop/AmazonS3



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Java heap space error

2012-10-21 Thread Marcos Ortiz Valmaseda
Regards, Subash.
Can you share more information about your YARN cluster?

- Mensaje original -
De: Subash D'Souza sdso...@truecar.com
Para: user@hadoop.apache.org
Enviado: Sun, 21 Oct 2012 09:18:43 -0400 (CDT)
Asunto: Java heap space error

I'm running CDH 4 on  a 4 node cluster each with 96 G of RAM. Up until last 
week the cluster was running until there was an error in the name node log file 
and I had to reformat it put the data back

Now when I run hive on YARN. I keep getting a Java heap space error. Based on 
the research I did. I upped the my mapred.child.java.opts first from 200m to 
400 m to 800m and I still have the same issue. It seems to fail near the 100% 
mapper mark

I checked the log files and the only thing that it does output is java heap 
space error. Nothing more.

Any help would be appreciated.

Thanks
Subash



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: hadoop 0.23.3 configurations

2012-10-11 Thread Marcos Ortiz

  
  
Regards, Visioner
Look here, is a quick and useful guide to do this:
http://practicalcloudcomputing.com/post/26448910436/install-and-run-hadoop-yarn-in-10-easy-steps

Best wishes
El 11/10/2012 10:07, Visioner Sadak
  escribi:

hi just installed 0.23.3 seems tht configuratons are
  entirely different anyone knws how to configure java_home
  hadoop-env.sh and mapred-site.xml are also not present in
  etc/hadoop/ folder 


-- 
Marcos Ortiz Valmaseda,
http://about.me/marcosortiz
Twitter: @marcosluis2186


  












Re: issue with permissions of mapred.system.dir

2012-10-09 Thread Marcos Ortiz


On 10/09/2012 07:44 PM, Goldstone, Robin J. wrote:
I am bringing up a Hadoop cluster for the first time (but am an 
experienced sysadmin with lots of cluster experience) and running into 
an issue with permissions on mapred.system.dir. It has generally been 
a chore to figure out all the various directories that need to be 
created to get Hadoop working, some on the local FS, others within 
HDFS, getting the right ownership and permissions, etc..  I think I am 
mostly there but can't seem to get past my current issue with 
mapred.system.dir.


Some general info first:
OS: RHEL6
Hadoop version: hadoop-1.0.3-1.x86_64

20 node cluster configured as follows
1 node as primary namenode
1 node as secondary namenode + job tracker
18 nodes as datanode + tasktracker

I have HDFS up and running and have the following in mapred-site.xml:
property
  namemapred.system.dir/name
  valuehdfs://hadoop1/mapred/value
  descriptionShared data for JT - this must be in HDFS/description
/property

I have created this directory in HDFS, owner mapred:hadoop, 
permissions 700 which seems to be the most common recommendation 
amongst multiple, often conflicting articles about how to set up 
Hadoop.  Here is the top level of my filesystem:

hyperion-hdp4@hdfs:hadoop fs -ls /
Found 3 items
drwx--   - mapred hadoop  0 2012-10-09 12:58 /mapred
drwxrwxrwx   - hdfs   hadoop  0 2012-10-09 13:00 /tmp
drwxr-xr-x   - hdfs   hadoop  0 2012-10-09 12:51 /user

Note, it doesn't seem to really matter what permissions I set on 
/mapred since when the Jobtracker starts up it changes them to 700.


However, when I try to run the hadoop example teragen program as a 
regular user I am getting this error:
hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar 
teragen -D dfs.block.size=536870912 100 
/user/robing/terasort-input

Generating 100 using 2 maps with step of 50
12/10/09 16:27:02 INFO mapred.JobClient: Running job: 
job_201210072045_0003

12/10/09 16:27:03 INFO mapred.JobClient:  map 0% reduce 0%
12/10/09 16:27:03 INFO mapred.JobClient: Job complete: 
job_201210072045_0003

12/10/09 16:27:03 INFO mapred.JobClient: Counters: 0
12/10/09 16:27:03 INFO mapred.JobClient: Job Failed: Job 
initialization failed:
org.apache.hadoop.security.AccessControlException: 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=robing, access=EXECUTE, inode=mapred:mapred:hadoop:rwx--

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
at 
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:57)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.init(DFSClient.java:3251)

at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:713)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:182)

at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:555)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:536)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:443)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:435)
at 
org.apache.hadoop.security.Credentials.writeTokenStorageFile(Credentials.java:169)
at 
org.apache.hadoop.mapred.JobInProgress.generateAndStoreTokens(JobInProgress.java:3537)
at 
org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:696)

at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4207)
at 
org.apache.hadoop.mapred.FairScheduler$JobInitializer$InitJob.run(FairScheduler.java:291)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
rest of stack trace omitted

This seems to be saying that is trying to write to the HDFS /mapred 
filesystem as me (robing) rather than as mapred, the username under 
which the jobtracker and tasktracker run.


To verify this is what is happening, I manually changed the 
permissions on /mapred from 700 to 755 since it claims to want execute 
access:

hyperion-hdp4@mapred:hadoop fs -chmod 755 /mapred
hyperion-hdp4@mapred:hadoop fs -ls /
Found 3 items
drwxr-xr-x   - mapred hadoop  0 2012-10-09 12:58 /mapred
drwxrwxrwx   - hdfs   hadoop  0 2012-10-09 13:00 /tmp
drwxr-xr-x   - hdfs   hadoop  0 2012-10-09 12:51 /user
hyperion-hdp4@mapred:

Now I try running again and it fails again, this time complaining it 
wants write access to /mapred:
hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar 
teragen -D dfs.block.size=536870912 100 

Re: use S3 as input to MR job

2012-10-02 Thread Marcos Ortiz
ngineering
  

  
  
  
  
  
  
  
  
  Medio SystemsInc|701
  Pike St. #1500
  Seattle, WA
  98101
  Predictive Analytics for a
  Connected
  World
  
  
  
  
  






  

  
  
  
  
  

  
  -- 
  Harsh J

  





-- 
Benjamin Kim
  benkimkimben at gmail

  

    
-- 
Marcos Ortiz Valmaseda,
Data Engineer  Senior System Administrator at UCI
Blog: http://marcosluis2186.posterous.com
Linkedin: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186


  












Re: Which hardware to choose

2012-10-02 Thread Marcos Ortiz

Which is a reasonable number in this hardware?

On 10/02/2012 09:40 PM, Michael Segel wrote:

I think he's saying that its 24 maps 8 reducers per node and at 48GB that could 
be too many mappers.
Especially if they want to run HBase.

On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:


Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's right?  
Sounds VERY low for a cluster that size.

We have only 10 c2100's and are running I believe 140 map and 70 reduce slots 
so far with pretty decent performance.



On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:

38 data nodes + 2 Name Nodes

  
Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
Tumblr's blog http://marcosortiz.tumblr.com/
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Hadoop Archives under 0.23

2012-10-02 Thread Marcos Ortiz


El 02/10/2012 2:12, Alexander Hristov escribió:

Hello

I'm trying to test the Hadoop archive functionality under 0.23 and I 
can't get it working.


I have in HDFS a /test folder with  several text files. I created a 
hadoop archive using


hadoop archive -archiveName test.har -p /test *.txt  /sample

Ok, this creates a /sample/test.har with the appropriate parts 
(_index, _SUCCESS,_masterindex,part-0).  Performing a cat on _index 
shows the texts files.

However, when I try to even list the contents of the HAR file using

hdfs dfs -ls -R har:///sample/test.har

The right command to do this is:
hdfs dfs -lsr har:///sample/test.har



I simply get har:///sample/test.har : No such file or directory! WTF?

Accessing the individual files does work, however:

hdfs dfs -cat har:///sample/test.har/file.txt

works

Regards

Alexander


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...

CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Ortiz Valmaseda,
Data Engineer  Senior System Administrator at UCI
Blog: http://marcosluis2186.posterous.com
Linkedin: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: How to run multiple jobs at the same time?

2012-09-23 Thread Marcos Ortiz

Apache Mahout was built for that
Look here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering


If you don't want to use the Mahout's approach (highly recommended), you 
can use

the MultipleInput class for that:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html

An example of the Ton White's Book using MultipleInputs:
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);


On 09/23/2012 12:31 PM, Jason Yang wrote:

Hi, all

I have implemented a K-Means algorithm in MapReduce. This program 
consists of many iterations and each iteration is a MapReduce Job. 
here is my pseudo-code:


-
int count  = 0;
do
{

SET input path = output path of last iteration;
SET output path = new path(count);
...
runJob
}
while( (!converged)  (count  maxCount) )
--

Now I got a question that what should I do if I would like to apply 
this algorithm on multiple data at the same time?


Because there are dependency btw iterations, so I have to use 
JobConf.runJob(), which would block until the iteration finished.


Could I use thread?

BTW, I'm using hadoop-0.20.2
--
YANG, Lin



--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
Tumblr's blog http://marcosortiz.tumblr.com/
@marcosluis2186 http://twitter.com/marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Suggestions required for learning Hadoop

2012-09-13 Thread Marcos Ortiz

Regards, Munnavar.
There is a great refcardz from DZone, written by Eugene Ciurana 
(http://eugeneciurana.com), which are perfect for

Sysadmins interesting on Hadoop called:
- Getting Started with Hadoop
- Deploying Hadoop

http://refcardz.dzone.com

If you want to know more, there are a lot of courses available from 
Cloudera[1], Hortonworks[2] or MapR[3]

[1] http://www.cloudera.com/product-services
[2] http://hortonworks.com/
[3] http://academy.mapr.com

and you want to go deeper, there are certification programs from 
Cloudera and Hortonworks

Best wishes

On 09/13/2012 01:37 PM, Munnavar Shaik wrote:


Dear Team Members,

I am working as a Linux Administrator, I am interested to work on 
Hadoop. Please let me know from where and how I can start to learning.


It is very great full to help for learning Hadoop and its related 
project.


Thank you Team,

*Munnavar*



http://www.uci.cu/


--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
Tumblr's blog http://marcosortiz.tumblr.com/
@marcosluis2186 http://twitter.com/marcosluis2186





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Hadoop or HBase

2012-08-28 Thread Marcos Ortiz

Regards to all the list.
Well, you should ask to the Tumblr´s fellows that they use a combination 
of MySQL and HBase for its blogging platform. They talked about this 
topic in the last HBaseCon. Here is the link:

http://www.hbasecon.com/sessions/growing-your-inbox-hbase-at-tumblr/

Blake Matheny, Director of Platform Engineering at Tumblr was the 
presenter of this topic.

Best wishes

El 28/08/2012 6:18, Kai Voigt escribió:

Having a distributed filesystem doesn't save you from having backups. If 
someone deletes a file in HDFS, it's gone.

What backend storage is supported by your CMS?

Kai

Am 28.08.2012 um 08:36 schrieb Kushal Agrawal kushalagra...@teledna.com:


As the data is too much in (10's of terabytes) it's difficult to take backup
because it takes 1.5 days to take backup of data every time. Instead of that
if we uses distributed file system we need not to do that.

Thanks  Regards,
Kushal Agrawal
kushalagra...@teledna.com
  
-Original Message-

From: Kai Voigt [mailto:k...@123.org]
Sent: Tuesday, August 28, 2012 11:57 AM
To: common-u...@hadoop.apache.org
Subject: Re: Hadoop or HBase

Typically, CMSs require a RDBMS. Which Hadoop and HBase are not.

Which CMS do you plan to use, and what's wrong with MySQL or other open
source RDBMSs?

Kai

Am 28.08.2012 um 08:21 schrieb Kushal Agrawal kushalagra...@teledna.com:


Hi,
I wants to use DFS for Content-Management-System (CMS), in

that I just wants to store and retrieve files.

Please suggest me what should I use:
Hadoop or HBase

Thanks  Regards,
Kushal Agrawal
kushalagra...@teledna.com

One Earth. Your moment. Go green...
This message is for the designated recipient only and may contain

privileged, proprietary, or otherwise private information. If you have
received it in error, please notify the sender immediately and delete the
original. Any other use of the email by you is prohibited.
--
Kai Voigt
k...@123.org










10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: distcp error.

2012-08-28 Thread Marcos Ortiz
Hi, Tao. This problem is only with 2.0.1 or with the two versions?
Have you tried to use distcp from 1.0.3 to 1.0.3?

El 28/08/2012 11:36, Tao escribió:

 Hi, all

 I use distcp copying data from hadoop1.0.3 to hadoop 2.0.1.

 When the file path(or file name) contain Chinese character, an
 exception will throw. Like below. I need some help about this.

 Thanks.

 [hdfs@host ~]$ hadoop distcp -i -prbugp -m 14 -overwrite -log
 /tmp/distcp.log hftp://10.xx.xx.aa:50070/tmp/中文路径测试hdfs:
 //10.xx.xx.bb:54310/tmp/distcp_test14

 12/08/28 23:32:31 INFO tools.DistCp: Input Options:
 DistCpOptions{atomicCommit=false, syncFolder=false,
 deleteMissing=false, ignoreFailures=true, maxMaps=14,
 sslConfigurationFile='null', copyStrategy='uniformsize',
 sourceFileListing=null, sourcePaths=[hftp://10.xx.xx.aa:50070/tmp/中文
 路径测试], targetPath=hdfs://10.xx.xx.bb:54310/tmp/distcp_test14}

 12/08/28 23:32:33 INFO tools.DistCp: DistCp job log path: /tmp/distcp.log

 12/08/28 23:32:34 WARN conf.Configuration: io.sort.mb is deprecated.
 Instead, use mapreduce.task.io.sort.mb

 12/08/28 23:32:34 WARN conf.Configuration: io.sort.factor is
 deprecated. Instead, use mapreduce.task.io.sort.factor

 12/08/28 23:32:34 WARN util.NativeCodeLoader: Unable to load
 native-hadoop library for your platform... using builtin-java classes
 where applicable

 12/08/28 23:32:36 INFO mapreduce.JobSubmitter: number of splits:1

 12/08/28 23:32:36 WARN conf.Configuration: mapred.jar is deprecated.
 Instead, use mapreduce.job.jar

 12/08/28 23:32:36 WARN conf.Configuration:
 mapred.map.tasks.speculative.execution is deprecated. Instead, use
 mapreduce.map.speculative

 12/08/28 23:32:36 WARN conf.Configuration: mapred.reduce.tasks is
 deprecated. Instead, use mapreduce.job.reduces

 12/08/28 23:32:36 WARN conf.Configuration:
 mapred.mapoutput.value.class is deprecated. Instead, use
 mapreduce.map.output.value.class

 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.map.class is
 deprecated. Instead, use mapreduce.job.map.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.job.name is
 deprecated. Instead, use mapreduce.job.name

 12/08/28 23:32:36 WARN conf.Configuration: mapreduce.inputformat.class
 is deprecated. Instead, use mapreduce.job.inputformat.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.output.dir is
 deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

 12/08/28 23:32:36 WARN conf.Configuration:
 mapreduce.outputformat.class is deprecated. Instead, use
 mapreduce.job.outputformat.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.map.tasks is
 deprecated. Instead, use mapreduce.job.maps

 12/08/28 23:32:36 WARN conf.Configuration: mapred.mapoutput.key.class
 is deprecated. Instead, use mapreduce.map.output.key.class

 12/08/28 23:32:36 WARN conf.Configuration: mapred.working.dir is
 deprecated. Instead, use mapreduce.job.working.dir

 12/08/28 23:32:37 INFO mapred.ResourceMgrDelegate: Submitted
 application application_1345831938927_0039 to ResourceManager at
 baby20/10.1.1.40:8040

 12/08/28 23:32:37 INFO mapreduce.Job: The url to track the job:
 http://baby20:8088/proxy/application_1345831938927_0039/

 12/08/28 23:32:37 INFO tools.DistCp: DistCp job-id: job_1345831938927_0039

 12/08/28 23:32:37 INFO mapreduce.Job: Running job: job_1345831938927_0039

 12/08/28 23:32:50 INFO mapreduce.Job: Job job_1345831938927_0039
 running in uber mode : false

 12/08/28 23:32:50 INFO mapreduce.Job: map 0% reduce 0%

 12/08/28 23:33:00 INFO mapreduce.Job: map 100% reduce 0%

 12/08/28 23:33:00 INFO mapreduce.Job: Task Id :
 attempt_1345831938927_0039_m_00_0, Status : FAILED

 Error: java.io.IOException: File copy failed: hftp://10.1.1.26:50070
 /tmp/中文路径测试/part-r-00017 --
 hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017

 at
 org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:262)

 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:229)

 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:45)

 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:725)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)

 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1232)

 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)

 Caused by: java.io.IOException: Couldn't run retriable-command:
 Copying hftp://10.1.1.26:50070/tmp/中文路径测试/part-r-00017 to
 hdfs://10.1.1.40:54310/tmp/distcp_test14/part-r-00017

 at
 org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)

 at
 org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:258)

 ... 10 more

 Caused by:
 

Re: Hadoop 1.0.3 setup

2012-07-09 Thread Marcos Ortiz


On 07/09/2012 09:58 AM, prabhu K wrote:

Yes, i have configuared multinode setup, 1 master 2 slaves,

i have formated the namenode and then i run the stat-dfs.sh script and
start-mapred.sh script.

I run the bin/hadoop fs -put input input command , getting following error
on my terminal.

hduser@md-trngpoc1:/usr/local/hadoop_dir/hadoop$ bin/hadoop fs -put input
input
Warning: $HADOOP_HOME is deprecated.
put: org.apache.hadoop.security.AccessControlException: Permission denied:
user=hduser, access=WRITE, inode=:root:supergroup:rwxr-xr-x
and executed the below command, getting /hadoop-install/hadoop directroy, i
coud't understand what's wrong iam doing?
Well, this erros says to you that you have the wrong permissions in the 
hadoop directory,
the user and group that you have is root:supergroup and the correct 
values for it is:

 hduser:supergroup


hduser@md-trngpoc1:/usr/local/hadoop_dir/hadoop$ echo $HADOOP_HOME
/hadoop-install/hadoop

*Namenode log:*
==

java.lang.InterruptedException: sleep interrupted
 at java.lang.Thread.sleep(Native Method)
 at
org.apache.hadoop.hdfs.server.namenode.DecommissionManager$Monitor.run(DecommissionManager.java:65)
 at java.lang.Thread.run(Thread.java:662)
2012-07-09 19:02:12,696 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException:
Problem binding to md-trngpoc1/10.5.114.110:54310 : Address alrea
dy in use

It seems that you are using that address:port values.
Use this commands:
netstat -puta | grep namenode
netstat -puta | grep datanode

to check which are the ports that the NN and DN are using.

 at org.apache.hadoop.ipc.Server.bind(Server.java:227)
 at org.apache.hadoop.ipc.Server$Listener.init(Server.java:301)
 at org.apache.hadoop.ipc.Server.init(Server.java:1483)
 at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:545)
 at org.apache.hadoop.ipc.RPC.getServer(RPC.java:506)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:294)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
Caused by: java.net.BindException: Address already in use
 at sun.nio.ch.Net.bind(Native Method)
 at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
 at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
 at org.apache.hadoop.ipc.Server.bind(Server.java:225)
 ... 8 more
*Datanode log*
=
2012-07-09 18:44:39,949 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = md-trngpoc3/10.5.114.168
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012
/
2012-07-09 18:44:40,039 INFO org.apache.hadoop.metrics2.impl.MetricsConfig:
loaded properties from hadoop-metrics2.properties
2012-07-09 18:44:40,047 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-07-09 18:44:40,048 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-07-09 18:44:40,048 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system
started
2012-07-09 18:44:40,125 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi
registered.
2012-07-09 18:44:40,163 WARN
org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in
dfs.data.dir: can not create directory: /app/hadoop_dir/hadoop/tmp/df
s/data
2012-07-09 18:44:40,163 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in
dfs.data.dir are invalid.
2012-07-09 18:44:40,163 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2012-07-09 18:44:40,164 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/
SHUTDOWN_MSG: Shutting down DataNode at md-trngpoc3/10.5.114.168
/
2012-07-09 18:46:09,586 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = md-trngpoc3/10.5.114.168
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 1.0.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
1335192; compiled by 'hortonfo' on Tue May  8 20:31:25 UTC 2012

Re: Versions

2012-07-07 Thread Marcos Ortiz


On 07/07/2012 02:39 PM, Harsh J wrote:

The Apache Bigtop project was started for this very purpose (building
stable, well inter-operating version stacks). Take a read at
http://incubator.apache.org/bigtop/ and for 1.x Bigtop packages, see
https://cwiki.apache.org/confluence/display/BIGTOP/How+to+install+Hadoop+distribution+from+Bigtop

To specifically answer your question though, your list appears fine to
me. They 'should work', but I am not suggesting that I have tested
this stack completely myself.

On Sat, Jul 7, 2012 at 11:57 PM, prabhu K prabhu.had...@gmail.com wrote:

Hi users list,

I am planing to install following tools.

Hadoop 1.0.3
hive 0.9.0
flume 1.2.0
Hbase 0.92.1
sqoop 1.4.1
My only suggestion here is that you use the 0.94 version of HBase, it 
has a lot of improvements over 0.92.1

See the Cloudera's blog post for it:
http://www.cloudera.com/blog/2012/05/apache-hbase-0-94-is-now-released/

Best wishes



my questions are.

1. the above tools are compatible with all the versions.

2. any tool need to change the version

3. list out all the tools with compatible versions.

Please suggest on this?





--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: set up Hadoop cluster on mixed OS

2012-07-06 Thread Marcos Ortiz
I have a mixed cluster too, with Linux (CentOS) and Solaris, the unique 
recommendation that I can give you

is to use exactly the same Hadoop version in all machines.

Best wishes
On 07/06/2012 05:31 AM, Senthil Kumar wrote:

You can setup hadoop cluster on mixed environment. We have a cluster with
Mac, Linux and Solaris.

Regards
Senthil

On Fri, Jul 6, 2012 at 1:50 PM, Yongwei Xing jdxyw2...@gmail.com wrote:


I have one MBP with 10.7.4 and one laptop with Ubuntu 12.04. Is it possible
to set up a hadoop cluster by such mixed environment?

Best Regards,

--
Welcome to my ET Blog http://www.jdxyw.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
@marcosluis2186 http://twitter.com/marcosluis2186





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: How to connect to a cluster by using eclipse

2012-07-04 Thread Marcos Ortiz
Jason,
Ramon is right.
The best way to debug a MapReduce job is mounting a local cluster, and
then, when you have tested enough your code, then, you can
deploy it in a real distributed cluster.
On 07/04/2012 10:00 PM, Jason Yang wrote:
 ramon,

 Thank for your reply very much.

 However, I was still wonder whether I could debug a MR application in
 this way.

 I have read some posts talking about using NAT to redirect all the
 packets to the network card which connect to the local LAN, but it
 does not work as I tried to redirect by using iptables :(

 在 2012年7月4日星期三, 写道:

 Jason,


 the easiest way to debug a MapRedupe program with eclipse is
 working on hadoop local.
 http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#Local


 In this mode all the components run locally on the same VM and
 can be easily debugged using Eclipse.


 Hope this will be useful.

 *From:*Jason Yang [mailto:lin.yang.ja...@gmail.com
 javascript:_e({}, 'cvml', 'lin.yang.ja...@gmail.com');]
 *Sent:* miércoles, 04 de julio de 2012 11:25
 *To:* mapreduce-user
 *Subject:* How to connect to a cluster by using eclipse

 Hi, all

 I have a hadoop cluster with 3 nodes, the network topology is like
 this:

 1. For each DataNode, its IP address is like :192.168.0.XXX;

 2. For the NameNode, it has two network cards: one is connect with
 the DataNodes as a local LAN with IP address 192.168.0.110, while
 the other one is connect to the company network(which eventually
 connect to the Internet);

 --

 now I'm trying to debug a MapReduce program on a computer which is
 in the company network. Since the jobtracker in this scenario is
 192.168.0.110:9001 http://192.168.0.110:9001, I was wondering
 how could I connect to the cluster by using eclipse?

 -- 

 YANG, Lin


 
 Subject to local law, communications with Accenture and its
 affiliates including telephone calls and emails (including
 content), may be monitored by our systems for the purposes of
 security and the assessment of internal compliance with Accenture
 policy.
 
 __

 www.accenture.com http://www.accenture.com



 -- 
 YANG, Lin


-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer  Sr. System Administrator at UCI*
about.me/marcosortiz http://about.me/marcosortiz
My Blog http://marcosluis2186.posterous.com
@marcosluis2186 http://twitter.com/marcosluis2186





10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yarn job runs in Local Mode even though the cluster is running in Distributed Mode

2012-06-13 Thread Marcos Ortiz
According to the CDH 4 official documentation, you should install a 
JobHistory server for your MRv2 (YARN)

cluster.
https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster

How to configure the HistoryServer
https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster#DeployingMapReducev2%28YARN%29onaCluster-Step3



On 06/13/2012 03:16 PM, anil gupta wrote:

Hi All

I am using cdh4 for running a HBase cluster on CentOs6.0. I have 5
nodes in my cluster(2 Admin Node and 3 DN).
My resourcemanager is up and running and showing that all three DN are
running the nodemanager. HDFS is also working fine and showing 3 DN's.

But when i fire the pi example job. It starts to run in Local mode.
Here is the console output:
sudo -u hdfs yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-
examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/06/13 12:03:27 WARN conf.Configuration: session.id is deprecated.
Instead, use dfs.metrics.session-id
12/06/13 12:03:27 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
12/06/13 12:03:27 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/06/13 12:03:27 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
12/06/13 12:03:28 INFO mapred.FileInputFormat: Total input paths to
process : 10
12/06/13 12:03:29 INFO mapred.JobClient: Running job: job_local_0001
12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter set in
config null
12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapred.FileOutputCommitter
12/06/13 12:03:29 WARN mapreduce.Counters: Group
org.apache.hadoop.mapred.Task$Counter is deprecated. Use
org.apache.hadoop.mapreduce.TaskCounter instead
12/06/13 12:03:29 INFO util.ProcessTree: setsid exited with exit code
0
12/06/13 12:03:29 INFO mapred.Task:  Using ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
12/06/13 12:03:29 WARN mapreduce.Counters: Counter name
MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as group
name and  BYTES_READ as counter name instead
12/06/13 12:03:29 INFO mapred.MapTask: numReduceTasks: 1
12/06/13 12:03:29 INFO mapred.MapTask: io.sort.mb = 100
12/06/13 12:03:30 INFO mapred.MapTask: data buffer = 79691776/99614720
12/06/13 12:03:30 INFO mapred.MapTask: record buffer = 262144/327680
12/06/13 12:03:30 INFO mapred.JobClient:  map 0% reduce 0%
12/06/13 12:03:35 INFO mapred.LocalJobRunner: Generated 95735000
samples.
12/06/13 12:03:36 INFO mapred.JobClient:  map 100% reduce 0%
12/06/13 12:03:38 INFO mapred.LocalJobRunner: Generated 151872000
samples.

Here is the content of yarn-site.xml:

configuration
   property
 nameyarn.nodemanager.aux-services/name
 valuemapreduce.shuffle/value
   /property

   property
 nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
 valueorg.apache.hadoop.mapred.ShuffleHandler/value
   /property

   property
 nameyarn.log-aggregation-enable/name
 valuetrue/value
   /property

   property
 descriptionList of directories to store localized files in./
description
 nameyarn.nodemanager.local-dirs/name
 value/disk/yarn/local/value
   /property

   property
 descriptionWhere to store container logs./description
 nameyarn.nodemanager.log-dirs/name
 value/disk/yarn/logs/value
   /property

   property
 descriptionWhere to aggregate logs to./description
 nameyarn.nodemanager.remote-app-log-dir/name
 value/var/log/hadoop-yarn/apps/value
   /property

   property
 descriptionClasspath for typical applications./description
  nameyarn.application.classpath/name
  value
 $HADOOP_CONF_DIR,
 $HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,
 $HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,
 $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,
 $YARN_HOME/*,$YARN_HOME/lib/*
  /value
   /property
property
 nameyarn.resourcemanager.resource-tracker.address/name
 valueihub-an-g1:8025/value
/property
property
 nameyarn.resourcemanager.address/name
 valueihub-an-g1:8040/value
/property
property
 nameyarn.resourcemanager.scheduler.address/name
 valueihub-an-g1:8030/value
/property
property
 nameyarn.resourcemanager.admin.address/name
 valueihub-an-g1:8141/value
/property
property
 nameyarn.resourcemanager.webapp.address/name
 valueihub-an-g1:8088/value
/property
property
 namemapreduce.jobhistory.intermediate-done-dir/name
 value/disk/mapred/jobhistory/intermediate/done/value
/property
property
 

Re: Yarn job runs in Local Mode even though the cluster is running in Distributed Mode

2012-06-13 Thread Marcos Ortiz
Can you share with us in pastebin all conf files that you are using for 
YARN?



On 06/13/2012 05:26 PM, anil gupta wrote:

Hi Marcus,

Sorry i forgot to mention that Job history server is installed and 
running and AFAIK resourcemanager is responsible for running MR jobs. 
Historyserver is only used to get info about MR jobs.


Thanks,
Anil

On Wed, Jun 13, 2012 at 2:04 PM, Marcos Ortiz mlor...@uci.cu 
mailto:mlor...@uci.cu wrote:


According to the CDH 4 official documentation, you should install
a JobHistory server for your MRv2 (YARN)
cluster.

https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster

How to configure the HistoryServer

https://ccp.cloudera.com/display/CDH4DOC/Deploying+MapReduce+v2+%28YARN%29+on+a+Cluster#DeployingMapReducev2%28YARN%29onaCluster-Step3





On 06/13/2012 03:16 PM, anil gupta wrote:

Hi All

I am using cdh4 for running a HBase cluster on CentOs6.0. I have 5
nodes in my cluster(2 Admin Node and 3 DN).
My resourcemanager is up and running and showing that all
three DN are
running the nodemanager. HDFS is also working fine and showing
3 DN's.

But when i fire the pi example job. It starts to run in Local
mode.
Here is the console output:
sudo -u hdfs yarn jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-
examples.jar pi 10 10
Number of Maps  = 10
Samples per Map = 10
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Wrote input for Map #5
Wrote input for Map #6
Wrote input for Map #7
Wrote input for Map #8
Wrote input for Map #9
Starting Job
12/06/13 12:03:27 WARN conf.Configuration: session.id
http://session.id is deprecated.
Instead, use dfs.metrics.session-id
12/06/13 12:03:27 INFO jvm.JvmMetrics: Initializing JVM
Metrics with
processName=JobTracker, sessionId=
12/06/13 12:03:27 INFO util.NativeCodeLoader: Loaded the
native-hadoop
library
12/06/13 12:03:27 WARN mapred.JobClient: Use
GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
12/06/13 12:03:28 INFO mapred.FileInputFormat: Total input
paths to
process : 10
12/06/13 12:03:29 INFO mapred.JobClient: Running job:
job_local_0001
12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter
set in
config null
12/06/13 12:03:29 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapred.FileOutputCommitter
12/06/13 12:03:29 WARN mapreduce.Counters: Group
org.apache.hadoop.mapred.Task$Counter is deprecated. Use
org.apache.hadoop.mapreduce.TaskCounter instead
12/06/13 12:03:29 INFO util.ProcessTree: setsid exited with
exit code
0
12/06/13 12:03:29 INFO mapred.Task:  Using
ResourceCalculatorPlugin :
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3d46e381
12/06/13 12:03:29 WARN mapreduce.Counters: Counter name
MAP_INPUT_BYTES is deprecated. Use FileInputFormatCounters as
group
name and  BYTES_READ as counter name instead
12/06/13 12:03:29 INFO mapred.MapTask: numReduceTasks: 1
12/06/13 12:03:29 INFO mapred.MapTask: io.sort.mb = 100
12/06/13 12:03:30 INFO mapred.MapTask: data buffer =
79691776/99614720
12/06/13 12:03:30 INFO mapred.MapTask: record buffer =
262144/327680
12/06/13 12:03:30 INFO mapred.JobClient:  map 0% reduce 0%
12/06/13 12:03:35 INFO mapred.LocalJobRunner: Generated 95735000
samples.
12/06/13 12:03:36 INFO mapred.JobClient:  map 100% reduce 0%
12/06/13 12:03:38 INFO mapred.LocalJobRunner: Generated 151872000
samples.

Here is the content of yarn-site.xml:

configuration
property
nameyarn.nodemanager.aux-services/name
valuemapreduce.shuffle/value
/property

property
nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name
valueorg.apache.hadoop.mapred.ShuffleHandler/value
/property

property
nameyarn.log-aggregation-enable/name
valuetrue/value
/property

property
descriptionList of directories to store localized files in./
description
nameyarn.nodemanager.local-dirs/name
value/disk/yarn/local/value
/property

property
descriptionWhere to store container logs./description
nameyarn.nodemanager.log-dirs/name
value/disk/yarn/logs/value
/property

property
descriptionWhere to aggregate logs to./description

Re: override mapred-site.xml from command line

2012-06-07 Thread Marcos Ortiz



On 06/06/2012 07:44 PM, Sid Kumar wrote:
I am able to set it via the API. 
Configuration.setBoolean(mapred.output.compress,true). This works!


But the -D from the command line still doesn't work. Any idea what I 
may be missing here?


Some additional info - Also when I try running the -D on command line 
on a local cluster (pseudo distributed mode) it works, but when I try 
it on a fully distributed cluster running jobs from a client machine 
it doesn't work. Is there a different way for setting it in this case 
- in hadoop-env perhaps?


Thanks
Sid

On Wed, Jun 6, 2012 at 4:06 PM, Sid Kumar sqlsid...@gmail.com 
mailto:sqlsid...@gmail.com wrote:


Mayank,
I dont have a final tag for that property set. I looked at the
mapred-default.xml in the src/mapred folder and that doesn't have
a final tag too. Should I set it explicitly to false?


You should do it explicitly.
You should read the excellent blog post from Lars Francke where he did a 
great job explaining parameter by parameter and why is recommendable to 
set them to final.

http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html

Regards



Sid


On Wed, Jun 6, 2012 at 3:50 PM, Mayank Bansal may...@apache.org
mailto:may...@apache.org wrote:

Check your mapred site xml if these parameters have
finaltrue/final

making final to false should solve your problem.


On Wed, Jun 6, 2012 at 3:41 PM, Sid Kumar sqlsid...@gmail.com
mailto:sqlsid...@gmail.com wrote:

Hi,
I am trying to override mapred-site.xml (more specifically
mapred.compress.map.output
and mapred.output.compression.
codec) from the command line when I
execute the jar.
I have been using hadoop jar jarname class -
Dmapred.compress.map.output=true and
-Dmapred.output.compression.codec=org.apache.hadoop.io.SnappyCodec


The above doesnt work as the job.xml for the jar still
uses the default properties and not the one i specify
here. Is there a different approach to override these
properties. I am submitting jobs from a client machine
that has the same version of configuration files as my
cluster.

Thanks

Sid






--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: HBase is able to connect to ZooKeeper but the connection closes immediately

2012-06-06 Thread Marcos Ortiz

Can you show us the code that you are developing?
Which HBase version are you using ?

Yo should check if you are creating multiples HBaseConfiguration objects.
The approach to this is to create one single HBaseConfiguration object 
and then

reuse it in all your code.

Regards


On 06/06/2012 10:25 AM, Manu S wrote:

Hi All,

We are running a mapreduce job in a fully distributed cluster.The 
output of the job is writing to HBase.


While running this job we are getting an error:
*Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able 
to connect to ZooKeeper but the connection closes immediately. This could be a 
sign that the server has too many connections (30 is the default). Consider 
inspecting your ZK server logs for that error and then make sure you are 
reusing HBaseConfiguration as often as you can. See HTable's javadoc for more 
information.*
at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.init(ZooKeeperWatcher.java:155)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:1002)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:304)
at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.init(HConnectionManager.java:295)
at 
org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:157)
at org.apache.hadoop.hbase.client.HTable.init(HTable.java:169)
at 
org.apache.hadoop.hbase.client.HTableFactory.createHTableInterface(HTableFactory.java:36)

I had gone through some threads related to this issue and I modified 
the *zoo.cfg* accordingly. These configurations are same in all the nodes.

Please find the configuration of HBase  ZooKeeper:

Hbase-site.xml:

configuration

property
namehbase.cluster.distributed/name
valuetrue/value
/property

property
namehbase.rootdir/name
valuehdfs://namenode/hbase/value
/property

property
namehbase.zookeeper.quorum/name
valuenamenode/value
/property

/configuration


Zoo.cfg:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/var/zookeeper
# the port at which the clients will connect
clientPort=2181
#server.0=localhost:2888:3888
server.0=namenode:2888:3888

# Max Client connections ###
*maxClientCnxns=1000
minSessionTimeout=4000
maxSessionTimeout=4*


It would be really great if anyone can help me to resolve this issue 
by giving your thoughts/suggestions.


Thanks,
Manu S


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: No space left on device

2012-05-28 Thread Marcos Ortiz

Do you have the JT and NN on the same node?
Look here on the Lars Francke´s post:
http://gbif.blogspot.com/2011/01/setting-up-hadoop-cluster-part-1-manual.html
This is a very schema how to install Hadoop, and look the configuration 
that he used for the name and data directories.
If this directories are in the same disk, and you don´t have enough 
space for it, you can find that exception.


My recomendation is to divide these directories in separate discs with a 
very similar schema to the Lars´s configuration

Another recomendation is to check the Hadoop´s logs. Read about this here:
http://www.cloudera.com/blog/2010/11/hadoop-log-location-and-retention/

regards

On 05/28/2012 02:20 AM, yingnan.ma wrote:

ok,I find it. the jobtracker server is full.


2012-05-28



yingnan.ma



发件人: yingnan.ma
发送时间: 2012-05-28  13:01:56
收件人: common-user
抄送:
主题: No space left on device

Hi,
I encounter a problem as following:
  Error - Job initialization failed:
org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
  at 
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:201)
 at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
 at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
 at java.io.FilterOutputStream.close(FilterOutputStream.java:140)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
 at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
 at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.close(ChecksumFileSystem.java:348)
 at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
 at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
 at 
org.apache.hadoop.mapred.JobHistory$JobInfo.logSubmitted(JobHistory.java:1344)
 ..
So, I think that the HDFS is full or something, but I cannot find a way to 
address the problem, if you had some suggestion, Please show me , thank you.
Best Regards


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: EOFException at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)......

2012-05-25 Thread Marcos Ortiz


Regards, waqas. I think that you have to ask to MapR experts.


On 05/25/2012 05:42 AM, waqas latif wrote:

Hi Experts,

I am fairly new to hadoop MapR and I was trying to run a matrix
multiplication example presented by Mr. Norstadt under following link
http://www.norstad.org/matrix-multiply/index.html. I can run it
successfully with hadoop 0.20.2 but I tried to run it with hadoop 1.0.3 but
I am getting following error. Is it the problem with my hadoop
configuration or it is compatibility problem in the code which was written
in hadoop 0.20 by author.Also please guide me that how can I fix this error
in either case. Here is the error I am getting.

The same code that you write for 0.20.2 should work in 1.0.3 too.



in thread main java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:180)
 at java.io.DataInputStream.readFully(DataInputStream.java:152)
 at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
 at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1486)
 at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475)
 at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1470)
 at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:60)
 at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:87)
 at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:112)
 at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:150)
 at TestMatrixMultiply.testRandom(TestMatrixMultiply.java:278)
 at TestMatrixMultiply.main(TestMatrixMultiply.java:308)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Thanks in advance

Regards,
waqas

Can you put here the completed log for this?
Best wishes

--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: While Running in cloudera version of hadoop getting error

2012-05-24 Thread Marcos Ortiz

Why don´t use the same Hadoop version in both clusters?
It will brings to you minor troubles.


On 05/24/2012 02:26 PM, samir das mohapatra wrote:

Hi
   I created application jar and i was trying to run in 2 node cluster using
cludera  .20 version  , it was running fine,
But when i am running that same jar in Deployment server (Cloudera version
.20.x ) having 40 node cluster I am getting error

cloude any one please help me with this.

12/05/24 09:39:09 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.

Like this says here, you should implement Tool for your MapReduce Job


12/05/24 09:39:10 INFO mapred.FileInputFormat: Total input paths to process
: 1

12/05/24 09:39:10 INFO mapred.JobClient: Running job: job_201203231049_12426

12/05/24 09:39:11 INFO mapred.JobClient:  map 0% reduce 0%

12/05/24 09:39:20 INFO mapred.JobClient: Task Id :
attempt_201203231049_12426_m_00_0, Status : FAILED

java.lang.RuntimeException: Error in configuring object

 at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)

 at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)

 at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)

 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387)

 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:396)

 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)

 at org.apache.hadoop.mapred.Child.main(Child.java:264)

Caused by: java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav

attempt_201203231049_12426_m_00_0: getDefaultExtension()

12/05/24 09:39:20 INFO mapred.JobClient: Task Id :
attempt_201203231049_12426_m_01_0, Status : FAILED



Thanks

samir





--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Is it okay to upgrade from CDH3U2 to hadoop 1.0.2 and hbase 0.92.1?

2012-05-21 Thread Marcos Ortiz
I think that you should follow the CDH4 Beta 2 docs, specifically the 
know issues for this version:

https://ccp.cloudera.com/display/CDH4B2/Known+Issues+and+Work+Arounds+in+CDH4

Then, you should see the HBase installation and upgrading on this version:
https://ccp.cloudera.com/display/CDH4B2/HBase+Installation#HBaseInstallation-InstallingHBase

Another thing that you keep in mind is that with HBase 0.92.1, you 
should restart your cluster because
the wire protocol changed from 0.90 to 0.92, so, the rolling restarts do 
not work here.


Best wishes

On 05/21/2012 10:44 PM, edward choi wrote:

Hi,
I have used CDH3U2 for almost a year now. Since it is a quite old
distribution, there are certain glitches that keep bothering me.
So I was considering upgrading to Hadoop 1.0.3 and Hbase 0.92.1.

My concern is that, if it is okay to just install the new packages and set
the configurations the same as before?
Or do I need to download all the files on HDFS to local hard drive and
upload them again once the new packages are installed? (that would be a
horrible job to do though)
Any advice will be helpful.
Thanks.

Ed


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: namenode directory disappear after machines restart

2012-05-21 Thread Marcos Ortiz
This is an usual behavior on Unix/Linux systems. When you restart the 
system, the content of the /tmp directory is cleaned, because precisely, 
the purpose of this directory is to keep files temporally.
For that reason, the data directory for the HDFS filesystem should be 
another directory, /var/hadoop/data for example, of course, a directory 
durable in time.
So, you should change your dfs.name.dir and your dfs.data.dir variable 
in your hdfs-site.xml.


Regards

On 05/21/2012 11:21 PM, Brendan cheng wrote:

Hi,
I'm not sure if there is a setting to avoid the Namenode removed after hosting 
machine of Namenode restart.I found that after successfully installed single 
node pseudo distributed hadoop following from your website, the name node dir 
/tmp/hadoop-brendan/dfs/name are removed if machine reboot.
What do I miss?
Brendan
2012-05-22 11:14:05,678 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory 
/tmp/hadoop-brendan/dfs/name does not exist.2012-05-22 11:14:05,680 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization 
failed.org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/tmp/hadoop-brendan/dfs/name is in an inconsistent state: storage directory does not exist or is not 
accessible.at 
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)   at 
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:362)  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)at 
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496)  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)2012-05-22 11:14:05,685 ERROR 
org.apache.hadoop.hdfs.server.namenode.NameNode: 
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory 
/tmp/hadoop-brendan/dfs/name is in an inconsistent state: storage directory does not exist or is not 
accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:303)   
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388)at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:362)  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276)at 
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:496)  at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)   at 
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
 
10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Luis Ortíz Valmaseda
 Data Engineer  Sr. System Administrator at UCI
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: hadoop on fedora 15

2012-04-26 Thread Marcos Ortiz



On 04/26/2012 01:49 AM, john cohen wrote:

I had the same issue.  My problem was the use of VPN
connected to work, and at the same time working
with M/R jobs on my Mac.  It occurred to me that
maybe Hadoop was binding to the wrong IP (the IP
given to you after connecting through VPN),
bottom line, I disconnect from the VPN, and the M/R job
finished as expected after that.

This is logic because, after that you configure to connect to the VPN, 
your machines have
other IPs, based on the request of the private network. You can test, 
changing the IPs for the

new ones based on the VPN request.

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: unable to resolve the heap space error even when running the examples

2012-04-12 Thread Marcos Ortiz

Can you show to us the logs of your NN/DN?

On 04/12/2012 03:28 AM, SRIKANTH KOMMINENI (RIT Student) wrote:

Tried that it didn't work for a lot of combinations of values

On Thu, Apr 12, 2012 at 3:25 AM, Mapred Learn mapred.le...@gmail.com 
mailto:mapred.le...@gmail.com wrote:


Try exporting HADOOP_HEAPSIZE to bigger value like 1500 (1.5 gb)
before running program or change it in hadoop-env.sh

If still gives error, u can try with bigger value.

Sent from my iPhone

On Apr 12, 2012, at 12:10 AM, SRIKANTH KOMMINENI (RIT Student)
sxk7...@rit.edu mailto:sxk7...@rit.edu wrote:


Hello,

I have searched a lot and still cant find any solution that can
fix my problem.

I am using the the basic downloaded version of hadoop-1.0.2 and I
have edited only what has been asked in the setup page of hadoop
and I have set it up to work in a pseudo random distributed mode.

My JAVA_HOME is set to /usr/lib/jvm/java-6-sun, I tried editing
the heap size in hadoop-env.sh that didn't work. I tried setting
the CHILD_OPTS that didn't work, I found that there was another
hadoop-env.sh in /etc/hadoop/ as per the recommendations in the
mailing list archives that didn't work . I tried increasing the
io.sort.mb that didn't work. I am totally frustrated but it
still doesn't work.please help.


-- 
Srikanth Kommineni,

Graduate Assistant,
Dept of Computer Science,
Rochester Institute of Technology.




-- 
Srikanth Kommineni,

Graduate Assistant,
Dept of Computer Science,
Rochester Institute of Technology.




-- 
Srikanth Kommineni,

Graduate Assistant,
Dept of Computer Science,
Rochester Institute of Technology.





--
Srikanth Kommineni,
Graduate Assistant,
Dept of Computer Science,
Rochester Institute of Technology.



--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz

Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to Hadoop 
1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz

Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to Hadoop 
1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz



On 04/04/2012 09:15 AM, Jagat Singh wrote:

Hello Marcos

Yes , Yahoo tutorials are pretty old but still they explain the 
concepts of Map Reduce , HDFS beautifully. The way in which tutorials 
have been defined into sub sections , each builing on previous one is 
awesome. I remember when i started i was digged in there for many 
days. The tutorials are lagging now from new API point of view.
Yes, for that reason, for its beauty, this tutorial is read by many new 
Hadoop comers, so, I think that it need an update.


Lets have some documentation session one day , I would love to 
Volunteer to update those tutorials if people at Yahoo take input from 
outside world :)
I want to help on this too, so, we need to talk with Hadoop colleagues 
to do this.

Regards and best wishes


Regards,

Jagat


- Original Message -

From: Marcos Ortiz

Sent: 04/04/12 08:32 AM

To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org


Subject: Yahoo Hadoop Tutorial with new APIs?


Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo 
at http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old 
APIs? (Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to 
Hadoop 1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
  Data Engineer at UCI
  http://marcosluis2186.posterous.com  


http://www.uci.cu/








--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: opensuse 12.1

2012-04-04 Thread Marcos Ortiz
Like OpenSUSE is a RPM-based distribution, you can try with the Apache 
BigTop project [1], and look for the RPM packages and give them a try.
You have noticed that the RPM specification between OpenSUSE and Red 
Hat-based distributions ()  change a little, but it can be a starting point.

See the documentation for the project [2].

[1] http://incubator.apache.org/projects/bigtop.html
[2] 
https://cwiki.apache.org/confluence/display/BIGTOP/Index%3bjsessionid=AA31645DFDAE1F3282D0159DB9B6AE9A


Regards

On 04/04/2012 12:24 PM, Raj Vishwanathan wrote:

Lots of people seem to start with this.

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ 



Raj





From: Barry, Sean Fsean.f.ba...@intel.com
To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org
Sent: Wednesday, April 4, 2012 9:12 AM
Subject: FW: opensuse 12.1



-Original Message-
From: Barry, Sean F [mailto:sean.f.ba...@intel.com]
Sent: Wednesday, April 04, 2012 9:10 AM
To: common-user@hadoop.apache.org
Subject: opensuse 12.1

 What is the best way to install hadoop on opensuse 12.1 for a 
small two node cluster.

-SB




10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci




--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz
Ok, Robert, I will be waiting for you then. There are many folks that 
use this tutorial, so I think this a good effort in favor of the Hadoop 
community.It would be nice
if Yahoo! donate this work, because, I have some ideas behind this, for 
example: to release a Spanish version of the tutorial.

Regards and best wishes

On 04/04/2012 05:29 PM, Robert Evans wrote:
I am dropping the cross posts and leaving this on common-user with the 
others BCCed.


Marcos,

That is a great idea to be able to update the tutorial, especially if 
the community is interested in helping to do so.  We are looking into 
the best way to do this.  The idea right now is to donate this to the 
Hadoop project so that the community can keep it up to date, but we 
need some time to jump through all of the corporate hoops to get this 
to happen.  We have a lot going on right now, so if you don't see any 
progress on this please feel free to ping me and bug me about it.


--
Bobby Evans


On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote:

Hello Marcos

 Yes , Yahoo tutorials are pretty old but still they explain the
concepts of Map Reduce , HDFS beautifully. The way in which
tutorials have been defined into sub sections , each builing on
previous one is awesome. I remember when i started i was digged in
there for many days. The tutorials are lagging now from new API
point of view.

 Lets have some documentation session one day , I would love to
Volunteer to update those tutorials if people at Yahoo take input
from outside world :)

 Regards,

 Jagat

- Original Message -
From: Marcos Ortiz
Sent: 04/04/12 08:32 AM
To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org
%27hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org
Subject: Yahoo Hadoop Tutorial with new APIs?

Regards to all the list.
 There are many people that use the Hadoop Tutorial released by
Yahoo at http://developer.yahoo.com/hadoop/tutorial/
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old
APIs? (Hadoop 0.18 I think).
 Is there a project for update this tutorial to the new APIs? to
Hadoop 1.0.2 or YARN (Hadoop 0.23)

 Best wishes
 -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at
UCI http://marcosluis2186.posterous.com
http://www.uci.cu/



http://www.uci.cu/


--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz

Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo at 
http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old APIs? 
(Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to Hadoop 
1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz



On 04/04/2012 09:15 AM, Jagat Singh wrote:

Hello Marcos

Yes , Yahoo tutorials are pretty old but still they explain the 
concepts of Map Reduce , HDFS beautifully. The way in which tutorials 
have been defined into sub sections , each builing on previous one is 
awesome. I remember when i started i was digged in there for many 
days. The tutorials are lagging now from new API point of view.
Yes, for that reason, for its beauty, this tutorial is read by many new 
Hadoop comers, so, I think that it need an update.


Lets have some documentation session one day , I would love to 
Volunteer to update those tutorials if people at Yahoo take input from 
outside world :)
I want to help on this too, so, we need to talk with Hadoop colleagues 
to do this.

Regards and best wishes


Regards,

Jagat


- Original Message -

From: Marcos Ortiz

Sent: 04/04/12 08:32 AM

To: common-u...@hadoop.apache.org, 'hdfs-user@hadoop.apache.org', 
mapreduce-u...@hadoop.apache.org


Subject: Yahoo Hadoop Tutorial with new APIs?


Regards to all the list.
There are many people that use the Hadoop Tutorial released by Yahoo 
at http://developer.yahoo.com/hadoop/tutorial/ 
http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining
The main issue here is that, this tutorial is written with the old 
APIs? (Hadoop 0.18 I think).
Is there a project for update this tutorial to the new APIs? to 
Hadoop 1.0.2 or YARN (Hadoop 0.23)


Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
  Data Engineer at UCI
  http://marcosluis2186.posterous.com  


http://www.uci.cu/








--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: How can I build a Collaborative Filtering recommendation framework based on mapreduce

2012-03-31 Thread Marcos Ortiz
Mahout is built precisely, so I think that you can evaluate it again.
It has to two collaborating filtering algorithms:

- Non-distributed recommenders (Taste)
https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+Documentation

- Distributed recommenders (Item-based)
https://cwiki.apache.org/confluence/display/MAHOUT/Itembased+Collaborative+Filtering

- First-time FAQSs
https://cwiki.apache.org/confluence/display/MAHOUT/Recommender+First-Timer+FAQ

About the test that you did with Mahout:
- Which are the features of your machine?
If you are working with 175M of data, a single machine
is not the best way to do it. It's more worthy if you use
small Hadoop cluster for this (1 NN/JT and 3 DN/TT), and then
you can ask on the Mahout mailing list how to improve the performance
of your system.

Regards

On 3/31/2012 6:17 AM, chao yin wrote:
 Hi all:
 I'm new to mapreduce, but familiar with Collaborative Filtering 
 recommendation framework.
 I tried to use mahout to do this work. But it disappointed me. My 
 machine work all day to do this job without any result with about 175M data.
 Is there anyone knows anything about Collaborative Filtering 
 recommendation framework based on mapreduce, or mahout, any suggestion 
 to improve performance?
 
 -- 
 Best regards,
 Yin

-- 
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Job tracker service start issue.

2012-03-26 Thread Marcos Ortiz



On 03/23/2012 06:57 AM, kasi subrahmanyam wrote:

Hi Oliver,

I am not sure my suggestion might solve your problem or it might be already
solved on your side.
It seems the task tracker is having a problem accessing the tmp directory.
Try going to the core and mapred site xml and change the tmp directory to a
new one.
If this is not yet working then manually change the permissions of theat
directory  using :
chmod -R 777 tmp
Please, don´t do chmod -R 777 in tmp directory. It´s not recommendable 
for production servers.

The first option is more wise:
1- change the tmp directory in the core and mapreduce files
2- chown this new directory to group hadoop, where are the mapred and 
hdfs users


On Fri, Mar 23, 2012 at 3:33 PM, Olivier Sallouolivier.sal...@irisa.frwrote:



Le 3/23/12 8:50 AM, Manish Bhoge a écrit :

I have Hadoop running on Standalone box. When I am starting deamon for
namenode, secondarynamenode, job tracker, task tracker and data node, it

is starting gracefully. But soon after it start job tracker it doesn't

show up job tracker service. when i run 'jps' it is showing me all the
services including task tracker except Job Tracker.

Is there any time limit that need to set up or is it going into the safe
mode. Because when i saw job tracker log this what it is showing, looks
like it is starting the namenode but soon after it shutdown:

2012-03-22 23:26:04,061 INFO org.apache.hadoop.mapred.JobTracker:

STARTUP_MSG:

/
STARTUP_MSG: Starting JobTracker
STARTUP_MSG:   host = manish/10.131.18.119
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2-cdh3u3
STARTUP_MSG:   build =

file:///data/1/tmp/nightly_2012-02-16_09-46-24_3/hadoop-0.20-0.20.2+923.195-1~maverick
-r 217a3767c48ad11d4632e19a22897677268c40c4; compiled by 'root' on Thu Feb
16 10:22:53 PST 2012

/
2012-03-22 23:26:04,140 INFO

org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
Updating the current master key for generating delegation tokens

2012-03-22 23:26:04,141 INFO

org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
Starting expired delegation token remover thread,
tokenRemoverScanInterval=60 min(s)

2012-03-22 23:26:04,141 INFO

org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
Updating the current master key for generating delegation tokens

2012-03-22 23:26:04,142 INFO org.apache.hadoop.mapred.JobTracker:

Scheduler configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
limitMaxMemForMapTasks, limitMaxMemForReduceTasks) (-1, -1, -1, -1)

2012-03-22 23:26:04,143 INFO org.apache.hadoop.util.HostsFileReader:

Refreshing hosts (include/exclude) list

2012-03-22 23:26:04,186 INFO org.apache.hadoop.mapred.JobTracker:

Starting jobtracker with owner as mapred

2012-03-22 23:26:04,201 INFO org.apache.hadoop.ipc.Server: Starting

Socket Reader #1 for port 54311

2012-03-22 23:26:04,203 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:

Initializing RPC Metrics with hostName=JobTracker, port=54311

2012-03-22 23:26:04,206 INFO

org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics
with hostName=JobTracker, port=54311

2012-03-22 23:26:09,250 INFO org.mortbay.log: Logging to

org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog

2012-03-22 23:26:09,298 INFO org.apache.hadoop.http.HttpServer: Added

global filtersafety
(class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)

2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer: Port

returned by webServer.getConnectors()[0].getLocalPort() before open() is
-1. Opening the listener on 50030

2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer:

listener.getLocalPort() returned 50030
webServer.getConnectors()[0].getLocalPort() returned 50030

2012-03-22 23:26:09,318 INFO org.apache.hadoop.http.HttpServer: Jetty

bound to port 50030

2012-03-22 23:26:09,319 INFO org.mortbay.log: jetty-6.1.26.cloudera.1
2012-03-22 23:26:09,517 INFO org.mortbay.log: Started

SelectChannelConnector@0.0.0.0:50030

2012-03-22 23:26:09,519 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:

Initializing JVM Metrics with processName=JobTracker, sessionId=

2012-03-22 23:26:09,519 INFO org.apache.hadoop.mapred.JobTracker:

JobTracker up at: 54311

2012-03-22 23:26:09,519 INFO org.apache.hadoop.mapred.JobTracker:

JobTracker webserver: 50030

2012-03-22 23:26:09,648 WARN org.apache.hadoop.mapred.JobTracker: Failed

to operate on mapred.system.dir
(hdfs://localhost:54310/app/hadoop/tmp/mapred/system) because of
permissions.

2012-03-22 23:26:09,648 WARN org.apache.hadoop.mapred.JobTracker: This

directory should be owned by the user 'mapred (auth:SIMPLE)'

2012-03-22 23:26:09,650 WARN org.apache.hadoop.mapred.JobTracker:

Bailing out ...

org.apache.hadoop.security.AccessControlException: The systemdir


Apache Hadoop works with IPv6?

2012-03-21 Thread Marcos Ortiz

Regards.
I'm very interested to know if Apache Hadoop works with IPv6 hosts. One 
of my clients
has some hosts with this feature and they want to know if Hadoop 
supports this.

Anyone has tested this?

Best wishes

--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Reduce copy speed too slow

2012-03-20 Thread Marcos Ortiz

Hi, Gayatri


On 03/20/2012 11:59 AM, Gayatri Rao wrote:

Hi all,

I am running a map reduce job in EC2 instances and it seems to be very
slow. It takes hours together for simple projection and aggregation of
data.

What filesystem are you using for data storage: HDFS in EC2 or Amazon S3?
Which is the data size that you are analyzing?


Upon observation, I gathered that the reduce copy speed is 0.01 MB/sec. I
am new to hadoop. Could any one please share  insights about the reduce
copy speeds
are good to work with. If any one has an experience any tips in improving
it.
Hadoop Map/Reduce jobs shuffle lots of data, so the recommended 
configuration is to use 10Gbps networks for

the underline connection (and dedicated switches on dual-gigabit networks)

Remember too that Hadoop is not a real-time system, if you need 
real-time random access to your data, use HBase

http://hbase.apache.org

Regards


Thanks
Gayatri


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com


10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Retry question

2012-03-18 Thread Marcos Ortiz

HDFS is precisely built with these concerns in mind.
If you read a 60 GB file and the rack goes down, the system
will present to you transparently another copy, based on your
replication factor.
A block can not be available too due to corruption, and in this case,
it can be replicated to other live machines and fix the error with
the fsck utility.

Regards

On 3/18/2012 9:46 AM, Rita wrote:

My replication factor is 3 and if I were reading data thru libhdfs using
C is there a retry method? I am reading a 60gb file and what would will
happen if a rack goes down and the next block isn't available? Will the
API retry? is there a way t configuration this option?


--
--- Get your facts first, then you can distort them as you please.--


--
Marcos Luis Ortíz Valmaseda (@marcosluis2186)
 Data Engineer at UCI
 http://marcosluis2186.posterous.com

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci


Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Marcos Ortiz



On 03/15/2012 09:22 AM, Manu S wrote:

Thanks a lot Bijoy, that makes sense :)

Suppose if I have Mysql database in some other node(not in hadoop 
cluster), can I import the tables using sqoop to my HDFS?

Yes, this is the main purpose of Sqoop
On the Cloudera site, you have the completed documentation for it

Sqoop User Guide
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

Sqoop installation
https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation

Sqoop for MySQL
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql

Sqoop site on GitHub
http://github.com/cloudera/sqoop

Cloudera blog related post to Sqoop
http://www.cloudera.com/blog/category/sqoop/


Best wishes




On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com 
mailto:bejoy.had...@gmail.com wrote:


Hi Manu
 Please find my responses inline

I had read about we can install Pig, hive  Sqoop on the client
node, no
need to install it in cluster. What is the client node actually?
Can I use
my management-node as a client?

On larger clusters we have different node that is out of hadoop
cluster and
these stay in there. So user programs would be triggered from this
node.
This is the node refereed to as client node/ edge node etc . For your
cluster management node and client node can be the same

What is the best practice to install Pig, Hive,  Sqoop?

On a client node

For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
in each nodes?

No, can be on a client node or on any of the nodes

Mysql is needed for Hive as a metastore and sqoop can import
mysql database
to HDFS or hive or pig, so can we make use of mysql DB's residing on
another node?
Regarding your first point, SQOOP import is for different purpose,
to get
data from RDBNS into hdfs. But the meta stores is used by hive  in
framing
the map reduce jobs corresponding to your hive query. Here SQOOP
can't help
you much
Recommend to have the metastore db of hive on the same node where
hive is
installed as for execution hive queries there is meta data look up
required
much especially when your table has large number of partitions and
all.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com
mailto:manupk...@gmail.com wrote:

 Greetings All !!!

 I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes,
in which 5
 are used for a fully distributed cluster, 1 for
pseudo-distributed  1 as
 management-node.

 Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
 Pseudo distributed mode: All

 I had read about we can install Pig, hive  Sqoop on the client
node, no
 need to install it in cluster. What is the client node actually?
Can I use
 my management-node as a client?

 What is the best practice to install Pig, Hive,  Sqoop?
 For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
 in each nodes?

 Mysql is needed for Hive as a metastore and sqoop can import
mysql database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?

 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Thanks  Regards

Manu S
SI Engineer - OpenSource  HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Marcos Luis Ortíz Valmaseda
 Sr. Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://postgresql.uci.cu/blog/38



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Best practice to setup Sqoop,Pig and Hive for a hadoop cluster ?

2012-03-15 Thread Marcos Ortiz



On 03/15/2012 09:22 AM, Manu S wrote:

Thanks a lot Bijoy, that makes sense :)

Suppose if I have Mysql database in some other node(not in hadoop 
cluster), can I import the tables using sqoop to my HDFS?

Yes, this is the main purpose of Sqoop
On the Cloudera site, you have the completed documentation for it

Sqoop User Guide
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html

Sqoop installation
https://ccp.cloudera.com/display/CDHDOC/Sqoop+Installation

Sqoop for MySQL
http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html#_mysql

Sqoop site on GitHub
http://github.com/cloudera/sqoop

Cloudera blog related post to Sqoop
http://www.cloudera.com/blog/category/sqoop/


Best wishes




On Thu, Mar 15, 2012 at 6:27 PM, Bejoy Ks bejoy.had...@gmail.com 
mailto:bejoy.had...@gmail.com wrote:


Hi Manu
 Please find my responses inline

I had read about we can install Pig, hive  Sqoop on the client
node, no
need to install it in cluster. What is the client node actually?
Can I use
my management-node as a client?

On larger clusters we have different node that is out of hadoop
cluster and
these stay in there. So user programs would be triggered from this
node.
This is the node refereed to as client node/ edge node etc . For your
cluster management node and client node can be the same

What is the best practice to install Pig, Hive,  Sqoop?

On a client node

For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
in each nodes?

No, can be on a client node or on any of the nodes

Mysql is needed for Hive as a metastore and sqoop can import
mysql database
to HDFS or hive or pig, so can we make use of mysql DB's residing on
another node?
Regarding your first point, SQOOP import is for different purpose,
to get
data from RDBNS into hdfs. But the meta stores is used by hive  in
framing
the map reduce jobs corresponding to your hive query. Here SQOOP
can't help
you much
Recommend to have the metastore db of hive on the same node where
hive is
installed as for execution hive queries there is meta data look up
required
much especially when your table has large number of partitions and
all.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 5:34 PM, Manu S manupk...@gmail.com
mailto:manupk...@gmail.com wrote:

 Greetings All !!!

 I am using Cloudera CDH3 for Hadoop deployment. We have 7 nodes,
in which 5
 are used for a fully distributed cluster, 1 for
pseudo-distributed  1 as
 management-node.

 Fully distributed cluster: HDFS, Mapreduce  Hbase cluster
 Pseudo distributed mode: All

 I had read about we can install Pig, hive  Sqoop on the client
node, no
 need to install it in cluster. What is the client node actually?
Can I use
 my management-node as a client?

 What is the best practice to install Pig, Hive,  Sqoop?
 For the fully distributed cluster do we need to install Pig,
Hive,  Sqoop
 in each nodes?

 Mysql is needed for Hive as a metastore and sqoop can import
mysql database
 to HDFS or hive or pig, so can we make use of mysql DB's residing on
 another node?

 --
 Thanks  Regards
 
 Manu S
 SI Engineer - OpenSource  HPC
 Wipro Infotech
 Mob: +91 8861302855Skype: manuspkd
 www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Thanks  Regards

Manu S
SI Engineer - OpenSource  HPC
Wipro Infotech
Mob: +91 8861302855Skype: manuspkd
www.opensourcetalk.co.in http://www.opensourcetalk.co.in





--
Marcos Luis Ortíz Valmaseda
 Sr. Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://postgresql.uci.cu/blog/38



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS 
INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

Re: Error while using libhdfs C API

2012-03-09 Thread Marcos Ortiz

  
  


On 03/09/2012 07:34 AM, Amritanshu Shekhar wrote:

  
  
  
  
  
Hi Marcos,
Figured out the
compilation issue. It was due to error.h header file which
was not used and not present in the distribution. There is
one small issue however I was trying to test hdfs read. I
copied an input file to /user/inputData(this can be listed
using bin/hadoop dfs -ls /user/inputData). hdfsExists
call fails for this directory however it works when I copy
my file to /tmp. Is it because hdfs only recognizes /tmp
as a valid dir? Thus I was wondering what directory
structure does hdfs recognize by default and if we can
override it through a conf variable what would that variable
be and where to set it?
Thanks,
Amritanshu
  

Awesome, Amritanshu. CC to hdfs-user@hadoop.apache.org
Please, give some logs about your work with the compilation.
How did you solve this? To have it on the mailing list archives.

About your another issue, 
1- Did you check that the $HADOOP_USER has access to
/user/inputData?

HDFS:
 It recognize the directory that you entered on the
hdfs-site.xml on the dfs.name.dir(NN) property and on the
dfs.data.dir (DN), but by default, it works with /tmp directory (not
recommended in production). Look on the Eugene Ciuranas Refcard
called "Deploying Hadoop", where he did a amazing work explaining in
a few pages some tricky configurations tips.

Regards
 

  



  
From:
        Marcos Ortiz [mailto:mlor...@uci.cu] 
Sent: Wednesday, March 07, 2012 7:36 PM
To: Amritanshu Shekhar
Subject: Re: Error while using libhdfs C API
  



  
  On 03/07/2012 01:15 AM, Amritanshu Shekhar wrote: 
Hi Marcos,
Thanks for
the quick reply. Actually I am using a gmake build system
where the library is being linked as a static library(.a )
rather than a shared object. It seems strange since stderr
is a standard symbol which should be resolved. Currently I
am using the version that came with the
distribution($HOME/c++/Linux-amd64-64/lib/libhdfs.a) . I
tried building the library from the source but there were
build dependencies that could not be resolved. I tried
building $HOME/hadoop/hdfs/src/c++/libhdfs by running:
./configure
./make
I got a lot
of dependency errors so gave up the effort. If you happen
to have a working application that make suse of libhdfs
please let me know. Any inputs would be welcome as I have
hit a roadblock as far as libhdfs is concerned.
Thanks,
Amritanshu
No, Amritansu. I don't have
any examples of the use of libhdfs API, but I remembered
that some folks were using it. Search on the mailing list
archives (http://www.search-hadoop.com).
Can you put the errors that you had in your system when you
tried to compile the library?
Regards and best wishes



  
From:
        Marcos Ortiz [mailto:mlor...@uci.cu]

Sent: Monday, March 05, 2012 6:51 PM
To: hdfs-user@hadoop.apache.org
Cc: Amritanshu Shekhar
Subject: Re: Error while using libhdfs C API
  


Which platform are you using?
  Did you update the dynamic linker runtime bindings (ldconfig)?
  
  ldconfig $HOME/hadoop/c++/Linux-amd64/lib
  Regards
  
  On 03/06/2012 02:38 AM, Amritanshu Shekhar wrote: 
Hi,
I was trying to link 64 bit libhdfs in my
  application program but it seems there is an issue with this
  library. Get the following error:
Undefined first
  referenced
symbol in file
stderr
  libhdfs.a(hdfs.o)
__errno_location
  libhdfs.a(hdfs.o)
ld: fatal: Symbol referencing errors. No
  output written to ../../bin/sun86/mapreduce
collect2: ld returned 1 exit status

Now I was wondering if this a common
  error and is there an actual issue with the library or am I
  getting an error because of an incorrect configuration? I am
  using the following library:
  $HOME/hadoop/c++/Linux-amd64-64/lib/libhdfs.a
Thanks,
Amritanshu






  
-- 
Marcos Luis Ortz

Re: Hadoop 0.23.1 installation

2012-03-01 Thread Marcos Ortiz

On 03/01/2012 04:48 AM, raghavendhra rahul wrote:

Hi,
   I tried to configure hadoop 0.23.1.I added all libs from share 
folder to lib directory.But still i get the error while formating the 
namenode



Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/hadoop/hdfs/server/namenode/NameNode
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hdfs.server.namenode.NameNode

at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: 
org.apache.hadoop.hdfs.server.namenode.NameNode. Program will exit.



Any help???

Can you show us here your .conf files?
core-site.xml
mapred-site.xml
hdfs-site.xml

Which is your configuration for your conf/hadoop-env.sh?

Regards

--
Marcos Luis Ortíz Valmaseda
 Sr. Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://postgresql.uci.cu/blog/38



Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE 
ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com


Re: Query Regarding design MR job for Billing

2012-02-28 Thread Marcos Ortiz

On 02/27/2012 11:33 PM, Stuti Awasthi wrote:

Hi Marcos,

Thanks for the pointers. I am also thinking on the similar lines.
I am doubtful at 1 point :

I will be having separate data files for every interval. Let's take example if 
I have 5 mins interval file which contain data for 2 hours and 10 mins. In this 
scenario I want to process 2 hours data with hours job and 10 mins data with 
mins job. Now since I will provide my data file as Input to MR jobs so I think 
original file needs to split in 2 files : HourFile and
MinsFile. HourFile wll contain data for 2 hours and MinsFile will conatin data 
for 10 mins.
Well, you can with Oozie(http://yahoo.github.com/oozie/) or 
Cascading(http://cascading.org) for complex workflow programming.
1- For example, you can write a MapReduce job for spit your data: one by 
hour, and one by mins. In your case: a simple output would be one data 
file containing your data for 2 hours, and another data file for your 10 
mins. I think that this job could be Mapper-only type with the 
MultipleOutputFormat.


2- Then you can write the different jobs for each interval 
(HourIntervalJob, MonthIntervalJob, etc), spliting its outputs depending 
of each interval in HDFS.


You can define your complete workflow, and then, you can evaluate Oozie 
or Cascading to control that workflow.

Regards

Remember that all thes are suggestions. I'm not a MR expert



I have attained file splitting with simple Java class but I think there is too 
much I/O operations and if I can attain this also in MR or in some efficient 
way, it will be good because the original data files can be huge and then the 
initial breaking of files will itself take too much time.

Please suggest.
Thanks

-Original Message-
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: Sunday, February 26, 2012 7:40 PM
To: mapreduce-user@hadoop.apache.org
Cc: Stuti Awasthi
Subject: Re: Query Regarding design MR job for Billing

Well, first, you can design 6 MR jobs:
1- for 5 mins interval
2- for 1 hour
3- for 1 day
4- for 1 month
5- for 1 year
6- and a last for any interval

If you say that for each interval, you have to do a different calculation; this 
way could be a solution (at least I think that).
You can read the design patterns for MapReduce algorithms proposed by Jimmy Lin and 
Chris Dyer on his Data-Intensive Text Processing with MapReduce book.

Regards


On 02/27/2012 05:39 AM, Stuti Awasthi wrote:

No. The data will be either of 5 mins interval, or 1 hour interval or 1 day 
interval and so on 
So suppose utilization is for 40 days then I will charge 30 days according to 
months billing and remaining 10 days as days billing job.

-Original Message-
From: Rohit Kelkar [mailto:rohitkel...@gmail.com]
Sent: Monday, February 27, 2012 4:06 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Query Regarding design MR job for Billing

Just trying to understand your use case you need an hour job to run on
data between 6:40 AM and 7:40 AM. Would it be like a moving window?
For ex. run hour jobs on
6:41 AM to 7:41 AM
6:42 AM to 7:42 AM
and so on...


On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthistutiawas...@hcl.com   wrote:

Hi all,

I have to implement BillingEngine using MR jobs. My usecase is like this:
I will be having data files of formatTimeStamp   Information for Billing.
Now these datafiles will be containing timestamp either at minute interval, 
hour inverval, day interval, month interval, year interval. Every type of 
interval will be having different type of calculation for billing so basically 
different jobs for every type of interval.

Suppose I have a data file which contain minute interval timestamp. I have a 
scenario that if data is present for hours , then it should be processed by 
hourly job and remaining will be processed by minutejob.

Example :

2/10/12 6:40 AMdata for billing
2/10/12 6:40 AMdata for billing
.
2/10/12 6:45 AMdata for billing
2/10/12 6:45 AMdata for billing
.
.
2/10/12 7:40 AMdata for billing
2/10/12 7:40 AMdata for billing
.
.
2/10/12 7:45 AMdata for billing
2/10/12 7:45 AMdata for billing
.

Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by 
Hourjob and 2/10/12 7:45 AM is processed by MinuteJob.
Please suggest how to design my MR to achieve this.

Thanks
Stuti

::DISCLAIMER::
-
-
-

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of HCL or its 
affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message
without the prior written consent of the author of this e-mail is
strictly prohibited

Re: Query Regarding design MR job for Billing

2012-02-27 Thread Marcos Ortiz

Well, first, you can design 6 MR jobs:
1- for 5 mins interval
2- for 1 hour
3- for 1 day
4- for 1 month
5- for 1 year
6- and a last for any interval

If you say that for each interval, you have to do a different 
calculation; this way could be a solution (at least I think that).
You can read the design patterns for MapReduce algorithms proposed by 
Jimmy Lin and Chris Dyer on his Data-Intensive Text Processing with 
MapReduce book.


Regards


On 02/27/2012 05:39 AM, Stuti Awasthi wrote:

No. The data will be either of 5 mins interval, or 1 hour interval or 1 day 
interval and so on 
So suppose utilization is for 40 days then I will charge 30 days according to 
months billing and remaining 10 days as days billing job.

-Original Message-
From: Rohit Kelkar [mailto:rohitkel...@gmail.com]
Sent: Monday, February 27, 2012 4:06 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Query Regarding design MR job for Billing

Just trying to understand your use case
you need an hour job to run on data between 6:40 AM and 7:40 AM. Would it be 
like a moving window? For ex. run hour jobs on
6:41 AM to 7:41 AM
6:42 AM to 7:42 AM
and so on...


On Mon, Feb 27, 2012 at 1:01 PM, Stuti Awasthistutiawas...@hcl.com  wrote:

Hi all,

I have to implement BillingEngine using MR jobs. My usecase is like this:
I will be having data files of formatTimeStamp  Information for Billing.
Now these datafiles will be containing timestamp either at minute interval, 
hour inverval, day interval, month interval, year interval. Every type of 
interval will be having different type of calculation for billing so basically 
different jobs for every type of interval.

Suppose I have a data file which contain minute interval timestamp. I have a 
scenario that if data is present for hours , then it should be processed by 
hourly job and remaining will be processed by minutejob.

Example :

2/10/12 6:40 AMdata for billing
2/10/12 6:40 AMdata for billing
.
2/10/12 6:45 AMdata for billing
2/10/12 6:45 AMdata for billing
.
.
2/10/12 7:40 AMdata for billing
2/10/12 7:40 AMdata for billing
.
.
2/10/12 7:45 AMdata for billing
2/10/12 7:45 AMdata for billing
.

Now I want data between 2/10/12 6:40 AM to 2/10/12 7:40 AM is processed by 
Hourjob and 2/10/12 7:45 AM is processed by MinuteJob.
Please suggest how to design my MR to achieve this.

Thanks
Stuti

::DISCLAIMER::
--
-

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its
affiliates. Any views or opinions presented in this email are solely those of 
the author and may not necessarily reflect the opinions of HCL or its 
affiliates.
Any form of reproduction, dissemination, copying, disclosure,
modification, distribution and / or publication of this message
without the prior written consent of the author of this e-mail is
strictly prohibited. If you have received this email in error please delete it 
and notify the sender immediately. Before opening any mail and attachments 
please check them for viruses and defect.

--
-



--
Marcos Luis Ortíz Valmaseda
 Senior Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://www.linkedin.com/in/marcosluis2186
 Twitter: @marcosluis2186



Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE 
ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
http://www.antiterroristas.cu
http://justiciaparaloscinco.wordpress.com


Re: MapReduce jobs hanging or failing near completion

2011-07-07 Thread Marcos Ortiz



El 7/7/2011 8:43 PM, Kai Ju Liu escribió:

Over the past week or two, I've run into an issue where MapReduce jobs
hang or fail near completion. The percent completion of both map and
reduce tasks is often reported as 100%, but the actual number of
completed tasks is less than the total number. It appears that either
tasks backtrack and need to be restarted or the last few reduce tasks
hang interminably on the copy step.

In certain cases, the jobs actually complete. In other cases, I can't
wait long enough and have to kill the job manually.

My Hadoop cluster is hosted in EC2 on instances of type c1.xlarge with 4
attached EBS volumes. The instances run Ubuntu 10.04.1 with the
2.6.32-309-ec2 kernel, and I'm currently using Cloudera's CDH3u0
distribution. Has anyone experienced similar behavior in their clusters,
and if so, had any luck resolving it? Thanks!


Can you post here your NN and DN logs files?
Regards


Kai Ju


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 Linux User # 418229
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186



Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Marcos Ortiz
Evert Lammerts at Sara.nl did something seemed to your problem, spliting 
a big 2.7 TB file to chunks of 10 GB.
This work was presented on the BioAssist Programmers' Day on January of 
this year and its name was

Large-Scale Data Storage and Processing for Scientist in The Netherlands

http://www.slideshare.net/evertlammerts

P.D: I sent the message with a copy to him

El 6/20/2011 10:38 AM, Niels Basjes escribió:

Hi,

On Mon, Jun 20, 2011 at 16:13, Mapred Learnmapred.le...@gmail.com  wrote:
   

But this file is a gzipped text file. In this case, it will only go to 1 mapper 
than the case if it was
split into 60 1 GB files which will make map-red job finish earlier than one 60 
GB file as it will
Hv 60 mappers running in parallel. Isn't it so ?
 

Yes, that is very true.

   


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: Query about hadoop dfs -cat in hadoop-0-0.20.2

2011-06-17 Thread Marcos Ortiz

On 06/17/2011 07:41 AM, Lemon Cheng wrote:

Hi,

I am using the hadoop-0.20.2. After calling ./start-all.sh, i can type 
hadoop dfs -ls.
However, when i type hadoop dfs -cat 
/usr/lemon/wordcount/input/file01, the error is shown as follow.
I have searched the related problem in the web, but i can't find a 
solution for helping me to solve this problem.

Anyone can give suggestion?
Many Thanks.



11/06/17 19:27:12 INFO hdfs.DFSClient: No node available for block: 
blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01
11/06/17 19:27:12 INFO hdfs.DFSClient: Could not obtain block 
blk_7095683278339921538_1029 from any node:  java.io.IOException: No 
live nodes contain current block
11/06/17 19:27:15 INFO hdfs.DFSClient: No node available for block: 
blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01
11/06/17 19:27:15 INFO hdfs.DFSClient: Could not obtain block 
blk_7095683278339921538_1029 from any node:  java.io.IOException: No 
live nodes contain current block
11/06/17 19:27:18 INFO hdfs.DFSClient: No node available for block: 
blk_7095683278339921538_1029 file=/usr/lemon/wordcount/input/file01
11/06/17 19:27:18 INFO hdfs.DFSClient: Could not obtain block 
blk_7095683278339921538_1029 from any node:  java.io.IOException: No 
live nodes contain current block
11/06/17 19:27:21 WARN hdfs.DFSClient: DFS Read: java.io.IOException: 
Could not obtain block: blk_7095683278339921538_1029 
file=/usr/lemon/wordcount/input/file01
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1812)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767)

at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:85)
at org.apache.hadoop.fs.FsShell.printToStdout(FsShell.java:114)
at org.apache.hadoop.fs.FsShell.access$100(FsShell.java:49)
at org.apache.hadoop.fs.FsShell$1.process(FsShell.java:352)
at 
org.apache.hadoop.fs.FsShell$DelayedExceptionThrowing.globAndProcess(FsShell.java:1898)
at org.apache.hadoop.fs.FsShell.cat 
http://org.apache.hadoop.fs.fsshell.cat/(FsShell.java:346)

at org.apache.hadoop.fs.FsShell.doall(FsShell.java:1543)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1761)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1880)


Regards,
Lemon

Are you sure that all your DataNodes are online?


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186




Re: Query about hadoop dfs -cat in hadoop-0-0.20.2

2011-06-17 Thread Marcos Ortiz

On 06/17/2011 09:51 AM, Lemon Cheng wrote:

Hi,

Thanks for your reply.
I am not sure that. How can I prove that?

Which is your dfs.tmp.dir and dfs.data.dir values?

You can check the DataNodes´s health with bin/slaves.sh jps | grep 
Datanode | sort


Which is the output of bin/hadoop dfsadmin -report?

One recomendation that I could say you is to have at least 1 NameNode 
and two Datanodes


regards


I checked the localhost:50070, it shows 1 live node and 0 dead node.
And  the log hadoop-appuser-datanode-localhost.localdomain.log shows:
/
2011-06-17 19:59:38,658 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:

/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1 http://127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010

/
2011-06-17 19:59:46,738 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Registered 
FSDatasetStatusMBean
2011-06-17 19:59:46,749 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 
50010
2011-06-17 19:59:46,752 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 
1048576 bytes/s
2011-06-17 19:59:46,812 INFO org.mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via 
org.mortbay.log.Slf4jLog
2011-06-17 19:59:46,870 INFO org.apache.hadoop.http.HttpServer: Port 
returned by webServer.getConnectors()[0].getLocalPort() before open() 
is -1. Opening the listener on 50075
2011-06-17 19:59:46,871 INFO org.apache.hadoop.http.HttpServer: 
listener.getLocalPort() returned 50075 
webServer.getConnectors()[0].getLocalPort() returned 50075
2011-06-17 19:59:46,871 INFO org.apache.hadoop.http.HttpServer: Jetty 
bound to port 50075

2011-06-17 19:59:46,875 INFO org.mortbay.log: jetty-6.1.14
2011-06-17 20:01:45,702 INFO org.mortbay.log: Started 
SelectChannelConnector@0.0.0.0:50075 
http://SelectChannelConnector@0.0.0.0:50075
2011-06-17 20:01:45,709 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=DataNode, sessionId=null
2011-06-17 20:01:45,743 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=DataNode, port=50020
2011-06-17 20:01:45,751 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = 
DatanodeRegistration(localhost.localdomain:50010, 
storageID=DS-993704729-127.0.0.1-50010-1308296320968, infoPort=50075, 
ipcPort=50020)
2011-06-17 20:01:45,751 INFO org.apache.hadoop.ipc.Server: IPC Server 
listener on 50020: starting
2011-06-17 20:01:45,753 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting
2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 50020: starting
2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 50020: starting
2011-06-17 20:01:45,754 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 50020: starting
2011-06-17 20:01:45,795 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:50010 http://127.0.0.1:50010, 
storageID=DS-993704729-127.0.0.1-50010-1308296320968, infoPort=50075, 
ipcPort=50020)In DataNode.run, data = 
FSDataset{dirpath='/tmp/hadoop-appuser/dfs/data/current'}



2011-06-17 20:01:45,799 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: using 
BLOCKREPORT_INTERVAL of 360msec Initial delay: 0msec
2011-06-17 20:01:45,828 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 
blocks got processed in 11 msecs
2011-06-17 20:01:45,833 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic 
block scanner.
2011-06-17 20:56:02,945 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 
blocks got processed in 1 msecs
2011-06-17 21:56:02,248 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 0 
blocks got processed in 1 msecs



On Fri, Jun 17, 2011 at 9:42 PM, Marcos Ortiz mlor...@uci.cu 
mailto:mlor...@uci.cu wrote:


On 06/17/2011 07:41 AM, Lemon Cheng wrote:

Hi,

I am using the hadoop-0.20.2. After calling ./start-all.sh, i can
type hadoop dfs -ls.
However, when i type hadoop dfs -cat
/usr/lemon/wordcount/input/file01, the error is shown as follow.
I have searched the related problem in the web, but i can't find
a solution for helping me to solve this problem.
Anyone can give suggestion?
Many Thanks.



11/06/17 19:27:12 INFO hdfs.DFSClient: No node available for
block: blk_7095683278339921538_1029
file=/usr/lemon/wordcount/input/file01
11/06/17 19:27:12 INFO hdfs.DFSClient: Could not obtain block
blk_7095683278339921538_1029 from any node

Re: can't compile the mapreduce project in eclipse

2011-06-14 Thread Marcos Ortiz

Did you add all dependencies of the source code?


El 6/14/2011 10:32 AM, Erix Yao escribió:

hi
   I checked out the source code from 
http://svn.apache.org/repos/asf/hadoop/mapreduce/tags/release-0.21.0 
and execute ant compile eclipse-files , but after import the project 
into eclipse, I found the error as below:


Description Resource Path Location Type
Project 'mapreduce' is missing required library: 
'build/ivy/lib/Hadoop/common/avro-1.3.0.jar' mapreduce Build path 
Build Path Problem

Here, Avro jar is missing


Description Resource Path Location Type
Project 'mapreduce' is missing required source folder: 
'src/contrib/sqoop/src/java' mapreduce Build path Build Path Problem
And here, sqoop is missing. I don't why this library is required for 
this, but, it seems to be the problem.
You should add all the required dependencies on your CLASSPATH variables 
on Windows-Preferences-Java-Build Path-Classpath Variables


Please, go to the Cloudera Resources Site and you can search the 
Eclipse/Hadoop screencast that explains these details easy and quickly 
how to build the hadoop project.

Regards


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: Programming Multiple rounds of mapreduce

2011-06-13 Thread Marcos Ortiz
Well, you can define a job for each round and then, you can define the 
running workflow based in your implementation and to chain your jobs


El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:

Hello,

I am trying to write a program where I need to write multiple rounds 
of map and reduce.


The output of the last round of map-reduce must be fed into the input 
of the next round.


Can anyone please guide me to any link / material that can teach me as 
to how I can achieve this.


Thanks a lot in advance!

Thanks  regards
Arko


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: Input examples

2011-06-07 Thread Marcos Ortiz

You can use the HackReduce's datasets too for this.
http://hackreduce.org/datasets

Regards

El 6/7/2011 1:56 PM, Jonathan Coveney escribió:
Have you taken a look at the O'Reilly Hadoop book? It deals 
consistently with a weather dataset that is, I believe, largely available.


2011/6/7 Francesco De Luca f.deluc...@gmail.com 
mailto:f.deluc...@gmail.com


Hello Sean,

not exactely. I mean some applications like word count or inverted
index and the relative input data.

2011/6/7 Sean Owen sro...@gmail.com mailto:sro...@gmail.com

Not sure if it's quite what you mean, but, Apache Mahout is
essentially all applications of Hadoop for machine learning, a
bunch of runnable jobs (some with example data too).

mahout.apache.org http://mahout.apache.org/

On Tue, Jun 7, 2011 at 3:54 PM, Francesco De Luca
f.deluc...@gmail.com mailto:f.deluc...@gmail.com wrote:

Where i can find some hadoop map reduce application
examples (except word count)
with associate input files?

Thanks






--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: Changing dfs.block.size

2011-06-06 Thread Marcos Ortiz
Another advice here, is that you can test the right block size with a 
seemed enviroment to your production system, before to deploy the real 
system, and then, you can avoid these kinds of changes.


El 6/6/2011 3:09 PM, J. Ryan Earl escribió:

Hello,

So I have a question about changing dfs.block.size in 
$HADOOP_HOME/conf/hdfs-site.xml.  I understand that when files are 
created, blocksizes can be modified from default.  What happens if you 
modify the blocksize of an existing HDFS site?  Do newly created files 
get the default blocksize and old files remain the same?  Is there a 
way to change the blocksize of existing files; I'm assuming you could 
write MapReduce job to do it, but any build in facilities?


Thanks,
-JR




--
Marcos Luís Ortíz Valmaseda
 Software Engineer (UCI)
 http://marcosluis2186.posterous.com
 http://twitter.com/marcosluis2186
  



Re: cant remove files from tmp

2011-06-06 Thread Marcos Ortiz

How many DN you have?
If this number is more than 1, check this in another DN to see if this 
happens too there.
Check the /var/log/messages or dmesg (like Todd said you) with this for 
example: (this is one of my Ubuntu servers):

less dmesg | grep EXT4-fs

[1.583836] EXT4-fs (sda7): INFO: recovery required on readonly 
filesystem

[1.583843] EXT4-fs (sda7): write access will be enabled during recovery
[2.572935] EXT4-fs (sda7): orphan cleanup on readonly fs
[2.620969] EXT4-fs (sda7): ext4_orphan_cleanup: deleting 
unreferenced inode 455946
[2.621015] EXT4-fs (sda7): ext4_orphan_cleanup: deleting 
unreferenced inode 455942

[2.621029] EXT4-fs (sda7): 2 orphan inodes deleted
[2.621034] EXT4-fs (sda7): recovery complete
[2.785283] EXT4-fs (sda7): mounted filesystem with ordered data 
mode. Opts: (null)

[   22.041130] EXT4-fs (sda7): re-mounted. Opts: errors=remount-ro
[   22.505474] EXT4-fs (sda8): mounted filesystem with ordered data 
mode. Opts: (null)


Regards

El 6/6/2011 4:43 PM, Todd Lipcon escribió:

Hi Prem,

My guess is that your Linux filesystem on this partition is corrupt. 
Check dmesg for output indicating fs-level errors.


-Todd

On Mon, Jun 6, 2011 at 1:23 PM, Jain, Prem premanshu.j...@netapp.com 
mailto:premanshu.j...@netapp.com wrote:


Mapuser or hdfs user didn't seem to help, so I switched to root:

[root@hadoop20 mapred]# ls -la /part/data
total 0
drwx-- 3 hdfs   hadoop 16 Jun  6 10:22 .
drwxrwxrwx 4 hdfs   hadoop 47 May 26 18:36 ..
drwxr-xr-x 4 mapred mapred 35 May 26 21:02 tmp
[root@hadoop20 mapred]#

[root@hadoop20 mapred]# pwd

/part/data/tmp/distcache/642114211252449475_2038269146_799583695/hmaster/user/mapred
[root@hadoop20 mapred]# ls -la
total 0
drwxr-xr-x 3 mapred mapred 22 Jun  6 12:46 .
drwxr-xr-x 3 mapred mapred 19 May 26 21:17 ..
?- ? ?  ?   ?? input-dir



-Original Message-
From: Marcos Ortiz [mailto:mlor...@uci.cu mailto:mlor...@uci.cu]
Sent: Monday, June 06, 2011 1:17 PM
To: hdfs-user@hadoop.apache.org mailto:hdfs-user@hadoop.apache.org
Cc: Jain, Prem
Subject: Re: cant remove files from tmp

* Why are using he root user for these operations?
* Which are your permisions on your data directory? (ls -la
/part/data)?

Regards

El 6/6/2011 3:41 PM, Jain, Prem escribió:
 I have a wrecked datanode which is giving me hard time
restarting. It
 keeps complaining of Datanode dead, pid file exists.  I already
tried
 deleting the files but seems like the files are corrupted and don't
 allow me delete.

 

 Here is the log:
 

 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = hadoop20/192.168.1.190 http://192.168.1.190
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2-cdh3u0
 STARTUP_MSG:   build =  -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14;
 compiled by 'root' on Fri Mar 25 20:07:24 EDT 2011
 /
 2011-06-06 09:11:01,232 INFO
 org.apache.hadoop.security.UserGroupInformation: JAAS Configuration
 already set up for Hadoop, not re-installing.
 2011-06-06 09:11:01,369 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
 org.apache.hadoop.util.Shell$ExitCodeException: du: cannot access
 `/part/data/tmp/distcache/642114211252449475_2038269146_79
 9583695/hmaster/user/mapred/input-dir': No such file or directory
 du: cannot read directory
 `/part/data/tmp/mapred/jobcache/job_201105261845_0005': Permission
 denied


 _
 Here is the file I can't delete
 _
 [root@hadoop20 distcache]# pwd
 /part/data/tmp/distcache
 [root@hadoop20 distcache]# ls -la
 total 0
 drwxr-xr-x 3 mapred mapred 52 May 26 21:36 .
 drwxr-xr-x 4 mapred mapred 35 May 26 21:02 ..
 drwxr-xr-x 3 mapred mapred 20 May 26 21:17
 642114211252449475_2038269146_799583695
 [root@hadoop20 distcache]# cd *
 [root@hadoop20 642114211252449475_2038269146_799583695]# ls -la
 total 0
 drwxr-xr-x 3 mapred mapred 20 May 26 21:17 .
 drwxr-xr-x 3 mapred mapred 52 May 26 21:36 ..
 drwxr-xr-x 3 mapred mapred 17 May 26 21:17 hmaster
 [root@hadoop20 642114211252449475_2038269146_799583695]# cd h*
 [root@hadoop20 hmaster]# ls
 user
 [root@hadoop20 hmaster]# cd *
 [root@hadoop20 user]# ls -la
 total 0
 drwxr-xr-x 3 mapred mapred 19 May 26 21:17 .
 drwxr-xr-x 3 mapred mapred 17 May 26 21:17 ..
 drwxr-xr-x 3 mapred mapred 22 May 26 21:17 mapred
 [root@hadoop20 user]# cd m*
 [root@hadoop20 mapred]# ls -la

Re: question about using java in streaming mode

2011-06-05 Thread Marcos Ortiz Valmaseda
Why are using Java in streming mode instead use the native Mapper/Reducer code?
Can you show to us the JobTracker's logs?

Regards
- Mensaje original -
De: Siddhartha Jonnalagadda sid@gmail.com
Para: mapreduce-user@hadoop.apache.org
Enviados: Domingo, 5 de Junio 2011 7:16:08 GMT +01:00 Amsterdam / Berlín / 
Berna / Roma / Estocolmo / Viena
Asunto: question about using java in streaming mode

Hi, 


I was able use streaming in hadoop using python for the wordcount program, but 
created a Mapper and Reducer in Java since all my code is currently in Java. 
I first tried this: 
echo “foo foo quux labs foo bar quux” |java -cp ~/dummy.jar WCMapper | sort | 
java -cp ~/dummy.jar WCReducer 

It gave the correct output: 
labs 1 
foo 3 
bar 1 
quux 2 

Then, I installed a single-node cluster in hadoop and tried this: hadoop jar 
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper “java -cp ~/dummy.jar 
WCMapper” -reducer “java -cp ~/dummy.jar WCReducer” -input gutenberg/* -output 
gutenberg-output -file dummy.jar (by tailoring the python command) 

This is the error: 
hadoop@siddhartha-laptop:/usr/local/hadoop$ hadoop jar 
contrib/streaming/hadoop-streaming-0.20.203.0.jar -mapper “java -cp ~/dummy.jar 
WCMapper” -reducer “java -cp ~/dummy.jar WCReducer” -input gutenberg/* -output 
gutenberg-output -file dummy.jar 
packageJobJar: [dummy.jar, /app/hadoop/tmp/hadoop-unjar5573454211442575176/] [] 
/tmp/streamjob6721719460213928092.jar tmpDir=null 
11/06/04 20:47:15 INFO mapred.FileInputFormat: Total input paths to process : 3 
11/06/04 20:47:15 INFO streaming.StreamJob: getLocalDirs(): 
[/app/hadoop/tmp/mapred/local] 
11/06/04 20:47:15 INFO streaming.StreamJob: Running job: job_201106031901_0039 
11/06/04 20:47:15 INFO streaming.StreamJob: To kill this job, run: 
11/06/04 20:47:15 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop 
job -Dmapred.job.tracker=localhost:54311 -kill job_201106031901_0039 
11/06/04 20:47:15 INFO streaming.StreamJob: Tracking URL: 
http://localhost:50030/jobdetails.jsp?jobid=job_201106031901_0039 
11/06/04 20:47:16 INFO streaming.StreamJob: map 0% reduce 0% 
11/06/04 20:48:00 INFO streaming.StreamJob: map 100% reduce 100% 
11/06/04 20:48:00 INFO streaming.StreamJob: To kill this job, run: 
11/06/04 20:48:00 INFO streaming.StreamJob: /usr/local/hadoop/bin/../bin/hadoop 
job -Dmapred.job.tracker=localhost:54311 -kill job_201106031901_0039 
11/06/04 20:48:00 INFO streaming.StreamJob: Tracking URL: 
http://localhost:50030/jobdetails.jsp?jobid=job_201106031901_0039 
11/06/04 20:48:00 ERROR streaming.StreamJob: Job not successful. Error: NA 
11/06/04 20:48:00 INFO streaming.StreamJob: killJob… 
Streaming Job Failed! 

Any advice? 
Sincerely, 
Siddhartha Jonnalagadda, 
Text mining Researcher, Lnx Research, LLC, Orange, CA 
sjonnalagadda.wordpress.com 







Confidentiality Notice: 

This e-mail message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized review, use, disclosure or distribution is prohibited. If you 
are not the intended recipient, please contact the sender by reply e-mail and 
destroy all copies of the original message. 

-- 
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
http://marcosluis2186.posterous.com



Re: question about using java in streaming mode

2011-06-05 Thread Marcos Ortiz

El 6/5/2011 4:01 PM, Siddhartha Jonnalagadda escribió:

Hi Marcos,

I thought that streaming would make it easier because I was getting 
different errors with extending mapper and reducer in java.


I tried: hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar 
-file dummy.jar -mapper java -cp dummy.jar WCMapper -reducer java 
-cp dummy.jar WCReducer -input gutenberg/* -output gutenberg-output


The error log in the map task:
*_stderr logs_*
Exception in thread main java.lang.NoClassDefFoundError: WCMapper
Caused by: java.lang.ClassNotFoundException: WCMapper
   
Which is the definition of your ClassPath? Because, this error is caused 
where the system can not find

the definition of a class.

at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)


at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:248)


Could not find the main class: WCMapper.  Program will exit.
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)


at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:121)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)


at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:435)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)


at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed 
with code 1


at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)


at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:435)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:371)


at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)


at org.apache.hadoop.mapred.Child.main(Child.java:253)
   





Sincerely,
Siddhartha Jonnalagadda,
sjonnalagadda.wordpress.com http://sjonnalagadda.wordpress.com


Confidentiality Notice:

This e-mail message, including any attachments, is for the sole use of 
the intended recipient(s) and may contain confidential and privileged 
information. Any unauthorized review, use, disclosure or distribution 
is prohibited. If you are not the intended recipient, please contact 
the sender by reply e-mail and destroy all copies of the original message.







On Sun, Jun 5, 2011 at 10:59 AM, Marcos Ortiz Valmaseda 
mlor...@uci.cu mailto:mlor...@uci.cu wrote:


Why are using Java in streming mode instead use the native
Mapper/Reducer code?
Can you show to us the JobTracker's logs?

Regards
- Mensaje original -
De: Siddhartha Jonnalagadda sid@gmail.com
mailto:sid@gmail.com
Para: mapreduce-user@hadoop.apache.org
mailto:mapreduce-user@hadoop.apache.org
Enviados: Domingo, 5 de Junio 2011 7:16:08 GMT +01:00 Amsterdam /
Berlín / Berna / Roma / Estocolmo / Viena
Asunto: question about using java in streaming mode

Hi,


I was able use streaming in hadoop using python for the wordcount
program, but created a Mapper and Reducer in Java since all my
code is currently in Java.
I first tried this:
echo “foo foo quux labs foo bar quux” |java -cp ~/dummy.jar
WCMapper | sort | java -cp ~/dummy.jar WCReducer

It gave the correct output:
labs 1
foo 3
bar 1
quux 2

Then, I installed a single-node cluster in hadoop and tried this:
hadoop jar contrib/streaming/hadoop-streaming-0.20.203.0.jar
-mapper “java -cp ~/dummy.jar WCMapper” -reducer “java -cp
~/dummy.jar WCReducer

Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster

2011-05-31 Thread Marcos Ortiz

On 05/31/2011 10:06 AM, Xu, Richard wrote:


1 namenode, 1 datanode. Dfs.replication=3. We also tried 0, 1, 2, same 
result.


*From:*Yaozhen Pan [mailto:itzhak@gmail.com]
*Sent:* Tuesday, May 31, 2011 10:34 AM
*To:* hdfs-user@hadoop.apache.org
*Subject:* Re: Unable to start hadoop-0.20.2 but able to start 
hadoop-0.20.203 cluster


How many datanodes are in your cluster? and what is the value of 
dfs.replication in hdfs-site.xml (if not specified, default value is 
3)?


From the error log, it seems there are not enough datanodes to 
replicate the files in hdfs.


在 2011 5 31 22:23,Harsh J ha...@cloudera.com
mailto:ha...@cloudera.com写道:
Xu,

Please post the output of `hadoop dfsadmin -report` and attach the
tail of a started DN's log?


On Tue, May 31, 2011 at 7:44 PM, Xu, Richard richard...@citi.com
mailto:richard...@citi.com wrote:
 2. Also, Configured Cap...

This might easily be the cause. I'm not sure if its a Solaris thing
that can lead to this though.


 3. in datanode server, no error in logs, but tasktracker logs has
the following suspicious thing:...

I don't see any suspicious log message in what you'd posted. Anyhow,
the TT does not matter here.

--
Harsh J


Regards, Xu
When you installed on Solaris:
- Did you syncronize the ntp server on all nodes:
  echo server youservernetp.com  /etc/inet/ntp.conf
  svcadm enable svc:/network/ntp:default

- Are you using the same Java version on both systems (Ubuntu and Solaris)?

- Can you test with one NN and two DN?



--
Marcos Luis Ortiz Valmaseda
 Software Engineer (Distributed Systems)
 http://uncubanitolinuxero.blogspot.com



Re: MultipleOutputs Files remain in temporary folder

2011-05-30 Thread Marcos Ortiz

On 05/30/2011 11:02 AM, Panayotis Antonopoulos wrote:

Hello,
I just noticed that the files that are created using MultipleOutputs 
remain in the temporary folder into attempt sub-folders when there is 
no normal output  (using context.write(...)).


Has anyone else noticed that?
Is there any way to change that and make the files appear in the 
output directory?


Thank you in advance!
Panagiotis.



   |mapred.local.dir|

This lets the MapReduce servers know where to store intermediate files. 
This may be a comma-separated list of directories to spread the load. 
Make sure there’s enough space here for all your intermediate files. We 
share the same disks for MapReduce and HDFS.



   |mapred.system.dir|

This is a folder in the|defaultFS|where MapReduce stores some control 
files. In our case that would be a directory in HDFS. If you 
have|dfs.permissions|(which it is by default) enabled make sure that 
this directory exists and is owned by mapred:hadoop.



   |mapred.temp.dir|

This is a folder to store temporary files in. It is hardly -- if at all 
used. If I understand the description correctly this is supposed to be 
in HDFS but I’m not entirely sure by reading the source code. So we set 
this to a directory that exists on the local filesystem as well as in HDFS.




--
Marcos Luis Ortiz Valmaseda
 Software Engineer (Distributed Systems)
 http://uncubanitolinuxero.blogspot.com



Re: run hadoop pseudo-distribute examples failed

2011-05-20 Thread Marcos Ortiz
On 05/19/2011 10:35 PM, 李�S wrote:
 Hi Marcos,
 Thanks for your reply.
 The temporary directory '/tmp/hadoop-xxx' is defined in hadoop core
 jar's configuration file *core-default.xml*. Do u think this may
 cause the failure? Bellow is the detail config:

 property
 namehadoop.tmp.dir/name
 value/tmp/hadoop-${user.name}/value
 descriptionA base for other temporary directories./description
 /property

 And what's the other config files do u need? Almostly, I didn't modify
 any configuration after downloading the hadoop-0.20.2 files, I think
 those configuration are all the default values.
Yes, those are the default values, but I think that you can test with
another directory because this is a temporary directory , and it can be
erased easy.
For example, when you use the CDH3, the default value there is
/var/lib/hadoop-0.20.2/cache/${user.name}, which is more convenient.
Of course, it's a recommendation.
You can search the Lars Francke's Blog (http://blog.lars-francke.de/)
where he did a excellent work explaining the manual installation of a
Hadoop Cluster.

Regards

 2011-05-20
 
 李�S
 
 *发件人:* Marcos Ortiz
 *发送时间:* 2011-05-19 20:40:06
 *收件人:* mapreduce-user
 *抄送:* 李�S
 *主题:* Re: run hadoop pseudo-distribute examples failed
 On 05/18/2011 10:53 PM, 李�S wrote:
 Hi All,
 I'm trying to run hadoop(0.20.2) examples in Pseudo-Distributed Mode
 following the hadoop user guide. After I run the 'start-all.sh', it
 seems the namenode can't connect to datanode.
 'SSH localhost' is OK on my server. Someone advises to rm
 '/tmp/hadoop-' and format namenode again, but it doesn't work.
 And 'iptables -L' shows there is no firewall rules in my server:

 test:/home/liyun2010# iptables -L
 Chain INPUT (policy ACCEPT)
 target prot opt source destination
 Chain FORWARD (policy ACCEPT)
 target prot opt source destination
 Chain OUTPUT (policy ACCEPT)
 target prot opt source destination

 Is there anyone can give me more advice? Thanks!
 Bellow is my namenode and datanode log files:
 liyun2010@test:~/hadoop-0.20.2/logs$
 mailto:liyun2010@test:%7E/hadoop-0.20.2/logs$ cat
 hadoop-liyun2010-namenode-test.puppet.com.log

 2011-05-19 10:58:25,938 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG: host = test.puppet.com/127.0.0.1
 STARTUP_MSG: args = []
 STARTUP_MSG: version = 0.20.2
 STARTUP_MSG: build =
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20
 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
 /
 2011-05-19 10:58:26,197 INFO
 org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC
 Metrics with hostName=NameNode, port=9000
 2011-05-19 10:58:26,212 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 test.puppet.com/127.0.0.1:9000
 2011-05-19 10:58:26,220 INFO
 org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
 Metrics with processName=NameNode, sessionId=null
 2011-05-19 10:58:26,224 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2011-05-19 10:58:26,405 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 fsOwner=liyun2010,users
 2011-05-19 10:58:26,406 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 supergroup=supergroup
 2011-05-19 10:58:26,406 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2011-05-19 10:58:26,429 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2011-05-19 10:58:26,434 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2011-05-19 10:58:26,511 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files = 9
 2011-05-19 10:58:26,524 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files
 under construction = 1
 2011-05-19 10:58:26,530 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size
 920 loaded in 0 seconds.
 2011-05-19 10:58:26,606 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Invalid
 opcode, reached end of edit log Number of transactions found 99
 2011-05-19 10:58:26,606 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Edits file
 /tmp/hadoop-liyun2010/dfs/name/current/edits of size 1049092
 edits # 99 loaded in 0 seconds.
 2011-05-19 10:58:26,660

Re: Starting Datanode

2011-05-20 Thread Marcos Ortiz

On 05/20/2011 03:46 PM, Anh Nguyen wrote:

On 05/20/2011 01:15 PM, Marcos Ortiz wrote:

On 05/20/2011 01:02 PM, Anh Nguyen wrote:

Hi,

I just upgraded to hadoop-0.20.203.0, and am having problem starting 
the

datanode:
# hadoop datanode
Unrecognized option: -jvm
Could not create the Java virtual machine.

It looks like it has something to do with daemon.sh, particularly the
setting of HADOOP_OPTS:
   if [[ $EUID -eq 0 ]]; then
 HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS
   else
 HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS
   fi

Am I missing something?

Thanks in advance.

Anh-


Which Java's version are you using?



# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

It worked with hadoop-0.20.2.
Anh-
How do you are starting the services? using bin/start-all.sh o simply 
the datanode?



--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229
 http://about.me/marcosortiz



Re: Starting Datanode

2011-05-20 Thread Marcos Ortiz

On 05/20/2011 04:08 PM, Anh Nguyen wrote:

On 05/20/2011 02:06 PM, Marcos Ortiz wrote:

On 05/20/2011 04:27 PM, Marcos Ortiz wrote:

On 05/20/2011 03:46 PM, Anh Nguyen wrote:

On 05/20/2011 01:15 PM, Marcos Ortiz wrote:

On 05/20/2011 01:02 PM, Anh Nguyen wrote:

Hi,

I just upgraded to hadoop-0.20.203.0, and am having problem 
starting the

datanode:
# hadoop datanode
Unrecognized option: -jvm
Could not create the Java virtual machine.

It looks like it has something to do with daemon.sh, particularly 
the

setting of HADOOP_OPTS:
   if [[ $EUID -eq 0 ]]; then
 HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS
   else
 HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS
   fi

Am I missing something?

Thanks in advance.

Anh-


Which Java's version are you using?



# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

It worked with hadoop-0.20.2.
Anh-
How do you are starting the services? using bin/start-all.sh o 
simply the datanode?



Anh, test first that this option (-jvm is included in that Java 
version).




Tested earlier:
# java -jvm
Unrecognized option: -jvm
Could not create the Java virtual machine.

Did you check the requirements for that release? I don´t know if this 
version require at least a mayor version to 1.6.20.

Did you test with the 1.6.24?

I think that can be a bug.
Take a time to review the last issues for Hadoop on the JIRA of the project.

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229
 http://about.me/marcosortiz



Re: Starting Datanode

2011-05-20 Thread Marcos Ortiz

On 05/20/2011 04:08 PM, Anh Nguyen wrote:

On 05/20/2011 02:06 PM, Marcos Ortiz wrote:

On 05/20/2011 04:27 PM, Marcos Ortiz wrote:

On 05/20/2011 03:46 PM, Anh Nguyen wrote:

On 05/20/2011 01:15 PM, Marcos Ortiz wrote:

On 05/20/2011 01:02 PM, Anh Nguyen wrote:

Hi,

I just upgraded to hadoop-0.20.203.0, and am having problem 
starting the

datanode:
# hadoop datanode
Unrecognized option: -jvm
Could not create the Java virtual machine.

It looks like it has something to do with daemon.sh, particularly 
the

setting of HADOOP_OPTS:
   if [[ $EUID -eq 0 ]]; then
 HADOOP_OPTS=$HADOOP_OPTS -jvm server $HADOOP_DATANODE_OPTS
   else
 HADOOP_OPTS=$HADOOP_OPTS -server $HADOOP_DATANODE_OPTS
   fi

Am I missing something?

Thanks in advance.

Anh-


Which Java's version are you using?



# java -version
java version 1.6.0_20
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)

It worked with hadoop-0.20.2.
Anh-
How do you are starting the services? using bin/start-all.sh o 
simply the datanode?



Anh, test first that this option (-jvm is included in that Java 
version).




Tested earlier:
# java -jvm
Unrecognized option: -jvm
Could not create the Java virtual machine.

Did you check the requirements for that release? I don´t know if this 
version require at least a mayor version to 1.6.20.

Did you test with the 1.6.24?

I think that can be a bug.
Take a time to review the last issues for Hadoop on the JIRA of the project.

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229
 http://about.me/marcosortiz



Re: Profiling Hadoop Code

2011-05-19 Thread Marcos Ortiz

On 05/19/2011 04:26 AM, Shuja Rehman wrote:

Hi All,

I was investigating the ways to profile the hadoop code. All I found 
is to use JobConf.setProfileEnabled(boolean) 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setProfileEnabled%28boolean%29 
but i believe this is not available in the new api. so can anybody let 
me know how i can profile my hadoop code to get details which part is 
taking what time to tune the application?


Thanks

--
Regards
Shuja-ur-Rehman Baig



Version 0.20.2
Location:/hadoop-0.20.2/src/mapred/org/apache/hadoop/mapred/JobConf.java

/**
   * Get whether the task profiling is enabled.
   * @return true if some tasks will be profiled
   */
  public boolean getProfileEnabled() {
return getBoolean(mapred.task.profile, false);
  }

  /**
   * Set whether the system should collect profiler information for 
some of

   * the tasks in this job? The information is stored in the user log
   * directory.
   * @param newValue true means it should be gathered
   */
  public void setProfileEnabled(boolean newValue) {
setBoolean(mapred.task.profile, newValue);
  }

  /**
   * Get the profiler configuration arguments.
   *
   * The default value for this property is
   * 
-agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y,verbose=n,file=%s

   *
   * @return the parameters to pass to the task child to configure 
profiling

   */
  public String getProfileParams() {
return get(mapred.task.profile.params,
   -agentlib:hprof=cpu=samples,heap=sites,force=n,thread=y, +
 verbose=n,file=%s);
  }

  /**
   * Set the profiler configuration arguments. If the string contains a 
'%s' it
   * will be replaced with the name of the profiling output file when 
the task

   * runs.
   *
   * This value is passed to the task child JVM on the command line.
   *
   * @param value the configuration string
   */
  public void setProfileParams(String value) {
set(mapred.task.profile.params, value);
  }

  /**
   * Get the range of maps or reduces to profile.
   * @param isMap is the task a map?
   * @return the task ranges
   */
  public IntegerRanges getProfileTaskRange(boolean isMap) {
return getRange((isMap ? mapred.task.profile.maps :
   mapred.task.profile.reduces), 0-2);
  }

  /**
   * Set the ranges of maps or reduces to profile. setProfileEnabled(true)
   * must also be called.
   * @param newValue a set of integer ranges of the map ids
   */
  public void setProfileTaskRange(boolean isMap, String newValue) {
// parse the value to make sure it is legal
new Configuration.IntegerRanges(newValue);
set((isMap ? mapred.task.profile.maps : 
mapred.task.profile.reduces),

newValue);
  }


--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229
 http://about.me/marcosortiz



Re: FW: NNbench and MRBench

2011-05-08 Thread Marcos Ortiz

El 5/8/2011 12:46 AM, stanley@emc.com escribió:

Thanks Marcos.
This post of  Michael Noll does provide some information about how to run these 
benchmarks, but there's not much information about how to evaluate the results.
Do you know some resources about the result analysis?

Thanks very much :)

Regards,
Stanley

-Original Message-
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: 2011年5月8日 11:09
To: mapreduce-user@hadoop.apache.org
Cc: Shi, Stanley
Subject: Re: FW: NNbench and MRBench

El 5/7/2011 10:33 PM, stanley@emc.com escribió:
   

Thanks, Marcos,

Through these links, I still can't find anything about the NNbench and MRBench.

-Original Message-
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: 2011年5月8日 10:23
To: mapreduce-user@hadoop.apache.org
Cc: Shi, Stanley
Subject: Re: FW: NNbench and MRBench

El 5/7/2011 8:53 PM, stanley@emc.com escribió:

 

Hi guys,

I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster 
with the nnbench and mrbench.
I'm new to the hadoop thing and have no one to refer to. I don't know what the 
supposed result should I have?
Now for mrbench, I have an average time of 22sec for a one map job. Is this too 
bad? What the supposed results might be?

For nnbench, what's the supposed results? Below is my result.

  Datetime: 2011-05-05 20:40:25,459

   Test Operation: rename
   Start time: 2011-05-05 20:40:03,820
  Maps to run: 1
   Reduces to run: 1
   Block Size (bytes): 1
   Bytes to write: 0
   Bytes per checksum: 1
  Number of files: 1
   Replication factor: 1
   Successful file operations: 1

   # maps that missed the barrier: 0
 # exceptions: 0

  TPS: Rename: 1763
   Avg Exec time (ms): Rename: 0.5672
 Avg Lat (ms): Rename: 0.4844
null

RAW DATA: AL Total #1: 4844
RAW DATA: AL Total #2: 0
 RAW DATA: TPS Total (ms): 5672
  RAW DATA: Longest Map Time (ms): 5672.0
  RAW DATA: Late maps: 0
RAW DATA: # of exceptions: 0
=
One more question, when I set maps number to bigger, I get all zeros results:
=
Test Operation: create_write
   Start time: 2011-05-03 23:22:39,239
  Maps to run: 160
   Reduces to run: 160
   Block Size (bytes): 1
   Bytes to write: 0
   Bytes per checksum: 1
  Number of files: 1
   Replication factor: 1
   Successful file operations: 0

   # maps that missed the barrier: 0
 # exceptions: 0

  TPS: Create/Write/Close: 0
Avg exec time (ms): Create/Write/Close: 0.0
   Avg Lat (ms): Create/Write: NaN
  Avg Lat (ms): Close: NaN

RAW DATA: AL Total #1: 0
RAW DATA: AL Total #2: 0
 RAW DATA: TPS Total (ms): 0
  RAW DATA: Longest Map Time (ms): 0.0
  RAW DATA: Late maps: 0
RAW DATA: # of exceptions: 0
=

Can anyone point me to some documents?
I really appreciate your help :)

Thanks,
stanley


   

You can use these resources:
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/
http://wiki.apache.org/hadoop/HardwareBenchmarks
http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems

Regards


 

Well, on the Micheal Noll's post says this:

NameNode benchmark (nnbench)
===
NNBench (see src/test/org/apache/hadoop/hdfs/NNBench.java) is useful for
load testing the NameNode hardware and configuration. It generates a lot
of HDFS-related requests with normally very small payloads for the
sole purpose of putting a high HDFS management stress on the NameNode.
The benchmark can simulate requests for creating, reading, renaming and
deleting files on HDFS.

I like to run this test simultaneously from several machines -- e.g.
from a set of DataNode boxes -- in order to hit the NameNode from
multiple locations at the same time.

The syntax of NNBench is as follows:

NameNode Benchmark 0.4
Usage: nnbenchoptions
Options:
  -operationAvailable operations are create_write open_read
rename delete. This option is mandatory
   * NOTE: The open_read, rename and delete

Re: FW: NNbench and MRBench

2011-05-07 Thread Marcos Ortiz

El 5/7/2011 8:53 PM, stanley@emc.com escribió:

Hi guys,

I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster 
with the nnbench and mrbench.
I'm new to the hadoop thing and have no one to refer to. I don't know what the 
supposed result should I have?
Now for mrbench, I have an average time of 22sec for a one map job. Is this too 
bad? What the supposed results might be?

For nnbench, what's the supposed results? Below is my result.

Date  time: 2011-05-05 20:40:25,459

 Test Operation: rename
 Start time: 2011-05-05 20:40:03,820
Maps to run: 1
 Reduces to run: 1
 Block Size (bytes): 1
 Bytes to write: 0
 Bytes per checksum: 1
Number of files: 1
 Replication factor: 1
 Successful file operations: 1

 # maps that missed the barrier: 0
   # exceptions: 0

TPS: Rename: 1763
 Avg Exec time (ms): Rename: 0.5672
   Avg Lat (ms): Rename: 0.4844
null

  RAW DATA: AL Total #1: 4844
  RAW DATA: AL Total #2: 0
   RAW DATA: TPS Total (ms): 5672
RAW DATA: Longest Map Time (ms): 5672.0
RAW DATA: Late maps: 0
  RAW DATA: # of exceptions: 0
=
One more question, when I set maps number to bigger, I get all zeros results:
=
Test Operation: create_write
 Start time: 2011-05-03 23:22:39,239
Maps to run: 160
 Reduces to run: 160
 Block Size (bytes): 1
 Bytes to write: 0
 Bytes per checksum: 1
Number of files: 1
 Replication factor: 1
 Successful file operations: 0

 # maps that missed the barrier: 0
   # exceptions: 0

TPS: Create/Write/Close: 0
Avg exec time (ms): Create/Write/Close: 0.0
 Avg Lat (ms): Create/Write: NaN
Avg Lat (ms): Close: NaN

  RAW DATA: AL Total #1: 0
  RAW DATA: AL Total #2: 0
   RAW DATA: TPS Total (ms): 0
RAW DATA: Longest Map Time (ms): 0.0
RAW DATA: Late maps: 0
  RAW DATA: # of exceptions: 0
=

Can anyone point me to some documents?
I really appreciate your help :)

Thanks,
stanley
   

You can use these resources:
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/
http://wiki.apache.org/hadoop/HardwareBenchmarks
http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229
 http://about.me/marcosortiz



Re: FW: NNbench and MRBench

2011-05-07 Thread Marcos Ortiz

El 5/7/2011 10:33 PM, stanley@emc.com escribió:

Thanks, Marcos,

Through these links, I still can't find anything about the NNbench and MRBench.

-Original Message-
From: Marcos Ortiz [mailto:mlor...@uci.cu]
Sent: 2011年5月8日 10:23
To: mapreduce-user@hadoop.apache.org
Cc: Shi, Stanley
Subject: Re: FW: NNbench and MRBench

El 5/7/2011 8:53 PM, stanley@emc.com escribió:
   

Hi guys,

I have a cluster of 16 machines running Hadoop. Now I want to do some benchmark on this cluster 
with the nnbench and mrbench.
I'm new to the hadoop thing and have no one to refer to. I don't know what the 
supposed result should I have?
Now for mrbench, I have an average time of 22sec for a one map job. Is this too 
bad? What the supposed results might be?

For nnbench, what's the supposed results? Below is my result.

 Date   time: 2011-05-05 20:40:25,459

  Test Operation: rename
  Start time: 2011-05-05 20:40:03,820
 Maps to run: 1
  Reduces to run: 1
  Block Size (bytes): 1
  Bytes to write: 0
  Bytes per checksum: 1
 Number of files: 1
  Replication factor: 1
  Successful file operations: 1

  # maps that missed the barrier: 0
# exceptions: 0

 TPS: Rename: 1763
  Avg Exec time (ms): Rename: 0.5672
Avg Lat (ms): Rename: 0.4844
null

   RAW DATA: AL Total #1: 4844
   RAW DATA: AL Total #2: 0
RAW DATA: TPS Total (ms): 5672
 RAW DATA: Longest Map Time (ms): 5672.0
 RAW DATA: Late maps: 0
   RAW DATA: # of exceptions: 0
=
One more question, when I set maps number to bigger, I get all zeros results:
=
Test Operation: create_write
  Start time: 2011-05-03 23:22:39,239
 Maps to run: 160
  Reduces to run: 160
  Block Size (bytes): 1
  Bytes to write: 0
  Bytes per checksum: 1
 Number of files: 1
  Replication factor: 1
  Successful file operations: 0

  # maps that missed the barrier: 0
# exceptions: 0

 TPS: Create/Write/Close: 0
Avg exec time (ms): Create/Write/Close: 0.0
  Avg Lat (ms): Create/Write: NaN
 Avg Lat (ms): Close: NaN

   RAW DATA: AL Total #1: 0
   RAW DATA: AL Total #2: 0
RAW DATA: TPS Total (ms): 0
 RAW DATA: Longest Map Time (ms): 0.0
 RAW DATA: Late maps: 0
   RAW DATA: # of exceptions: 0
=

Can anyone point me to some documents?
I really appreciate your help :)

Thanks,
stanley

 

You can use these resources:
http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
http://answers.oreilly.com/topic/460-how-to-benchmark-a-hadoop-cluster/
http://wiki.apache.org/hadoop/HardwareBenchmarks
http://www.quora.com/Apache-Hadoop/Are-there-any-good-Hadoop-benchmark-problems

Regards

   

Well, on the Micheal Noll's post says this:

NameNode benchmark (nnbench)
===
NNBench (see src/test/org/apache/hadoop/hdfs/NNBench.java) is useful for 
load testing the NameNode hardware and configuration. It generates a lot 
of HDFS-related requests with normally very small payloads for the 
sole purpose of putting a high HDFS management stress on the NameNode. 
The benchmark can simulate requests for creating, reading, renaming and 
deleting files on HDFS.


I like to run this test simultaneously from several machines -- e.g. 
from a set of DataNode boxes -- in order to hit the NameNode from 
multiple locations at the same time.


The syntax of NNBench is as follows:

NameNode Benchmark 0.4
Usage: nnbench options
Options:
-operation Available operations are create_write open_read 
rename delete. This option is mandatory
 * NOTE: The open_read, rename and delete operations assume 
that the files they operate on, are already available. The create_write 
operation must be run before running the other operations.

-maps number of maps. default is 1. This is not mandatory
-reduces number of reduces. default is 1. This is not mandatory
-startTime time to start, given in seconds from the epoch. 
Make sure this is far enough into the future, so all maps (operations) 
will start at the same time. default is launch time + 2 mins. This is 
not mandatory
-blockSize Block size in bytes

Re: Other FS Pointer?

2011-05-04 Thread Marcos Ortiz Valmaseda
For example:

 * Amazon S3 (Amazon Simple Storage Service): http://aws.amazon.com/s3/
   On the Hadoop wiki, there is a competed guide to work Hadoop with Amazon S3
   http://wiki.apache.org/hadoop/AmazonS3 

 * IBM GPFS:
   http://www.ibm.com/systems/gpfs/ 
   https://issues.apache.org/jira/browse/HADOOP-6330
   http://www.almaden.ibm.com/StorageSystems/projects/gpfs/

 * CloudStore: 
   http://kosmosfs.sourceforge.net/

 * ASter Data ´s Integration with Hadoop: 
   http://www.asterdata.com/news/091001-Aster-Hadoop-connector.php

Regards
- Mensaje original -
De: Anh Nguyen angu...@redhat.com
Para: hdfs-user@hadoop.apache.org
Enviados: Miércoles, 4 de Mayo 2011 12:57:24 (GMT-0500) Auto-Detected
Asunto: Other FS Pointer?

Hi,
Can anyone point me to a doc describing how to port/use another 
clustered FS?
Thanks.

Anh-

-- 
Marcos Luís Ortíz Valmaseda
 Software Engineer 
 Universidad de las Ciencias Informáticas
 Linux User # 418229

http://uncubanitolinuxero.blogspot.com
http://www.linkedin.com/in/marcosluis2186



Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1

2011-04-12 Thread Marcos Ortiz

El 4/11/2011 10:45 PM, Alex Luya escribió:

BUILD FAILED
.../branch-0 .20-append/build.xml:927: The following error
occurred while executing this line:
../branch-0 .20-append/build.xml:933: exec returned: 1

Total time: 1 minute 17 seconds
+ RESULT=1
+ '[' 1 '!=' 0 ']'
+ echo 'Build Failed: 64-bit build not run'
Build Failed: 64-bit build not run
+ exit 1
-
I checked content in file build.xml:

line 927:antcall target=cn-docs//targettarget name=cn-docs
depends=forrest.check, init description=Generate forrest-based
Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache
Forrest installationgt; on the command line. if=forrest.home
line 933:exec dir=${src.docs.cn}
executable=${forrest.home}/bin/forrest failonerror=true
---
It seems try to execute forrest,what is the problem here?I am running a
64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5  installed.Some
guys told there are some tricks in this
page:http://wiki.apache.org/hadoop/HowToRelease  to get forrest build to
work.But I can't find any tricks in the page.
Any help is appreciated.


   

1- Which version of Java do you have on the JAVA_HOME variable?
You can browse on the Forrest page to get how you can build it: 
http://forrest.apache.org


2- another question for you:
Do you actually need Forrest?

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229



Re: Question regarding datanode been wiped by hadoop

2011-04-12 Thread Marcos Ortiz

El 4/12/2011 10:46 AM, felix gao escribió:


What reason/condition would cause a datanode’s blocks to be removed? 
  Our cluster had a one of its datanodes crash because of bad RAM. 
  After the system was upgraded and the datanode/tasktracker brought 
online the next day we noticed the amount of space utilized was 
minimal and the cluster was rebalancing blocks to the datanode.   It 
would seem the prior blocks were removed.   Was this because the 
datanode was declared dead?   What is the criteria for a namenode to 
decide (Assuming its the namenode) when a datanode should remove prior 
blocks?



1- Did you check the DataNode´s logs?
2- Did you protect the NameNode´s dfs.name.dir and the dfs.edits.dir ´s 
directories?
On these directories, the NameNode stores the file system image and the 
second is where the edit log or journal is written. A good practice for 
these directories is to have them on RAID 1 or RAID 10 to guarantize the 
consistency of your cluster.


Any data loss  in these directories (dfs.name.dir and dfs.edits.dir) 
will result in a loss of data in your HDFS. So, the second good practice 
is to have a secondary NameNode to setup in any case that the primary 
NameNode fails.


Another thing to keep in mind, is that when the NameNode fails, you have 
to restar the JobTracker and the TaskTrackers after that the NameNode 
will be restarted.

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229



Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1

2011-04-11 Thread Marcos Ortiz

El 4/11/2011 10:45 PM, Alex Luya escribió:

BUILD FAILED
.../branch-0 .20-append/build.xml:927: The following error
occurred while executing this line:
../branch-0 .20-append/build.xml:933: exec returned: 1

Total time: 1 minute 17 seconds
+ RESULT=1
+ '[' 1 '!=' 0 ']'
+ echo 'Build Failed: 64-bit build not run'
Build Failed: 64-bit build not run
+ exit 1
-
I checked content in file build.xml:

line 927:antcall target=cn-docs//targettarget name=cn-docs
depends=forrest.check, init description=Generate forrest-based
Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache
Forrest installationgt; on the command line. if=forrest.home
line 933:exec dir=${src.docs.cn}
executable=${forrest.home}/bin/forrest failonerror=true
---
It seems try to execute forrest,what is the problem here?I am running a
64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5  installed.Some
guys told there are some tricks in this
page:http://wiki.apache.org/hadoop/HowToRelease  to get forrest build to
work.But I can't find any tricks in the page.
Any help is appreciated.


   

1- Which version of Java do you have on the JAVA_HOME variable?
You can browse on the Forrest page to get how you can build it: 
http://forrest.apache.org


2- another question for you:
Do you actually need Forrest?

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229



Re: hadoop branch-0.20-append Build error:build.xml:933: exec returned: 1

2011-04-11 Thread Marcos Ortiz

El 4/11/2011 10:45 PM, Alex Luya escribió:

BUILD FAILED
.../branch-0 .20-append/build.xml:927: The following error
occurred while executing this line:
../branch-0 .20-append/build.xml:933: exec returned: 1

Total time: 1 minute 17 seconds
+ RESULT=1
+ '[' 1 '!=' 0 ']'
+ echo 'Build Failed: 64-bit build not run'
Build Failed: 64-bit build not run
+ exit 1
-
I checked content in file build.xml:

line 927:antcall target=cn-docs//targettarget name=cn-docs
depends=forrest.check, init description=Generate forrest-based
Chinese documentation. To use, specify -Dforrest.home=lt;base of Apache
Forrest installationgt; on the command line. if=forrest.home
line 933:exec dir=${src.docs.cn}
executable=${forrest.home}/bin/forrest failonerror=true
---
It seems try to execute forrest,what is the problem here?I am running a
64bit ubuntu,with 64+32bit-jdk-1.6 and 64-bit-jdk-1.5  installed.Some
guys told there are some tricks in this
page:http://wiki.apache.org/hadoop/HowToRelease  to get forrest build to
work.But I can't find any tricks in the page.
Any help is appreciated.


   

1- Which version of Java do you have on the JAVA_HOME variable?
You can browse on the Forrest page to get how you can build it: 
http://forrest.apache.org


2- another question for you:
Do you actually need Forrest?

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer (Large-Scaled Distributed Systems)
 University of Information Sciences,
 La Habana, Cuba
 Linux User # 418229



Re: mapred.min.split.size

2011-03-18 Thread Marcos Ortiz

El 3/18/2011 3:54 PM, Pedro Costa escribió:

Hi

What's the purpose of the parameter mapred.min.split.size?

Thanks,
   
There are many parameters that control the number of map tasks for a 
Job, and mapred.min.split.size controls the minimun size of a split. 
Other parameters are:

- mapreduce.map.tasks: The suggested number of map tasks
- dfs.block.size: the file system block size in bytes of the input file

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer
 Universidad de las Ciencias Informáticas
 Linux User # 418229

http://uncubanitolinuxero.blogspot.com
http://www.linkedin.com/in/marcosluis2186



Re: Lost Task Tracker because of no heartbeat

2011-03-16 Thread Marcos Ortiz
On Wed, 2011-03-16 at 17:50 +0100, baran cakici wrote:
 Hi Everyone,
 
 I make a Project with Hadoop-MapRedeuce for my master-Thesis. I have a
 strange problem on my System.
 
 First of all, I use Hadoop-0.20.2 on Windows XP Pro with Eclipse
 Plug-In. When I start a job with big Input(4GB - it`s may be not to
 big, but algorithm require some time), then i lose my Task Tracker in
 several minutes or seconds. I mean, Seconds since heartbeat
 increase 
 and then after 600 Seconds I lose TaskTracker.  
 
 I read somewhere, that can be occured because of small number of open
 files (ulimit -n). I try to increase this value, but i can write as
 max value in Cygwin 3200.(ulimit -n 3200) and default value is 256.
 Actually I don`t know, is it helps or not.
 
 In my job and task tracker.log have I some Errors, I posted those to.
 
 Jobtracker.log
 
 -Call to localhost/127.0.0.1:9000 failed on local exception:
 java.io.IOException: An existing connection was forcibly closed by the
 remote host
 
 another :
 -
 2011-03-15 12:13:30,718 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_201103151143_0002_m_91_0 is 97125 ms debug.
 2011-03-15 12:16:50,718 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_201103151143_0002_m_91_0 is 297125 ms debug.
 2011-03-15 12:20:10,718 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_201103151143_0002_m_91_0 is 497125 ms debug.
 2011-03-15 12:23:30,718 INFO org.apache.hadoop.mapred.JobTracker:
 attempt_201103151143_0002_m_91_0 is 697125 ms debug.
 
 Error launching task
 Lost tracker 'tracker_apple:localhost/127.0.0.1:2654'
 
 there are my logs(jobtracker.log, tasktracker.log ...) in attachment 
 
 I need really Help, I don`t have so much time for my Thessis.
 
 Thanks a lot for your Helps,
 
 Baran 

Regards, Baran 
I was analyzing your logs and I have several questions:
1- On the hadoop-Baran-jobtracker-apple.log you have this:
Cleaning up the system directory
2011-03-15 01:18:44,468 INFO org.apache.hadoop.mapred.JobTracker:
problem cleaning system directory:
hdfs://localhost:9000/cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot
delete /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system.
Name node is in safe mode.

This is a notice that you are doing something wrong with HDFS.
Can you provide the output of:
 hadoop dfsadmin -report 
on the NameNode?

Regards

-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186




Re: Lost Task Tracker because of no heartbeat

2011-03-16 Thread Marcos Ortiz
On Thu, 2011-03-17 at 00:19 +0530, Harsh J wrote: 
 On Thu, Mar 17, 2011 at 12:42 AM, Marcos Ortiz mlor...@uci.cu wrote:
  2011-03-15 01:18:44,468 INFO org.apache.hadoop.mapred.JobTracker:
  problem cleaning system directory:
  hdfs://localhost:9000/cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system
  org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot
  delete /cygwin/usr/local/hadoop-datastore/hadoop-Baran/mapred/system.
  Name node is in safe mode.
 
 Marcos, the JT keeps attempting to clear the mapred.system.dir on the
 DFS at startup, and fails because the NameNode wasn't ready when it
 tried (and thereby reattempts after a time, and passes later when NN
 is ready for some editing action). This is mostly because Baran is
 issuing a start-all/stop-all instead of a simple start/stop of mapred
 components.
 
Thanks a lot, Harsh for the response.
I think that's a good entry to add to the Problems/Solutions section on
the Hadoop Wiki.

Regards
-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186




Re: cloudera CDH3 error: namenode running,but:Error: JAVA_HOME is not set and Java could not be found

2011-03-16 Thread Marcos Ortiz
On Wed, 2011-03-16 at 23:19 +0800, Alex Luya wrote: 
 I download cloudera CDH3 beta:hadoop-0.20.2+228,and modified three
 files:hdfs.xml,core-site.xml and hadoop-env.sh.and I do have set
 JAVA_HOME in file:hadoop-env.sh,and then try to run:start-dfs.sh,got
 this error,but strange thing is that namenode is running.I can't
 understand why.Any help is appreciate. 
 
I think that this questions is for the cdh-users mailing list, but I
will try to help you?

1- Are you sure that you installed the Java Development Kit(JDK 1.6+)?
2- Which is your environment?
   - Operating System
   - Architecture
3- Can you check the Cloudera Documentation about: Installing and
configuring CDH3? http://docs.cloudera.com/

4- If you did a new user account(recommended) called hadoop. Did you
check that on environment of this user, did you set the JAVA_HOME
variable?
If you are using a Linux distribution supported by the Cloudera´s Team,
I recommend you that you should use the official repositories.

http://archives.cloudera.com

You can check the last news on the blog, where they talked about the new
Linux distributions supported by CDH3:
- Debian 5/6
- Ubuntu 10.10
- Red Hat 5.4
- CentOS 5
- SUSE EL 11

The last recommendation is to check the DZone RefCard from Eugene
Ciurana(http://eugeneciurana.eu) and the Cloudera´Team called Apache
Hadoop Deployment:A Blueprint for Reliable Distributed Computing

Regards, 
-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186




Re: Could not obtain block

2011-03-09 Thread Marcos Ortiz

El 3/9/2011 6:27 AM, Evert Lammerts escribió:

We see a lot of IOExceptions coming from HDFS during a job that does nothing 
but untar 100 files (1 per Mapper, sizes vary between 5GB and 80GB) that are in 
HDFS, to HDFS. DataNodes are also showing Exceptions that I think are related. 
(See stacktraces below.)

This job should not be able to overload the system I think... I realize that 
much data needs to go over the lines, but HDFS should still be responsive. Any 
ideas / help is much appreciated!

Some details:
* Hadoop 0.20.2 (CDH3b4)
* 5 node cluster plus 1 node for JT/NN (Sun Thumpers)
* 4 cores/node, 4GB RAM/core
* CentOS 5.5

Job output:

java.io.IOException: java.io.IOException: Could not obtain block: 
blk_-3695352030358969086_130839 
file=/user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz
   

Which is the ouput of:
  bin/hadoop dfsadmin -report

Which is the output of:
  bin/hadoop fsck /user/emeij/icwsm-data-test/

at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:449)
at ilps.DownloadICWSM$UntarMapper.map(DownloadICWSM.java:1)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:390)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:324)
at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:234)
Caused by: java.io.IOException: Could not obtain block: 
blk_-3695352030358969086_130839 
file=/user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz
   

Which is the ouput of:
 bin/hadoop fsck /user/emeij/icwsm-data-test/01-26-SOCIAL_MEDIA.tar.gz 
--files -blocks -racks

at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1784)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1932)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:335)
at ilps.DownloadICWSM$CopyThread.run(DownloadICWSM.java:149)


Example DataNode Exceptions (not that these come from the node at 
192.168.28.211):

2011-03-08 19:40:40,297 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9222067946733189014_3798233 
java.io.EOFException: while trying to read 3067064 bytes
2011-03-08 19:40:41,018 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.211:50050, dest: /192.168.28.211:49748, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201103071120_0030_m_32_0, offset: 30
72, srvID: DS-568746059-145.100.2.180-50050-1291128670510, blockid: 
blk_3596618013242149887_4060598, duration: 2632000
2011-03-08 19:40:41,049 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9221028436071074510_2325937 
java.io.EOFException: while trying to read 2206400 bytes
2011-03-08 19:40:41,348 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9221549395563181322_4024529 
java.io.EOFException: while trying to read 3037288 bytes
2011-03-08 19:40:41,357 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9221885906633018147_3895876 
java.io.EOFException: while trying to read 1981952 bytes
2011-03-08 19:40:41,434 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block blk_-9221885906633018147_3895876 unfinalized and removed.
2011-03-08 19:40:41,434 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-9221885906633018147_3895876 received exception 
java.io.EOFException: while trying to read 1981952 bytes
2011-03-08 19:40:41,434 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(192.168.28.211:50050, 
storageID=DS-568746059-145.100.2.180-50050-1291128670510, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 1981952 bytes
 at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
 at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
 at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
 at 

Re: Could not obtain block

2011-03-09 Thread Marcos Ortiz

El 3/9/2011 11:09 AM, Evert Lammerts escribió:

I didn't mention it but the complete filesystem is reported healthy by fsck. 
I'm guessing that the java.io.EOFException indicates a problem caused by the 
load of the job.

Any ideas?

   

It's a very tricky work to debug a MapReduce Job execution but I'll try.

java.io.EOFException: while trying to read 1981952 bytes
  at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
  at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
  at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
  at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
  at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
  at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
 2011-03-08 19:40:41,465 WARN 
org.apache.hadoop.hdfs.server.datanode.DataNode: Block 
blk_-9221549395563181322_4024529 unfinalized and removed.


1- Did you check this?
2- Which are the file permisions on /user/emeij/icwsm-data-test/ ?

If the fsck command gives that all is fine, really I don't know more.

Regards

--
Marcos Luís Ortíz Valmaseda
 Software Engineer
 Universidad de las Ciencias Informáticas
 Linux User # 418229

http://uncubanitolinuxero.blogspot.com
http://www.linkedin.com/in/marcosluis2186



how to use hadoop apis with cloudera distribution ?

2011-03-08 Thread Marcos Ortiz
On Tue, 2011-03-08 at 07:16 -0800, Mapred Learn wrote: 
 
  Hi,
  I downloaded CDH3 VM for hadoop but if I want to use something like:
   
  import org.apache.hadoop.conf.Configuration;
   
  in my java code, what else do I need to do ?

Can you see all tutorial that Cloudera has on its site
http://www.cloudera.com/presentations
http://www.cloudera.com/info/training
http://www.cloudera.com/developers/learn-hadoop/

Can you check the CDH3 Official Documentation and the last news about
the new release:

http://docs.cloudera.com
http://www.cloudera.com/blog/category/cdh/

  
  Do i need to download hadoop from apache ? 
No, CDH beta 3 has with all required tools to work with Hadoop, even
more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase,
Flume, etc
   
  if yes, then what does cdh3 do ?
The Cloudera' colleagues has a excelent work packaging the most used
applications with Hadoop on a single virtual machine for testing and
they did a better approach to use Hadoop.

They has Red Hat and Ubuntu/Debian compatible packages to do more easy
the installation, configuration and use of Hadoop on these operating
systems.

Please, read http://docs.cloudera.com

   
  if not, then where can i find hadoop code on cdh VM ?
   
   
  I am using above line in my java code in eclipse and eclipse is not able to 
  find it.
Do you set JAVA_HOME, and HADOOP_HOME on your system?

If you have any doubt with this, you can check the excellent DZone'
refcards about Getting Started with Hadood and Deploying Hadoop
written by Eugene Ciurana(http://eugeneciurana.eu), VP of Technology at
Badoo.com

Regards, and I hope that this information could be useful for you.

-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186




Re: Dataset comparison and ranking - views

2011-03-08 Thread Marcos Ortiz
On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
 Hi Marcos,
 
 Thanks for replying. I think I was not very clear in my last post. Let
 me describe my use case in detail.
 
 I have two datasets coming from different sources, lets call them
 dataset1 and dataset2. Both of them contain records for entities, say
 Person. A single record looks like:
 
 First Name Last Name,  Street, City, State,Zip
 
 We want to compare each record of dataset1 with each record of
 dataset2, in effect a cross join. 
 
 We know that the way data is collected, names will not match exactly,
 but we want to find close enoughs. So we have a rule which says create
 bigrams and find the matching bigrams. If 0 to 5 match, give a score
 of 10, if 5-15 match, give a score of 20 and so on. 
Well, a approach for this problem has a solution given by Milind
Bhandarkar, on his presentation called Practical Problem Solving with
Hadoop and Pig.
He talk about a solution for Bigrams giving a example with word
matching.
Bigrams


Input: A large text corpus
• Output: List(word , Top (word ))
• Two Stages:
• Generate all possible bigrams
• Find most frequent K bigrams for each word

Bigrams: Stage 1
Map
===
• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
• In each sentence, or each “word word ”
• Output (word , word ), (word , word )
• Partition  Sort by (word , word )

pairs.pl

while(STDIN) {
chomp;
$_ =~ s/[^a-zA-Z]+/ /g ;
$_ =~ s/^\s+//g ;
$_ =~ s/\s+$//g ;
$_ =~ tr/A-Z/a-z/;
my @words = split(/\s+/, $_);
for (my $i = 0; $i  $#words - 1; ++$i) {
print $words[$i]:$words[$i+1]\n;
print $words[$i+1]:$words[$i]\n;
}
}

Bigrams: Stage 1
Reduce
==
• Input: List(word , word ) sorted and partitioned
• Output: List(word , [freq, word ])
• Counting similar to Unigrams example

count.pl


$_ = STDIN; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(STDIN) {
chomp;
my ($w1, $w2) = split(/:/, $_);
if ($w1 eq $pw1  $w2 eq $pw2) {
$count++;
} else {
print $pw1:$count:$pw2\n;
$pw1 = $w1;
$pw2 = $w2;
$count = 1;
}
}
print $pw1:$count:$pw2\n;

Bigrams: Stage 2
Map
===
• Input: List(word , [freq,word ])
• Output: List(word , [freq, word ])
• Identity Mapper (/bin/cat)
• Partition by word
• Sort descending by (word , freq)

Bigrams: Stage 2
Reduce
==
• Input: List(word , [freq,word ])
• partitioned by word
• sorted descending by (word , freq)
• Output: Top (List(word , [freq, word ]))
• For each word, throw away after K records

firstN.pl
$N = 5;
$_ = STDIN; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = $pw1\t$pw2,$count;;
while(STDIN) {
chomp;
my ($w1, $c, $w2) = split(/:/, $_);
if ($w1 eq $pw1) {
if ($idx  $N) {
$out .= $w2,$c;;
$idx++;
}
} else {
print $out\n;
$pw1 = $w1;
$idx = 1;
$out = $pw1\t$w2,$c;;
}
}
print $out\n;


You can translate this approach to your especific problem.
I recommend you that you discuss this with him because he has a vast
experience with all this, much more than me.

Regards

 For Zip, we have our rule saying exact match or within 5 kms of each
 other(through a lookup), give a score of 50 and so on.
 
 Once we have each person of dataset1 compared with that of dataset2,
 we find the overall rank. Which is a weighted average of scores of
 name, address etc comparison. 
 
 One approach is to use the DistributedCache for the smaller dataset
 and do a nested loop join in the mapper. The second approach is to use
 multiple  MR flows, and compare the fields and reduce/collate the
 results. 
 
 I am curious to know if people have other approaches they have
 implemented, what are the efficiencies they have built up etc.
 
 Thanks and Regards,
 Sonal
 Hadoop ETL and Data Integration
 Nube Technologies 
 
 
 
 
 
 
 
 On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz mlor...@uci.cu wrote:
 
 On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
  Hi,
 
  I am working on a problem to compare two different datasets,
 and rank
  each record of the first with respect to the other, in terms
 of how
  similar they are. The records are dimensional, but do not
 have a lot
  of dimensions. Some of the fields will be compared for exact
 matches,
  some for similar sound, some with closest match etc. One of
 the
  datasets is large, and the other is much smaller.  The final
 goal is
  to compute a rank between each record of first dataset with
 each
  record of the second. The rank is based on weighted scores
 of each
  dimension comparison.
 
  I was wondering if people in the community have any
 advice/suggested
  patterns/thoughts about cross joining two datasets in map
 reduce. Do
  let me know if you have any suggestions.
 
  Thanks and Regards,
  Sonal
  Hadoop ETL and Data Integration
  Nube

Re: how to use hadoop apis with cloudera distribution ?

2011-03-08 Thread Marcos Ortiz Valmaseda
You can check the Cloudera Training Videos, where is a screencast explaining 
how to develop Hadoop using Eclipse.
http://www.cloudera.com/presentations
http://vimeo.com/cloudera

Now, For working with Hadoop APIs using Eclipse, for developing applications 
based on Hadoop, you can use the Kamasphere Plugin for Hadoop Development, or 
if you are a NetBeans user, they have a module for that.

Regards.


- Mensaje original -
De: Mapred Learn mapred.le...@gmail.com
Para: Marcos Ortiz mlor...@uci.cu
CC: mapreduce-user@hadoop.apache.org
Enviados: Martes, 8 de Marzo 2011 12:26:00 (GMT-0500) Auto-Detected
Asunto: Re: how to use hadoop apis with cloudera distribution ?


Thanks Marco ! 
I was trying to use CDH3 with eclipse and not able to know why eclipse 
complains for the import statement for hadoop apis when cloudera already 
includes them. 

I did not understand how CDH3 works with eclipse, does it download hadoop apis 
when we add svn urls ? 



On Tue, Mar 8, 2011 at 7:22 AM, Marcos Ortiz  mlor...@uci.cu  wrote: 



On Tue, 2011-03-08 at 07:16 -080s, 



0, Mapred Learn wrote: 
 
  Hi, 
  I downloaded CDH3 VM for hadoop but if I want to use something like: 
  
  import org.apache.hadoop.conf.Configuration; 
  
  in my java code, what else do I need to do ? 

Can you see all tutorial that Cloudera has on its site 
http://www.cloudera.com/presentations 
http://www.cloudera.com/info/training 
http://www.cloudera.com/developers/learn-hadoop/ 

Can you check the CDH3 Official Documentation and the last news about 
the new release: 

http://docs.cloudera.com 
http://www.cloudera.com/blog/category/cdh/ 



  Do i need to download hadoop from apache ? 
No, CDH beta 3 has with all required tools to work with Hadoop, even 
more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase, 
Flume, etc 

  
  if yes, then what does cdh3 do ? 
The Cloudera' colleagues has a excelent work packaging the most used 
applications with Hadoop on a single virtual machine for testing and 
they did a better approach to use Hadoop. 

They has Red Hat and Ubuntu/Debian compatible packages to do more easy 
the installation, configuration and use of Hadoop on these operating 
systems. 

Please, read http://docs.cloudera.com 


  
  if not, then where can i find hadoop code on cdh VM ? 
  
  
  I am using above line in my java code in eclipse and eclipse is not able to 
  find it. 
Do you set JAVA_HOME, and HADOOP_HOME on your system? 

If you have any doubt with this, you can check the excellent DZone' 
refcards about Getting Started with Hadood and Deploying Hadoop 
written by Eugene Ciurana( http://eugeneciurana.eu ), VP of Technology at 
Badoo.com 

Regards, and I hope that this information could be useful for you. 

-- 
Marcos Luís Ortíz Valmaseda 
Software Engineer 
Centro de Tecnologías de Gestión de Datos (DATEC) 
Universidad de las Ciencias Informáticas 
http://uncubanitolinuxero.blogspot.com 
http://www.linkedin.com/in/marcosluis2186 

-- 
Marcos Luís Ortíz Valmaseda
 Software Engineer 
 Universidad de las Ciencias Informáticas
 Linux User # 418229

http://uncubanitolinuxero.blogspot.com
http://www.linkedin.com/in/marcosluis2186



Re: Dataset comparison and ranking - views

2011-03-07 Thread Marcos Ortiz
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
 Hi,
 
 I am working on a problem to compare two different datasets, and rank
 each record of the first with respect to the other, in terms of how
 similar they are. The records are dimensional, but do not have a lot
 of dimensions. Some of the fields will be compared for exact matches,
 some for similar sound, some with closest match etc. One of the
 datasets is large, and the other is much smaller.  The final goal is
 to compute a rank between each record of first dataset with each
 record of the second. The rank is based on weighted scores of each
 dimension comparison.
 
 I was wondering if people in the community have any advice/suggested
 patterns/thoughts about cross joining two datasets in map reduce. Do
 let me know if you have any suggestions.   
 
 Thanks and Regards,
 Sonal
 Hadoop ETL and Data Integration
 Nube Technologies 

Regards, Sonal. Can you give us more information about a basic workflow
of your idea?

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
(mili...@yahoo-inc.com)'s talk Practical Problem Solving with Hadoop
and Pig:
Users = load ‘users’ as (name, age);
Filtered = filter Users by age = 18 and age = 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;


-- 
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186