Re: Working of combiner in hadoop
The key/value pairs are processes by the mapper independently of each other. The combiner logic deals with all the outputs from multiple key/value pairs do that logic can not be in the map method. On Jul 4, 2014 1:29 AM, Chhaya Vishwakarma chhaya.vishwaka...@lntinfotech.com wrote: Hi, If have two map tasks running on one node , i have written combiner class also. Will combiner be called once for each map task or just once for both the map tasks Can i write a logic inside map which will work as combiner ? if yes will there be any side effect? Regards, Chhaya Vishwakarma -- The contents of this e-mail and any attachment(s) may contain confidential or privileged information for the intended recipient(s). Unintended recipients are prohibited from taking action on the basis of information in this e-mail and using or disseminating the information, and must notify the sender and delete it from their system. LT Infotech will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in this e-mail
Re: Need to evaluate the price of a Hadoop cluster
Some comments: 3 drives each of capacity 1Tb will be better than one 3 Tb drive. On a small cluster you can not afford to reserve a whole machine for each master daemon. The NameNode and JobTracker will have to cohabit with DataNodes and TaskTrackers. As for pricing if it is for an institution you should visit a few vendor websites. If it is for you then visit EBay. You should add networking hardware to your budget. Cheers On Jul 3, 2014 11:19 AM, YIMEN YIMGA Gael gael.yimen-yi...@sgcib.com wrote: Hello Dear all, I would like to evaluate the price of a Hadoop cluster using the below characteristics for my Namenode and for my Datanode. My cluster should have one Namenode and three datanode. Could someone help me with the price of commodity hardware with these characteristics, please ? Standing by … *NAMENODE* *Model* : xxx *CPU* : 2 cpu 2GHz *RAM* : 14 GB *HD*: 1TB *OS*: Rhel or Debian *Content** :* Rhel 6 or Debian 7.5 ssh daemon Apache 2.4 job tracker daemon namenode daemon dhcp service *DATANODE* *Model* : xxx *CPU* : 1 cpu 3Ghz *RAM* : 16 GB *HD*: 3TB *OS*: Rhel or Debian *Content** :* Rhel 6 or Debian 7.5 ssh daemon Apache 2.4 task tracker daemon datanode daemon dhcp service Warm regards *---* *Gaël YIMEN YIMGA* *Stagiaire - GBIS* *ITEC/CSY/SAT* *Tour CB 3 - S 04 025/ 58* *170, place Henri Regnault* *Paris - La Défense 6* * This message and any attachments (the message) are confidential, intended solely for the addressee(s), and may contain legally privileged information. Any unauthorised use or dissemination is prohibited. E-mails are susceptible to alteration. Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall be liable for the message if altered, changed or falsified. Please visit http://swapdisclosure.sgcib.com for important information with respect to derivative products. Ce message et toutes les pieces jointes (ci-apres le message) sont confidentiels et susceptibles de contenir des informations couvertes par le secret professionnel. Ce message est etabli a l'intention exclusive de ses destinataires. Toute utilisation ou diffusion non autorisee est interdite. Tout message electronique est susceptible d'alteration. La SOCIETE GENERALE et ses filiales declinent toute responsabilite au titre de ce message s'il a ete altere, deforme ou falsifie. Veuillez consulter le site http://swapdisclosure.sgcib.com afin de recueillir d'importantes informations sur les produits derives. *
Re: Working of combiner in hadoop
On Fri, Jul 4, 2014 at 10:59 AM, Chhaya Vishwakarma chhaya.vishwaka...@lntinfotech.com wrote: Hi, If have two map tasks running on one node , i have written combiner class also. Will combiner be called once for each map task or just once for both the map tasks Can i write a logic inside map which will work as combiner ? if yes will there be any side effect? Hi Chaya, Refer the following URLS http://java.dzone.com/articles/designing-mapreduce-algorithms http://isaacslavitt.com/2014/01/01/in-mapper-combiner-pattern-for-mapreduce/ http://alpinenow.com/blog/in-mapper-combiner/ Best regards -- ** JAGANADH G http://jaganadhg.in *ILUGCBE* http://ilugcbe.org.in
In progress edit log from last run not being played in case of a cluster (HA) restart
Hi All, I am running Hadoop 2.4.0. I am trying to restart my HA cluster but since there isn't a way to gracefully shutdown the NN (AFAIK), I am running into a (sort of) race condition. A client has issued a delete command and NN successfully deletes the requested file (in-progress edit logs across NN JNs are updated and DN physically delete the blocks). But before the current in-progress edit log segment can be closed, the NN is stopped. Now when the NN is started again, it reads all edit logs from JNs but it does not consider the last in-progress edit log from the last run. Due to this NN is expecting more blocks to be reported than what the DNs have. Unfortunately sometimes this difference can be large enough (considering dfs.namenode.safemode.threshold-pct) to leave the NN in safemode forever. This problem is looks to be generic to me. Can someone please confirm if this is indeed a bug or point out where I may be wrong (either in my process or understanding). I modified the NN code to also read the in-progress edit log from JNs and my problem was resolved. But I am not sure what implications this might have. Here is the code change I did: diff --git a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImag index e78153f..b864ec1 100644 --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java @@ -623,7 +623,7 @@ private boolean loadFSImage(FSNamesystem target, StartupOption startOpt, } editStreams = editLog.selectInputStreams( imageFiles.get(0).getCheckpointTxId() + 1, - toAtLeastTxId, recovery, false); + toAtLeastTxId, recovery, true); } else { editStreams = FSImagePreTransactionalStorageInspector .getEditLogStreams(storage); -- Regards Nitin Goyal
Re: Multi-Cluster Setup
hey Rahul, thanks for pointing me to that page. It's definately worth a read. Need both clusters to be at least V2.3 for that? I was digging also a little bit further. There is the property setting fs.defaultFS whchi might be the exact setting I was actually looking for. Unfortuantely MapR restricts access to the CLDB and not directly to the Namenode, which makes this command right now useless (we have a lot of data in a MapR Cluster, but want to access it in another way) for us. Thanks everyone, who helped here. Cheers Wolli 2014-07-03 18:33 GMT+02:00 Rahul Chaudhari rahulchaudhari0...@gmail.com: Fabian, I see this as the classic case of federation of hadoop clusters. The MR or job can refer to the specific hdfs://file location as input but at the same time run on another cluster. You can refer to following link for further details on federation. http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/Federation.html Regards, Rahul Chaudhari On Thu, Jul 3, 2014 at 9:06 PM, fab wol darkwoll...@gmail.com wrote: Hey Nitin, I'm not talking about concept-wise. I'm takling about how to actually do it technically and how to set it up. Imagine this: I have two clusters, both running fine and they are both (setup-wise) the same, besides that one has way more tasktrackers/Nodemanagers than the other one. Now I want to incorporate some data from the small cluster in the analysis of the big cluster. How could i access the data natively (Just giving the input job another HDFS folder)? In MapR I configure the specified file and then i have another folder in the MapRFS with all the content from the other cluster ... Could i somehow specify one Namenode to lookup another Namenode and incorporate all the uncommon files? Cheers Fabian 2014-07-03 17:09 GMT+02:00 Nitin Pawar nitinpawar...@gmail.com: Nothing is stopping you to implement cluster the way you want. You can have storage only nodes for your HDFS and do not run tasktrackers on them. Start bunch of machines with High RAM and high CPUs but no storage. Only thing to worry then would be network bandwidth to carry data from hdfs to tasks and back to hdfs. On Thu, Jul 3, 2014 at 8:29 PM, fab wol darkwoll...@gmail.com wrote: hey everyone, MapR is offering the possibility to acces from one cluster (e.g. a compute only cluster without much storage capabilities) another cluster's HDFS/MapRFS (see http://doc.mapr.com/display/MapR/mapr-clusters.conf). In times of Hadoop-as-a-Service this becomes very interesting. Is this somehow possible with the normal Hadoop Distributions possible (CDH and HDP, I'm looking at you ;- ) ) or with even without this help from those distributors? Any Hacks and Tricks or even specific Functions are welcome. If this is not possible, has anyone issued this as a Ticket or something?`Ticket Number forwarding is also appreciated ... Cheers Wolli -- Nitin Pawar -- Regards, Rahul Chaudhari
Streaming data - Avaiable tools
Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Thank you And What advice would you give me on running my first Hadoop cluster based Job
Hi, Over the past two weeks, from a standing start, I¹ve worked on a Hadoop based parallel genetic sequence alignment algorithm as part of my university masters project. Thankfully that¹s now up and running, along the way I got some great help from members of this group and I deeply appreciate that strangers would take time out of their busy lives to shed a bit of light on what seemed at times an insurmountable task. On Monday I get to play with a 32 node system and the only advice I have so far is to benchmark my algorithm with 5gb per node. I wonder if, if you were starting out again on your first big Hadoop map reduce job what would would you differently ? What advice would you give me starting out ? Thanks again, I really appreciate your support. Best Chris Regards, Chris MacKenzie http://www.chrismackenziephotography.co.uk/ http://www.chrismackenziephotography.co.uk/Expert http://plus.google.com/+ChrismackenziephotographyCoUk/posts http://www.linkedin.com/in/chrismackenziephotography/
Re: Streaming data - Avaiable tools
Storm. It’s not a part of the Apache project but it seems to be what people are using to process event data. B. From: santosh.viswanat...@accenture.com Sent: Friday, July 04, 2014 11:25 AM To: user@hadoop.apache.org Subject: Streaming data - Avaiable tools Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com
Re: Streaming data - Avaiable tools
Storm is another project sponsored by ASF. Look here: http://storm.apache.org On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote: Storm. It’s not a part of the Apache project but it seems to be what people are using to process event data. B. *From:* santosh.viswanat...@accenture.com mailto:santosh.viswanat...@accenture.com *Sent:* Friday, July 04, 2014 11:25 AM *To:* user@hadoop.apache.org mailto:user@hadoop.apache.org *Subject:* Streaming data - Avaiable tools Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186 http://twitter.com/marcosluis2186) http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Streaming data - Avaiable tools
My information is out of date. It looks like it’s a full on incubator project now. Here is a working link: https://storm.incubator.apache.org/ B. From: Marcos Ortiz Sent: Friday, July 04, 2014 11:31 AM To: user@hadoop.apache.org Subject: Re: Streaming data - Avaiable tools Storm is another project sponsored by ASF. Look here: http://storm.apache.org On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote: Storm. It’s not a part of the Apache project but it seems to be what people are using to process event data. B. From: santosh.viswanat...@accenture.com Sent: Friday, July 04, 2014 11:25 AM To: user@hadoop.apache.org Subject: Streaming data - Avaiable tools Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Marcos Ortiz (@marcosluis2186) http://about.me/marcosortiz VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Re: Streaming data - Avaiable tools
Try Storm+ Esper http://tomdzk.wordpress.com/2011/09/28/storm-esper/ On Fri, Jul 4, 2014 at 12:38 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: My information is out of date. It looks like it's a full on incubator project now. Here is a working link: https://storm.incubator.apache.org/ B. *From:* Marcos Ortiz mlor...@uci.cu *Sent:* Friday, July 04, 2014 11:31 AM *To:* user@hadoop.apache.org *Subject:* Re: Streaming data - Avaiable tools Storm is another project sponsored by ASF. Look here: http://storm.apache.org On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote: Storm. It's not a part of the Apache project but it seems to be what people are using to process event data. B. *From:* santosh.viswanat...@accenture.com *Sent:* Friday, July 04, 2014 11:25 AM *To:* user@hadoop.apache.org *Subject:* Streaming data - Avaiable tools Hello Experts, Wanted to explore the available tools in the market on streaming data. I know Apache Spark exists. Are there any other tools available? Regards, Santosh Karthikeyan -- This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186 http://twitter.com/marcosluis2186) http://about.me/marcosortiz -- VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 2014. Ver www.uci.cu
Pagerank In Hadoop
I want to run a PageRank job in Hadoop. I know that there is a Pegasus implementation of PageRank. How do I submit the job to Hadoop for running PageRank algorithm? I also want to know if I have to supply the code. Thank You -- *Whether you think you can or you cannot.either way you are right* With Regards... Deep