Re: Working of combiner in hadoop

2014-07-04 Thread Chris Mawata
The key/value pairs are processes by the mapper independently of each
other. The combiner logic deals with all the outputs from multiple
key/value pairs do that logic can not be in the map method.
 On Jul 4, 2014 1:29 AM, Chhaya Vishwakarma 
chhaya.vishwaka...@lntinfotech.com wrote:

  Hi,



 If have two map tasks running on one node , i have written combiner class
 also.

 Will combiner be called once for each map task or just once for both the
 map tasks



 Can i write a logic inside map which will work as combiner ? if yes will
 there be any side effect?



 Regards,

 Chhaya Vishwakarma



 --
 The contents of this e-mail and any attachment(s) may contain confidential
 or privileged information for the intended recipient(s). Unintended
 recipients are prohibited from taking action on the basis of information in
 this e-mail and using or disseminating the information, and must notify the
 sender and delete it from their system. LT Infotech will not accept
 responsibility or liability for the accuracy or completeness of, or the
 presence of any virus or disabling code in this e-mail



Re: Need to evaluate the price of a Hadoop cluster

2014-07-04 Thread Chris Mawata
Some comments:
3 drives each of capacity 1Tb will be better than one 3 Tb drive.

On a small cluster you can not afford to reserve a whole machine for each
master daemon. The NameNode and JobTracker will have to cohabit with
DataNodes and TaskTrackers.

As for pricing if it is for an institution you should visit a few vendor
websites. If it is for you then visit EBay.

You should add networking hardware to your budget.

Cheers
On Jul 3, 2014 11:19 AM, YIMEN YIMGA Gael gael.yimen-yi...@sgcib.com
wrote:

 Hello Dear all,



 I would like to evaluate the price of a Hadoop cluster using the below
 characteristics for my Namenode and for my Datanode.



 My cluster should have one Namenode and three datanode.



 Could someone help me with the price of commodity hardware with these
 characteristics, please ?



 Standing by …



 *NAMENODE*



 *Model* : xxx

 *CPU* : 2 cpu 2GHz

 *RAM* : 14 GB

 *HD*: 1TB

 *OS*: Rhel or Debian



 *Content** :*

 Rhel 6 or Debian 7.5

 ssh daemon

 Apache 2.4

 job tracker daemon

 namenode daemon

 dhcp service


 *DATANODE*



 *Model* : xxx

 *CPU* : 1 cpu 3Ghz

 *RAM* : 16 GB

 *HD*: 3TB

 *OS*: Rhel or Debian



 *Content** :*

 Rhel 6 or Debian 7.5

 ssh daemon

 Apache 2.4

 task tracker daemon

 datanode daemon

 dhcp service




























 Warm regards



 *---*

 *Gaël YIMEN YIMGA*



 *Stagiaire - GBIS*

 *ITEC/CSY/SAT*



 *Tour CB 3 - S 04 025/ 58*

 *170, place Henri Regnault*

 *Paris - La Défense 6*



 *
 This message and any attachments (the message) are confidential,
 intended solely for the addressee(s), and may contain legally privileged
 information.
 Any unauthorised use or dissemination is prohibited. E-mails are
 susceptible to alteration.
 Neither SOCIETE GENERALE nor any of its subsidiaries or affiliates shall
 be liable for the message if altered, changed or
 falsified.
 Please visit http://swapdisclosure.sgcib.com for important information
 with respect to derivative products.
   
 Ce message et toutes les pieces jointes (ci-apres le message) sont
 confidentiels et susceptibles de contenir des informations couvertes
 par le secret professionnel.
 Ce message est etabli a l'intention exclusive de ses destinataires. Toute
 utilisation ou diffusion non autorisee est interdite.
 Tout message electronique est susceptible d'alteration.
 La SOCIETE GENERALE et ses filiales declinent toute responsabilite au
 titre de ce message s'il a ete altere, deforme ou falsifie.
 Veuillez consulter le site http://swapdisclosure.sgcib.com afin de
 recueillir d'importantes informations sur les produits derives.
 *



Re: Working of combiner in hadoop

2014-07-04 Thread JAGANADH G
On Fri, Jul 4, 2014 at 10:59 AM, Chhaya Vishwakarma 
chhaya.vishwaka...@lntinfotech.com wrote:

  Hi,



 If have two map tasks running on one node , i have written combiner class
 also.

 Will combiner be called once for each map task or just once for both the
 map tasks



 Can i write a logic inside map which will work as combiner ? if yes will
 there be any side effect?






Hi Chaya,

Refer the following URLS
http://java.dzone.com/articles/designing-mapreduce-algorithms
http://isaacslavitt.com/2014/01/01/in-mapper-combiner-pattern-for-mapreduce/
http://alpinenow.com/blog/in-mapper-combiner/

Best regards
 --
**
JAGANADH G
http://jaganadhg.in
*ILUGCBE*
http://ilugcbe.org.in


In progress edit log from last run not being played in case of a cluster (HA) restart

2014-07-04 Thread Nitin Goyal
Hi All,

I am running Hadoop 2.4.0. I am trying to restart my HA cluster but since
there isn't a way to gracefully shutdown the NN (AFAIK), I am running into
a (sort of) race condition. A client has issued a delete command and NN
successfully deletes the requested file (in-progress edit logs across NN 
JNs are updated and DN physically delete the blocks). But before the
current in-progress edit log segment can be closed, the NN is stopped. Now
when the NN is started again, it reads all edit logs from JNs but it does
not consider the last in-progress edit log from the last run. Due to this
NN is expecting more blocks to be reported than what the DNs have.
Unfortunately sometimes this difference can be large enough (considering
dfs.namenode.safemode.threshold-pct) to leave the NN in safemode forever.

This problem is looks to be generic to me. Can someone please confirm if
this is indeed a bug or point out where I may be wrong (either in my
process or understanding).


I modified the NN code to also read the in-progress edit log from JNs and
my problem was resolved. But I am not sure what implications this might
have. Here is the code change I did:

diff --git
a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImag
index e78153f..b864ec1 100644
---
a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
+++
b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSImage.java
@@ -623,7 +623,7 @@ private boolean loadFSImage(FSNamesystem target,
StartupOption startOpt,
   }
   editStreams = editLog.selectInputStreams(
   imageFiles.get(0).getCheckpointTxId() + 1,
-  toAtLeastTxId, recovery, false);
+  toAtLeastTxId, recovery, true);
 } else {
   editStreams = FSImagePreTransactionalStorageInspector
 .getEditLogStreams(storage);

-- 
Regards
Nitin Goyal


Re: Multi-Cluster Setup

2014-07-04 Thread fab wol
hey Rahul,

thanks for pointing me to that page. It's definately worth a read. Need
both clusters to be at least V2.3 for that?

I was digging also a little bit further. There is the property setting
fs.defaultFS whchi might be the exact setting I was actually looking for.
Unfortuantely MapR restricts access to the CLDB and not directly to the
Namenode, which makes this command right now useless (we have a lot of data
in a MapR Cluster, but want to access it in another way) for us.

Thanks everyone, who helped here.

Cheers
Wolli


2014-07-03 18:33 GMT+02:00 Rahul Chaudhari rahulchaudhari0...@gmail.com:

 Fabian,
I see this as the classic case of federation of hadoop clusters. The MR
 or job can refer to the specific hdfs://file location as input but at the
 same time run on another cluster.
 You can refer to following link for further details on federation.


 http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/Federation.html

 Regards,
 Rahul Chaudhari


 On Thu, Jul 3, 2014 at 9:06 PM, fab wol darkwoll...@gmail.com wrote:

 Hey Nitin,

 I'm not talking about concept-wise. I'm takling about how to actually do
 it technically and how to set it up. Imagine this: I have two clusters,
 both running fine and they are both (setup-wise) the same, besides that one
 has way more tasktrackers/Nodemanagers than the other one. Now I want to
 incorporate some data from the small cluster in the analysis of the big
 cluster. How could i access the data natively (Just giving the input job
 another HDFS folder)? In MapR I configure the specified file and then i
 have another folder in the MapRFS with all the content from the other
 cluster ... Could i somehow specify one Namenode to lookup another Namenode
 and incorporate all the uncommon files?

 Cheers
 Fabian


 2014-07-03 17:09 GMT+02:00 Nitin Pawar nitinpawar...@gmail.com:

 Nothing is stopping you to implement cluster the way you want.
 You can have storage only nodes for your HDFS and do not run
 tasktrackers on them.

 Start bunch of machines with High RAM and high CPUs but no storage.

 Only thing to worry then would be network bandwidth to carry data from
 hdfs to tasks and back to hdfs.


 On Thu, Jul 3, 2014 at 8:29 PM, fab wol darkwoll...@gmail.com wrote:

 hey everyone,

 MapR is offering the possibility to acces from one cluster (e.g. a
 compute only cluster without much storage capabilities) another cluster's
 HDFS/MapRFS (see http://doc.mapr.com/display/MapR/mapr-clusters.conf).
 In times of Hadoop-as-a-Service this becomes very interesting. Is this
 somehow possible with the normal Hadoop Distributions possible (CDH and
 HDP, I'm looking at you ;- ) ) or with even without this help from those
 distributors? Any Hacks and Tricks or even specific Functions are welcome.
 If this is not possible, has anyone issued this as a Ticket or
 something?`Ticket Number forwarding is also appreciated ...

 Cheers
 Wolli




 --
 Nitin Pawar





 --
 Regards,
 Rahul Chaudhari



Streaming data - Avaiable tools

2014-07-04 Thread santosh.viswanathan
Hello Experts,

Wanted to explore the available tools in the market on streaming data. I know 
Apache Spark exists. Are there any other tools available?


Regards,
Santosh Karthikeyan



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


Thank you And What advice would you give me on running my first Hadoop cluster based Job

2014-07-04 Thread Chris MacKenzie
Hi,

Over the past two weeks, from a standing start, I¹ve worked on a Hadoop
based parallel genetic sequence alignment algorithm as part of my
university masters project.

Thankfully that¹s now up and running, along the way I got some great help
from members of this group and I deeply appreciate that strangers would
take time out of their busy lives to shed a bit of light on what seemed at
times an insurmountable task.

On Monday I get to play with a 32 node system and the only advice I have
so far is to benchmark my algorithm with 5gb per node.

I wonder if, if you were starting out again on your first big Hadoop map
reduce job what would would you differently ? What advice would you give
me starting out ?

Thanks again, I really appreciate your support.

Best Chris


Regards,

Chris MacKenzie
 http://www.chrismackenziephotography.co.uk/
http://www.chrismackenziephotography.co.uk/Expert
 http://plus.google.com/+ChrismackenziephotographyCoUk/posts
http://www.linkedin.com/in/chrismackenziephotography/




Re: Streaming data - Avaiable tools

2014-07-04 Thread Adaryl Bob Wakefield, MBA
Storm. It’s not a part of the Apache project but it seems to be what people are 
using to process event data.

B.

From: santosh.viswanat...@accenture.com 
Sent: Friday, July 04, 2014 11:25 AM
To: user@hadoop.apache.org 
Subject: Streaming data - Avaiable tools

Hello Experts,

 

Wanted to explore the available tools in the market on streaming data. I know 
Apache Spark exists. Are there any other tools available?

 

 

Regards,
Santosh Karthikeyan





This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. 
__

www.accenture.com


Re: Streaming data - Avaiable tools

2014-07-04 Thread Marcos Ortiz

Storm is another project sponsored by ASF. Look here:
http://storm.apache.org

On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote:
Storm. It’s not a part of the Apache project but it seems to be what 
people are using to process event data.

B.
*From:* santosh.viswanat...@accenture.com 
mailto:santosh.viswanat...@accenture.com

*Sent:* Friday, July 04, 2014 11:25 AM
*To:* user@hadoop.apache.org mailto:user@hadoop.apache.org
*Subject:* Streaming data - Avaiable tools

Hello Experts,

Wanted to explore the available tools in the market on streaming data. 
I know Apache Spark exists. Are there any other tools available?


Regards,
Santosh Karthikeyan




This message is for the designated recipient only and may contain 
privileged, proprietary, or otherwise confidential information. If you 
have received it in error, please notify the sender immediately and 
delete the original. Any other use of the e-mail by you is prohibited. 
Where allowed by local law, electronic communications with Accenture 
and its affiliates, including e-mail and instant messaging (including 
content), may be scanned by our systems for the purposes of 
information security and assessment of internal compliance with 
Accenture policy.

__

www.accenture.com


--
Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186 
http://twitter.com/marcosluis2186)

http://about.me/marcosortiz

VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu



Re: Streaming data - Avaiable tools

2014-07-04 Thread Adaryl Bob Wakefield, MBA
My information is out of date. It looks like it’s a full on incubator project 
now. Here is a working link:
https://storm.incubator.apache.org/
B.
From: Marcos Ortiz 
Sent: Friday, July 04, 2014 11:31 AM
To: user@hadoop.apache.org 
Subject: Re: Streaming data - Avaiable tools

Storm is another project sponsored by ASF. Look here:
http://storm.apache.org


On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote:

  Storm. It’s not a part of the Apache project but it seems to be what people 
are using to process event data.

  B.

  From: santosh.viswanat...@accenture.com 
  Sent: Friday, July 04, 2014 11:25 AM
  To: user@hadoop.apache.org 
  Subject: Streaming data - Avaiable tools

  Hello Experts,

   

  Wanted to explore the available tools in the market on streaming data. I know 
Apache Spark exists. Are there any other tools available?

   

   

  Regards,
  Santosh Karthikeyan



--

  This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy. 
  
__

  www.accenture.com



-- 
Marcos Ortiz (@marcosluis2186) 
http://about.me/marcosortiz




VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de julio de 
2014. Ver www.uci.cu 



Re: Streaming data - Avaiable tools

2014-07-04 Thread Cristóbal Giadach
Try Storm+ Esper

http://tomdzk.wordpress.com/2011/09/28/storm-esper/


On Fri, Jul 4, 2014 at 12:38 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   My information is out of date. It looks like it's a full on incubator
 project now. Here is a working link:
 https://storm.incubator.apache.org/
  B.
   *From:* Marcos Ortiz mlor...@uci.cu
 *Sent:* Friday, July 04, 2014 11:31 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Streaming data - Avaiable tools

 Storm is another project sponsored by ASF. Look here:
 http://storm.apache.org

 On 04/07/14 12:28, Adaryl Bob Wakefield, MBA wrote:

  Storm. It's not a part of the Apache project but it seems to be what
 people are using to process event data.

 B.

  *From:* santosh.viswanat...@accenture.com
 *Sent:* Friday, July 04, 2014 11:25 AM
 *To:* user@hadoop.apache.org
 *Subject:* Streaming data - Avaiable tools


 Hello Experts,



 Wanted to explore the available tools in the market on streaming data. I
 know Apache Spark exists. Are there any other tools available?





 Regards,
 Santosh Karthikeyan

 --

 This message is for the designated recipient only and may contain
 privileged, proprietary, or otherwise confidential information. If you have
 received it in error, please notify the sender immediately and delete the
 original. Any other use of the e-mail by you is prohibited. Where allowed
 by local law, electronic communications with Accenture and its affiliates,
 including e-mail and instant messaging (including content), may be scanned
 by our systems for the purposes of information security and assessment of
 internal compliance with Accenture policy.

 __

 www.accenture.com


 --
 Marcos Ortiz http://www.linkedin.com/in/mlortiz (@marcosluis2186
 http://twitter.com/marcosluis2186)
 http://about.me/marcosortiz

 --

 VII Escuela Internacional de Verano en la UCI del 30 de junio al 11 de
 julio de 2014. Ver www.uci.cu



Pagerank In Hadoop

2014-07-04 Thread Deep Pradhan
I want to run a PageRank job in Hadoop. I know that there is a Pegasus
implementation of PageRank. How do I submit the job to Hadoop for running
PageRank algorithm? I also want to know if I have to supply the code.
Thank You

-- 
*Whether you think you can or you cannot.either way you are right*
With Regards...
Deep