Re: Loading Data to HDFS

2012-10-30 Thread M. C. Srivas
Loading a petabyte from a single machine will take you about 4 months,
assuming you can push 100MB/s (1 GigE) continuously for 24 hrs/day over
those 4 months. Any interruptions and the 4 months will become 6 months.

You might want to consider a more parallel solution instead of a single
gateway machine.


On Tue, Oct 30, 2012 at 3:07 AM, sumit ghosh sumi...@yahoo.com wrote:

 Hi,

 I have a data on remote machine accessible over ssh. I have Hadoop CDH4
 installed on RHEL. I am planning to load quite a few Petabytes of Data onto
 HDFS.

 Which will be the fastest method to use and are there any projects around
 Hadoop which can be used as well?


 I cannot install Hadoop-Client on the remote machine.

 Have a great Day Ahead!
 Sumit.


 ---
 Here I am attaching my previous discussion on CDH-user to avoid
 duplication.
 ---
 On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:
 in addition to jarcec's suggestions, you could use httpfs. then you'd only
 need to poke a single host:port in your firewall as all the traffic goes
 thru it.
 thx
 Alejandro

 On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com
 wrote:
  Hi Sumit,
  there is plenty of ways how to achieve that. Please find my feedback
 below:
 
  Does Sqoop support loading flat files to HDFS?
 
  No, sqoop is supporting only data move from external database and
 warehouse systems. Copying files is not supported at the moment.
 
  Can use distcp?
 
  No. Distcp can be used only to copy data between HDFS filesystesm.
 
  How do we use the core-site.xml file on the remote machine to use
  copyFromLocal?
 
  Yes you can install hadoop binaries on your machine (with no hadoop
 running services) and use hadoop binary to upload data. Installation
 procedure is described in CDH4 installation guide [1] (follow client
 installation).
 
  Another way that I can think of is leveraging WebHDFS [2] or maybe
 hdfs-fuse [3]?
 
  Jarcec
 
  Links:
  1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
  2:
 https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
  3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS
 
  On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:
 
 
  Hi,
 
  I have a data on remote machine accessible over ssh. What is the fastest
  way to load data onto HDFS?
 
  Does Sqoop support loading flat files to HDFS?
  Can use distcp?
  How do we use the core-site.xml file on the remote machine to use
  copyFromLocal?
 
  Which will be the best to use and are there any other open source
 projects
  around Hadoop which can be used as well?
  Have a great Day Ahead!
  Sumit


Re: Restricting the number of slave nodes used for a given job (regardless of the # of map/reduce tasks involved)

2012-09-22 Thread M. C. Srivas
What you are asking for (and much more sophisticated slicing/dicing of
the cluster) is possible with MapR's distro. Please contact me offline if
you are interested, or try it for yourself at www.mapr.com/download

On Mon, Sep 10, 2012 at 2:06 AM, Safdar Kureishy
safdar.kurei...@gmail.comwrote:

 Hi,

 I need to run some benchmarking tests for a given mapreduce job on a
 *subset
 *of a 10-node Hadoop cluster. Not that it matters, but the current cluster
 settings allow for ~20 map slots and 10 reduce slots per node.

 Without loss of generalization, let's say I want a job with these
 constraints below:
 - to use only *5* out of the 10 nodes for running the mappers,
 - to use only *5* out of the 10 nodes for running the reducers.

 Is there any other way of achieving this through Hadoop property overrides
 during job-submission time? I understand that the Fair Scheduler can
 potentially be used to create pools of a proportionate # of mappers and
 reducers, to achieve a similar outcome, but the problem is that I still
 cannot tie such a pool to a fixed # of machines (right?). Essentially,
 regardless of the # of map/reduce tasks involved, I only want a *fixed # of
 machines* to handle the job.

 Any tips on how I can go about achieving this?

 Thanks,
 Safdar



Re: Ideal file size

2012-06-07 Thread M. C. Srivas
On Wed, Jun 6, 2012 at 10:14 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 On Wed, Jun 6, 2012 at 9:48 AM, M. C. Srivas mcsri...@gmail.com wrote:

  Many factors to consider than just the size of the file.  . How long can
  you wait before you *have to* process the data?  5 minutes? 5 hours? 5
  days?  If you want good timeliness, you need to roll-over faster.  The
  longer you wait:
 
  1.  the lesser the load on the NN.
  2.  but the poorer the timeliness
  3.  and the larger chance of lost data  (ie, the data is not saved until
  the file is closed and rolled over, unless you want to sync() after every
  write)
 
  To Begin with I was going to use Flume and specify rollover file size. I
 understand the above parameters, I just want to ensure that too many small
 files doesn't cause problem on the NameNode. For instance there would be
 times when we get GBs of data in an hour and at times only few 100 MB. From
 what Harsh, Edward and you've described it doesn't cause issues with the
 NameNode but rather increase in processing times if there are too many
 small files. Looks like I need to find that balance.

 It would also be interesting to see how others solve this problem when not
 using Flume.



They use NFS with MapR.

Any and all log-rotators (like the one in log4j) simply just work over NFS,
and MapR does not have a NN, so the problems with small files or number of
files do not exist.





 
 
  On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   We have continuous flow of data into the sequence file. I am wondering
  what
   would be the ideal file size before file gets rolled over. I know too
  many
   small files are not good but could someone tell me what would be the
  ideal
   size such that it doesn't overload NameNode.
  
 



Re: Ideal file size

2012-06-06 Thread M. C. Srivas
Many factors to consider than just the size of the file.  . How long can
you wait before you *have to* process the data?  5 minutes? 5 hours? 5
days?  If you want good timeliness, you need to roll-over faster.  The
longer you wait:

1.  the lesser the load on the NN.
2.  but the poorer the timeliness
3.  and the larger chance of lost data  (ie, the data is not saved until
the file is closed and rolled over, unless you want to sync() after every
write)



On Wed, Jun 6, 2012 at 7:00 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 We have continuous flow of data into the sequence file. I am wondering what
 would be the ideal file size before file gets rolled over. I know too many
 small files are not good but could someone tell me what would be the ideal
 size such that it doesn't overload NameNode.



Re: Feedback on real world production experience with Flume

2012-04-21 Thread M. C. Srivas
Karl,

 since you did ask for alternatives,  people using MapR prefer to use the
NFS access to directly deposit data (or access it).  Works seamlessly from
all Linuxes, Solaris, Windows, AIX and a myriad of other legacy systems
without having to load any agents on those machines. And it is fully
automatic HA

Since compression is built-in in MapR, the data gets compressed coming in
over NFS automatically without much fuss.

Wrt to performance,  can get about 870 MB/s per node if you have 10GigE
attached (of course, with compression, the effective throughput will
surpass that based on how good the data can be squeezed).


On Fri, Apr 20, 2012 at 3:14 PM, Karl Hennig khen...@baynote.com wrote:

 I am investigating automated methods of moving our data from the web tier
 into HDFS for processing, a process that's performed periodically.

 I am looking for feedback from anyone who has actually used Flume in a
 production setup (redundant, failover) successfully.  I understand it is
 now being largely rearchitected during its incubation as Apache Flume-NG,
 so I don't have full confidence in the old, stable releases.

 The other option would be to write our own tools.  What methods are you
 using for these kinds of tasks?  Did you write your own or does Flume (or
 something else) work for you?

 I'm also on the Flume mailing list, but I wanted to ask these questions
 here because I'm interested in Flume _and_ alternatives.

 Thank you!




Re: NameNode per-block memory usage?

2012-01-17 Thread M. C. Srivas
Konstantin's paper
http://www.usenix.org/publications/login/2010-04/openpdfs/shvachko.pdf

mentions that on average a file consumes about 600 bytes of memory in the
name-node (1 file object + 2 block objects).

To quote from his paper (see page 9)

.. in order to store 100 million files (referencing 200 million blocks) a
name-node should have at least 60GB of RAM. This matches observations on
deployed clusters.



On Tue, Jan 17, 2012 at 7:08 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hello,

 How much memory/JVM heap does NameNode use for each block?

 I've tried locating this in the FAQ and on search-hadoop.com, but
 couldn't find a ton of concrete numbers, just these two:

 http://search-hadoop.com/m/RmxWMVyVvK1 - 150 bytes/block?
 http://search-hadoop.com/m/O886P1VyVvK1 - 1 GB heap for every object?

 Thanks,
 Otis



Re: HDFS Backup nodes

2011-12-13 Thread M. C. Srivas
Suresh,

As of today, there is no option except to use NFS.  And as you yourself
mention, the first HA prototype when it comes out will require NFS.

(a) I wasn't aware that Bookkeeper had progressed that far. I wonder
whether it would be able to keep up with the data rates that is required in
order to hold the NN log without falling behind.

(b) I do know Karthik Ranga at FB just started a design to put the NN data
in HDFS itself, but that is in very preliminary design stages with no real
code there.

The problem is that the HA code written with NFS in mind is very different
from the HA code written with HDFS in mind, which are both quite different
from the code that is written with Bookkeeper in mind. Essentially the
three options will form three different implementations, since the failure
modes of each of the back-ends are different. Am I totally off base?

thanks,
Srivas.




On Tue, Dec 13, 2011 at 11:00 AM, Suresh Srinivas sur...@hortonworks.comwrote:

 Srivas,

 As you may know already, NFS is just being used in the first prototype for
 HA.

 Two options for editlog store are:
 1. Using BookKeeper. Work has already completed on trunk towards this. This
 will replace need for NFS to  store the editlogs and is highly available.
 This solution will also be used for HA.
 2. We have a short term goal also to enable editlogs going to HDFS itself.
 The work is in progress.

 Regards,
 Suresh


 
  -- Forwarded message --
  From: M. C. Srivas mcsri...@gmail.com
  Date: Sun, Dec 11, 2011 at 10:47 PM
  Subject: Re: HDFS Backup nodes
  To: common-user@hadoop.apache.org
 
 
  You are out of luck if you don't want to use NFS, and yet want redundancy
  for the NN.  Even the new NN HA work being done by the community will
  require NFS ... and the NFS itself needs to be HA.
 
  But if you use a Netapp, then the likelihood of the Netapp crashing is
  lower than the likelihood of a garbage-collection-of-death happening in
 the
  NN.
 
  [ disclaimer:  I don't work for Netapp, I work for MapR ]
 
 
  On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:
 
   Thanks Joey. We've had enough problems with nfs (mainly under very high
   load) that we thought it might be riskier to use it for the NN.
  
   randy
  
  
   On 12/07/2011 06:46 PM, Joey Echeverria wrote:
  
   Hey Rand,
  
   It will mark that storage directory as failed and ignore it from then
   on. In order to do this correctly, you need a couple of options
   enabled on the NFS mount to make sure that it doesn't retry
   infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
   options set.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:
  
   What happens then if the nfs server fails or isn't reachable? Does
 hdfs
   lock up? Does it gracefully ignore the nfs copy?
  
   Thanks,
   randy
  
   - Original Message -
   From: Joey Echeverriaj...@cloudera.com
   To: common-user@hadoop.apache.org
   Sent: Wednesday, December 7, 2011 6:07:58 AM
   Subject: Re: HDFS Backup nodes
  
   You should also configure the Namenode to use an NFS mount for one of
   it's storage directories. That will give the most up-to-date back of
   the metadata in case of total node failure.
  
   -Joey
  
   On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumar
 praveen...@gmail.com
wrote:
  
   This means still we are relying on Secondary NameNode idealogy for
   Namenode's backup.
   Can OS-mirroring of Namenode is a good alternative keep it alive all
  the
   time ?
  
   Thanks,
   Praveenesh
  
   On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
   mahesw...@huawei.comwrote:
  
AFAIK backup node introduced in 0.21 version onwards.
   __**__
   From: praveenesh kumar [praveen...@gmail.com]
   Sent: Wednesday, December 07, 2011 12:40 PM
   To: common-user@hadoop.apache.org
   Subject: HDFS Backup nodes
  
   Does hadoop 0.20.205 supports configuring HDFS backup nodes ?
  
   Thanks,
   Praveenesh
  
  
  
  
   --
   Joseph Echeverria
   Cloudera, Inc.
   443.305.9434
  
  
  
  
  
  
 
 



Re: HDFS Backup nodes

2011-12-11 Thread M. C. Srivas
You are out of luck if you don't want to use NFS, and yet want redundancy
for the NN.  Even the new NN HA work being done by the community will
require NFS ... and the NFS itself needs to be HA.

But if you use a Netapp, then the likelihood of the Netapp crashing is
lower than the likelihood of a garbage-collection-of-death happening in the
NN.

[ disclaimer:  I don't work for Netapp, I work for MapR ]


On Wed, Dec 7, 2011 at 4:30 PM, randy randy...@comcast.net wrote:

 Thanks Joey. We've had enough problems with nfs (mainly under very high
 load) that we thought it might be riskier to use it for the NN.

 randy


 On 12/07/2011 06:46 PM, Joey Echeverria wrote:

 Hey Rand,

 It will mark that storage directory as failed and ignore it from then
 on. In order to do this correctly, you need a couple of options
 enabled on the NFS mount to make sure that it doesn't retry
 infinitely. I usually run with the tcp,soft,intr,timeo=10,**retrans=10
 options set.

 -Joey

 On Wed, Dec 7, 2011 at 12:37 PM,randy...@comcast.net  wrote:

 What happens then if the nfs server fails or isn't reachable? Does hdfs
 lock up? Does it gracefully ignore the nfs copy?

 Thanks,
 randy

 - Original Message -
 From: Joey Echeverriaj...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Wednesday, December 7, 2011 6:07:58 AM
 Subject: Re: HDFS Backup nodes

 You should also configure the Namenode to use an NFS mount for one of
 it's storage directories. That will give the most up-to-date back of
 the metadata in case of total node failure.

 -Joey

 On Wed, Dec 7, 2011 at 3:17 AM, praveenesh kumarpraveen...@gmail.com
  wrote:

 This means still we are relying on Secondary NameNode idealogy for
 Namenode's backup.
 Can OS-mirroring of Namenode is a good alternative keep it alive all the
 time ?

 Thanks,
 Praveenesh

 On Wed, Dec 7, 2011 at 1:35 PM, Uma Maheswara Rao G
 mahesw...@huawei.comwrote:

  AFAIK backup node introduced in 0.21 version onwards.
 __**__
 From: praveenesh kumar [praveen...@gmail.com]
 Sent: Wednesday, December 07, 2011 12:40 PM
 To: common-user@hadoop.apache.org
 Subject: HDFS Backup nodes

 Does hadoop 0.20.205 supports configuring HDFS backup nodes ?

 Thanks,
 Praveenesh




 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434








Re: Any related paper on how to resolve hadoop SPOF issue?

2011-10-09 Thread M. C. Srivas
Are there any performance benchmarks available for Ceph? (with Hadoop,
without, both?)

On Thu, Aug 25, 2011 at 11:44 AM, Alex Nelson ajnel...@cs.ucsc.edu wrote:

 Hi George,

 UC Santa Cruz contributed a ;login: article describing replacing HDFS with
 Ceph.  (I was one of the authors.)  One of the key architectural advantages
 of Ceph over HDFS is that Ceph distributes its metadata service over
 multiple metadata servers.

 I hope that helps.

 --Alex


 On Aug 25, 2011, at 02:54 , George Kousiouris wrote:

 
  Hi,
 
  many thanks!!
 
  I also found this one, which has some related work also with other
 efforts, in case it is helpful for someone else:
 
 https://ritdml.rit.edu/bitstream/handle/1850/13321/ATalwalkarThesis1-2011.pdf?sequence=1
 
 
  BR,
  George
 
  On 8/25/2011 12:46 PM, Nan Zhu wrote:
  Hope it helps
 
  http://www.springerlink.com/content/h17r882710314147/
 
  Best,
 
  Nan
 
  On Thu, Aug 25, 2011 at 5:43 PM, George Kousiouris
 gkous...@mail.ntua.grwrote:
 
  Hi guys,
 
  We are currently in the process of writing a paper regarding hadoop and
 we
  would like to reference any attempt to remove the single point of
 failure of
  the Namenode. We have found in various presentations some efforts (like
  dividing the namespace between more than one namenodes if i remember
  correctly) but the search for a concrete paper on the issue came to
 nothing.
 
  Is anyone aware or has participated in such an effort?
 
  Thanks,
  George
 
  --
 
  ---
 
  George Kousiouris
  Electrical and Computer Engineer
  Division of Communications,
  Electronics and Information Engineering
  School of Electrical and Computer Engineering
  Tel: +30 210 772 2546
  Mobile: +30 6939354121
  Fax: +30 210 772 2569
  Email: gkous...@mail.ntua.gr
  Site: http://users.ntua.gr/gkousiou/
 
  National Technical University of Athens
  9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 
 
 
 
 
  --
 
  ---
 
  George Kousiouris
  Electrical and Computer Engineer
  Division of Communications,
  Electronics and Information Engineering
  School of Electrical and Computer Engineering
  Tel: +30 210 772 2546
  Mobile: +30 6939354121
  Fax: +30 210 772 2569
  Email: gkous...@mail.ntua.gr
  Site: http://users.ntua.gr/gkousiou/
 
  National Technical University of Athens
  9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
 




Re: question regarding MapR

2011-10-09 Thread M. C. Srivas
MapR's entire distro is free, and consists mostly of open-source components.
 The only closed-source component is MapR's file-system, which is a complete
drop-in replacement for HDFS.

If you wish, we can discuss further about your concerns re: MapR  outside
this mailing list.

thanks,
Srivas.


On Sun, Oct 9, 2011 at 4:12 AM, George Kousiouris gkous...@mail.ntua.grwrote:


 Hi all,

 We were looking how to overcome the random write/concurrent write issue
 with hdfs. We came across the MapR platform that claims to resolve this,
 along with the distribued Namenode. However we are a bit concerned that it
 seems that this is not open source. Does anyone have any info on which parts
 of MapR are indeed open source? Or any alternative solution?

 BR,
 George

 --

 ---

 George Kousiouris
 Electrical and Computer Engineer
 Division of Communications,
 Electronics and Information Engineering
 School of Electrical and Computer Engineering
 Tel: +30 210 772 2546
 Mobile: +30 6939354121
 Fax: +30 210 772 2569
 Email: gkous...@mail.ntua.gr
 Site: http://users.ntua.gr/gkousiou/

 National Technical University of Athens
 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece




Re: YCSB Benchmarking for HBase

2011-08-04 Thread M. C. Srivas
Lohit did some work on making YCSB run on a bunch of machines in a
coordinated manner. Plus fixed some limits in how many zk
connections/threads can run inside one process. See
http://github.com/lohitvijayarenu/YCSB

I believe that code also has a data-verification option to ensure that a *
get* reads what was written and not something else.



On Wed, Aug 3, 2011 at 3:10 AM, praveenesh kumar praveen...@gmail.comwrote:

 Hi,

 Anyone working on YCSB (Yahoo Cloud Service Benchmarking) for HBase ??

 I am trying to run it, its giving me error:

 $ java -cp build/ycsb.jar com.yahoo.ycsb.CommandLine -db
 com.yahoo.ycsb.db.HBaseClient

 YCSB Command Line client
 Type help for command line help
 Start with -help for usage info
 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/conf/Configuration
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2406)
at java.lang.Class.getConstructor0(Class.java:2716)
at java.lang.Class.newInstance0(Class.java:343)
at java.lang.Class.newInstance(Class.java:325)
at com.yahoo.ycsb.CommandLine.main(Unknown Source)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 6 more

 By the error, it seems like its not able to get Hadoop-core.jar file, but
 its already in the class path.
 Has anyone worked on YCSB with hbase ?

 Thanks,
 Praveenesh



Re: The best architecture for EC2/Hadoop interface?

2011-08-02 Thread M. C. Srivas
Avoid Swing.

Try GWT, its quite nice.  http://code.google.com/webtoolkit/.
You get the Java debuggability, and a browser-based GUI.


On Tue, Aug 2, 2011 at 3:35 AM, Steve Loughran ste...@apache.org wrote:

 On 02/08/11 05:09, Mark Kerzner wrote:

 Hi,

 I want to give my users a GUI that would allow them to start Hadoop
 clusters
 and run applications that I will provide on the AMIs. What would be a good
 approach to make it simple for the user? Should I write a Java Swing app
 that will wrap around the EC2 commands? Should I use some more direct EC2
 API? Or should I use a web browser interface?

 My idea was to give the user a Java Swing GUI, so that he gives his Amazon
 credentials to it, and it would be secure because the application is not
 exposed to the outside. Does this approach make sense?


 1. I'm not sure that Java Swing GUI makes sense for anything anymore -if it
 ever did.

 2. Have a look at what other people have done first before writing your
 own. Amazon provide something for their derivative of Hadoop, Elastic MR, I
 suspect KarmaSphere and others may provide UIs in front of it too.

 the other thing is most big jobs are more than one operation, so you are a
 workflow world. Things like cascading pig and oozie help here, and if you
 can bring them up in-cluster, you can get a web UI.



Re: Which release to use?

2011-07-18 Thread M. C. Srivas
Mike,

Just a minor inaccuracy in your email. Here's setting the record straight:

1. MapR directly sells their distribution of Hadoop. Support is from  MapR.
2. EMC also sells the MapR distribution, for use on any hardware. Support is
from EMC worldwide.
3. EMC also sells a Hadoop appliance, which has the MapR distribution
specially built for it. Support is from EMC.

4. MapR also has a free, unlimited, unrestricted version called M3, which
has the same 2-5x performance, management and stability improvements, and
includes NFS. It is not crippleware, and the unlimited, unrestricted, free
use does not expire on any date.

Hope that clarifies what MapR is doing.

thanks  regards,
Srivas.

On Mon, Jul 18, 2011 at 11:33 AM, Michael Segel
michael_se...@hotmail.comwrote:


 EMC has inked a deal with MapRTech to resell their release and support
 services for MapRTech.
 Does this mean that they are going to stop selling their own release on
 Greenplum? Maybe not in the near future, however,
 a Greenplum appliance may not get the customer transaction that their
 reselling of MapR will generate.

 It sounds like they are hedging their bets and are taking an 'IBM'
 approach.


  Subject: RE: Which release to use?
  Date: Mon, 18 Jul 2011 08:30:59 -0500
  From: jeff.schm...@shell.com
  To: common-user@hadoop.apache.org
 
  Steve,
 
  I read your blog nice post - I believe EMC is selling the Greenplumb
  solution as an appliance -
 
  Cheers -
 
  Jeffery
 
  -Original Message-
  From: Steve Loughran [mailto:ste...@apache.org]
  Sent: Friday, July 15, 2011 4:07 PM
  To: common-user@hadoop.apache.org
  Subject: Re: Which release to use?
 
  On 15/07/2011 18:06, Arun C Murthy wrote:
   Apache Hadoop is a volunteer driven, open-source project. The
  contributors to Apache Hadoop, both individuals and folks across a
  diverse set of organizations, are committed to driving the project
  forward and making timely releases - see discussion on hadoop-0.23 with
  a raft newer features such as HDFS Federation, NextGen MapReduce and
  plans for HA NameNode etc.
  
   As with most successful projects there are several options for
  commercial support to Hadoop or its derivatives.
  
   However, Apache Hadoop has thrived before there was any commercial
  support (I've personally been involved in over 20 releases of Apache
  Hadoop and deployed them while at Yahoo) and I'm sure it will in this
  new world order.
  
   We, the Apache Hadoop community, are committed to keeping Apache
  Hadoop 'free', providing support to our users and to move it forward at
  a rapid rate.
  
 
  Arun makes a good point which is that the Apache project depends on
  contributions from the community to thrive. That includes
 
-bug reports
-patches to fix problems
-more tests
-documentation improvements: more examples, more on getting started,
  troubleshooting, etc.
 
  If there's something lacking in the codebase, and you think you can fix
  it, please do so. Helping with the documentation is a good start, as it
  can be improved, and you aren't going to break anything.
 
  Once you get into changing the code, you'll end up working with the head
 
  of whichever branch you are targeting.
 
  The other area everyone can contribute on is testing. Yes, Y! and FB can
 
  test at scale, yes, other people can test large clusters too -but nobody
 
  has a network that looks like yours but you. And Hadoop does care about
  network configurations. Testing beta and release candidate releases in
  your infrastructure, helps verify that the final release will work on
  your site, and you don't end up getting all the phone calls about
  something not working
 
 



Re: large memory tasks

2011-06-15 Thread M. C. Srivas
I like this, Shi!  Very clever!

On Wed, Jun 15, 2011 at 4:36 PM, Shi Yu sh...@uchicago.edu wrote:

 Suppose you are looking up a value V of a key K.   And V is required for an
 upcoming process. Suppose the data in the upcoming process  has the form

 R1  K1 K2 K3,

 where R1 is the record number, K1 to K3 are the keys occurring in the
 record, which means in the look up case you would query for V1, V2, V3

 Using inner join you could attach all the V values for a single record and
 prepare the data like

 R1 K1 K2 K3 V1 V2 V3

 then each record has the complete information for the next process. So you
 pay the storage for the efficiency. Even taking into account the time
 required for preparing the data, it is still faster than the look-up
 approach.

 I have also tried TokyoCabinet, you need to compile and install some
 extensions to get it working. Sometimes getting things and APIs to work can
 be painful. If you don't need to update the lookup table, install TC,
 MemCache, MongoDB locally on each node would be the most efficient solution
 because all the look-ups are local.


 On 6/15/2011 5:56 PM, Ian Upright wrote:

 If the data set doesn't fit in working memory, but is still of a
 reasonable
 size  (lets say a few hundred gigabytes), then I'd probably use something
 like this:

 http://fallabs.com/tokyocabinet/

  From reading the Hadoop docs (which I'm very new to), then I might use
 DistributedCache to replicate that database around.  My impression would
 be
 that this might be among the most efficient things one could do.

 However, for my particular application, even using tokycabinet introduces
 too much inefficiency, and a pure plain old memory-based lookups is by far
 the most efficient.  (not to mention that some of the lookups I'm doing
 are
 specialized trees that can't be done with tokyocabinet or any typical db,
 but thats beside the point)

 I'm having trouble understanding your more efficient method by using more
 data and HDFS, and having trouble understanding how it could possibly be
 any
 more efficient than say the above approach.

 How is increasing the size minimizing the lookups?

 Ian

  I had the same problem before, a big lookup table too large to load in
 memory.

 I tried and compared the following approaches:  in-memory MySQL DB, a
 dedicated central memcache server, a dedicated central MongoDB server,
 local DB (each node has its own MongoDB server) model.

 The local DB model is the most efficient one.  I believe dedicated
 server approach could get improved if the number of server is increased
 and distributed. I just tried single server.

 But later I dropped out the lookup table approach. Instead, I attached
 the table information in the HDFS (which could be considered as an inner
 join DB process), which significantly increases the size of data sets
 but avoids the bottle neck of table look up. There is a trade-off, when
 no table looks up, the data to process is intensive (TB size). In
 contrast, a look-up table could save 90% of the data storage.

 According to our experiments on a 30-node cluster, attaching information
 in HDFS is even 20%  faster than the local DB model. When attaching
 information in HDFS, it is also easier to ping-pong Map/Reduce
 configuration to further improve the efficiency.

 Shi

 On 6/15/2011 5:05 PM, GOEKE, MATTHEW (AG/1000) wrote:

 Is the lookup table constant across each of the tasks? You could try
 putting it into memcached:

 http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf

 Matt

 -Original Message-
 From: Ian Upright [mailto:i...@upright.net]
 Sent: Wednesday, June 15, 2011 3:42 PM
 To: common-user@hadoop.apache.org
 Subject: large memory tasks

 Hello, I'm quite new to Hadoop, so I'd like to get an understanding of
 something.

 Lets say I have a task that requires 16gb of memory, in order to
 execute.
 Lets say hypothetically it's some sort of big lookuptable of sorts that
 needs that kind of memory.

 I could have 8 cores run the task in parallel (multithreaded), and all 8
 cores can share that 16gb lookup table.

 On another machine, I could have 4 cores run the same task, and they
 still
 share that same 16gb lookup table.

 Now, with my understanding of Hadoop, each task has it's own memory.

 So if I have 4 tasks that run on one machine, and 8 tasks on another,
 then
 the 4 tasks need a 64 GB machine, and the 8 tasks need a 128 GB machine,
 but
 really, lets say I only have two machines, one with 4 cores and one with
 8,
 each machine only having 24 GB.

 How can the work be evenly distributed among these machines?  Am I
 missing
 something?  What other ways can this be configured such that this works
 properly?

 Thanks, Ian
 This e-mail message may contain privileged and/or confidential
 information, and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any 

Re: Hadoop and WikiLeaks

2011-05-19 Thread M. C. Srivas
Interesting to note that Cassandra and ZK are now considered Hadoop
projects.

There were independent of Hadoop before the recent update.


On Thu, May 19, 2011 at 4:18 AM, Steve Loughran ste...@apache.org wrote:

 On 18/05/11 18:05, javam...@cox.net wrote:

 Yes!

 -Pete

  Edward Caprioloedlinuxg...@gmail.com  wrote:

 =
 http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F

 March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
 Awards

 The Hadoop project won the innovator of the yearaward from the UK's
 Guardian newspaper, where it was described as had the potential as a
 greater catalyst for innovation than other nominees including WikiLeaks
 and
 the iPad.

 Does this copy text bother anyone else? Sure winning any award is great
 but
 does hadoop want to be associated with innovation like WikiLeaks?



 Ian updated the page yesterday with changes I'd put in for trademarks, and
 I added this news quote directly from the paper. We could strip out the
 quote easily enough.




Re: Cluster hard drive ratios

2011-05-04 Thread M. C. Srivas
Hey Matt,

 we are using the same Dell boxes, and we can get 2 GB/s per node (read and
write) without problems.


On Wed, May 4, 2011 at 8:43 AM, Matt Goeke msg...@gmail.com wrote:

 I have been reviewing quite a few presentations on the web from
 various businesses, in addition to the ones I watched first hand at
 the cloudera data summit last week, and I am curious as to others
 thoughts around hard drive ratios. Various sources including Cloudera
 have sited 1 HDD x 2 cores x 4 GB ECC but this makes me wonder what
 the upper bound for HDDs is in this ratio. We have specced out various
 machines from Dell and it is possible to get dual hexacores with 14
 drives (2 raided for OS and 12x2TB) but this seems to conflict with
 that original ratio and some of the specs I have witnessed in
 presentations (which are mostly 4 drive configurations). I would
 assume all you incur is additional complexity and more potential for
 hardware failure on a specific machine but I have seen little to no
 data stating at what point there is a plateau in write speed
 performance. Can anyone give personal experience around this type of
 setup?

 If we accept that we are incurring the negatives I stated above but we
 gain higher data density in the cluster then is this setup fine or we
 overlooking something?

 Thanks,
 Matt



Re: Memory mapped resources

2011-04-13 Thread M. C. Srivas
Sorry, don't mean to say you don't know mmap or didn't do cool things in the
past.

But you will see why anyone would've interpreted this original post, given
the title of the posting and the following wording, to mean can I mmap
files that are in hdfs

On Mon, Apr 11, 2011 at 3:57 PM, Benson Margulies bimargul...@gmail.comwrote:

 We have some very large files that we access via memory mapping in
 Java. Someone's asked us about how to make this conveniently
 deployable in Hadoop. If we tell them to put the files into hdfs, can
 we obtain a File for the underlying file on any given node?



Re: Memory mapped resources

2011-04-12 Thread M. C. Srivas
I am not sure if you realize, but HDFS is not VM integrated.  What you are
asking for is support  *inside* the linux kernel for HDFS file systems. I
don't see that happening for the next few years, and probably never at all.
(HDFS is all Java today, and Java certainly is not going to go inside the
kernel)

The ways to get there are
a) use the hdfs-fuse proxy
b) do this by hand  - copy the file into each individual machine's local
disk, and then mmap the local path
c) more or less do the same as (b), using a thing called the Distributed
Cache in Hadoop, and them mmap the local path
d) don't use HDFS, and instead use something else for this purpose


On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies bimargul...@gmail.comwrote:

 Here's the OP again.

 I want to make it clear that my question here has to do with the
 problem of distributing 'the program' around the cluster, not 'the
 data'. In the case at hand, the issue a system that has a large data
 resource that it needs to do its work. Every instance of the code
 needs the entire model. Not just some blocks or pieces.

 Memory mapping is a very attractive tactic for this kind of data
 resource. The data is read-only. Memory-mapping it allows the
 operating system to ensure that only one copy of the thing ends up in
 physical memory.

 If we force the model into a conventional file (storable in HDFS) and
 read it into the JVM in a conventional way, then we get as many copies
 in memory as we have JVMs.  On a big machine with a lot of cores, this
 begins to add up.

 For people who are running a cluster of relatively conventional
 systems, just putting copies on all the nodes in a conventional place
 is adequate.



Re: decommissioning node woes

2011-03-19 Thread M. C. Srivas
All trunking/bonding at the switch (eg, LACP) gives only 1 NIC's worth of
bandwidth point-to-point, even if your boxes all have multiple NICs.   It
chooses a NIC at connection initiation (via round-robin, or load, or
whatever). But once the TCP connection is established, there is no
load-balancing --


On Sat, Mar 19, 2011 at 7:11 PM, Michael Segel michael_se...@hotmail.comwrote:


 Usually the port bonding is done at a lower level so that you and your
 applications see this as a single port. So you don't have to worry about
 load balancing between the ports.
 (Or am I missing something?)

 thx

 -Mike


  From: tdunn...@maprtech.com
  Date: Sat, 19 Mar 2011 09:00:30 -0700
  Subject: Re: decommissioning node woes
  To: common-user@hadoop.apache.org
  CC: michael_se...@hotmail.com
 
  Unfortunately this doesn't help much because it is hard to get the ports
 to
  balance the load.
 
  On Fri, Mar 18, 2011 at 8:30 PM, Michael Segel 
 michael_se...@hotmail.comwrote:
 
   With a 1GBe port, you could go 100Mbs for the bandwidth limit.
   If you bond your ports, you could go higher.
  



Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread M. C. Srivas
The summary is quite inaccurate.

On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi,

 is it accurate to say that

   - In 0.20 the Secondary NameNode acts as a cold spare; it can be used to
   recreate the HDFS if the Primary NameNode fails, but with the delay of
   minutes if not hours, and there is also some data loss;



The Secondary NN is not a spare. It is used to augment the work of the
Primary, by offloading some of its work to another machine. The work
offloaded is log rollup or checkpointing. This has been a source of
constant confusion (some named it incorrectly as a secondary and now we
are stuck with it).

The Secondary NN certainly cannot take over for the Primary. It is not its
purpose.

Yes, there is data loss.




   - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539), which
   replaces the Secondary NameNode. The Backup Node can be used as a warm
   spare, with the failover being a matter of seconds. There can be multiple
   Backup Nodes, for additional insurance against failure, and previous best
   common practices apply to it;



There is no Backup NN in the manner you are thinking of. It is completely
manual, and requires restart of the whole world, and takes about 2-3 hours
to happen. If you are lucky, you may have only a little data loss (people
have lost entire clusters due to this -- from what I understand, you are far
better off resurrecting the Primary instead of trying to bring up a Backup
NN).

In any case, when you run it like you mention above, you will have to
(a) make sure that the primary is dead
(b) edit hdfs-site.xml on *every* datanode to point to the new IP address of
the backup, and restart each datanode.
(c) wait for 2-3 hours for all the block-reports from every restarted DN to
finish

2-3 hrs afterwards:
(d) after that, restart all TT and the JT to connect to the new NN
(e) finally, restart all the clients (eg, HBase, Oozie, etc)

Many companies, including Yahoo! and Facebook, use a couple of NetApp filers
to hold the actual data that the NN writes. The two NetApp filers are run in
HA mode with NVRAM copying.  But the NN remains a single point of failure,
and there is probably some data loss.



   - 0.22 will have further improvements to the HDFS performance, such
   as HDFS-1093.

 Does the paper on HDFS Reliability by Tom
 White
 http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
 still
 represent the current state of things?



See Dhruba's blog-post about the Avatar NN + some custom stackable HDFS
code on all the clients + Zookeeper + the dual NetApp filers.

It helps Facebook do manual, controlled, fail-over during software upgrades,
at the cost of some performance loss on the DataNodes (the DataNodes have to
do 2x block reports, and each block-report is expensive, so it limits the
DataNode a bit).  The article does not talk about dataloss when the
fail-over is initiated manually, so I don't know about that.

http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html





 Thank you. Sincerely,
 Mark



Re: Is this a fair summary of HDFS failover?

2011-02-14 Thread M. C. Srivas
I understand you are writing a book Hadoop in Practice.  If so, its
important that what's recommended in the book should be verified in
practice. (I mean, beyond simply posting in this newsgroup - for instance,
the recommendations on NN fail-over should be tried out first before writing
about how to do it). Otherwise you won't know your recommendations really
work or not.



On Mon, Feb 14, 2011 at 12:31 PM, Mark Kerzner markkerz...@gmail.comwrote:

 Thank you, M. C. Srivas, that was enormously useful. I understand it now,
 but just to be complete, I have re-formulated my points according to your
 comments:

   - In 0.20 the Secondary NameNode performs snapshotting. Its data can be
   used to recreate the HDFS if the Primary NameNode fails. The procedure is
   manual and may take hours, and there is also data loss since the last
   snapshot;
   - In 0.21 there is a Backup Node (HADOOP-4539), which aims to help with
   HA and act as a cold spare. The data loss is less than with Secondary NN,
   but it is still manual and potentially error-prone, and it takes hours;
   - There is an AvatarNode patch available for 0.20, and Facebook runs its
   cluster that way, but the patch submitted to Apache requires testing and
 the
   developers adopting it must do some custom configurations and also
 exercise
   care in their work.

 As a conclusion, when building an HA HDFS cluster, one needs to follow the
 best
 practices outlined by Tom
 White
 http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf,
 and may still need to resort to specialized NSF filers for running the
 NameNode.

 Sincerely,
 Mark



 On Mon, Feb 14, 2011 at 11:50 AM, M. C. Srivas mcsri...@gmail.com wrote:

  The summary is quite inaccurate.
 
  On Mon, Feb 14, 2011 at 8:48 AM, Mark Kerzner markkerz...@gmail.com
  wrote:
 
   Hi,
  
   is it accurate to say that
  
 - In 0.20 the Secondary NameNode acts as a cold spare; it can be used
  to
 recreate the HDFS if the Primary NameNode fails, but with the delay
 of
 minutes if not hours, and there is also some data loss;
  
 
 
  The Secondary NN is not a spare. It is used to augment the work of the
  Primary, by offloading some of its work to another machine. The work
  offloaded is log rollup or checkpointing. This has been a source of
  constant confusion (some named it incorrectly as a secondary and now we
  are stuck with it).
 
  The Secondary NN certainly cannot take over for the Primary. It is not
 its
  purpose.
 
  Yes, there is data loss.
 
 
 
 
 - in 0.21 there are streaming edits to a Backup Node (HADOOP-4539),
  which
 replaces the Secondary NameNode. The Backup Node can be used as a
 warm
 spare, with the failover being a matter of seconds. There can be
  multiple
 Backup Nodes, for additional insurance against failure, and previous
  best
 common practices apply to it;
  
 
 
  There is no Backup NN in the manner you are thinking of. It is
 completely
  manual, and requires restart of the whole world, and takes about 2-3
  hours
  to happen. If you are lucky, you may have only a little data loss (people
  have lost entire clusters due to this -- from what I understand, you are
  far
  better off resurrecting the Primary instead of trying to bring up a
 Backup
  NN).
 
  In any case, when you run it like you mention above, you will have to
  (a) make sure that the primary is dead
  (b) edit hdfs-site.xml on *every* datanode to point to the new IP address
  of
  the backup, and restart each datanode.
  (c) wait for 2-3 hours for all the block-reports from every restarted DN
 to
  finish
 
  2-3 hrs afterwards:
  (d) after that, restart all TT and the JT to connect to the new NN
  (e) finally, restart all the clients (eg, HBase, Oozie, etc)
 
  Many companies, including Yahoo! and Facebook, use a couple of NetApp
  filers
  to hold the actual data that the NN writes. The two NetApp filers are run
  in
  HA mode with NVRAM copying.  But the NN remains a single point of
  failure,
  and there is probably some data loss.
 
 
 
 - 0.22 will have further improvements to the HDFS performance, such
 as HDFS-1093.
  
   Does the paper on HDFS Reliability by Tom
   White
  
 http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf
   still
   represent the current state of things?
  
 
 
  See Dhruba's blog-post about the Avatar NN + some custom stackable HDFS
  code on all the clients + Zookeeper + the dual NetApp filers.
 
  It helps Facebook do manual, controlled, fail-over during software
  upgrades,
  at the cost of some performance loss on the DataNodes (the DataNodes have
  to
  do 2x block reports, and each block-report is expensive, so it limits the
  DataNode a bit).  The article does not talk about dataloss when the
  fail-over is initiated manually, so I don't know about that.
 
 
 
 http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html
 
 
 
 
  
   Thank you. Sincerely,
   Mark
  
 



Re: the performance of HDFS

2011-01-25 Thread M. C. Srivas
On Tue, Jan 25, 2011 at 12:33 PM, Da Zheng zhengda1...@gmail.com wrote:

 Hello,

 I try to measure the performance of HDFS, but the writing rate is quite
 low. When the replication factor is 1, the rate of writing to HDFS is about
 60MB/s. When the replication factor is 3, the rate drops significantly to
 about 15MB/s. Even though the actual rate of writing data to the disk is
 about 45MB/s, it's still much lower than when replication factor is 1. The
 link between two nodes in the cluster is 1Gbps. CPU is Dual-Core AMD
 Opteron(tm) Processor 2212, so CPU isn't bottleneck either. I thought I
 should be able to saturate the disk very easily. I wonder where the
 bottleneck is. What is the throughput for writing on a Hadoop cluster when
 the replication factor is 3?


The numbers above seem correct as per my observations.  If your data is
3-way replicated, the data-node writes about 3x the actual data written.
Conversely, your write-rate will be limited to 1/3 of  how fast the disk can
write, minus some overhead for replication.

The aggregate write-rate can get much higher if you use more drives, but a
single stream throughput is limited to the speed of one disk spindle.






 Thanks,
 Da



Re: Approached to combing the output of reducers

2010-10-23 Thread M. C. Srivas
Not with HDFS, since only one process may write to a single file (and its
not visible until the file is closed). In fact, its worse than that ... the
same process that's writing that file cannot see it or read it until after
its done.

 If you have multiple reducers, you are simply out of luck and will have to
run a separate job to copy the data out.


On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 Once I run a map-reduce job I get output in the form of
 part-r-0 part-r-1 ...

 In many cases the output is significantly smaller than the original input -
 take the classic word count

 In most cases I want to combine the output into a single file that may well
 not live on HDFS but on a more accessible file system

 Are there standard libraries or approaches for consolidating reducer
 output.

 A second Map-Reduce job taking the output directory as an input is an OK
 start but as output there needs to be a single reducer that
 writes a real file and not reduce output -

 Are there standard libraries or approaches to this?

 --
 Steven M. Lewis PhD
 4221 105th Ave Ne
 Kirkland, WA 98033
 206-384-1340 (cell)
 Institute for Systems Biology
 Seattle WA



Re: Approached to combing the output of reducers

2010-10-23 Thread M. C. Srivas
On Sat, Oct 23, 2010 at 4:19 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 
  I am assuming the first job outputs multiple files and that the second
 (and
  I presume a map-reduce job)

 will assign the output intended for a single file to a single reducer (in
 some cases multiple output files might be supported - one
 per reducer - On issue is how to allow the reducer to write to some
 'external file system' -.i.e. not hdfs or  the instance's local file system
  but s3 on amazon or some mounted nfs system on a stand alone cluster


 bin/hadoop jar  jarname  input-dir  output-dir

Thus.

bin/hadoop jar  jarname   hdfs://...file:///my/nfs/mounted/dir/...

will work, if you nfs-mount your destination dir on all the nodes in the
cluster.



 


  On Oct 23, 2010, at 3:32 PM, M. C. Srivas mcsri...@gmail.com wrote:
 
   Not with HDFS, since only one process may write to a single file (and
 its
   not visible until the file is closed). In fact, its worse than that ...
  the
   same process that's writing that file cannot see it or read it until
  after
   its done.
  
   If you have multiple reducers, you are simply out of luck and will have
  to
   run a separate job to copy the data out.
  
  
   On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis lordjoe2...@gmail.com
  wrote:
  
   Once I run a map-reduce job I get output in the form of
   part-r-0 part-r-1 ...
  
   In many cases the output is significantly smaller than the original
  input -
   take the classic word count
  
   In most cases I want to combine the output into a single file that may
  well
   not live on HDFS but on a more accessible file system
  
   Are there standard libraries or approaches for consolidating reducer
   output.
  
   A second Map-Reduce job taking the output directory as an input is an
 OK
   start but as output there needs to be a single reducer that
   writes a real file and not reduce output -
  
   Are there standard libraries or approaches to this?
  
   --
   Steven M. Lewis PhD
   4221 105th Ave Ne
   Kirkland, WA 98033
   206-384-1340 (cell)
   Institute for Systems Biology
   Seattle WA
  
 



 --
 Steven M. Lewis PhD
 4221 105th Ave Ne
 Kirkland, WA 98033
 206-384-1340 (cell)
 Institute for Systems Biology
 Seattle WA



Re: BUG: Anyone use block size more than 2GB before?

2010-10-21 Thread M. C. Srivas
I thought the petasort benchmark you published used 12.5G block sizes. How
did you make that work?

On Mon, Oct 18, 2010 at 4:27 PM, Owen O'Malley omal...@apache.org wrote:

 Block sizes larger than 2**31 are known to not work. I haven't ever tracked
 down the problem, just set my block size to be smaller than that.

 -- Owen