Re: Questions about hadoop-metrics2.properties

2013-10-24 Thread Luke Lu
1. File and Ganglia are the only bundled sinks, though there are
socket/json (for chukwa) and graphite sinks patches in the works.
2. Hadoop metrics (and metrics2) is mostly designed for system/process
metrics, which means you'll need to attach jconsole to your map/reduce task
processes to see your task metrics instrumented via metrics. What you
actually want is probably custom job counters.
3. You don't need any configuration to use JMX to access metrics2, as JMX
is currently on by default. The configuration in hadoop-metrics2.properties
is mostly for optional sink configuration and metrics filtering.

__Luke



On Wed, Oct 23, 2013 at 4:21 PM, Benyi Wang bewang.t...@gmail.com wrote:

 1. Does hadoop metrics2 only support File and Ganglia sink?
 2. Can I expose metrics as JMX, especially for customized metrics? I
 created some  metrics in my mapreduce job and could successfully output
 them using a FileSink. But if I use jconsole to access YARN nodemanager, I
 can only see hadoop metrics e.g Hadoop/NodeManager/NodeManagerMetrices
 etc.,  not mine with prefix maptask. How to setup to see maptask/reducetask
 prefix metrics?
 3. Is there an example using jmx? I could not find

 The configuration syntax is:

   [prefix].[source|sink|jmx|].[instance].[option]


 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/metrics2/package-summary.html



Re: What happens when you have fewer input files than mapper slots?

2013-03-21 Thread Luke Lu
 Short version : let's say you have 20 nodes, and each node has 10 mapper
 slots.  You start a job with 20 very small input files.  How is the work
 distributed to the cluster?  Will it be even, with each node spawning one
 mapper task?  Is there any way of predicting or controlling how the work
 will be distributed?


You're right in expecting that the tasks of the small job will likely be
evenly distributed among 20 nodes, if the 20 files are evenly distributed
among the nodes and that there are free slots on every node.


 Long version : My cluster is currently used for two different jobs.  The
 cluster is currently optimized for Job A, so each node has a maximum of 18
 mapper slots.  However, I also need to run Job B.  Job B is VERY
 cpu-intensive, so we really only want one mapper to run on a node at any
 given time.  I've done a bunch of research, and it doesn't seem like Hadoop
 gives you any way to set the maximum number of mappers per node on a
 per-job basis.  I'm at my wit's end here, and considering some rather
 egregious workarounds.  If you can think of anything that can help me, I'd
 very much appreciate it.


Are you seeing Job B tasks are not being evenly distributed to each node?
You can check the locations of the files by hadoop fsck. If the evenness is
the goal you can also write your own input format that return empty
locations for each split and read the small files in map task directly. If
you're using Hadoop 1.0.x and fair scheduler, you might need to set
mapred.fairscheduler.assignmultiple to false in mapred-site.xml (JT restart
required) to work around a bug in fairscheduler (MAPREDUCE-2905) that
causes the tasks be assigned unevenly. The bug is fixed in Hadoop 1.1+.

__Luke


Re: how to resolve conflicts with jar dependencies

2013-03-12 Thread Luke Lu
The problem is resolved in the next release of hadoop (2.0.3-alpha cf.
MAPREDUCE-1700)

For hadoop 1.x based releases/distributions, put
-Dmapreduce.user.classpath.first=true on the hadoop command line and/or
client config


On Tue, Mar 12, 2013 at 6:49 AM, Jane Wayne jane.wayne2...@gmail.comwrote:

 hi,

 i need to know how to resolve conflicts with jar dependencies.

 * first, my job requires Jackson JSON-processor v1.9.11.
 * second, the hadoop cluster has Jackson JSON-processor v1.5.2. the
 jars are installed in $HADOOP_HOME/lib.

 according to this link,

 http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
 ,
 there are 3 ways to include 3rd party libraries in a map/reduce (mr)
 job.
 * use the -libjars flag
 * include the dependent libraries in the executing jar file's /lib
 directory
 * put the jars in the $HADOOP_HOME/lib directory

 i can report that using -libjars and including the libraries in my
 jar's /lib directory do not work (in my case of jar conflicts). i
 still get a NoSuchMethodException. the only way to get my job to run
 is the last option, placing the newer jars in $HADOOP_HOME/lib. the
 last option is fine on a sandbox or development instance, but there
 are some political difficulties (not only technical) in modifying our
 production environment.

 my questions/concerns are:
 1. how come the -libjars and /lib options do not work? how does class
 loading work in mr tasks?
 2. is there another option available that i am not aware of to try and
 get dependent jars by the job to overwrite what's in
 $HADOOP_HOME/lib at runtime of the mr tasks?


 any help is appreciated. thank you all.



Re: Introducing Parquet: efficient columnar storage for Hadoop.

2013-03-12 Thread Luke Lu
IMO, it'll be enlightening to Hadoop users to compare Parquet with Trevni
and ORCFile, all of which are columnar formats for Hadoop that are
relatively new. Do we really need 3 columnar formats?


On Tue, Mar 12, 2013 at 8:45 AM, Dmitriy Ryaboy dvrya...@gmail.com wrote:

 Fellow Hadoopers,

 We'd like to introduce a joint project between Twitter and Cloudera
 engineers -- a new columnar storage format for Hadoop called Parquet (
 http://parquet.github.com).

 We created Parquet to make the advantages of compressed, efficient columnar
 data representation available to any project in the Hadoop ecosystem,
 regardless of the choice of data processing framework, data model, or
 programming language.

 Parquet is built from the ground up with complex nested data structures in
 mind. We adopted the repetition/definition level approach to encoding such
 data structures, as described in Google's Dremel paper; we have found this
 to be a very efficient method of encoding data in non-trivial object
 schemas.

 Parquet is built to support very efficient compression and encoding
 schemes. Parquet allows compression schemes to be specified on a per-column
 level, and is future-proofed to allow adding more encodings as they are
 invented and implemented. We separate the concepts of encoding and
 compression, allowing parquet consumers to implement operators that work
 directly on encoded data without paying decompression and decoding penalty
 when possible.

 Parquet is built to be used by anyone. The Hadoop ecosystem is rich with
 data processing frameworks, and we are not interested in playing favorites.
 We believe that an efficient, well-implemented columnar storage substrate
 should be useful to all frameworks without the cost of extensive and
 difficult to set up dependencies.

 The initial code, available at https://github.com/Parquet, defines the
 file
 format, provides Java building blocks for processing columnar data, and
 implements Hadoop Input/Output Formats, Pig Storers/Loaders, and an example
 of a complex integration -- Input/Output formats that can convert
 Parquet-stored data directly to and from Thrift objects.

 A preview version of Parquet support will be available in Cloudera's Impala
 0.7.

 Twitter is starting to convert some of its major data source to Parquet in
 order to take advantage of the compression and deserialization savings.

 Parquet is currently under heavy development. Parquet's near-term roadmap
 includes:
 * Hive SerDes (Criteo)
 * Cascading Taps (Criteo)
 * Support for dictionary encoding, zigzag encoding, and RLE encoding of
 data (Cloudera and Twitter)
 * Further improvements to Pig support (Twitter)

 Company names in parenthesis indicate whose engineers signed up to do the
 work -- others can feel free to jump in too, of course.

 We've also heard requests to provide an Avro container layer, similar to
 what we do with Thrift. Seeking volunteers!

 We welcome all feedback, patches, and ideas; to foster community
 development, we plan to contribute Parquet to the Apache Incubator when the
 development is farther along.

 Regards,
 Nong Li, Julien Le Dem, Marcel Kornacker, Todd Lipcon, Dmitriy Ryaboy,
 Jonathan Coveney, and friends.



Re: Hadoop cluster hangs on big hive job

2013-03-11 Thread Luke Lu
You mean HDFS-4479?

The log seems to indicate the infamous jetty hang issue (MAPREDUCE-2386)
though.


On Mon, Mar 11, 2013 at 1:52 PM, Suresh Srinivas sur...@hortonworks.comwrote:

 I have seen one such problem related to big hive jobs that open a lot of
 files. See HDFS-4496 for more details. Snippet from the description:
 The following issue was observed in a cluster that was running a Hive job
 and was writing to 100,000 temporary files (each task is writing to 1000s
 of files). When this job is killed, a large number of files are left open
 for write. Eventually when the lease for open files expires, lease recovery
 is started for all these files in a very short duration of time. This
 causes a large number of commitBlockSynchronization where logSync is
 performed with the FSNamesystem lock held. This overloads the namenode
 resulting in slowdown.

 Could this be the cause? Can you see namenode log to see if you have lease
 recovery activity? If not, can you send some information about what is
 happening in the namenode logs at the time of this slowdown?



 On Mon, Mar 11, 2013 at 1:32 PM, Daning Wang dan...@netseer.com wrote:

 [hive@mr3-033 ~]$ hadoop version
 Hadoop 1.0.4
 Subversion
 https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r
 1393290
 Compiled by hortonfo on Wed Oct  3 05:13:58 UTC 2012


 On Sun, Mar 10, 2013 at 8:16 AM, Suresh Srinivas 
 sur...@hortonworks.comwrote:

 What is the version of hadoop?

 Sent from phone

 On Mar 7, 2013, at 11:53 AM, Daning Wang dan...@netseer.com wrote:

 We have hive query processing zipped csv files. the query was scanning
 for 10 days(partitioned by date). data for each day around 130G. The
 problem is not consistent since if you run it again, it might go through.
 but the problem has never happened on the smaller jobs(like processing only
 one days data).

 We don't have space issue.

 I have attached log file when problem happening. it is stuck like
 following(just search 19706 of 49964)

 2013-03-05 15:13:51,587 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_19_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:51,811 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_39_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:52,551 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_32_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:52,760 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_00_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:52,946 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_24_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:54,742 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_08_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 

 Thanks,

 Daning


 On Thu, Mar 7, 2013 at 12:21 AM, Håvard Wahl Kongsgård 
 haavard.kongsga...@gmail.com wrote:

 hadoop logs?
 On 6. mars 2013 21:04, Daning Wang dan...@netseer.com wrote:

 We have 5 nodes cluster(Hadoop 1.0.4), It hung a couple of times while
 running big jobs. Basically all the nodes are dead, from that
 trasktracker's log looks it went into some kinds of loop forever.

 All the log entries like this when problem happened.

 Any idea how to debug the issue?

 Thanks in advance.


 2013-03-05 15:13:19,526 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_12_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:19,552 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_28_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:20,858 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_36_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:21,141 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_16_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:21,486 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_19_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:21,692 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_39_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:22,448 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_32_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:22,643 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_00_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:22,840 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201302270947_0010_r_24_0 0.131468% reduce  copy (19706 of
 49964 at 0.00 MB/s) 
 2013-03-05 15:13:24,628 INFO org.apache.hadoop.mapred.TaskTracker:
 

Re: issue with permissions of mapred.system.dir

2012-10-14 Thread Luke Lu
It's a known issue for fairscheduler in Hadoop 1.x, see
MAPREDUCE-4398. A workaround is to submit 4 or more jobs as the user
of the jobtracker and everything will work fine afterwards. BTW, IBM
BigInsights community version (open source) contains the right fix
(properly initialize job init threads) since BigInsights 1.3.0.1.
Unfortunately IBM devs are too busy to port/submit the patches to
Apache right now :)

__Luke

On Wed, Oct 10, 2012 at 9:32 AM, Goldstone, Robin J.
goldsto...@llnl.gov wrote:
 There is no /hadoop1 directory.  It is //hadoop1 which is the name of the
 server running the name node daemon:

 valuehdfs://hadoop1/mapred/value

 Per offline conversations with Arpit, it appears this problem is related to
 the fact that I am using the fair scheduler.  The fair scheduler is designed
 to run map reduce jobs as the user, rather than under the mapred username.
 Apparently there are some issues with this scheduler related to permissions
 on certain directories not allowing other users to execute/write in places
 that are necessary for the job to run.  I haven't yet tried Arpit's
 suggestion to switch to the task scheduler but I imagine it will resolve my
 issue, at least for now.  Ultimately I do want to use the fair scheduler, as
 multi-tenancy is a key requirement for our Hadoop deployment.

 From: Manu S manupk...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Wednesday, October 10, 2012 3:34 AM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: issue with permissions of mapred.system.dir

 What is the permission for /hadoop1 dir in HDFS? Is mapred user have
 permission on the same directory?

 Thanks,
 Manu S

 On Wed, Oct 10, 2012 at 5:52 AM, Arpit Gupta ar...@hortonworks.com wrote:

 what is your mapreduce.jobtracker.staging.root.dir set to. This is a
 directory that needs to be writable by the user and is is recommended to be
 set to /user so it writes in appropriate users home directory.

 --
 Arpit Gupta
 Hortonworks Inc.
 http://hortonworks.com/

 On Oct 9, 2012, at 4:44 PM, Goldstone, Robin J. goldsto...@llnl.gov
 wrote:

 I am bringing up a Hadoop cluster for the first time (but am an
 experienced sysadmin with lots of cluster experience) and running into an
 issue with permissions on mapred.system.dir.   It has generally been a chore
 to figure out all the various directories that need to be created to get
 Hadoop working, some on the local FS, others within HDFS, getting the right
 ownership and permissions, etc..  I think I am mostly there but can't seem
 to get past my current issue with mapred.system.dir.

 Some general info first:
 OS: RHEL6
 Hadoop version: hadoop-1.0.3-1.x86_64

 20 node cluster configured as follows
 1 node as primary namenode
 1 node as secondary namenode + job tracker
 18 nodes as datanode + tasktracker

 I have HDFS up and running and have the following in mapred-site.xml:
 property
   namemapred.system.dir/name
   valuehdfs://hadoop1/mapred/value
   descriptionShared data for JT - this must be in HDFS/description
 /property

 I have created this directory in HDFS, owner mapred:hadoop, permissions
 700 which seems to be the most common recommendation amongst multiple, often
 conflicting articles about how to set up Hadoop.  Here is the top level of
 my filesystem:
 hyperion-hdp4@hdfs:hadoop fs -ls /
 Found 3 items
 drwx--   - mapred hadoop  0 2012-10-09 12:58 /mapred
 drwxrwxrwx   - hdfs   hadoop  0 2012-10-09 13:00 /tmp
 drwxr-xr-x   - hdfs   hadoop  0 2012-10-09 12:51 /user

 Note, it doesn't seem to really matter what permissions I set on /mapred
 since when the Jobtracker starts up it changes them to 700.

 However, when I try to run the hadoop example teragen program as a
 regular user I am getting this error:
 hyperion-hdp4@robing:hadoop jar /usr/share/hadoop/hadoop-examples*.jar
 teragen -D dfs.block.size=536870912 100 /user/robing/terasort-input
 Generating 100 using 2 maps with step of 50
 12/10/09 16:27:02 INFO mapred.JobClient: Running job:
 job_201210072045_0003
 12/10/09 16:27:03 INFO mapred.JobClient:  map 0% reduce 0%
 12/10/09 16:27:03 INFO mapred.JobClient: Job complete:
 job_201210072045_0003
 12/10/09 16:27:03 INFO mapred.JobClient: Counters: 0
 12/10/09 16:27:03 INFO mapred.JobClient: Job Failed: Job initialization
 failed:
 org.apache.hadoop.security.AccessControlException:
 org.apache.hadoop.security.AccessControlException: Permission denied:
 user=robing, access=EXECUTE, inode=mapred:mapred:hadoop:rwx--
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
 at
 org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
 at
 

Re: Writing click stream data to hadoop

2012-05-30 Thread Luke Lu
SequenceFile.Writer#syncFs is in Hadoop 1.0.0 (actually since
0.20.205), which calls the underlying FSDataOutputStream#sync which is
actually hflush semantically (data not durable in case of data center
wide power outage). hsync implementation is not yet in 2.0. HDFS-744
just brought hsync in trunk.

__Luke

On Fri, May 25, 2012 at 9:30 AM, Harsh J ha...@cloudera.com wrote:
 Mohit,

 Not if you call sync (or hflush/hsync in 2.0) periodically to persist
 your changes to the file. SequenceFile doesn't currently have a
 sync-API inbuilt in it (in 1.0 at least), but you can call sync on the
 underlying output stream instead at the moment. This is possible to do
 in 1.0 (just own the output stream).

 Your use case also sounds like you may want to simply use Apache Flume
 (Incubating) [http://incubator.apache.org/flume/] that already does
 provide these features and the WAL-kinda reliability you seek.

 On Fri, May 25, 2012 at 8:24 PM, Mohit Anchlia mohitanch...@gmail.com wrote:
 We get click data through API calls. I now need to send this data to our
 hadoop environment. I am wondering if I could open one sequence file and
 write to it until it's of certain size. Once it's over the specified size I
 can close that file and open a new one. Is this a good approach?

 Only thing I worry about is what happens if the server crashes before I am
 able to cleanly close the file. Would I lose all previous data?



 --
 Harsh J


Re: JMX Monitoring of DataNode?

2011-10-25 Thread Luke Lu
JMX is available via web service in json format @ node:port/jmx in
0.204+ and 0.23+

On Mon, Oct 24, 2011 at 5:29 PM, Time Less timelessn...@gmail.com wrote:
 I setup JMX monitoring of the NameNode, and it worked fine. Tried to do the
 same for the DataNode, and it fails.

 The datanode listens on the port I specify, but VisualVM can't connect (I'm
 specifying no user/password/SSL).

 For troubleshooting purposes, I'm totally willing to try different tools
 than VisualVM, so if anyone knows some tool that works particularly well,
 I'd love to hear it.

 --
 Tim Ellis
 Data Architect, Riot Games




Re: How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread Luke Lu
Pre-0.21 (sustaining releases, large-scale tested)  hadoop:
dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-core/artifactId
  version0.20.203.0/version
/dependency

Pre-0.23 (small scale tested) hadoop:
dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapred/artifactId
  version.../version
/dependency

Trunk (currently targeting 0.23.0, large-scale tested) hadoop WILL be:
dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapreduce/artifactId
  version.../version
/dependency

On Fri, Aug 12, 2011 at 2:20 PM, W.P. McNeill bill...@gmail.com wrote:
 I'm building a Hadoop project using Maven. I want to add
 Maven dependencies to my project. What do I do?

 I think the answer is I add a dependency/dependency section to my .POM
 file, but I'm not sure what the contents of this section (groupId,
 artifactId etc.) should be. Googling does not turn up a clear answer. Is
 there a canonical Hadoop Maven dependency specification?



Re: How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread Luke Lu
There is a reason I capitalized WILL (SHALL) :)  The current trunk
mapreduce code is influx. Once mr2 (MAPREDUCE-279) is merged into
trunk (soon!). We'll be producing hadoop-mapreduce-0.23.0-SNAPSHOT,
which depends on hadoop-hdfs-0.23.0-SNAPSHOT, which depends on
hadoop-common-0.23.0-SNAPSHOT.

If you just want to play with the new API, you can use the
0.22.0-SNAPSHOT artifacts. 0.23.0 is supposedly source compatible with
previous hadoop versions including 0.20.x (for legacy API).

On Fri, Aug 12, 2011 at 4:08 PM, W.P. McNeill bill...@gmail.com wrote:
 I want the latest version of Hadoop (with the new API). I guess that's the
 trunk version, but I don't see the hadoop-mapreduce artifact listed on
 https://repository.apache.org/index.html#nexus-search;quick~hadoop

 On Fri, Aug 12, 2011 at 2:47 PM, Luke Lu l...@apache.org wrote:

 Pre-0.21 (sustaining releases, large-scale tested)  hadoop:
 dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-core/artifactId
  version0.20.203.0/version
 /dependency

 Pre-0.23 (small scale tested) hadoop:
 dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapred/artifactId
  version.../version
 /dependency

 Trunk (currently targeting 0.23.0, large-scale tested) hadoop WILL be:
 dependency
  groupIdorg.apache.hadoop/groupId
  artifactIdhadoop-mapreduce/artifactId
  version.../version
 /dependency

 On Fri, Aug 12, 2011 at 2:20 PM, W.P. McNeill bill...@gmail.com wrote:
  I'm building a Hadoop project using Maven. I want to add
  Maven dependencies to my project. What do I do?
 
  I think the answer is I add a dependency/dependency section to my
 .POM
  file, but I'm not sure what the contents of this section (groupId,
  artifactId etc.) should be. Googling does not turn up a clear answer. Is
  there a canonical Hadoop Maven dependency specification?
 




Re: one question about hadoop

2011-05-26 Thread Luke Lu
Hadoop embeds jetty directly into hadoop servers with the
org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml
is auto generated with the jasper compiler during the build phase. The
new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop
HttpServer and doesn't need web.xml and/or jsp support either.

On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote:
 hi,admin:

    I'm a  fresh fish from China.
    I want to know how the Jetty combines with the hadoop.
    I can't find the file named web.xml that should exist in usual system
 that combine with Jetty.
    I'll be very happy to receive your answer.
    If you have any question,please feel free to contract with me.

 Best Regards,

 Jack



Re: Configuring jvm metrics in hadoop-0.20.203.0

2011-05-21 Thread Luke Lu
On Fri, May 20, 2011 at 9:02 AM, Matyas Markovics
markovics.mat...@gmail.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,
 I am trying to get jvm metrics from the new verison of hadoop.
 I have read the migration instructions and come up with the following
 content for hadoop-metrics2.properties:

 *.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
 jvm.sink.file.period=2
 jvm.sink.file.filename=/home/ec2-user/jvmmetrics.log

The (documented) syntax is
[lowercased-service].sink.[sink-name].[option], So for jobtracker it
would be jobtracker.sink.file...

This will get all metrics from all the contexts (unlike metrics1 where
you're required to configure each context). If you want to restrict
the sink to only jvm metrics do this:

jobtracker.sink.jvmfile.class=${*.sink.file.class}
jobtracker.sink.jvmfile.context=jvm
jobtracker.sink.jvmfile.filename=/path/to/namenode-jvm-metrics.out

 Any help would be appreciated even if you have a different approach to
 get memory usage from reducers.

reducetask.sink.file.filename=/path/to/reducetask-metrics.out

__Luke


Re: metrics2 ganglia monitoring

2011-05-18 Thread Luke Lu
Ganglia plugin is not yet ported to metrics v2 (because Y don't use
Ganglia, see also the discussion links on HADOOP-6728). It shouldn't
be hard to do a port though, as the new sink interface is actually
simpler.

On Wed, May 18, 2011 at 4:07 AM, Eric Berkowitz
eberkow...@roosevelt.edu wrote:
 We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations 
 both on the native os and within hadoop.

 We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 
 we need help configuring the metrics to continue ganglia monitoring.

 All tasktrackers/datanodes push unicast udp upstream to a central gmond 
 daemon on their rack that is then polled by a single gmetad daemon for the 
 cluster.

 The current metrics files includes entries similar to the following for all 
 contexts:

 Configuration of the dfs context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
 dfs.period=25
 dfs.servers=xsrv1.cs.roosevelt.edu:8670

 We have reviewed the package documentation for metrics2 but the examples and 
 explanation are not helpful.  Any assistance in the proper configuration of
 hadoop-metrics2.properties file to support our current ganglia configuration 
 would be appreciated.

 Eric


Re: Memory mapped resources

2011-04-12 Thread Luke Lu
You can use distributed cache for memory mapped files (they're local
to the node the tasks run on.)

http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata

On Tue, Apr 12, 2011 at 10:40 AM, Benson Margulies
bimargul...@gmail.com wrote:
 Here's the OP again.

 I want to make it clear that my question here has to do with the
 problem of distributing 'the program' around the cluster, not 'the
 data'. In the case at hand, the issue a system that has a large data
 resource that it needs to do its work. Every instance of the code
 needs the entire model. Not just some blocks or pieces.

 Memory mapping is a very attractive tactic for this kind of data
 resource. The data is read-only. Memory-mapping it allows the
 operating system to ensure that only one copy of the thing ends up in
 physical memory.

 If we force the model into a conventional file (storable in HDFS) and
 read it into the JVM in a conventional way, then we get as many copies
 in memory as we have JVMs.  On a big machine with a lot of cores, this
 begins to add up.

 For people who are running a cluster of relatively conventional
 systems, just putting copies on all the nodes in a conventional place
 is adequate.



Re: location for mapreduce next generation branch

2011-03-28 Thread Luke Lu
MR-279 is the branch you're looking for.

On Mon, Mar 28, 2011 at 3:59 PM, Stephen Boesch java...@gmail.com wrote:

 Looking under http://svn.apache.org/repos/asf/hadoop/mapreduce/branches/  it
 does not seem to be present.

 pointers to correct location appreciated.



Re: monitor the hadoop cluster

2010-11-11 Thread Luke Lu
The job detail page from the jobtracker shows a lot of information
about any given job: the start/finish times of each task and various
counters (like time spent in various phase, the input/output
bytes/records etc.)

For monitoring the aggregate performance of a cluster, the hadoop
metrics system can send a lot of information to standard monitoring
tools (ganglia etc.) to graph and monitor various aggregate metrics
like (running/waiting maps/reduces etc.)

__Luke

On Thu, Nov 11, 2010 at 12:23 PM, Da Zheng zhengda1...@gmail.com wrote:
 Hello,

 On 11/11/2010 03:00 PM, David Rosenstrauch wrote:

 On 11/11/2010 02:52 PM, Da Zheng wrote:

 Hello,

 I wrote a MapReduce program and ran it on a 3-node hadoop cluster, but
 its running time varies a lot, from 2 minutes to 3 minutes. I want to
 understand how time is used by the map phase and the reduce phase, and
 hope to find the place to improve the performance.

 Also the current input data is sorted, so I wrote a customized
 partitioner to reduce the data shuffling across the network. I need some
 means to help me observe the data movement.

 I know hadoop community developed chukwa for monitoring, but it seems
 very immature right now. I wonder how people monitor hadoop cluster
 right now. Is there a good way to solve my problems listed above?

 Thanks,
 Da

 Just my $0.02, but IMO you're working on some faulty assumptions here.
 Hadoop is explicitly *not* a real-time system, and so it's not reasonable
 for you to expect to have such fine-grained control over its processing
 speed.  It's a distributed system, where many things can affect how long a
 job takes, such as:  how many nodes in the cluster, how many other jobs are
 running, the technical specs of each node, whether/how Hadoop implements
 speculative execution during your job, whether your job as any task
 failures/retries, whether you have any hardware failures during your job,
 ..

 You can have control over performance on a Hadoop cluster, via things like
 adding nodes, tweaking some config parms, etc.  But you're much more likely
 to be able to make performance improvements like cutting a job down from 3
 hours to 2 hours, not from 3 minutes to 2 minutes. You're just not going to
 get that kind of fine-grained control with Hadoop.  Nor should you be
 looking for it, IMO.  If that's what you want, then Hadoop is probably the
 wrong tool for your job.

 I don't really try to cut the time from 3 minutes to 2 minutes. I was asking
 whether I can have some tools to monitor the hadoop cluster and possibly
 find the spot for performance improvement. I'm very new to hadoop, and I
 hope to have a good view how time is used by each mapper and reducer, so
 I'll have more confidence to run it on a much larger dataset.

 More importantly, I want to see how much data shaffling can be saved if I
 use the customized partitioner.

 Best,
 Da



Re: is there any way to set the niceness of map/reduce jobs

2010-11-02 Thread Luke Lu
nice(1) only changes cpu scheduling priority. It doesn't really help
if you have tasks (and their child processes) that use too much
memory, which causes swapping, which is probably the real culprit to
cause servers to freeze. Decreasing kernel swappiness probably helps.
Another thing to try is ionice (on linux if you have reasonably recent
kernel with cfq as io scheduler, default for rhel5) if the freeze is
caused by io contention (assuming no swapping.)

You can write a simple script to periodically renice(1) and ionice(1)
these processes to see if they actually work for you.

On Tue, Nov 2, 2010 at 4:51 PM, Jinsong Hu jinsong...@hotmail.com wrote:
 Hi: there:
  I have a cluster that is used for both hadoop mapreduce and hbase. What I
 found is that when I am running map/reduce jobs, the job can be very
 memory/cpu intensive, and cause hbase or data nodes to freeze. in hbase's
 case, the region server may shut it self down.
  In order to avoid this, I made very conservative configuration of the
 maximum number of mappers and reducers. However, I am wonder if hadoop
 allows me to start map/reduce with the command nice so that
 those jobs get lower priority than datanode/tasktracker/hbase regionserver.
 That way, if there is enough resource, the jobs can fully utilize them. but
 if not, those jobs will yield to other processes.

 Jimmy.



Re: load a serialized object in hadoop

2010-10-13 Thread Luke Lu
On Wed, Oct 13, 2010 at 12:27 PM, Shi Yu sh...@uchicago.edu wrote:
 I haven't implemented anything in map/reduce yet for this issue. I just try
 to invoke the same java class using   bin/hadoop  command.  The thing is a
 very simple program could be executed in Java, but not doable in bin/hadoop
 command.

If you are just trying to use bin/hadoop jar your.jar command, your
code runs in a local client jvm and mapred.child.java.opts has no
effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop
jar your.jar

 I think if I couldn't get through the first stage, even I had a
 map/reduce program it would also fail. I am using Hadoop 0.19.2. Thanks.

 Best Regards,

 Shi

 On 2010-10-13 14:15, Luke Lu wrote:

 Can you post your mapper/reducer implementation? or are you using
 hadoop streaming? for which mapred.child.java.opts doesn't apply to
 the jvm you care about. BTW, what's the hadoop version you're using?

 On Wed, Oct 13, 2010 at 11:45 AM, Shi Yush...@uchicago.edu  wrote:


 Here is my code. There is no Map/Reduce in it. I could run this code
 using
 java -Xmx1000m ,  however, when using  bin/hadoop  -D
 mapred.child.java.opts=-Xmx3000M   it has heap space not enough error.  I
 have tried other program in Hadoop with the same settings so the memory
 is
 available in my machines.


 public static void main(String[] args) {
   try{
             String myFile = xxx.dat;
             FileInputStream fin = new FileInputStream(myFile);
             ois = new ObjectInputStream(fin);
             margintagMap = ois.readObject();
             ois.close();
             fin.close();
     }catch(Exception e){
         //
    }
 }

 On 2010-10-13 13:30, Luke Lu wrote:


 On Wed, Oct 13, 2010 at 8:04 AM, Shi Yush...@uchicago.edu    wrote:



 As a coming-up to the my own question, I think to invoke the JVM in
 Hadoop
 requires much more memory than an ordinary JVM.



 That's simply not true. The default mapreduce task Xmx is 200M, which
 is much smaller than the standard jvm default 512M and most users
 don't need to increase it. Please post the code reading the object (in
 hdfs?) in your tasks.




 I found that instead of
 serialization the object, maybe I could create a MapFile as an index to
 permit lookups by key in Hadoop. I have also compared the performance
 of
 MongoDB and Memcache. I will let you know the result after I try the
 MapFile
 approach.

 Shi

 On 2010-10-12 21:59, M. C. Srivas wrote:



 On Tue, Oct 12, 2010 at 4:50 AM, Shi Yush...@uchicago.edu
  wrote:





 Hi,

 I want to load a serialized HashMap object in hadoop. The file of
 stored
 object is 200M. I could read that object efficiently in JAVA by
 setting




 -Xmx




 as 1000M.  However, in hadoop I could never load it into memory. The
 code




 is




 very simple (just read the ObjectInputStream) and there is yet no




 map/reduce




 implemented.  I set the  mapred.child.java.opts=-Xmx3000M, still get
 the
 java.lang.OutOfMemoryError: Java heap space  Could anyone explain
 a




 little




 bit how memory is allocate to JVM in hadoop. Why hadoop takes up so
 much
 memory?  If a program requires 1G memory on a single node, how much




 memory




 it requires (generally) in Hadoop?






 The JVM reserves swap space in advance, at the time of launching the
 process. If your swap is too low (or do not have any swap configured),
 you
 will hit this.

 Or, you are on a 32-bit machine, in which case 3G is not possible in
 the
 JVM.

 -Srivas.







 Thanks.

 Shi

 --













 --
 Postdoctoral Scholar
 Institute for Genomics and Systems Biology
 Department of Medicine, the University of Chicago
 Knapp Center for Biomedical Discovery
 900 E. 57th St. Room 10148
 Chicago, IL 60637, US
 Tel: 773-702-6799





 --
 Postdoctoral Scholar
 Institute for Genomics and Systems Biology
 Department of Medicine, the University of Chicago
 Knapp Center for Biomedical Discovery
 900 E. 57th St. Room 10148
 Chicago, IL 60637, US
 Tel: 773-702-6799




Re: load a serialized object in hadoop

2010-10-13 Thread Luke Lu
On Wed, Oct 13, 2010 at 2:21 PM, Shi Yu sh...@uchicago.edu wrote:
 Hi,  thanks for the advice. I tried with your settings,
 $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m
 still no effect. Or this is a system variable? Should I export it? How to
 configure it?

HADOOP_CLIENT_OPTS is an environment variable so you should run it as
HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest

if you use sh derivative shells (bash, ksh etc.) prepend env for other shells.

__Luke


 Shi

  java -Xms3G -Xmx3G -classpath
 .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar
 OOloadtest


 On 2010-10-13 15:28, Luke Lu wrote:

 On Wed, Oct 13, 2010 at 12:27 PM, Shi Yush...@uchicago.edu  wrote:


 I haven't implemented anything in map/reduce yet for this issue. I just
 try
 to invoke the same java class using   bin/hadoop  command.  The thing is
 a
 very simple program could be executed in Java, but not doable in
 bin/hadoop
 command.


 If you are just trying to use bin/hadoop jar your.jar command, your
 code runs in a local client jvm and mapred.child.java.opts has no
 effect. You should run it with HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop
 jar your.jar



 I think if I couldn't get through the first stage, even I had a
 map/reduce program it would also fail. I am using Hadoop 0.19.2. Thanks.

 Best Regards,

 Shi

 On 2010-10-13 14:15, Luke Lu wrote:


 Can you post your mapper/reducer implementation? or are you using
 hadoop streaming? for which mapred.child.java.opts doesn't apply to
 the jvm you care about. BTW, what's the hadoop version you're using?

 On Wed, Oct 13, 2010 at 11:45 AM, Shi Yush...@uchicago.edu    wrote:



 Here is my code. There is no Map/Reduce in it. I could run this code
 using
 java -Xmx1000m ,  however, when using  bin/hadoop  -D
 mapred.child.java.opts=-Xmx3000M   it has heap space not enough error.
  I
 have tried other program in Hadoop with the same settings so the memory
 is
 available in my machines.


 public static void main(String[] args) {
   try{
             String myFile = xxx.dat;
             FileInputStream fin = new FileInputStream(myFile);
             ois = new ObjectInputStream(fin);
             margintagMap = ois.readObject();
             ois.close();
             fin.close();
     }catch(Exception e){
         //
    }
 }

 On 2010-10-13 13:30, Luke Lu wrote:



 On Wed, Oct 13, 2010 at 8:04 AM, Shi Yush...@uchicago.edu
  wrote:




 As a coming-up to the my own question, I think to invoke the JVM in
 Hadoop
 requires much more memory than an ordinary JVM.




 That's simply not true. The default mapreduce task Xmx is 200M, which
 is much smaller than the standard jvm default 512M and most users
 don't need to increase it. Please post the code reading the object (in
 hdfs?) in your tasks.





 I found that instead of
 serialization the object, maybe I could create a MapFile as an index
 to
 permit lookups by key in Hadoop. I have also compared the performance
 of
 MongoDB and Memcache. I will let you know the result after I try the
 MapFile
 approach.

 Shi

 On 2010-10-12 21:59, M. C. Srivas wrote:




 On Tue, Oct 12, 2010 at 4:50 AM, Shi Yush...@uchicago.edu
  wrote:






 Hi,

 I want to load a serialized HashMap object in hadoop. The file of
 stored
 object is 200M. I could read that object efficiently in JAVA by
 setting





 -Xmx





 as 1000M.  However, in hadoop I could never load it into memory.
 The
 code





 is





 very simple (just read the ObjectInputStream) and there is yet no





 map/reduce





 implemented.  I set the  mapred.child.java.opts=-Xmx3000M, still
 get
 the
 java.lang.OutOfMemoryError: Java heap space  Could anyone
 explain
 a





 little





 bit how memory is allocate to JVM in hadoop. Why hadoop takes up
 so
 much
 memory?  If a program requires 1G memory on a single node, how
 much





 memory





 it requires (generally) in Hadoop?







 The JVM reserves swap space in advance, at the time of launching the
 process. If your swap is too low (or do not have any swap
 configured),
 you
 will hit this.

 Or, you are on a 32-bit machine, in which case 3G is not possible in
 the
 JVM.

 -Srivas.








 Thanks.

 Shi

 --














 --
 Postdoctoral Scholar
 Institute for Genomics and Systems Biology
 Department of Medicine, the University of Chicago
 Knapp Center for Biomedical Discovery
 900 E. 57th St. Room 10148
 Chicago, IL 60637, US
 Tel: 773-702-6799





 --
 Postdoctoral Scholar
 Institute for Genomics and Systems Biology
 Department of Medicine, the University of Chicago
 Knapp Center for Biomedical Discovery
 900 E. 57th St. Room 10148
 Chicago, IL 60637, US
 Tel: 773-702-6799








Re: load a serialized object in hadoop

2010-10-13 Thread Luke Lu
Just took a look at the bin/hadoop of your particular version
(http://svn.apache.org/viewvc/hadoop/common/tags/release-0.19.2/bin/hadoop?revision=796970view=markup).
It looks like that HADOOP_CLIENT_OPTS doesn't work with the jar
command, which is fixed in later version.

So try HADOOP_OPTS=-Xmx1000M bin/hadoop ... instead. It would work
because it just translates to the same java command line that worked
for you :)

__Luke

On Wed, Oct 13, 2010 at 4:18 PM, Shi Yu sh...@uchicago.edu wrote:
 Hi, I tried the following five ways:

 Approach 1: in command line
 HADOOP_CLIENT_OPTS=-Xmx4000m bin/hadoop jar WordCount.jar OOloadtest


 Approach 2: I added the hadoop-site.xml file with the following element.
 Each time I changed, I stop and restart hadoop on all the nodes.
 ...
 property
 nameHADOOP_CLIENT_OPTS/name
 value-Xmx4000m/value
 /property

 run the command
 $bin/hadoop jar WordCount.jar OOloadtest

 Approach 3: I changed like this
 ...
 property
 nameHADOOP_CLIENT_OPTS/name
 value4000m/value
 /property
 

 Then run the command:
 $bin/hadoop jar WordCount.jar OOloadtest

 Approach 4: To make sure, I changed the m to numbers, that was
 ...
 property
 nameHADOOP_CLIENT_OPTS/name
 value40/value
 /property
 

 Then run the command:
 $bin/hadoop jar WordCount.jar OOloadtest

 All these four approaches come to the same Java heap space error.

 java.lang.OutOfMemoryError: Java heap space
        at
 java.lang.AbstractStringBuilder.init(AbstractStringBuilder.java:45)
        at java.lang.StringBuilder.init(StringBuilder.java:68)
        at
 java.io.ObjectInputStream$BlockDataInputStream.readUTFBody(ObjectInputStream.java:2997)
        at
 java.io.ObjectInputStream$BlockDataInputStream.readUTF(ObjectInputStream.java:2818)
        at java.io.ObjectInputStream.readString(ObjectInputStream.java:1599)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1320)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at java.util.HashMap.readObject(HashMap.java:1028)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
        at
 java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1846)
        at
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at ObjectManager.loadObject(ObjectManager.java:42)
        at OOloadtest.main(OOloadtest.java:21)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 Approach 5:
 In comparison, I called the Java command directly as follows (there is a
 counter showing how much time it costs if the serialized object is
 successfully loaded):

 $java -Xms3G -Xmx3G -classpath
 .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar OOloadtest

 return:
 object loaded, timing (hms): 0 hour(s) 1 minute(s) 12 second(s)
 162millisecond(s)


 What was the problem in my command? Where can I find the documentation about
 HADOOP_CLIENT_OPTS? Have you tried the same thing and found it works?

 Shi


 On 2010-10-13 16:28, Luke Lu wrote:

 On Wed, Oct 13, 2010 at 2:21 PM, Shi Yush...@uchicago.edu  wrote:


 Hi,  thanks for the advice. I tried with your settings,
 $ bin/hadoop jar Test.jar OOloadtest -D HADOOP_CLIENT_OPTS=-Xmx4000m
 still no effect. Or this is a system variable? Should I export it? How to
 configure it?


 HADOOP_CLIENT_OPTS is an environment variable so you should run it as
 HADOOP_CLIENT_OPTS=-Xmx1000m bin/hadoop jar Test.jar OOloadtest

 if you use sh derivative shells (bash, ksh etc.) prepend env for other
 shells.

 __Luke




 Shi

  java -Xms3G -Xmx3G -classpath

 .:WordCount.jar:hadoop-0.19.2-core.jar:lib/log4j-1.2.15.jar:lib/commons-collections-3.2.1.jar:lib/stanford-postagger-2010-05-26.jar
 OOloadtest


 On 2010-10-13 15:28, Luke Lu wrote:


 On Wed, Oct 13, 2010 at 12:27 PM, Shi Yush...@uchicago.edu    wrote:



 I haven't implemented anything in map/reduce yet for this issue. I