Event: Meetup in Munich, Thursday May, 24

2012-05-22 Thread alo alt
Folks,

for our German / Switzerland / Austria people:

Thu, May 24 NoSQL Meetup in Munich, Bavaria, Germany:
http://www.nosqlmunich.de/

eCircle GmbH
Nymphenburger Höfe NY II
Dachauer Str. 63
80335 München

register here: http://www.doodle.com/7e5a6ecizinaznbu
Entry free

Speaker:
Doug (Hypertable), Christian (HBase NRT) and Me (flumeNG / sqoop)

- Alex

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



Re: Stream data processing

2012-05-22 Thread Zhiwei Lin
Hi Robert,
Thank you.
How quickly do you have to get the result out once the new data is added?
If possible, I hope to get the result instantly.

How far back in time do you have to look for  from the occurrence of
?
The time slot is not constant. It depends on the last occurrence of 
in front of .  So, I need to look up the history to get the last 
in this case.

Do you have to do this for all combinations of values or is it just a small
subset of values?
I think this depends on the time of last occurrence of  in the history.
If  rarely occurred, then the early stage data has to be taken into
account.

Definitely, I think HDFS is a good place to store the data I have (the size
of daily log is above 1GB). But I am not sure if Map/Reduce can help to
handle the stated problem.

Zhiwei


On 21 May 2012 22:07, Robert Evans ev...@yahoo-inc.com wrote:

 Zhiwei,

 How quickly do you have to get the result out once the new data is added?
  How far back in time do you have to look for  from the occurrence of
 ?  Do you have to do this for all combinations of values or is it just
 a small subset of values?

 --Bobby Evans

 On 5/21/12 3:01 PM, Zhiwei Lin zhiwei...@gmail.com wrote:

 I have large volume of stream log data. Each data record contains a time
 stamp, which is very important to the analysis.
 For example, I have data format like this:
 (1) 20:30:21 01/April/2012A.
 (2) 20:30:51 01/April/2012.
 (3) 21:30:21 01/April/2012.

 Moreover, new data comes every few minutes.
 I have to calculate the probability of the occurrence  given the
 occurrence of  (where  occurs earlier than ). So, it is
 really time-dependant.

 I wonder if Hadoop  is the right platform for this job? Is there any
 package available for this kind of work?

 Thank you.

 Zhiwei




-- 

Best wishes.

Zhiwei


Re: Moving blocks from a datanode

2012-05-22 Thread Chris Smith
M,

See http://wiki.apache.org/hadoop/FAQ - 3.6. I want to make a large
cluster smaller by taking out a bunch of nodes simultaneously. How can this
be done?

This explains how to decomission nodes by moving the data off of the
existing node.  It's fairly easy and painless (just add the nodename to the
slaves.exclude file and notify dfs)  and once the data is off the node you
could swap-out the disks and then re-introduce the node back into the
cluster with larger drives (removing the nodename from slaves.exclude).

Chris

On 17 May 2012 02:55, Mayuran Yogarajah
mayuran.yogara...@casalemedia.comwrote:

 Our cluster has several nodes which have smaller disks than other nodes and
 as a result fill up quicker.

 I am looking to move data off these nodes and onto the others.



 Here is what I am planning to do:

 1)  On the nodes with smaller disks, set dfs.datanode.du.reserved to a
 larger value

 2)  Restart data nodes

 3)  Run balancer



 Will this have the desired effect?

 If there is a better way to accomplish this please let me know.



 Thanks,

 M




Question about reducers

2012-05-22 Thread Andrés Durán
Hello,

I'm working with a Hadoop, version is 1.0.3 and configured in 
pseudo-distributed mode.

I have 128 reducers tasks and it's running in a local machine with 32 
cores. The job is working fine and fast it  takes 1 hour and 30 minutes to 
fininsh. But when the Job starts, the reducers are comming to the running phase 
from the tasks queue very slow, it takes 7 minutes to allocate 32 tasks in the 
running phase. Why is too slow to allocate task in running mode? It's possible 
to adjust any variable in the jobs tracker setup to reduce this allocation time?

 Thanks to all!

 Best regards,
Andrés Durán

Re: Question about reducers

2012-05-22 Thread Harsh J
Hi,

This may be cause, depending on your scheduler, only one Reducer may
be allocated per TT heartbeat. A reasoning of why this is the case is
explained here: http://search-hadoop.com/m/KYv8JhkOHc1

You may have better results in 1.0.3 using an alternative scheduler
such as FairScheduler with multiple-assignments-per-heartbeat turned
on (See http://hadoop.apache.org/common/docs/current/fair_scheduler.html
and boolean property mapred.fairscheduler.assignmultiple to enable)
or via CapacityScheduler (See
http://hadoop.apache.org/common/docs/current/capacity_scheduler.html)
which does it as well (OOB).

On Tue, May 22, 2012 at 5:36 PM, Andrés Durán du...@tadium.es wrote:
 Hello,

        I'm working with a Hadoop, version is 1.0.3 and configured in 
 pseudo-distributed mode.

        I have 128 reducers tasks and it's running in a local machine with 32 
 cores. The job is working fine and fast it  takes 1 hour and 30 minutes to 
 fininsh. But when the Job starts, the reducers are comming to the running 
 phase from the tasks queue very slow, it takes 7 minutes to allocate 32 tasks 
 in the running phase. Why is too slow to allocate task in running mode? It's 
 possible to adjust any variable in the jobs tracker setup to reduce this 
 allocation time?

  Thanks to all!

  Best regards,
        Andrés Durán



-- 
Harsh J


Re: Question about reducers

2012-05-22 Thread Harsh J
A minor correction: CapacityScheduler doesn't seem to do multi-reducer
assignments (or at least not in 1.x), but does do multi-map
assignments. This is for the same reason as
http://search-hadoop.com/m/KYv8JhkOHc1. FairScheduler in 1.x supports
multi-map and multi-reducer assignments over single heartbeats, which
should do good on your single 32-task machine.

Do give it a try and let us know!

On Tue, May 22, 2012 at 5:51 PM, Harsh J ha...@cloudera.com wrote:
 Hi,

 This may be cause, depending on your scheduler, only one Reducer may
 be allocated per TT heartbeat. A reasoning of why this is the case is
 explained here: http://search-hadoop.com/m/KYv8JhkOHc1

 You may have better results in 1.0.3 using an alternative scheduler
 such as FairScheduler with multiple-assignments-per-heartbeat turned
 on (See http://hadoop.apache.org/common/docs/current/fair_scheduler.html
 and boolean property mapred.fairscheduler.assignmultiple to enable)
 or via CapacityScheduler (See
 http://hadoop.apache.org/common/docs/current/capacity_scheduler.html)
 which does it as well (OOB).

 On Tue, May 22, 2012 at 5:36 PM, Andrés Durán du...@tadium.es wrote:
 Hello,

        I'm working with a Hadoop, version is 1.0.3 and configured in 
 pseudo-distributed mode.

        I have 128 reducers tasks and it's running in a local machine with 32 
 cores. The job is working fine and fast it  takes 1 hour and 30 minutes to 
 fininsh. But when the Job starts, the reducers are comming to the running 
 phase from the tasks queue very slow, it takes 7 minutes to allocate 32 
 tasks in the running phase. Why is too slow to allocate task in running 
 mode? It's possible to adjust any variable in the jobs tracker setup to 
 reduce this allocation time?

  Thanks to all!

  Best regards,
        Andrés Durán



 --
 Harsh J



-- 
Harsh J


Where does Hadoop store its maps?

2012-05-22 Thread Mark Kerzner
Hi,

I am using a Hadoop cluster of my own construction on EC2, and I am running
out of hard drive space with maps. If I knew which directories are used by
Hadoop for map spill, I could use the large ephemeral drive on EC2 machines
for that. Otherwise, I would have to keep increasing my available hard
drive on root, and that's not very smart.

Thank you. The error I get is below.

Sincerely,
Mark



org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
any valid local directory for output/file.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
at 
org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs
java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
at 
org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
at org.frd.main.Map.map(Map.java:70)
at org.frd.main.Map.map(Map.java:24)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(User
java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at 
org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
at 
org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
at org.frd.main.Map.map(Map.java:70)
at org.frd.main.Map.map(Map.java:24)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(User
org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists
at 
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: EEXIST: File exists
at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
at 
org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172)
... 7 more


Re: Where does Hadoop store its maps?

2012-05-22 Thread Harsh J
Mark,

The property mapred.local.dir is what you are looking for. It takes a
path or a list of comma-separated paths of local FS that maps and
reduces may use as intermediate storage directories.

So you may set mapred.local.dir to
/path/to/ephemeral/disk/mount/mapred/local (or multiple comma
separated points if you have multiple disk mounts).

Hope this helps.

On Tue, May 22, 2012 at 6:58 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote:
 Hi,

 I am using a Hadoop cluster of my own construction on EC2, and I am running
 out of hard drive space with maps. If I knew which directories are used by
 Hadoop for map spill, I could use the large ephemeral drive on EC2 machines
 for that. Otherwise, I would have to keep increasing my available hard
 drive on root, and that's not very smart.

 Thank you. The error I get is below.

 Sincerely,
 Mark



 org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
 any valid local directory for output/file.out
        at 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)
        at 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
        at 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
        at 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
        at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495)
        at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
        at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs
 java.io.IOException: Spill failed
        at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
        at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
        at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at 
 org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
        at 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
        at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
        at org.frd.main.Map.map(Map.java:70)
        at org.frd.main.Map.map(Map.java:24)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(User
 java.io.IOException: Spill failed
        at 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
        at 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
        at 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
        at 
 org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
        at 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
        at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
        at org.frd.main.Map.map(Map.java:70)
        at org.frd.main.Map.map(Map.java:24)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(User
 org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists
        at 
 org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
        at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
        at org.apache.hadoop.mapred.Child.main(Child.java:264)
 Caused by: EEXIST: File exists
        at 

Re: Need some help for writing map reduce functions in hadoop-1.0.1 java

2012-05-22 Thread madhu phatak
Hi,
 You can go through the code of this project (
https://github.com/zinnia-phatak-dev/Nectar) to understand how the complex
algorithms are implemented using M/R.

On Fri, May 18, 2012 at 12:16 PM, Ravi Joshi ravi.josh...@yahoo.com wrote:

 I am writing my own map and reduce method for implementing K Means
 algorithm in Hadoop-1.0.1 in java language. Although i got some example
 link of K Means algorithm in Hadoop over blogs but i don't want to copy
 their code, as a lerner i want to implement it my self. So i just need some
 ideas/clues for the same. Below is the work which i already done.

 I have Point and Cluster classes which are Writable, Point class have
 point x, point y and Cluster by whom this Point belongs. On the other hand
 my Cluster class has an ArrayList which stores all the Point objects which
 belongs to that Cluster. Cluseter class has an centroid variable also. Hope
 i am going correct (if not correct me please.)

 Now first of all my input (which is a file, containing some points
 coordinates) must be provided to Point Objects. I mean this input file must
 be mapped to all the Point. This should be done ONCE in map class (but
 how?). After assigning some value to each Point, some random Cluster must
 be chosen at the initial phase (This must be done only ONCE, but how). Now
 every Point must be mapped to all the cluster with the distance between
 that point and centroid. In the reduce method, every Point will be checked
 and assigned to that Cluster which is nearest to that Point (by comparing
 the distance). Now new centroid is calculated in each Cluster (Should map
 and reduce be called recursively? if yes then where all the initialization
 part would go. Here by saying initialization i mean providing input to
 Point objects (which must be done ONCE initially) and choosing some random
 centroid (Initially we have to choose random centroid ONCE) ).
 One more question, The value of parameter K(which will decide the total
 number of clusters should be assigned by user or hadoop will itself decide
 it?)

 Somebody please explain me, i don't need the code, i want to write it
 myself. I need a way. Thank you.

 -Ravi




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: Splunk + Hadoop

2012-05-22 Thread Edward Capriolo
So a while back their was an article:
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

I recently did my own take on full text searching your logs with
solandra, though I have prototyped using solr inside datastax
enterprise as well.

http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/more_taco_bell_programming_with

Splunk has a graphical front end with a good deal of sophistication,
but I am quite happy just being able to solr search everything, and
providing my own front ends built in solr.

On Mon, May 21, 2012 at 5:13 PM, Abhishek Pratap Singh
manu.i...@gmail.com wrote:
 I have used Hadoop and Splunk both. Can you please let me know what is your
 requirement?
 Real time processing with hadoop depends upon What defines Real time in
 particular scenario. Based on requirement, Real time (near real time) can
 be achieved.

 ~Abhishek

 On Fri, May 18, 2012 at 3:58 PM, Russell Jurney 
 russell.jur...@gmail.comwrote:

 Because that isn't Cube.

 Russell Jurney
 twitter.com/rjurney
 russell.jur...@gmail.com
 datasyndrome.com

 On May 18, 2012, at 2:01 PM, Ravi Shankar Nair
 ravishankar.n...@gmail.com wrote:

  Why not Hbase with Hadoop?
  It's a best bet.
  Rgds, Ravi
 
  Sent from my Beethoven
 
 
  On May 18, 2012, at 3:29 PM, Russell Jurney russell.jur...@gmail.com
 wrote:
 
  I'm playing with using Hadoop and Pig to load MongoDB with data for
 Cube to
  consume. Cube https://github.com/square/cube/wiki is a realtime
 tool...
  but we'll be replaying events from the past.  Does that count?  It is
 nice
  to batch backfill metrics into 'real-time' systems in bulk.
 
  On Fri, May 18, 2012 at 12:11 PM, shreya@cognizant.com wrote:
 
  Hi ,
 
  Has anyone used Hadoop and splunk, or any other real-time processing
 tool
  over Hadoop?
 
  Regards,
  Shreya
 
 
 
  This e-mail and any files transmitted with it are for the sole use of
 the
  intended recipient(s) and may contain confidential and privileged
  information. If you are not the intended recipient(s), please reply to
 the
  sender and destroy all copies of the original message. Any unauthorized
  review, use, disclosure, dissemination, forwarding, printing or
 copying of
  this email, and/or any action taken in reliance on the contents of this
  e-mail is strictly prohibited and may be unlawful.
 
 
  Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
 datasyndrome.com



Re: Question about reducers

2012-05-22 Thread Andrés Durán
Many thanks Harsh, I will try it.  :D

Best regards,
Andrés Durán


El 22/05/2012, a las 14:25, Harsh J escribió:

 A minor correction: CapacityScheduler doesn't seem to do multi-reducer
 assignments (or at least not in 1.x), but does do multi-map
 assignments. This is for the same reason as
 http://search-hadoop.com/m/KYv8JhkOHc1. FairScheduler in 1.x supports
 multi-map and multi-reducer assignments over single heartbeats, which
 should do good on your single 32-task machine.
 
 Do give it a try and let us know!
 
 On Tue, May 22, 2012 at 5:51 PM, Harsh J ha...@cloudera.com wrote:
 Hi,
 
 This may be cause, depending on your scheduler, only one Reducer may
 be allocated per TT heartbeat. A reasoning of why this is the case is
 explained here: http://search-hadoop.com/m/KYv8JhkOHc1
 
 You may have better results in 1.0.3 using an alternative scheduler
 such as FairScheduler with multiple-assignments-per-heartbeat turned
 on (See http://hadoop.apache.org/common/docs/current/fair_scheduler.html
 and boolean property mapred.fairscheduler.assignmultiple to enable)
 or via CapacityScheduler (See
 http://hadoop.apache.org/common/docs/current/capacity_scheduler.html)
 which does it as well (OOB).
 
 On Tue, May 22, 2012 at 5:36 PM, Andrés Durán du...@tadium.es wrote:
 Hello,
 
I'm working with a Hadoop, version is 1.0.3 and configured in 
 pseudo-distributed mode.
 
I have 128 reducers tasks and it's running in a local machine with 
 32 cores. The job is working fine and fast it  takes 1 hour and 30 minutes 
 to fininsh. But when the Job starts, the reducers are comming to the 
 running phase from the tasks queue very slow, it takes 7 minutes to 
 allocate 32 tasks in the running phase. Why is too slow to allocate task in 
 running mode? It's possible to adjust any variable in the jobs tracker 
 setup to reduce this allocation time?
 
  Thanks to all!
 
  Best regards,
Andrés Durán
 
 
 
 --
 Harsh J
 
 
 
 -- 
 Harsh J



Re: Stream data processing

2012-05-22 Thread Zhiwei Lin
Hi Bobby,

Thank you. Great help.

Zhiwei

On 22 May 2012 14:52, Robert Evans ev...@yahoo-inc.com wrote:

 If you want the results to come out instantly Map/Reduce is not the proper
 choice.  Map/Reduce is designed for batch processing.  It can do small
 batches, but the overhead of launching the map/redcue jobs can be very high
 compared to the amount of processing you are doing.  I personally would
 look into using either Storm, S4, or some other realtime stream processing
 framework.  From what you have said it sounds like you probably want to use
 Storm, as it can be used to guarantee that each event is processed once and
 only once.  You can also store your results into HDFS if you want, perhaps
 through HBASE, if you need to do further processing on the data.

 --Bobby Evans

 On 5/22/12 5:02 AM, Zhiwei Lin zhiwei...@gmail.com wrote:

 Hi Robert,
 Thank you.
 How quickly do you have to get the result out once the new data is added?
 If possible, I hope to get the result instantly.

 How far back in time do you have to look for  from the occurrence of
 ?
 The time slot is not constant. It depends on the last occurrence of 
 in front of .  So, I need to look up the history to get the last 
 in this case.

 Do you have to do this for all combinations of values or is it just a small
 subset of values?
 I think this depends on the time of last occurrence of  in the history.
 If  rarely occurred, then the early stage data has to be taken into
 account.

 Definitely, I think HDFS is a good place to store the data I have (the size
 of daily log is above 1GB). But I am not sure if Map/Reduce can help to
 handle the stated problem.

 Zhiwei


 On 21 May 2012 22:07, Robert Evans ev...@yahoo-inc.com wrote:

  Zhiwei,
 
  How quickly do you have to get the result out once the new data is added?
   How far back in time do you have to look for  from the occurrence of
  ?  Do you have to do this for all combinations of values or is it
 just
  a small subset of values?
 
  --Bobby Evans
 
  On 5/21/12 3:01 PM, Zhiwei Lin zhiwei...@gmail.com wrote:
 
  I have large volume of stream log data. Each data record contains a time
  stamp, which is very important to the analysis.
  For example, I have data format like this:
  (1) 20:30:21 01/April/2012A.
  (2) 20:30:51 01/April/2012.
  (3) 21:30:21 01/April/2012.
 
  Moreover, new data comes every few minutes.
  I have to calculate the probability of the occurrence  given the
  occurrence of  (where  occurs earlier than ). So, it is
  really time-dependant.
 
  I wonder if Hadoop  is the right platform for this job? Is there any
  package available for this kind of work?
 
  Thank you.
 
  Zhiwei
 
 


 --

 Best wishes.

 Zhiwei




-- 

Best wishes.

Zhiwei


Re: Where does Hadoop store its maps?

2012-05-22 Thread Mark Kerzner
Thank you, Harsh and Madhu, that is exactly what I was looking for.

Mark

On Tue, May 22, 2012 at 8:36 AM, madhu phatak phatak@gmail.com wrote:

 Hi,
  Set mapred.local.dir in mapred-site.xml to point a directory on /mnt so
 that it will not use ec2 instance EBS.

 On Tue, May 22, 2012 at 6:58 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:

  Hi,
 
  I am using a Hadoop cluster of my own construction on EC2, and I am
 running
  out of hard drive space with maps. If I knew which directories are used
 by
  Hadoop for map spill, I could use the large ephemeral drive on EC2
 machines
  for that. Otherwise, I would have to keep increasing my available hard
  drive on root, and that's not very smart.
 
  Thank you. The error I get is below.
 
  Sincerely,
  Mark
 
 
 
  org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
  any valid local directory for output/file.out
 at
 
 org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376)
 at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
 at
 
 org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
 at
 
 org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495)
 at
  org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs
  java.io.IOException: Spill failed
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
 at
 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
  org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
 at
 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
 at
  org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
 at org.frd.main.Map.map(Map.java:70)
 at org.frd.main.Map.map(Map.java:24)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(User
  java.io.IOException: Spill failed
 at
 
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886)
 at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574)
 at
 
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
 at
  org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279)
 at
 
 org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107)
 at
  org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55)
 at org.frd.main.Map.map(Map.java:70)
 at org.frd.main.Map.map(Map.java:24)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(User
  org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File
  exists
 at
  org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178)
 at
  org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292)
 at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:272)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157)
 at org.apache.hadoop.mapred.Child.main(Child.java:264)
  Caused by: EEXIST: File exists
 at 

Re: Map/Reduce Tasks Fails

2012-05-22 Thread Harsh J
Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.

 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.



 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.

 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.



-- 
Harsh J


Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan






 From: Harsh J ha...@cloudera.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:13 AM
Subject: Re: Map/Reduce Tasks Fails
 
Sandeep,

Is the same DN 10.0.25.149 reported across all failures? And do you
notice any machine patterns when observing the failed tasks (i.e. are
they clumped on any one or a few particular TTs repeatedly)?

On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.

 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.



 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.

 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.



-- 
Harsh J




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Raj Vishwanathan
What kind of storage is attached to the data nodes ? This kind of error can 
happen when the CPU is really busy with I/O or interrupts.

Can you run top or dstat on some of the data nodes to see how the system is 
performing?

Raj




 From: Sandeep Reddy P sandeepreddy.3...@gmail.com
To: common-user@hadoop.apache.org 
Sent: Tuesday, May 22, 2012 7:23 AM
Subject: Re: Map/Reduce Tasks Fails
 
*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since
Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since
Start* *Total
Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded
Tasks Last Hour* *Seconds since heartbeat*
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225http://hadoop2.liaisondevqa.local:50060/
hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0
tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363http://hadoop4.liaisondevqa.local:50060/
hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0
tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605http://hadoop5.liaisondevqa.local:50060/
hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0
tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305http://hadoop3.liaisondevqa.local:50060/
hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0  Highest Failures:
tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22
failures




Re: Problems with block compression using native codecs (Snappy, LZO) and MapFile.Reader.get()

2012-05-22 Thread Jason B
JIRA entry created:

https://issues.apache.org/jira/browse/HADOOP-8423


On 5/21/12, Jason B urg...@gmail.com wrote:
 Sorry about using attachment. The code is below for the reference.
 (I will also file a jira as you suggesting)

 package codectest;

 import com.hadoop.compression.lzo.LzoCodec;
 import java.io.IOException;
 import java.util.Formatter;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.io.MapFile;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.compress.CompressionCodec;
 import org.apache.hadoop.io.compress.DefaultCodec;
 import org.apache.hadoop.io.compress.SnappyCodec;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;

 public class MapFileCodecTest implements Tool {
 private Configuration conf = new Configuration();

 private void createMapFile(Configuration conf, FileSystem fs, String
 path,
 CompressionCodec codec, CompressionType type, int records)
 throws IOException {
 MapFile.Writer writer = new MapFile.Writer(conf, fs, path,
 Text.class, Text.class,
 type, codec, null);
 Text key = new Text();
 for (int j = 0; j  records; j++) {
 StringBuilder sb = new StringBuilder();
 Formatter formatter = new Formatter(sb);
 formatter.format(%03d, j);
 key.set(sb.toString());
 writer.append(key, key);
 }
 writer.close();
 }

 private void testCodec(Configuration conf, Class? extends
 CompressionCodec clazz,
 CompressionType type, int records) throws IOException {
 FileSystem fs = FileSystem.getLocal(conf);
 try {
 System.out.println(Creating MapFiles with  + records  +
  records using codec  + clazz.getSimpleName());
 String path = clazz.getSimpleName() + records;
 createMapFile(conf, fs, path, clazz.newInstance(), type,
 records);
 MapFile.Reader reader = new MapFile.Reader(fs, path, conf);
 Text key1 = new Text(002);
 if (reader.get(key1, new Text()) != null) {
 System.out.println(1st key found);
 }
 Text key2 = new Text(004);
 if (reader.get(key2, new Text()) != null) {
 System.out.println(2nd key found);
 }
 } catch (Throwable ex) {
 ex.printStackTrace();
 }
 }

 @Override
 public int run(String[] strings) throws Exception {
 System.out.println(Using native library  +
 System.getProperty(java.library.path));

 testCodec(conf, DefaultCodec.class, CompressionType.RECORD, 100);
 testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 100);
 testCodec(conf, LzoCodec.class, CompressionType.RECORD, 100);

 testCodec(conf, DefaultCodec.class, CompressionType.RECORD, 10);
 testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 10);
 testCodec(conf, LzoCodec.class, CompressionType.RECORD, 10);

 testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 100);
 testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 100);
 testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 100);

 testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 10);
 testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 10);
 testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 10);
 return 0;
 }

 @Override
 public void setConf(Configuration c) {
 this.conf = c;
 }

 @Override
 public Configuration getConf() {
 return conf;
 }

 public static void main(String[] args) throws Exception {
 ToolRunner.run(new MapFileCodecTest(), args);
 }

 }


 On 5/21/12, Todd Lipcon t...@cloudera.com wrote:
 Hi Jason,

 Sounds like a bug. Unfortunately the mailing list strips attachments.

 Can you file a jira in the HADOOP project, and attach your test case
 there?

 Thanks
 Todd

 On Mon, May 21, 2012 at 3:57 PM, Jason B urg...@gmail.com wrote:
 I am using Cloudera distribution cdh3u1.

 When trying to check native codecs for better decompression
 performance such as Snappy or LZO, I ran into issues with random
 access using MapFile.Reader.get(key, value) method.
 First call of MapFile.Reader.get() works but a second call fails.

 Also  I am getting different exceptions depending on number of entries
 in a map file.
 With LzoCodec and 10 record file, jvm gets aborted.

 At the same time the DefaultCodec works fine for all cases, as well as
 record compression for the native codecs.

 I created a simple test program (attached) that creates map files
 locally with sizes of 10 and 100 records for three codecs: Default,
 Snappy, and LZO.
 (The test requires corresponding native library available)

 The summary 

Re: Problems with block compression using native codecs (Snappy, LZO) and MapFile.Reader.get()

2012-05-22 Thread Edward Capriolo
if You are getting a SIGSEG it never hurts to try a more recent JVM.
21 has many bug fixes at this point.

On Tue, May 22, 2012 at 11:45 AM, Jason B urg...@gmail.com wrote:
 JIRA entry created:

 https://issues.apache.org/jira/browse/HADOOP-8423


 On 5/21/12, Jason B urg...@gmail.com wrote:
 Sorry about using attachment. The code is below for the reference.
 (I will also file a jira as you suggesting)

 package codectest;

 import com.hadoop.compression.lzo.LzoCodec;
 import java.io.IOException;
 import java.util.Formatter;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.io.MapFile;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.compress.CompressionCodec;
 import org.apache.hadoop.io.compress.DefaultCodec;
 import org.apache.hadoop.io.compress.SnappyCodec;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;

 public class MapFileCodecTest implements Tool {
     private Configuration conf = new Configuration();

     private void createMapFile(Configuration conf, FileSystem fs, String
 path,
             CompressionCodec codec, CompressionType type, int records)
 throws IOException {
         MapFile.Writer writer = new MapFile.Writer(conf, fs, path,
 Text.class, Text.class,
                 type, codec, null);
         Text key = new Text();
         for (int j = 0; j  records; j++) {
             StringBuilder sb = new StringBuilder();
             Formatter formatter = new Formatter(sb);
             formatter.format(%03d, j);
             key.set(sb.toString());
             writer.append(key, key);
         }
         writer.close();
     }

     private void testCodec(Configuration conf, Class? extends
 CompressionCodec clazz,
             CompressionType type, int records) throws IOException {
         FileSystem fs = FileSystem.getLocal(conf);
         try {
             System.out.println(Creating MapFiles with  + records  +
                      records using codec  + clazz.getSimpleName());
             String path = clazz.getSimpleName() + records;
             createMapFile(conf, fs, path, clazz.newInstance(), type,
 records);
             MapFile.Reader reader = new MapFile.Reader(fs, path, conf);
             Text key1 = new Text(002);
             if (reader.get(key1, new Text()) != null) {
                 System.out.println(1st key found);
             }
             Text key2 = new Text(004);
             if (reader.get(key2, new Text()) != null) {
                 System.out.println(2nd key found);
             }
         } catch (Throwable ex) {
             ex.printStackTrace();
         }
     }

     @Override
     public int run(String[] strings) throws Exception {
         System.out.println(Using native library  +
 System.getProperty(java.library.path));

         testCodec(conf, DefaultCodec.class, CompressionType.RECORD, 100);
         testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 100);
         testCodec(conf, LzoCodec.class, CompressionType.RECORD, 100);

         testCodec(conf, DefaultCodec.class, CompressionType.RECORD, 10);
         testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 10);
         testCodec(conf, LzoCodec.class, CompressionType.RECORD, 10);

         testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 100);
         testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 100);
         testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 100);

         testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 10);
         testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 10);
         testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 10);
         return 0;
     }

     @Override
     public void setConf(Configuration c) {
         this.conf = c;
     }

     @Override
     public Configuration getConf() {
         return conf;
     }

     public static void main(String[] args) throws Exception {
         ToolRunner.run(new MapFileCodecTest(), args);
     }

 }


 On 5/21/12, Todd Lipcon t...@cloudera.com wrote:
 Hi Jason,

 Sounds like a bug. Unfortunately the mailing list strips attachments.

 Can you file a jira in the HADOOP project, and attach your test case
 there?

 Thanks
 Todd

 On Mon, May 21, 2012 at 3:57 PM, Jason B urg...@gmail.com wrote:
 I am using Cloudera distribution cdh3u1.

 When trying to check native codecs for better decompression
 performance such as Snappy or LZO, I ran into issues with random
 access using MapFile.Reader.get(key, value) method.
 First call of MapFile.Reader.get() works but a second call fails.

 Also  I am getting different exceptions depending on number of entries
 in a map file.
 With LzoCodec and 10 record file, jvm gets aborted.

 At the same time the DefaultCodec works fine for all cases, as well as
 record compression for the native codecs.

 I created a simple test program (attached) that 

Re: Problems with block compression using native codecs (Snappy, LZO) and MapFile.Reader.get()

2012-05-22 Thread Jason B
This is from our production environment.
Unfortunately, I cannot test this on any newer version until it is
upgraded to cdh4 (0.23)

But since this is chd3u1 release, it presumably already contains a lot
of bug fixes back ported from 0.21.

The SIGSER is just one of the issues.
EOFException is raised gracefully for other cases.


On 5/22/12, Edward Capriolo edlinuxg...@gmail.com wrote:
 if You are getting a SIGSEG it never hurts to try a more recent JVM.
 21 has many bug fixes at this point.

 On Tue, May 22, 2012 at 11:45 AM, Jason B urg...@gmail.com wrote:
 JIRA entry created:

 https://issues.apache.org/jira/browse/HADOOP-8423


 On 5/21/12, Jason B urg...@gmail.com wrote:
 Sorry about using attachment. The code is below for the reference.
 (I will also file a jira as you suggesting)

 package codectest;

 import com.hadoop.compression.lzo.LzoCodec;
 import java.io.IOException;
 import java.util.Formatter;
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.io.MapFile;
 import org.apache.hadoop.io.SequenceFile.CompressionType;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.io.compress.CompressionCodec;
 import org.apache.hadoop.io.compress.DefaultCodec;
 import org.apache.hadoop.io.compress.SnappyCodec;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;

 public class MapFileCodecTest implements Tool {
     private Configuration conf = new Configuration();

     private void createMapFile(Configuration conf, FileSystem fs, String
 path,
             CompressionCodec codec, CompressionType type, int records)
 throws IOException {
         MapFile.Writer writer = new MapFile.Writer(conf, fs, path,
 Text.class, Text.class,
                 type, codec, null);
         Text key = new Text();
         for (int j = 0; j  records; j++) {
             StringBuilder sb = new StringBuilder();
             Formatter formatter = new Formatter(sb);
             formatter.format(%03d, j);
             key.set(sb.toString());
             writer.append(key, key);
         }
         writer.close();
     }

     private void testCodec(Configuration conf, Class? extends
 CompressionCodec clazz,
             CompressionType type, int records) throws IOException {
         FileSystem fs = FileSystem.getLocal(conf);
         try {
             System.out.println(Creating MapFiles with  + records  +
                      records using codec  + clazz.getSimpleName());
             String path = clazz.getSimpleName() + records;
             createMapFile(conf, fs, path, clazz.newInstance(), type,
 records);
             MapFile.Reader reader = new MapFile.Reader(fs, path, conf);
             Text key1 = new Text(002);
             if (reader.get(key1, new Text()) != null) {
                 System.out.println(1st key found);
             }
             Text key2 = new Text(004);
             if (reader.get(key2, new Text()) != null) {
                 System.out.println(2nd key found);
             }
         } catch (Throwable ex) {
             ex.printStackTrace();
         }
     }

     @Override
     public int run(String[] strings) throws Exception {
         System.out.println(Using native library  +
 System.getProperty(java.library.path));

         testCodec(conf, DefaultCodec.class, CompressionType.RECORD,
 100);
         testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 100);
         testCodec(conf, LzoCodec.class, CompressionType.RECORD, 100);

         testCodec(conf, DefaultCodec.class, CompressionType.RECORD, 10);
         testCodec(conf, SnappyCodec.class, CompressionType.RECORD, 10);
         testCodec(conf, LzoCodec.class, CompressionType.RECORD, 10);

         testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 100);
         testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 100);
         testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 100);

         testCodec(conf, DefaultCodec.class, CompressionType.BLOCK, 10);
         testCodec(conf, SnappyCodec.class, CompressionType.BLOCK, 10);
         testCodec(conf, LzoCodec.class, CompressionType.BLOCK, 10);
         return 0;
     }

     @Override
     public void setConf(Configuration c) {
         this.conf = c;
     }

     @Override
     public Configuration getConf() {
         return conf;
     }

     public static void main(String[] args) throws Exception {
         ToolRunner.run(new MapFileCodecTest(), args);
     }

 }


 On 5/21/12, Todd Lipcon t...@cloudera.com wrote:
 Hi Jason,

 Sounds like a bug. Unfortunately the mailing list strips attachments.

 Can you file a jira in the HADOOP project, and attach your test case
 there?

 Thanks
 Todd

 On Mon, May 21, 2012 at 3:57 PM, Jason B urg...@gmail.com wrote:
 I am using Cloudera distribution cdh3u1.

 When trying to check native codecs for better decompression
 performance such as Snappy or LZO, I ran into issues with random
 access using MapFile.Reader.get(key, 

Re: Map/Reduce Tasks Fails

2012-05-22 Thread Arun C Murthy
Seems like a question better suited for Cloudera lists...

On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote:

 Hi,
 We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort
 some of the map tasks are Failed/Killed and the logs show similar error on
 all machines.
 
 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient:
 Exception in createBlockOutputStream 10.0.25.149:50010
 java.net.SocketTimeoutException: 69000 millis timeout while waiting
 for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835
 remote=/10.0.25.149:50010]
 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient:
 Abandoning block blk_7260720956806950576_1825
 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient:
 Excluding datanode 10.0.25.149:50010
 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent
 died.  Exiting attempt_201205211504_0007_m_16_1.
 
 
 
 Are these kind of errors common?? Atleast 1 map task is failing due to
 above reason on all the machines.We are using 24 mappers for teragen.
 For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers
 and 17failed/8 killed task attempts.
 
 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts.
 Cluster works good for small datasets.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: Map/Reduce Tasks Fails

2012-05-22 Thread Sandeep Reddy P
I got samilar errors for Apache Hadoop 1.0.0
Thanks,
Sandeep.


Re: Map/Reduce Tasks Fails

2012-05-22 Thread Sandeep Reddy P
Raj,
Top from one datanode when i get error from that machine

top - 14:10:15 up 23:12,  1 user,  load average: 13.45, 12.91, 8.31
Tasks: 187 total,   1 running, 186 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.7%us,  0.4%sy,  0.0%ni,  0.0%id, 98.9%wa,  0.0%hi,  0.1%si,
0.0%st
Mem:   8061608k total,  7927124k used,   134484k free,19316k buffers
Swap:  2097144k total,  384k used,  2096760k free,  6694656k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 1622 hdfs  20   0 1619m 157m  11m S  2.0  2.0  33:55.42 java
14712 mapred20   0  709m 119m  11m S  1.3  1.5   0:10.06 java
 1706 mapred20   0 1588m 126m  11m S  1.0  1.6  24:51.69 java
14663 mapred20   0  708m  89m  11m S  1.0  1.1   0:11.23 java
14686 mapred20   0  714m 106m  11m S  0.7  1.4   0:11.53 java
14762 mapred20   0  710m  89m  11m S  0.7  1.1   0:10.05 java
14640 mapred20   0  704m 119m  11m S  0.3  1.5   0:11.36 java

Error Message:
12/05/22 14:09:52 INFO mapred.JobClient: Task Id :
attempt_201205211504_0009_m_02_0, Status : FAILED
java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting...
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892)

attempt_201205211504_0009_m_02_0: log4j:WARN No appenders could be
found for logger (org.apache.hadoop.hdfs.DFSClient).
attempt_201205211504_0009_m_02_0: log4j:WARN Please initialize the
log4j system properly.

But other map tasks are running on the same datanode.

Thanks,
sandeep.


Re: CopyFromLocal

2012-05-22 Thread Ranjith
Harsh,

Thanks for the response bud. Appreciate it!

Thanks,
Ranjith

On May 21, 2012, at 11:09 PM, Harsh J ha...@cloudera.com wrote:

 Ranjith,
 
 MapReduce and HDFS are two different things. MapReduce uses HDFS (and
 can use any other FS as well) to do some efficient work, but HDFS does
 not use MapReduce.
 
 A simple HDFS transfer is done via network directly - Yes its just a
 block by block copy/write to/from the relevant DataNodes, done over
 network sockets at each end.
 
 On Tue, May 22, 2012 at 8:58 AM, Ranjith ranjith.raghuna...@gmail.com wrote:
 Thanks harsh. So when it connects directly to the data nodes it does not 
 fire off any mappers. So how does it get the data over? Is it just a block 
 by block copy?
 
 Thanks,
 Ranjith
 
 On May 21, 2012, at 9:22 PM, Harsh J ha...@cloudera.com wrote:
 
 Ranjith,
 
 Are you speaking of DistCp?
 http://hadoop.apache.org/common/docs/current/distcp.html
 
 An 'fs -copyFromLocal' otherwise just runs as a single program that
 connects to your DFS nodes and writes data from a single client
 thread, and is not distributed on its own.
 
 On Tue, May 22, 2012 at 6:48 AM, Ranjith ranjith.raghuna...@gmail.com 
 wrote:
 
 I have always wondered about this and and not sure as to phenomenon. When 
 I fire a map reduce job to copy data over in a distributed fashion I would 
 expect to see mappers executing the copy. What happens with a copy command 
 from Hadoop fs?
 
 Thanks,
 Ranjith
 
 
 
 --
 Harsh J
 
 
 
 -- 
 Harsh J


Job constructore deprecation

2012-05-22 Thread Jay Vyas
Hi guys : I have noticed that the comments for this class encourage us to
use the Job constructors, yet, they are deprecated.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapreduce/Job.html

What is the idiomatic way to create a Job in hadoop ? And why have the job
constructors been deprecated ?

-- 
Jay Vyas
MMSB/UCHC