Re: How to manage large record in MapReduce

2011-01-07 Thread Sonal Goyal
Jerome,

You can take a look at FileStreamInputFormat at
https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

This provides an input stream per file. In our case, we are using the input
stream to load data into the database directly. Maybe you can use this or a
similar approach for working with your videos.

HTH

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre jthie...@gmail.com wrote:

 Hi,

 we are currently using Hadoop (version 0.20.2) to manage some web archiving
 processes like fulltext indexing, and it works very well with small records
 that contains html.
 Now, we would like to work with other type of web data like videos. These
 kind of data could be really large and of course these records doesn't fit
 in memory.

 Is it possible to manage record which content doesn't reside in memory but
 on disk.
 A possibility would be to implements a Writable that read its content from
 a
 DataInput but doesn't load it in memory, instead it would copy that content
 to a temporary file in the local file system and allows to stream its
 content using an InputStream (an InputStreamWritable).

 Has somebody tested a similar approach, and if not do you think some big
 problems could happen (that impacts performance) with this method ?

 Thanks,

 Jérôme Thièvre



ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException

2011-01-07 Thread Shuja Rehman
Hi,

After power failure, the name node is not starting,, giving the following
error. kindly let me know how to resolve it
thnx



2011-01-07 04:14:49,666 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ubuntu/192.168.1.2
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2+737
STARTUP_MSG:   build =  -r 98c55c28258aa6f42250569bd7fa431ac657bdbd;
compiled by 'root' on Mon Oct 11 17:21:30 UTC 2010
/
2011-01-07 04:14:50,610 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2011-01-07 04:14:50,670 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
2011-01-07 04:14:50,907 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs
2011-01-07 04:14:50,908 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2011-01-07 04:14:50,908 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isPermissionEnabled=false
2011-01-07 04:14:50,931 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
accessTokenLifetime=0 min(s)
2011-01-07 04:14:52,378 INFO
org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
Initializing FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
2011-01-07 04:14:52,392 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
FSNamesystemStatusMBean
2011-01-07 04:14:52,651 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562)
at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237)
at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)
2011-01-07 04:14:52,662 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562)
at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237)
at
org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226)
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)

2011-01-07 04:14:52,673 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:


-- 
Regards
Shuja-ur-Rehman Baig
http://pk.linkedin.com/in/shujamughal


Too-many fetch failure Reduce Error

2011-01-07 Thread Adarsh Sharma

Dear all,

I am researching about the below error and could not able to find the 
reason :


Data Size : 3.4 GB
Hadoop-0.20.0

had...@ws32-test-lin:~/project/hadoop-0.20.2$ bin/hadoop jar 
hadoop-0.20.2-examples.jar wordcount /user/hadoop/page_content.txt 
page_content_output.txt
11/01/07 16:11:14 INFO input.FileInputFormat: Total input paths to 
process : 1

11/01/07 16:11:15 INFO mapred.JobClient: Running job: job_201101071129_0001
11/01/07 16:11:16 INFO mapred.JobClient:  map 0% reduce 0%
11/01/07 16:11:41 INFO mapred.JobClient:  map 1% reduce 0%
11/01/07 16:11:45 INFO mapred.JobClient:  map 2% reduce 0%
11/01/07 16:11:48 INFO mapred.JobClient:  map 3% reduce 0%
11/01/07 16:11:52 INFO mapred.JobClient:  map 4% reduce 0%
11/01/07 16:11:56 INFO mapred.JobClient:  map 5% reduce 0%
11/01/07 16:12:00 INFO mapred.JobClient:  map 6% reduce 0%
11/01/07 16:12:05 INFO mapred.JobClient:  map 7% reduce 0%
11/01/07 16:12:08 INFO mapred.JobClient:  map 8% reduce 0%
11/01/07 16:12:11 INFO mapred.JobClient:  map 9% reduce 0%
11/01/07 16:12:14 INFO mapred.JobClient:  map 10% reduce 0%
11/01/07 16:12:17 INFO mapred.JobClient:  map 11% reduce 0%
11/01/07 16:12:21 INFO mapred.JobClient:  map 12% reduce 0%
11/01/07 16:12:24 INFO mapred.JobClient:  map 13% reduce 0%
11/01/07 16:12:27 INFO mapred.JobClient:  map 14% reduce 0%
11/01/07 16:12:30 INFO mapred.JobClient:  map 15% reduce 0%
11/01/07 16:12:33 INFO mapred.JobClient:  map 16% reduce 0%
11/01/07 16:12:36 INFO mapred.JobClient:  map 17% reduce 0%
11/01/07 16:12:40 INFO mapred.JobClient:  map 18% reduce 0%
11/01/07 16:12:45 INFO mapred.JobClient:  map 19% reduce 0%
11/01/07 16:12:48 INFO mapred.JobClient:  map 20% reduce 0%
11/01/07 16:12:54 INFO mapred.JobClient:  map 21% reduce 0%
11/01/07 16:13:00 INFO mapred.JobClient:  map 22% reduce 0%
11/01/07 16:13:04 INFO mapred.JobClient:  map 22% reduce 1%
11/01/07 16:13:13 INFO mapred.JobClient:  map 23% reduce 1%
11/01/07 16:13:19 INFO mapred.JobClient:  map 24% reduce 1%
11/01/07 16:13:25 INFO mapred.JobClient:  map 25% reduce 1%
11/01/07 16:13:30 INFO mapred.JobClient:  map 26% reduce 1%
11/01/07 16:13:34 INFO mapred.JobClient:  map 26% reduce 3%
11/01/07 16:13:36 INFO mapred.JobClient:  map 27% reduce 3%
11/01/07 16:13:37 INFO mapred.JobClient:  map 27% reduce 4%
11/01/07 16:13:39 INFO mapred.JobClient:  map 28% reduce 4%
11/01/07 16:13:43 INFO mapred.JobClient:  map 29% reduce 4%
11/01/07 16:13:46 INFO mapred.JobClient:  map 30% reduce 4%
11/01/07 16:13:49 INFO mapred.JobClient:  map 31% reduce 4%
11/01/07 16:13:52 INFO mapred.JobClient:  map 32% reduce 4%
11/01/07 16:13:55 INFO mapred.JobClient:  map 33% reduce 4%
11/01/07 16:13:58 INFO mapred.JobClient:  map 34% reduce 4%
11/01/07 16:14:02 INFO mapred.JobClient:  map 35% reduce 4%
11/01/07 16:14:05 INFO mapred.JobClient:  map 36% reduce 4%
11/01/07 16:14:08 INFO mapred.JobClient:  map 37% reduce 4%
11/01/07 16:14:11 INFO mapred.JobClient:  map 38% reduce 4%
11/01/07 16:14:15 INFO mapred.JobClient:  map 39% reduce 4%
11/01/07 16:14:19 INFO mapred.JobClient:  map 40% reduce 4%
11/01/07 16:14:20 INFO mapred.JobClient:  map 40% reduce 5%
11/01/07 16:14:25 INFO mapred.JobClient:  map 41% reduce 5%
11/01/07 16:14:32 INFO mapred.JobClient:  map 42% reduce 5%
11/01/07 16:14:38 INFO mapred.JobClient:  map 43% reduce 5%
11/01/07 16:14:41 INFO mapred.JobClient:  map 43% reduce 6%
11/01/07 16:14:43 INFO mapred.JobClient:  map 44% reduce 6%
11/01/07 16:14:47 INFO mapred.JobClient:  map 45% reduce 6%
11/01/07 16:14:50 INFO mapred.JobClient:  map 46% reduce 6%
11/01/07 16:14:54 INFO mapred.JobClient:  map 47% reduce 7%
11/01/07 16:14:59 INFO mapred.JobClient:  map 48% reduce 7%
11/01/07 16:15:02 INFO mapred.JobClient:  map 49% reduce 7%
11/01/07 16:15:05 INFO mapred.JobClient:  map 50% reduce 7%
11/01/07 16:15:11 INFO mapred.JobClient:  map 51% reduce 7%
11/01/07 16:15:14 INFO mapred.JobClient:  map 52% reduce 7%
11/01/07 16:15:16 INFO mapred.JobClient:  map 52% reduce 8%
11/01/07 16:15:20 INFO mapred.JobClient:  map 53% reduce 8%
11/01/07 16:15:25 INFO mapred.JobClient:  map 54% reduce 8%
11/01/07 16:15:29 INFO mapred.JobClient:  map 55% reduce 8%
11/01/07 16:15:31 INFO mapred.JobClient:  map 55% reduce 9%
11/01/07 16:15:33 INFO mapred.JobClient:  map 56% reduce 9%
11/01/07 16:15:38 INFO mapred.JobClient:  map 57% reduce 9%
11/01/07 16:15:42 INFO mapred.JobClient:  map 58% reduce 9%
11/01/07 16:15:43 INFO mapred.JobClient:  map 58% reduce 10%
11/01/07 16:15:46 INFO mapred.JobClient:  map 59% reduce 10%
11/01/07 16:15:49 INFO mapred.JobClient:  map 60% reduce 10%
11/01/07 16:15:53 INFO mapred.JobClient:  map 61% reduce 10%
11/01/07 16:15:56 INFO mapred.JobClient:  map 62% reduce 10%
11/01/07 16:16:00 INFO mapred.JobClient:  map 63% reduce 10%
11/01/07 16:16:06 INFO mapred.JobClient:  map 64% reduce 10%
11/01/07 16:16:10 INFO mapred.JobClient:  map 65% reduce 10%
11/01/07 16:16:15 INFO mapred.JobClient:  map 66% reduce 10%
11/01/07 16:16:18 INFO mapred.JobClient:  map 67% reduce 10%

Re: How to manage large record in MapReduce

2011-01-07 Thread Jérôme Thièvre INA
Hi Sonal,

thank you, I have just implemented a solution similar to yours (without
copying to a temp file as suggested in my inital post), and it seems to
work.
Best Regards,

Jérôme

2011/1/7 Sonal Goyal sonalgoy...@gmail.com

 Jerome,

 You can take a look at FileStreamInputFormat at

 https://github.com/sonalgoyal/hiho/tree/hihoApache0.20/src/co/nubetech/hiho/mapreduce/lib/input

 This provides an input stream per file. In our case, we are using the input
 stream to load data into the database directly. Maybe you can use this or a
 similar approach for working with your videos.

 HTH

 Thanks and Regards,
 Sonal
 https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
 Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
 Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal





 On Thu, Jan 6, 2011 at 4:23 PM, Jérôme Thièvre jthie...@gmail.com wrote:

  Hi,
 
  we are currently using Hadoop (version 0.20.2) to manage some web
 archiving
  processes like fulltext indexing, and it works very well with small
 records
  that contains html.
  Now, we would like to work with other type of web data like videos. These
  kind of data could be really large and of course these records doesn't
 fit
  in memory.
 
  Is it possible to manage record which content doesn't reside in memory
 but
  on disk.
  A possibility would be to implements a Writable that read its content
 from
  a
  DataInput but doesn't load it in memory, instead it would copy that
 content
  to a temporary file in the local file system and allows to stream its
  content using an InputStream (an InputStreamWritable).
 
  Has somebody tested a similar approach, and if not do you think some big
  problems could happen (that impacts performance) with this method ?
 
  Thanks,
 
  Jérôme Thièvre
 



Regarding chaining multiple map-reduce jobs in Hadoop streaming

2011-01-07 Thread Varadharajan Mukundan
Hi,

I need to chain a couple of mapreduce jobs in Hadoop streaming. i am
planning to use python to write the mapper and reducer scripts. is
there any other way to chain these jobs other than using a shell
script and a temporary directory in HDFS?

-- 
Thanks,
M. Varadharajan



Experience is what you get when you didn't get what you wanted
               -By Prof. Randy Pausch in The Last Lecture

My Journal :- www.thinkasgeek.wordpress.com


Re: Regarding chaining multiple map-reduce jobs in Hadoop streaming

2011-01-07 Thread Harsh J
You can use Oozie from Yahoo! for building an elegant workflow out of
your streaming jobs; but you would still require output spots on your
HDFS as output records aren't really pipelined into the next job.

-- 
Harsh J
www.harshj.com


Re: Too-many fetch failure Reduce Error

2011-01-07 Thread Esteban Gutierrez Moguel
Adarsh,

Dou you have in /etc/hosts the hostnames for masters and slaves?

esteban.

On Fri, Jan 7, 2011 at 06:47, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Dear all,

 I am researching about the below error and could not able to find the
 reason :

 Data Size : 3.4 GB
 Hadoop-0.20.0

 had...@ws32-test-lin:~/project/hadoop-0.20.2$ bin/hadoop jar
 hadoop-0.20.2-examples.jar wordcount /user/hadoop/page_content.txt
 page_content_output.txt
 11/01/07 16:11:14 INFO input.FileInputFormat: Total input paths to process
 : 1
 11/01/07 16:11:15 INFO mapred.JobClient: Running job: job_201101071129_0001
 11/01/07 16:11:16 INFO mapred.JobClient:  map 0% reduce 0%
 11/01/07 16:11:41 INFO mapred.JobClient:  map 1% reduce 0%
 11/01/07 16:11:45 INFO mapred.JobClient:  map 2% reduce 0%
 11/01/07 16:11:48 INFO mapred.JobClient:  map 3% reduce 0%
 11/01/07 16:11:52 INFO mapred.JobClient:  map 4% reduce 0%
 11/01/07 16:11:56 INFO mapred.JobClient:  map 5% reduce 0%
 11/01/07 16:12:00 INFO mapred.JobClient:  map 6% reduce 0%
 11/01/07 16:12:05 INFO mapred.JobClient:  map 7% reduce 0%
 11/01/07 16:12:08 INFO mapred.JobClient:  map 8% reduce 0%
 11/01/07 16:12:11 INFO mapred.JobClient:  map 9% reduce 0%
 11/01/07 16:12:14 INFO mapred.JobClient:  map 10% reduce 0%
 11/01/07 16:12:17 INFO mapred.JobClient:  map 11% reduce 0%
 11/01/07 16:12:21 INFO mapred.JobClient:  map 12% reduce 0%
 11/01/07 16:12:24 INFO mapred.JobClient:  map 13% reduce 0%
 11/01/07 16:12:27 INFO mapred.JobClient:  map 14% reduce 0%
 11/01/07 16:12:30 INFO mapred.JobClient:  map 15% reduce 0%
 11/01/07 16:12:33 INFO mapred.JobClient:  map 16% reduce 0%
 11/01/07 16:12:36 INFO mapred.JobClient:  map 17% reduce 0%
 11/01/07 16:12:40 INFO mapred.JobClient:  map 18% reduce 0%
 11/01/07 16:12:45 INFO mapred.JobClient:  map 19% reduce 0%
 11/01/07 16:12:48 INFO mapred.JobClient:  map 20% reduce 0%
 11/01/07 16:12:54 INFO mapred.JobClient:  map 21% reduce 0%
 11/01/07 16:13:00 INFO mapred.JobClient:  map 22% reduce 0%
 11/01/07 16:13:04 INFO mapred.JobClient:  map 22% reduce 1%
 11/01/07 16:13:13 INFO mapred.JobClient:  map 23% reduce 1%
 11/01/07 16:13:19 INFO mapred.JobClient:  map 24% reduce 1%
 11/01/07 16:13:25 INFO mapred.JobClient:  map 25% reduce 1%
 11/01/07 16:13:30 INFO mapred.JobClient:  map 26% reduce 1%
 11/01/07 16:13:34 INFO mapred.JobClient:  map 26% reduce 3%
 11/01/07 16:13:36 INFO mapred.JobClient:  map 27% reduce 3%
 11/01/07 16:13:37 INFO mapred.JobClient:  map 27% reduce 4%
 11/01/07 16:13:39 INFO mapred.JobClient:  map 28% reduce 4%
 11/01/07 16:13:43 INFO mapred.JobClient:  map 29% reduce 4%
 11/01/07 16:13:46 INFO mapred.JobClient:  map 30% reduce 4%
 11/01/07 16:13:49 INFO mapred.JobClient:  map 31% reduce 4%
 11/01/07 16:13:52 INFO mapred.JobClient:  map 32% reduce 4%
 11/01/07 16:13:55 INFO mapred.JobClient:  map 33% reduce 4%
 11/01/07 16:13:58 INFO mapred.JobClient:  map 34% reduce 4%
 11/01/07 16:14:02 INFO mapred.JobClient:  map 35% reduce 4%
 11/01/07 16:14:05 INFO mapred.JobClient:  map 36% reduce 4%
 11/01/07 16:14:08 INFO mapred.JobClient:  map 37% reduce 4%
 11/01/07 16:14:11 INFO mapred.JobClient:  map 38% reduce 4%
 11/01/07 16:14:15 INFO mapred.JobClient:  map 39% reduce 4%
 11/01/07 16:14:19 INFO mapred.JobClient:  map 40% reduce 4%
 11/01/07 16:14:20 INFO mapred.JobClient:  map 40% reduce 5%
 11/01/07 16:14:25 INFO mapred.JobClient:  map 41% reduce 5%
 11/01/07 16:14:32 INFO mapred.JobClient:  map 42% reduce 5%
 11/01/07 16:14:38 INFO mapred.JobClient:  map 43% reduce 5%
 11/01/07 16:14:41 INFO mapred.JobClient:  map 43% reduce 6%
 11/01/07 16:14:43 INFO mapred.JobClient:  map 44% reduce 6%
 11/01/07 16:14:47 INFO mapred.JobClient:  map 45% reduce 6%
 11/01/07 16:14:50 INFO mapred.JobClient:  map 46% reduce 6%
 11/01/07 16:14:54 INFO mapred.JobClient:  map 47% reduce 7%
 11/01/07 16:14:59 INFO mapred.JobClient:  map 48% reduce 7%
 11/01/07 16:15:02 INFO mapred.JobClient:  map 49% reduce 7%
 11/01/07 16:15:05 INFO mapred.JobClient:  map 50% reduce 7%
 11/01/07 16:15:11 INFO mapred.JobClient:  map 51% reduce 7%
 11/01/07 16:15:14 INFO mapred.JobClient:  map 52% reduce 7%
 11/01/07 16:15:16 INFO mapred.JobClient:  map 52% reduce 8%
 11/01/07 16:15:20 INFO mapred.JobClient:  map 53% reduce 8%
 11/01/07 16:15:25 INFO mapred.JobClient:  map 54% reduce 8%
 11/01/07 16:15:29 INFO mapred.JobClient:  map 55% reduce 8%
 11/01/07 16:15:31 INFO mapred.JobClient:  map 55% reduce 9%
 11/01/07 16:15:33 INFO mapred.JobClient:  map 56% reduce 9%
 11/01/07 16:15:38 INFO mapred.JobClient:  map 57% reduce 9%
 11/01/07 16:15:42 INFO mapred.JobClient:  map 58% reduce 9%
 11/01/07 16:15:43 INFO mapred.JobClient:  map 58% reduce 10%
 11/01/07 16:15:46 INFO mapred.JobClient:  map 59% reduce 10%
 11/01/07 16:15:49 INFO mapred.JobClient:  map 60% reduce 10%
 11/01/07 16:15:53 INFO mapred.JobClient:  map 61% reduce 10%
 11/01/07 16:15:56 INFO mapred.JobClient:  map 62% reduce 10%
 11/01/07 16:16:00 INFO mapred.JobClient:  map 63% reduce 10%
 

Re: ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException

2011-01-07 Thread Todd Lipcon
Hi Shuja,

Can you paste the output of ls -lR on all of your dfs.name.dirs?
(hopefully you have more than one, with one on an external machine via NFS,
right?)

Thanks
-Todd

On Fri, Jan 7, 2011 at 4:39 AM, Shuja Rehman shujamug...@gmail.com wrote:

 Hi,

 After power failure, the name node is not starting,, giving the following
 error. kindly let me know how to resolve it
 thnx



 2011-01-07 04:14:49,666 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = ubuntu/192.168.1.2
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2+737
 STARTUP_MSG:   build =  -r 98c55c28258aa6f42250569bd7fa431ac657bdbd;
 compiled by 'root' on Mon Oct 11 17:21:30 UTC 2010
 /
 2011-01-07 04:14:50,610 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2011-01-07 04:14:50,670 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing
 NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
 2011-01-07 04:14:50,907 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs
 2011-01-07 04:14:50,908 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
 2011-01-07 04:14:50,908 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=false
 2011-01-07 04:14:50,931 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
 accessTokenLifetime=0 min(s)
 2011-01-07 04:14:52,378 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
 2011-01-07 04:14:52,392 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2011-01-07 04:14:52,651 ERROR
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
 initialization failed.
 java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at

 org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562)
at

 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237)
at

 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226)
at

 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)
at

 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343)
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317)
at

 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394)
at

 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)
 2011-01-07 04:14:52,662 ERROR
 org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:180)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at

 org.apache.hadoop.hdfs.server.namenode.FSImage.readCheckpointTime(FSImage.java:571)
at
 org.apache.hadoop.hdfs.server.namenode.FSImage.getFields(FSImage.java:562)
at

 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:237)
at

 org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.read(Storage.java:226)
at

 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:316)
at

 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:343)
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:317)
at

 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:214)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:394)
at

 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1148)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1157)

 2011-01-07 04:14:52,673 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:


 --
 Regards
 Shuja-ur-Rehman Baig
 http://pk.linkedin.com/in/shujamughal




-- 
Todd Lipcon
Software 

Help: How to increase amont maptasks per job ?

2011-01-07 Thread Tali K




We have a jobs which runs in several map/reduce stages.  In the first job, a 
large number of map tasks -82  are initiated, as expected.
And that cause all nodes to be used. 
 In a 
later job, where we are still dealing with large amounts of
 data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
This stage is actually the 
workhorse of the job, and requires much more processing power than the 
initial stage.
 We are trying to understand why only a few map tasks are 
being used, as we are not getting the full advantage of our cluster.



  

Data Platform Engineer at Turn Inc.

2011-01-07 Thread Songting Chen
Data Platform Engineer at Turn Inc.

If you're passionate about large-scale distributed systems, petabyte data 
warehouses, NoSQL key-value stores, high throughput
real-time reporting system and are interested in joining a world class 
engineering team you might well be the person we're 
looking for. This hands-on role contributes to the organization's success 
through expertise in large scale MapReduce systems, 
advanced database programming and database architecture. 

We are a small team but build innovative and powerful systems. Our results were 
published in top-tier conferences 
such as VLDB (http://www.vldb2010.org/proceedings/files/papers/I08.pdf). 

If any of the following areas interest you, please send your resume to 
j...@turn.com
* Hybrid MapReduce/Database system 
* Distributed and parallel computing
* Performance tuning and optimization
* Large scale semi-structured data store
* Real time reporting system in 24 x 7 environment
* Turning research results into enterprise class software 

About Turn Inc.

Turn was founded to bring the efficiencies of search to digital advertising and 
empower the world's premier advertising 
agencies and brands to reach custom audiences at scale. We are a software and 
services company with the industry's only 
end-to-end platform for delivering the most effective data-driven digital 
advertising in the world. Our technology 
infrastructure, self-service interface, optimization algorithms, real-time 
analytics, and interoperability represent 
the future of media and data management. The company is based in Silicon Valley 
with locations in New York City, 
Charlotte, Chicago, London, Los Angeles, and San Francisco.

We are a rapidly growing, well funded startup in Redwood City, CA, with a 
growing business, a working business model 
and a seasoned executive team. We're changing the way the world thinks about 
online advertising and we are looking 
for talented engineers to help us take it to the next level.



Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Ted Yu
Set higher values for mapred.tasktracker.map.tasks.maximum (and
mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml

On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:





 We have a jobs which runs in several map/reduce stages.  In the first job,
 a large number of map tasks -82  are initiated, as expected.
 And that cause all nodes to be used.
  In a
 later job, where we are still dealing with large amounts of
  data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
 This stage is actually the
 workhorse of the job, and requires much more processing power than the
 initial stage.
  We are trying to understand why only a few map tasks are
 being used, as we are not getting the full advantage of our cluster.






Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Rahul Jain
Also make sure you've enough input files for the next stage mappers to work
with...

Read thru the input splits part of tutorial:
http://wiki.apache.org/hadoop/HadoopMapReduce

If the last stage had only 4 reducers running, they'd generate 4 output
files. This will limit the # of mappers started in the next stage to 4,
unless you tune your input split parameters or write a custom input split.

Hope this helps, there is lot more literature on this on the web and hadoop
books released till date.

-Rahul


On Fri, Jan 7, 2011 at 1:19 PM, Ted Yu yuzhih...@gmail.com wrote:

 Set higher values for mapred.tasktracker.map.tasks.maximum (and
 mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml

 On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:

 
 
 
 
  We have a jobs which runs in several map/reduce stages.  In the first
 job,
  a large number of map tasks -82  are initiated, as expected.
  And that cause all nodes to be used.
   In a
  later job, where we are still dealing with large amounts of
   data, only 4 map tasks are initiated, and that caused to use only 4
 nodes.
  This stage is actually the
  workhorse of the job, and requires much more processing power than the
  initial stage.
   We are trying to understand why only a few map tasks are
  being used, as we are not getting the full advantage of our cluster.
 
 
 
 



RE: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Tali K

According to the documentation, that parameter is for the number of
tasks *per TaskTracker*.  I am asking about the number of tasks
for the entire job and entire cluster.  That parameter is already
set to 3, which is one less than the number of cores on each node's
CPU, as recommended.In my question I stated   that
82 tasks were run for the first job, yet only 4 for the second -
both numbers being cluster-wide.



 Date: Fri, 7 Jan 2011 13:19:42 -0800
 Subject: Re: Help: How to increase amont maptasks per job ?
 From: yuzhih...@gmail.com
 To: common-user@hadoop.apache.org
 
 Set higher values for mapred.tasktracker.map.tasks.maximum (and
 mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml
 
 On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:
 
 
 
 
 
  We have a jobs which runs in several map/reduce stages.  In the first job,
  a large number of map tasks -82  are initiated, as expected.
  And that cause all nodes to be used.
   In a
  later job, where we are still dealing with large amounts of
   data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
  This stage is actually the
  workhorse of the job, and requires much more processing power than the
  initial stage.
   We are trying to understand why only a few map tasks are
  being used, as we are not getting the full advantage of our cluster.
 
 
 
 
  

Accessing HDFS

2011-01-07 Thread maha
Hi everyone,

  I'm wondering if there is a way to doing the following commands to  HDFS  ...

   File LocalinputDir = new File (/user/maha/inputDir);  
   String[] file = LocalinputDir.list();

I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for 
hadoop to open the directory and list the file names .. how can I do that?
or  I know I can copyToLocal the directory and use the regular-local fs.List() 
to see the file names . But the other way is faster,

  Thanks, 

Maha



Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Niels Basjes
You said you have a large amount of data.
How large is that approximately?
Did you compress the intermediate data (with what codec)?

Niels

2011/1/7 Tali K ncherr...@hotmail.com:

 According to the documentation, that parameter is for the number of
    tasks *per TaskTracker*.  I am asking about the number of tasks
    for the entire job and entire cluster.  That parameter is already
    set to 3, which is one less than the number of cores on each node's
    CPU, as recommended.In my question I stated   that
    82 tasks were run for the first job, yet only 4 for the second -
    both numbers being cluster-wide.



 Date: Fri, 7 Jan 2011 13:19:42 -0800
 Subject: Re: Help: How to increase amont maptasks per job ?
 From: yuzhih...@gmail.com
 To: common-user@hadoop.apache.org

 Set higher values for mapred.tasktracker.map.tasks.maximum (and
 mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml

 On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:

 
 
 
 
  We have a jobs which runs in several map/reduce stages.  In the first job,
  a large number of map tasks -82  are initiated, as expected.
  And that cause all nodes to be used.
   In a
  later job, where we are still dealing with large amounts of
   data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
  This stage is actually the
  workhorse of the job, and requires much more processing power than the
  initial stage.
   We are trying to understand why only a few map tasks are
  being used, as we are not getting the full advantage of our cluster.
 
 
 
 




-- 
Met vriendelijke groeten,

Niels Basjes


Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Ted Yu
Check out mapred.map.tasks and mapred.reduce.tasks

On Fri, Jan 7, 2011 at 1:40 PM, Tali K ncherr...@hotmail.com wrote:


 According to the documentation, that parameter is for the number of
tasks *per TaskTracker*.  I am asking about the number of tasks
for the entire job and entire cluster.  That parameter is already
set to 3, which is one less than the number of cores on each node's
CPU, as recommended.In my question I stated   that
82 tasks were run for the first job, yet only 4 for the second -
both numbers being cluster-wide.



  Date: Fri, 7 Jan 2011 13:19:42 -0800
  Subject: Re: Help: How to increase amont maptasks per job ?
  From: yuzhih...@gmail.com
  To: common-user@hadoop.apache.org
 
  Set higher values for mapred.tasktracker.map.tasks.maximum (and
  mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml
 
  On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:
 
  
  
  
  
   We have a jobs which runs in several map/reduce stages.  In the first
 job,
   a large number of map tasks -82  are initiated, as expected.
   And that cause all nodes to be used.
In a
   later job, where we are still dealing with large amounts of
data, only 4 map tasks are initiated, and that caused to use only 4
 nodes.
   This stage is actually the
   workhorse of the job, and requires much more processing power than the
   initial stage.
We are trying to understand why only a few map tasks are
   being used, as we are not getting the full advantage of our cluster.
  
  
  
  




Re: Accessing HDFS

2011-01-07 Thread Jacob R Rideout
  I'm wondering if there is a way to doing the following commands to  HDFS  ...

   File LocalinputDir = new File (/user/maha/inputDir);
   String[] file = LocalinputDir.list();

 I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for 
 hadoop to open the directory and list the file names .. how can I do that?
 or  I know I can copyToLocal the directory and use the regular-local 
 fs.List() to see the file names . But the other way is faster,


You could do something like:

Path f = new Path(hdfs://user/maha/inputDir);
FileSystem fs = f.getFileSystem(new Configuration());
for(FileStatus s : fs.listStatus(f)) {
  System.out.println(s.getPath().getName());
}


Jacob RIDEOUT
Software Engineer
Return Path, Inc
Skype:   jrideout
Twitter: @jrideout


Re: Accessing HDFS

2011-01-07 Thread maha
Nice !  I'd better try that. So the trick is only to add hdfs to the path to 
access that namespace.

 Thanks a ton :)
 Maha

On Jan 7, 2011, at 1:55 PM, Jacob R Rideout wrote:

  I'm wondering if there is a way to doing the following commands to  HDFS  
 ...
 
   File LocalinputDir = new File (/user/maha/inputDir);
   String[] file = LocalinputDir.list();
 
 I'm given Hadoop and input directory with files {f1,f2 ..}. I would like for 
 hadoop to open the directory and list the file names .. how can I do that?
 or  I know I can copyToLocal the directory and use the regular-local 
 fs.List() to see the file names . But the other way is faster,
 
 
 You could do something like:
 
 Path f = new Path(hdfs://user/maha/inputDir);
 FileSystem fs = f.getFileSystem(new Configuration());
 for(FileStatus s : fs.listStatus(f)) {
  System.out.println(s.getPath().getName());
 }
 
 
 Jacob RIDEOUT
 Software Engineer
 Return Path, Inc
 Skype:   jrideout
 Twitter: @jrideout



Re: Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

2011-01-07 Thread Sonal Goyal
Which Hadoop versions are you testing and compiling against?

Thanks and Regards,
Sonal
https://github.com/sonalgoyal/hihoConnect Hadoop with databases,
Salesforce, FTP servers and others https://github.com/sonalgoyal/hiho
Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal





On Wed, Jan 5, 2011 at 3:20 PM, Cavus,M.,Fa. Post Direkt 
m.ca...@postdirekt.de wrote:

 Hi,
 I get this, did anyone know why I get an Error?:


 11/01/05 10:46:55 WARN conf.Configuration: fs.checkpoint.period is
 deprecated. Instead, use dfs.namenode.checkpoint.period
 11/01/05 10:46:55 WARN conf.Configuration: mapred.map.tasks is
 deprecated. Instead, use mapreduce.job.maps
 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: number of splits:1
 11/01/05 10:46:55 INFO mapreduce.JobSubmitter: adding the following
 namenodes' delegation tokens:null
 11/01/05 10:46:56 INFO mapreduce.Job: Running job: job_201101051016_0008
 11/01/05 10:46:57 INFO mapreduce.Job:  map 0% reduce 0%
 11/01/05 10:47:04 INFO mapreduce.Job:  map 100% reduce 0%
 11/01/05 10:47:13 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_0, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:23 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_1, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:34 INFO mapreduce.Job: Task Id :
 attempt_201101051016_0008_r_00_2, Status : FAILED
 Error: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext,
 but class was expected
 11/01/05 10:47:47 INFO mapreduce.Job: Job complete:
 job_201101051016_0008
 11/01/05 10:47:47 INFO mapreduce.Job: Counters: 19
FileSystemCounters
FILE_BYTES_WRITTEN=38
HDFS_BYTES_READ=69
Job Counters
Data-local map tasks=1
Total time spent by all maps waiting after reserving
 slots (ms)=0
Total time spent by all reduces waiting after reserving
 slots (ms)=0
Failed reduce tasks=1
SLOTS_MILLIS_MAPS=5781
SLOTS_MILLIS_REDUCES=6379
Launched map tasks=1
Launched reduce tasks=4
Map-Reduce Framework
Combine input records=0
Failed Shuffles=0
GC time elapsed (ms)=97
Map input records=0
Map output bytes=0
Map output records=0
Merged Map outputs=0
Spilled Records=0
SPLIT_RAW_BYTES=69
 11/01/05 10:47:47 INFO zookeeper.ZooKeeper: Session: 0x12d555a4ed80018
 closed




Re: Help: How to increase amont maptasks per job ?

2011-01-07 Thread Harsh J
It would depend on your input format. If the job is using an
InputFormat that does not let it split files, you would get only
mappers == no. of files. For splittable input files, you get mappers 
no. of files. Little more information on what the input format is
could help tracking down the problem a bit more.

On Sat, Jan 8, 2011 at 3:10 AM, Tali K ncherr...@hotmail.com wrote:

 According to the documentation, that parameter is for the number of
    tasks *per TaskTracker*.  I am asking about the number of tasks
    for the entire job and entire cluster.  That parameter is already
    set to 3, which is one less than the number of cores on each node's
    CPU, as recommended.In my question I stated   that
    82 tasks were run for the first job, yet only 4 for the second -
    both numbers being cluster-wide.



 Date: Fri, 7 Jan 2011 13:19:42 -0800
 Subject: Re: Help: How to increase amont maptasks per job ?
 From: yuzhih...@gmail.com
 To: common-user@hadoop.apache.org

 Set higher values for mapred.tasktracker.map.tasks.maximum (and
 mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml

 On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote:

 
 
 
 
  We have a jobs which runs in several map/reduce stages.  In the first job,
  a large number of map tasks -82  are initiated, as expected.
  And that cause all nodes to be used.
   In a
  later job, where we are still dealing with large amounts of
   data, only 4 map tasks are initiated, and that caused to use only 4 nodes.
  This stage is actually the
  workhorse of the job, and requires much more processing power than the
  initial stage.
   We are trying to understand why only a few map tasks are
  being used, as we are not getting the full advantage of our cluster.
 
 
 
 




-- 
Harsh J
www.harshj.com