Re: Indexed Hashtables

2009-01-14 Thread Sean Shanny

Delip,

So far we have had pretty good luck with memcached.  We are building a  
hadoop based solution for data warehouse ETL on XML based log files  
that represent click stream data on steroids.


We process about 34 million records or about 70 GB data a day.  We  
have to process dimensional data in our warehouse and then load the  
surrogate  pairs in memcached so we can traverse the XML  
files once again to perform the substitutions.  We are using the  
memcached solution because is scales out just like hadoop.  We will  
have code that allows us to fall back to the DB if the memcached  
lookup fails but that should not happen to often.


Thanks.

--sean

Sean Shanny
ssha...@tripadvisor.com




On Jan 14, 2009, at 9:47 PM, Delip Rao wrote:


Hi,

I need to lookup a large number of key/value pairs in my map(). Is
there any indexed hashtable available as a part of Hadoop I/O API?
I find Hbase an overkill for my application; something on the lines of
HashStore (www.cellspark.com/hashstore.html) should be fine.

Thanks,
Delip




Re: TestDFSIO delivers bad values of "throughput" and "average IO rate"

2009-01-14 Thread Konstantin Shvachko

In TestDFSIO we want each task to create only one file.
It is a one-to-one mapping from files to map tasks.
And splits are defined so that each map gets only
one file name, which it creates or reads.

--Konstantin

tienduc_dinh wrote:

I don't understand, why the parameter -nrFiles of TestDFSIO should override
mapred.map.tasks. 
nrFiles is the number of the files which will be created and

mapred.map.tasks is the number how many splits will be done by the input
file.

Thanks


Konstantin Shvachko wrote:

Hi tienduc_dinh,

Just a bit of a background, which should help to answer your questions.
TestDFSIO mappers perform one operation (read or write) each, measure
the time taken by the operation and output the following three values:
(I am intentionally omitting some other output stuff.)
- size(i)
- time(i)
- rate(i) = size(i) / time(i)
i is the index of the map task 0 <= i < N, and N is the "-nrFiles" value,
which equals the number of maps.

Then the reduce sums those values and writes them into "part-0".
That is you get three fields in it
size = size(0) + ... + size(N-1)
time = time(0) + ... + time(N-1)
rate = rate(0) + ... + rate(N-1)

Then we calculate
throughput = size / time
averageIORate = rate / N

So answering your questions
- There should be only one reduce task, otherwise you will have to
manually sum corresponding values in "part-0" and "part-1".
- The value of the ":rate" after the reduce equals the sum of individual
rates of each operation. So if you want to have an average you should
divide it by the number tasks rather than multiply.

Now, in your case you create only one file "-nrFiles 1", which means
you run only one map task.
Setting "mapred.map.tasks" to 10 in hadoop-site.xml defines the default
number of tasks per job. See here
http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.map.tasks
In case of TestDFSIO it will be overridden by "-nrFiles".

Hope this answers your questions.
Thanks,
--Konstantin



tienduc_dinh wrote:

Hello,

I'm now using hadoop-0.18.0 and testing it on a cluster with 1 master and
4
slaves. In hadoop-site.xml the value of "mapred.map.tasks" is 10. Because
the values "throughput" and "average IO rate" are similar, I just post
the
values of "throughput" of the same command with 3 times running

- > hadoop-0.18.0/bin/hadoop jar testDFSIO.jar -write -fileSize 2048
-nrFiles 1

+ with "dfs.replication = 1" => 33,60 / 31,48 / 30,95

+ with "dfs.replication = 2" => 26,40 / 20,99 / 21,70

I find something strange while reading the source code. 

- The value of mapred.reduce.tasks is always set to 1 


job.setNumReduceTasks(1) in the function runIOTest()  and reduceFile =
new
Path(WRITE_DIR, "part-0") in analyzeResult().

So I think, if we properly have mapred.reduce.tasks = 2, we will have on
the
file system 2 Paths to "part-0" and "part-1", e.g.
/benchmarks/TestDFSIO/io_write/part-0

- And i don't understand the line with "double med = rate / 1000 /
tasks".
Is it not "double med = rate * tasks / 1000 "






Indexed Hashtables

2009-01-14 Thread Delip Rao
Hi,

I need to lookup a large number of key/value pairs in my map(). Is
there any indexed hashtable available as a part of Hadoop I/O API?
I find Hbase an overkill for my application; something on the lines of
HashStore (www.cellspark.com/hashstore.html) should be fine.

Thanks,
Delip


Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Jim Twensky
Owen and Rasit,

Thank you for the responses. I've figured that mapred.reduce.tasks was set
to 1 in my hadoop-default xml and I didn't overwrite it in my
hadoop-site.xml configuration file.

Jim

On Wed, Jan 14, 2009 at 11:23 AM, Owen O'Malley  wrote:

> On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:
>
>  Jim,
>>
>> As far as I know, there is no operation done after Reducer.
>>
>
> Correct, other than output promotion, which moves the output file to the
> final filename.
>
>  But if you  are a little experienced, you already know these.
>> Ordered list means one final file, or am I missing something?
>>
>
> There is no value and a lot of cost associated with creating a single file
> for the output. The question is how you want the keys divided between the
> reduces (and therefore output files). The default partitioner hashes the key
> and mods by the number of reduces, which "stripes" the keys across the
> output files. You can use the mapred.lib.InputSampler to generate good
> partition keys and mapred.lib.TotalOrderPartitioner to get completely sorted
> output based on the partition keys. With the total order partitioner, each
> reduce gets an increasing range of keys and thus has all of the nice
> properties of a single file without the costs.
>
> -- Owen
>


Re: RAID vs. JBOD

2009-01-14 Thread Runping Qi

Hi,

We at Yahoo did some Hadoop benchmarking experiments on clusters with JBOD
and RAID0. We found that under heavy loads (such as gridmix), JBOD cluster
performed better.

Gridmix tests:

Load: gridmix2
Cluster size: 190 nodes
Test results:

RAID0: 75 minutes
JBOD:  67 minutes
Difference: 10%

Tests on HDFS writes performances

We ran map only jobs writing data to dfs concurrently on different clusters.
The overall dfs write throughputs on the jbod cluster are 30% (with a 58
nodes cluster) and 50% (with an 18 nodes cluster) better than that on the
raid0 cluster, respectively.

To understand why, we did some file level benchmarking on both clusters.
We found that the file write throughput on a JBOD machine is 30% higher than
that on a comparable machine with RAID0. This performance difference may be
explained by the fact that the throughputs of different disks can vary 30%
to 50%. With such variations, the overall throughput of a raid0 system may
be bottlenecked by the slowest disk.
 

-- Runping





On 1/11/09 1:23 PM, "David B. Ritch"  wrote:

> How well does Hadoop handle multiple independent disks per node?
> 
> I have a cluster with 4 identical disks per node.  I plan to use one
> disk for OS and temporary storage, and dedicate the other three to
> HDFS.  Our IT folks have some disagreement as to whether the three disks
> should be striped, or treated by HDFS as three independent disks.  Could
> someone with more HDFS experience comment on the relative advantages and
> disadvantages to each approach?
> 
> Here are some of my thoughts.  It's a bit easier to manage a 3-disk
> striped partition, and we wouldn't have to worry about balancing files
> between them.  Single-file I/O should be considerably faster.  On the
> other hand, I would expect typical use to require multiple files reads
> or write simultaneously.  I would expect Hadoop to be able to manage
> read/write to/from the disks independently.  Managing 3 streams to 3
> independent devices would likely result in less disk head movement, and
> therefore better performance.  I would expect Hadoop to be able to
> balance load between the disks fairly well.  Availability doesn't really
> differentiate between the two approaches - if a single disk dies, the
> striped array would go down, but all its data should be replicated on
> another datanode, anyway.  And besides, I understand that datanode will
> shut down a node, even if only one of 3 independent disks crashes.
> 
> So - any one want to agree or disagree with these thoughts?  Anyone have
> any other ideas, or - better - benchmarks and experience with layouts
> like these two?
> 
> Thanks!
> 
> David



Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Chris Douglas
I got it. For some reason getDefaultExtension() returns  
".lzo_deflate".


Is that a bug? Shouldn't it be .lzo?


The .lzo suffix is reserved for lzop (LzopCodec). LzoCodec doesn't  
generate compatible output, hence "lzo_deflate". -C





In the head revision I couldn't find it at all in
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/

There should be a Class LzoCodec.java. Was that moved to somewhere  
else?


Gert

Gert Pfeifer wrote:

Arun C Murthy wrote:

On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:


Hi,
I want to use an lzo file as input for a mapper. The record reader
determines the codec using a CompressionCodecFactory, like this:

(Hadoop version 0.19.0)


http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html


I should have mentioned that I have these native libs running:
2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
Successfully loaded & initialized native-lzo library

Is that what you tried to point out with this link?

Gert


hth,
Arun


compressionCodecs = new CompressionCodecFactory(job);
System.out.println("Using codecFactory:  
"+compressionCodecs.toString());

final CompressionCodec codec = compressionCodecs.getCodec(file);
System.out.println("Using codec: "+codec+" for file  
"+file.getName());



The output that I get is:

Using codecFactory: { etalfed_ozl.:
org.apache.hadoop.io.compress.LzoCodec }
Using codec: null for file test.lzo

Of course, the mapper does not work without codec. What could be  
the

problem?

Thanks,
Gert




Re: General questions about Map-Reduce

2009-01-14 Thread tienduc_dinh

I got it ...

Thanks to all

Cheers,
Duc
-- 
View this message in context: 
http://www.nabble.com/General-questions-about-Map-Reduce-tp21399361p21461628.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: libhdfs append question[MESSAGE NOT SCANNED]

2009-01-14 Thread Craig Macdonald

Tamas,

There is a patch attached to the issue, which you should be able to 
apply to get O_APPEND .


https://issues.apache.org/jira/browse/HADOOP-4494

Craig

Tamás Szokol wrote:


Hi!


I'm using the latest stable 0.19.0 version of hadoop. I'd like to try 
the new append functionality. Is it available from libhdfs? I didn't 
find it in hdfs.h interface nor in the hdfs.c implementation.


I saw the hdfsOpenFile's new flag O_APPEND in: HADOOP-4494
(http://mail-archives.apache.org/mod_mbox/hadoop-core-dev/200810.mbox/%3c270645717.1224722204314.javamail.j...@brutus%3e),
but I still don't find it in the latest release.

Is it available as a patch, or maybe only available in the svn repository?

Could you please give me some pointers how to use the append 
functionality from libhdfs?

Thank you in advance!

Cheers,
Tamas



 





Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Owen O'Malley

On Jan 14, 2009, at 12:46 AM, Rasit OZDAS wrote:


Jim,

As far as I know, there is no operation done after Reducer.


Correct, other than output promotion, which moves the output file to  
the final filename.



But if you  are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?


There is no value and a lot of cost associated with creating a single  
file for the output. The question is how you want the keys divided  
between the reduces (and therefore output files). The default  
partitioner hashes the key and mods by the number of reduces, which  
"stripes" the keys across the output files. You can use the  
mapred.lib.InputSampler to generate good partition keys and  
mapred.lib.TotalOrderPartitioner to get completely sorted output based  
on the partition keys. With the total order partitioner, each reduce  
gets an increasing range of keys and thus has all of the nice  
properties of a single file without the costs.


-- Owen


Hello, world for Hadoop + Lucene

2009-01-14 Thread John Howland
Howdy!

Is there any sort of "Hello, world!" example for building a Lucene
index with Hadoop? I am looking through the source in contrib/index
and it is a bit beyond me at the moment. Alternatively, is there more
documentation related to the contrib/index example code?

There seems to be a lot of information out on the tubes for how to do
distribute indices and query them (e.g. Katta). Nutch obviously also
comes up, but it is not clear to me how to come to grips with Nutch
and I'm not interested in web crawling. What I'm looking for is a
simple example for the hadoop/lucene newbie where you can take a
String or a Text object and index it as a document. If,
understandably, such an example does not exist, any pointers from the
experts would be appreciated. I don't care as much about real world
usage/performance, as I do about pedagogical code which can serve as a
base for learning, just to give me a toehold.

Many thanks,

John


Does the logging properties still work?

2009-01-14 Thread Wu Wei

Hi all,

In hadoop-default.xml file, I found properties hadoop.logfile.size and 
hadoop.logfile.count. I didn't change the default settings, but still 
log files grew beyond the limit. Does these logging properties still 
work in hadoop 0.19.0?


btw: I found a class org.apache.hadoop.util.LogFormatter using property 
hadoop.logfile.size was missing in hadoop 0.9 after the LogFormatter was 
replaced by log4j.


-Wei


HOD: cluster allocation don't work

2009-01-14 Thread Thomas Kobienia
Hi,

We use an hadoop cluster with version 0.18.2. The static hdfs and mapred
system works fine. Sometimes tasktrackerchild processes don't finish
after job completion. That is another problem.

My present issue is hod. I've set up torque and hod according the online
documentations. The torque service run on all cluster nodes and works
fine. I want to use hod with an external hdfs. I run 'hod allocate -d
~/hod-clusters/test -n 4' to allocate a mapred cluster. According the
log messages it founds the external hdfs but don't starts the jobtracker
and tasktracker.

I would appreciate any pointers for this problem?

Thanks,
Thomas

[2009-01-14 11:35:15,885] DEBUG/10 torque:54 - qsub stdin: #!/bin/sh
[2009-01-14 11:35:15,891] DEBUG/10 torque:54 - qsub stdin:
/opt/hadoop/hod-0.18.2/bin/ringmaster
--hodring.tarball-retry-initial-time 1.0
--hodring.cmd-retry-initial-time 2.0 --hodring.log-destination-uri
hdfs://hn01.berlin.semgine.com:3/hod/logs --hodring.log-dir
/opt/hadoop/hod-0.18.2/logs --hodring.temp-dir /tmp/hod --hodring.userid
hadoop --hodring.register --hodring.http-port-range 8000-9000
--hodring.java-home /usr/lib/jvm/java-1.6.0-sun-1.6.0.11
--hodring.tarball-retry-interval 3.0 --hodring.cmd-retry-interval 2.0
--hodring.mapred-system-dir-root /hod/mapredsystem
--hodring.xrs-port-range 32768-65536 --hodring.debug 4 --hodring.pkgs
/opt/hadoop/hadoop-0.18.2 --resource_manager.queue batch
--resource_manager.env-vars "HOD_PYTHON_HOME=/usr/bin/python25"
--resource_manager.id torque --resource_manager.batch-home /usr
--gridservice-hdfs.fs_port 3 --gridservice-hdfs.host
hn01.berlin.semgine.com --gridservice-hdfs.pkgs
/opt/hadoop/hadoop-0.18.2 --gridservice-hdfs.info_port 30030
--gridservice-hdfs.external --ringmaster.http-port-range 8000-9000
--ringmaster.temp-dir /tmp/hod --ringmaster.userid hadoop
--ringmaster.register --ringmaster.max-master-failures 2
--ringmaster.work-dirs /tmp/hod/1,/tmp/hod/2 --ringmaster.svcrgy-addr
hn01.berlin.semgine.com:44223 --ringmaster.log-dir
/opt/hadoop/hod-0.18.2/logs --ringmaster.max-connect 30
--ringmaster.xrs-port-range 32768-65536 --ringmaster.jt-poll-interval
120 --ringmaster.debug 4 --ringmaster.idleness-limit 3600
--gridservice-mapred.tracker_port 0 --gridservice-mapred.host localhost
--gridservice-mapred.pkgs /opt/hadoop/hadoop-0.18.2
--gridservice-mapred.info_port 0
[2009-01-14 11:35:15,904] DEBUG/10 torque:76 - qsub jobid:
38.hn01.berlin.semgine.com
[2009-01-14 11:35:15,912] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
38.hn01.berlin.semgine.com
[2009-01-14 11:35:15,955] DEBUG/10 hadoop:196 - job state R
[2009-01-14 11:35:15,959] INFO/20 hadoop:543 - Cluster Id
38.hn01.berlin.semgine.com
[2009-01-14 11:35:19,996] DEBUG/10 hadoop:547 - Ringmaster at :
http://hn04.berlin.semgine.com:39581/
[2009-01-14 11:35:20,039] INFO/20 hadoop:556 - HDFS UI at
http://hn01.berlin.semgine.com:30030
[2009-01-14 11:35:50,767] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
38.hn01.berlin.semgine.com
[2009-01-14 11:35:50,810] DEBUG/10 hadoop:196 - job state R
[2009-01-14 11:36:21,644] DEBUG/10 torque:87 - /usr/bin/qstat -f -1
38.hn01.berlin.semgine.com
[2009-01-14 11:36:21,684] DEBUG/10 hadoop:196 - job state R

The ringmaster process logs the following messages.

[2009-01-14 11:35:21,614] DEBUG/10 torque:147 - pbsdsh command:
/usr/bin/pbsdsh /opt/hadoop/hod-0.18.2/bin/hodring --hodring.log-dir
/opt/hadoop/hod-0.18.2/logs --hodring.tar
ball-retry-initial-time 1.0 --hodring.cmd-retry-initial-time 2.0
--hodring.cmd-retry-interval 2.0 --hodring.service-id
38.hn01.berlin.semgine.com --hodring.temp-dir /tmp/hod
--hodring.http-port-range 8000-9000 --hodring.userid hadoop
--hodring.java-home /usr/lib/jvm/java-1.6.0-sun-1.6.0.11
--hodring.svcrgy-addr hn01.berlin.semgine.com:44223 --hod
ring.tarball-retry-interval 3.0 --hodring.log-destination-uri
hdfs://hn01.berlin.semgine.com:3/hod/logs
--hodring.mapred-system-dir-root /hod/mapredsystem --hodring.pkgs
/opt/hadoop/hadoop-0.18.2 --hodring.debug 4
--hodring.ringmaster-xrs-addr hn04.berlin.semgine.com:39581
--hodring.xrs-port-range 32768-65536 --hodring.register
[2009-01-14 11:35:21,634] DEBUG/10 ringMaster:479 - getServiceAddr name:
mapred
[2009-01-14 11:35:21,643] DEBUG/10 ringMaster:487 - getServiceAddr
service: 
[2009-01-14 11:35:21,653] DEBUG/10 ringMaster:504 - getServiceAddr addr
mapred: not found
[2009-01-14 11:35:21,657] DEBUG/10 ringMaster:925 - Returned from
runWorkers.
[2009-01-14 11:35:22,087] DEBUG/10 ringMaster:479 - getServiceAddr name:
mapred
[2009-01-14 11:35:22,096] DEBUG/10 ringMaster:487 - getServiceAddr
service: 
[2009-01-14 11:35:22,104] DEBUG/10 ringMaster:504 - getServiceAddr addr
mapred: not found
[2009-01-14 11:35:23,113] DEBUG/10 ringMaster:479 - getServiceAddr name:
mapred
[2009-01-14 11:35:23,120] DEBUG/10 ringMaster:487 - getServiceAddr
service: 
[2009-01-14 11:35:23,127] DEBUG/10 ringMaster:504 - getServiceAddr addr
mapred: not found



Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Tom White
LZO was removed due to license incompatibility:
https://issues.apache.org/jira/browse/HADOOP-4874

Tom

On Wed, Jan 14, 2009 at 11:18 AM, Gert Pfeifer
 wrote:
> I got it. For some reason getDefaultExtension() returns ".lzo_deflate".
>
> Is that a bug? Shouldn't it be .lzo?
>
> In the head revision I couldn't find it at all in
> http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/
>
> There should be a Class LzoCodec.java. Was that moved to somewhere else?
>
> Gert
>
> Gert Pfeifer wrote:
>> Arun C Murthy wrote:
>>> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:
>>>
 Hi,
 I want to use an lzo file as input for a mapper. The record reader
 determines the codec using a CompressionCodecFactory, like this:

 (Hadoop version 0.19.0)

>>> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html
>>
>> I should have mentioned that I have these native libs running:
>> 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
>> Loaded the native-hadoop library
>> 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
>> Successfully loaded & initialized native-lzo library
>>
>> Is that what you tried to point out with this link?
>>
>> Gert
>>
>>> hth,
>>> Arun
>>>
 compressionCodecs = new CompressionCodecFactory(job);
 System.out.println("Using codecFactory: "+compressionCodecs.toString());
 final CompressionCodec codec = compressionCodecs.getCodec(file);
 System.out.println("Using codec: "+codec+" for file "+file.getName());


 The output that I get is:

 Using codecFactory: { etalfed_ozl.:
 org.apache.hadoop.io.compress.LzoCodec }
 Using codec: null for file test.lzo

 Of course, the mapper does not work without codec. What could be the
 problem?

 Thanks,
 Gert
>


Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Gert Pfeifer
I got it. For some reason getDefaultExtension() returns ".lzo_deflate".

Is that a bug? Shouldn't it be .lzo?

In the head revision I couldn't find it at all in
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/

There should be a Class LzoCodec.java. Was that moved to somewhere else?

Gert

Gert Pfeifer wrote:
> Arun C Murthy wrote:
>> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:
>>
>>> Hi,
>>> I want to use an lzo file as input for a mapper. The record reader
>>> determines the codec using a CompressionCodecFactory, like this:
>>>
>>> (Hadoop version 0.19.0)
>>>
>> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html
> 
> I should have mentioned that I have these native libs running:
> 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
> 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 
> Is that what you tried to point out with this link?
> 
> Gert
> 
>> hth,
>> Arun
>>
>>> compressionCodecs = new CompressionCodecFactory(job);
>>> System.out.println("Using codecFactory: "+compressionCodecs.toString());
>>> final CompressionCodec codec = compressionCodecs.getCodec(file);
>>> System.out.println("Using codec: "+codec+" for file "+file.getName());
>>>
>>>
>>> The output that I get is:
>>>
>>> Using codecFactory: { etalfed_ozl.:
>>> org.apache.hadoop.io.compress.LzoCodec }
>>> Using codec: null for file test.lzo
>>>
>>> Of course, the mapper does not work without codec. What could be the
>>> problem?
>>>
>>> Thanks,
>>> Gert


Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Gert Pfeifer
Arun C Murthy wrote:
> 
> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:
> 
>> Hi,
>> I want to use an lzo file as input for a mapper. The record reader
>> determines the codec using a CompressionCodecFactory, like this:
>>
>> (Hadoop version 0.19.0)
>>
> 
> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html

I should have mentioned that I have these native libs running:
2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
Successfully loaded & initialized native-lzo library

Is that what you tried to point out with this link?

Gert

> 
> hth,
> Arun
> 
>> compressionCodecs = new CompressionCodecFactory(job);
>> System.out.println("Using codecFactory: "+compressionCodecs.toString());
>> final CompressionCodec codec = compressionCodecs.getCodec(file);
>> System.out.println("Using codec: "+codec+" for file "+file.getName());
>>
>>
>> The output that I get is:
>>
>> Using codecFactory: { etalfed_ozl.:
>> org.apache.hadoop.io.compress.LzoCodec }
>> Using codec: null for file test.lzo
>>
>> Of course, the mapper does not work without codec. What could be the
>> problem?
>>
>> Thanks,
>> Gert


Re: Dynamic Node Removal and Addition

2009-01-14 Thread Rasit OZDAS
Hi Alyssa,

http://markmail.org/message/jyo4wssouzlb4olm#query:%22Decommission%20of%20datanodes%22+page:1+mid:p2krkt6ebysrsrpl+state:results
as pointed here, decommission (removal) of datanodes was not an easy job at
the date of version 0.12.
I strongly think it's still not easy.
As far as I know, one node should be used both as datanode and tasktracker.
So, performance loss will be possibly far greater than performance gain of
your design.
My solution would be using them still as datanodes, and changing TaskTracker
code a little bit, so that they won't be used for jobs. Code manipulation
here should be easy, as I assume.

Hope this helps,
Rasit

2009/1/12 Hargraves, Alyssa :
> Hello everyone,
>
> I have a question and was hoping some on the mailinglist could offer some
pointers. I'm working on a project with another student and for part of this
project we are trying to create something that will allow nodes to be added
and removed from the hadoop cluster at will.  The goal is to have the nodes
run a program that gives the user the freedom to add or remove themselves
from the cluster to take advantage of a workstation when the user leaves (or
if they'd like it running anyway when they're at the PC).  This would be on
Windows computers of various different OSes.
>
> From what we can find, hadoop does not already support this feature, but
it does seem to support dynamically adding nodes and removing nodes in other
ways.  For example, to add a node, one would have to make sure hadoop is set
up on the PC along with cygwin, Java, and ssh, but after that initial setup
it's just a matter of adding the PC to the conf/slaves file, making sure the
node is not listed in the exclude file, and running the start datanode and
start tasktracker commands from the node you are adding (basically described
in FAQ item 25).  To remove a node, it seems to be just a matter of adding
it to dfs.hosts.exclude and refreshing the list of nodes (described in
hadoop FAQ 17).
>
> Our question is whether or not a simple interface for this already exists,
and whether or not anyone sees any potential flaws with how we are planning
to accomplish these tasks.  From our research we were not able to find
anything that already exists for this purpose, but we find it surprising
that an interface for this would not already exist.  We welcome any
comments, recommendations, and insights anyone might have for accomplishing
this task.
>
> Thank you,
> Alyssa Hargraves
> Patrick Crane
> WPI Class of 2009



-- 
M. Raşit ÖZDAŞ


Namenode freeze

2009-01-14 Thread Sagar Naik

Hi
Datanode goes down. and then looks like ReplicationMonitor tries to 
even-out the replication

However while doing so,
it holds the lock on FsNameSystem
With this lock held, other threads wait on this lock to respond
As a result, the namenode does not list the dirs/ Web-UI does not respond

I would appreciate any pointers for this problem ?

(Hadoop .18.1)

-Sagar


Namenode freeze stackdump :


2009-01-14 00:57:02
Full thread dump Java HotSpot(TM) 64-Bit Server VM (10.0-b23 mixed mode):

"SocketListener0-4" prio=10 tid=0x2aac54008000 nid=0x644d in 
Object.wait() [0x4535a000..0x4535aa80]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aab6cb1dba0> (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked <0x2aab6cb1dba0> (a org.mortbay.util.ThreadPool$PoolThread)

  Locked ownable synchronizers:
   - None

"SocketListener0-5" prio=10 tid=0x2aac54008c00 nid=0x63f1 in 
Object.wait() [0x4545b000..0x4545bb00]

  java.lang.Thread.State: TIMED_WAITING (on object monitor)
   at java.lang.Object.wait(Native Method)
   - waiting on <0x2aab6c2ea1a8> (a 
org.mortbay.util.ThreadPool$PoolThread)

   at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
   - locked <0x2aab6c2ea1a8> (a org.mortbay.util.ThreadPool$PoolThread)

  Locked ownable synchronizers:
   - None

"Trash Emptier" daemon prio=10 tid=0x511ca400 nid=0x1fd waiting 
on condition [0x45259000..0x45259a00]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.fs.Trash$Emptier.run(Trash.java:219)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

"org.apache.hadoop.dfs.dfsclient$leasechec...@767a9224" daemon prio=10 
tid=0x51384400 nid=0x1fc 
sleeping[0x45158000..0x45158a80]

  java.lang.Thread.State: TIMED_WAITING (sleeping)
   at java.lang.Thread.sleep(Native Method)
   at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:792)
   at java.lang.Thread.run(Thread.java:619)

  Locked ownable synchronizers:
   - None

"IPC Server handler 44 on 54310" daemon prio=10 tid=0x2aac40183c00 
nid=0x1f4 waiting for monitor entry [0x44f56000..0x44f56d80]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880)
   - waiting to lock <0x2aaab423a530> (a 
org.apache.hadoop.dfs.FSNamesystem)
   at 
org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127)

   at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602)
   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

"IPC Server handler 43 on 54310" daemon prio=10 tid=0x2aac40182400 
nid=0x1f3 waiting for monitor entry [0x44e55000..0x44e55a00]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.startFileInternal(FSNamesystem.java:922)
   - waiting to lock <0x2aaab423a530> (a 
org.apache.hadoop.dfs.FSNamesystem)

   at org.apache.hadoop.dfs.FSNamesystem.startFile(FSNamesystem.java:903)
   at org.apache.hadoop.dfs.NameNode.create(NameNode.java:284)
   at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

"IPC Server handler 42 on 54310" daemon prio=10 tid=0x2aac40181000 
nid=0x1f2 waiting for monitor entry [0x44d54000..0x44d54a80]

  java.lang.Thread.State: BLOCKED (on object monitor)
   at 
org.apache.hadoop.dfs.FSNamesystem.blockReportProcessed(FSNamesystem.java:1880)
   - waiting to lock <0x2aaab423a530> (a 
org.apache.hadoop.dfs.FSNamesystem)
   at 
org.apache.hadoop.dfs.FSNamesystem.handleHeartbeat(FSNamesystem.java:2127)

   at org.apache.hadoop.dfs.NameNode.sendHeartbeat(NameNode.java:602)
   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:888)

  Locked ownable synchronizers:
   - None

"IPC Server hand

Re: Merging reducer outputs into a single part-00000 file

2009-01-14 Thread Rasit OZDAS
Jim,

As far as I know, there is no operation done after Reducer.
At the first look, the situation reminds me of same keys for all the tasks,
This can be the result of one of following cases:
- input format reads same keys for every task.
- mapper collects every incoming key-value pairs under same key.
- reducer makes the same.

But if you  are a little experienced, you already know these.
Ordered list means one final file, or am I missing something?

Hope this helps,
Rasit


2009/1/11 Jim Twensky :
> Hello,
>
> The original map-reduce paper states: "After successful completion, the
> output of the map-reduce execution is available in the R output files (one
> per reduce task, with file names as specified by the user)." However, when
> using Hadoop's TextOutputFormat, all the reducer outputs are combined in a
> single file called part-0. I was wondering how and when this merging
> process is done. When the reducer calls output.collect(key,value), is this
> record written to a local temporary output file in the reducer's disk and
> then these local files (a total of R) are later merged into one single file
> with a final thread or is it directly written to the final output file
> (part-0)? I am asking this because I'd like to get an ordered sample of
> the final output data, ie. one record per every 1000 records or something
> similar and I don't want to run a serial process that iterates on the final
> output file.
>
> Thanks,
> Jim
>



-- 
M. Raşit ÖZDAŞ