Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Gert Pfeifer
I got it. For some reason getDefaultExtension() returns ".lzo_deflate".

Is that a bug? Shouldn't it be .lzo?

In the head revision I couldn't find it at all in
http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/

There should be a Class LzoCodec.java. Was that moved to somewhere else?

Gert

Gert Pfeifer wrote:
> Arun C Murthy wrote:
>> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:
>>
>>> Hi,
>>> I want to use an lzo file as input for a mapper. The record reader
>>> determines the codec using a CompressionCodecFactory, like this:
>>>
>>> (Hadoop version 0.19.0)
>>>
>> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html
> 
> I should have mentioned that I have these native libs running:
> 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
> Loaded the native-hadoop library
> 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 
> Is that what you tried to point out with this link?
> 
> Gert
> 
>> hth,
>> Arun
>>
>>> compressionCodecs = new CompressionCodecFactory(job);
>>> System.out.println("Using codecFactory: "+compressionCodecs.toString());
>>> final CompressionCodec codec = compressionCodecs.getCodec(file);
>>> System.out.println("Using codec: "+codec+" for file "+file.getName());
>>>
>>>
>>> The output that I get is:
>>>
>>> Using codecFactory: { etalfed_ozl.:
>>> org.apache.hadoop.io.compress.LzoCodec }
>>> Using codec: null for file test.lzo
>>>
>>> Of course, the mapper does not work without codec. What could be the
>>> problem?
>>>
>>> Thanks,
>>> Gert


Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Gert Pfeifer
Arun C Murthy wrote:
> 
> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:
> 
>> Hi,
>> I want to use an lzo file as input for a mapper. The record reader
>> determines the codec using a CompressionCodecFactory, like this:
>>
>> (Hadoop version 0.19.0)
>>
> 
> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html

I should have mentioned that I have these native libs running:
2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
Loaded the native-hadoop library
2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
Successfully loaded & initialized native-lzo library

Is that what you tried to point out with this link?

Gert

> 
> hth,
> Arun
> 
>> compressionCodecs = new CompressionCodecFactory(job);
>> System.out.println("Using codecFactory: "+compressionCodecs.toString());
>> final CompressionCodec codec = compressionCodecs.getCodec(file);
>> System.out.println("Using codec: "+codec+" for file "+file.getName());
>>
>>
>> The output that I get is:
>>
>> Using codecFactory: { etalfed_ozl.:
>> org.apache.hadoop.io.compress.LzoCodec }
>> Using codec: null for file test.lzo
>>
>> Of course, the mapper does not work without codec. What could be the
>> problem?
>>
>> Thanks,
>> Gert


getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-13 Thread Gert Pfeifer
Hi,
I want to use an lzo file as input for a mapper. The record reader
determines the codec using a CompressionCodecFactory, like this:

(Hadoop version 0.19.0)

compressionCodecs = new CompressionCodecFactory(job);
System.out.println("Using codecFactory: "+compressionCodecs.toString());
final CompressionCodec codec = compressionCodecs.getCodec(file);
System.out.println("Using codec: "+codec+" for file "+file.getName());


The output that I get is:

Using codecFactory: { etalfed_ozl.: org.apache.hadoop.io.compress.LzoCodec }
Using codec: null for file test.lzo

Of course, the mapper does not work without codec. What could be the
problem?

Thanks,
Gert


Re: Name node heap space problem

2008-07-28 Thread Gert Pfeifer

Bull's eye. I am using 0.17.1.

Taeho Kang schrieb:

Gert,
What version of Hadoop are you using?

One of the people at my work who is using 0.17.1 is reporting a similar
problem - namenode's heapspace filling up too fast.

This is the status of his cluster (17 node cluster with version 0.17.1)
*- 174541 files and directories, 121000 blocks = 295541 total. Heap Size is
898.38 MB / 1.74 GB (50%) **
*
Here is the status of one of my clusters. (70 node cluster with version
0.16.3)
- *265241 files and directories, 1155060 blocks = 1420301 total. Heap Size
is 797.94 MB / 1.39 GB (56%)*
**
Notice that the second cluster has about 9 times more blocks than the first
one (and more files and dir's, too) but heap usage is in similar figures
(actually smaller...)

Has anyone also noticed any problems/inefficiencies in namenode's memory
utilization in 0.17.x version?




On Mon, Jul 28, 2008 at 2:13 AM, Gert Pfeifer
<[EMAIL PROTECTED]>wrote:


There I have:
  export HADOOP_HEAPSIZE=8000
,which should be enough (actually in this case I don't know).

Running the fsck on the directory it turned out that there are 1785959
files in this dir... I have no clue how I can get  the data out of there.
Can I somehow calculate, how much heap a namenode would need to do an ls on
this dir?

Gert


Taeho Kang schrieb:

Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with "export HADOOP_HEAPSIZE=1000"

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <
[EMAIL PROTECTED]>
wrote:

Update on this one...

I put some more memory in the machine running the name node. Now fsck is
running. Unfortunately ls fails with a time-out.

I identified one directory that causes the trouble. I can run fsck on it
but not ls.

What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,


I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert








Re: Name node heap space problem

2008-07-27 Thread Gert Pfeifer

There I have:
   export HADOOP_HEAPSIZE=8000
,which should be enough (actually in this case I don't know).

Running the fsck on the directory it turned out that there are 1785959 
files in this dir... I have no clue how I can get  the data out of there.
Can I somehow calculate, how much heap a namenode would need to do an ls 
on this dir?


Gert


Taeho Kang schrieb:

Check how much memory is allocated for the JVM running namenode.

In a file HADOOP_INSTALL/conf/hadoop-env.sh
you should change a line that starts with "export HADOOP_HEAPSIZE=1000"

It's set to 1GB by default.


On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <[EMAIL PROTECTED]>
wrote:


Update on this one...

I put some more memory in the machine running the name node. Now fsck is
running. Unfortunately ls fails with a time-out.

I identified one directory that causes the trouble. I can run fsck on it
but not ls.

What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,

I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert









Re: Name node heap space problem

2008-07-24 Thread Gert Pfeifer

Update on this one...

I put some more memory in the machine running the name node. Now fsck is 
running. Unfortunately ls fails with a time-out.


I identified one directory that causes the trouble. I can run fsck on it 
but not ls.


What could be the problem?

Gert

Gert Pfeifer schrieb:

Hi,
I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert




Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Did you try to use the IdentityReducer?

Zhou, Yunqing wrote:
> I only use it to do something in parallel,but the reduce step will cost me
> additional several days, is it possible to make hadoop do not use a reduce
> step?
> 
> Thanks
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIRfTP4RHiapZN5BAQLIwA/7Bmkza40S/UrFi1JLECppLfFwe7v+WcM1
H5keFsV3xrQ7Pyz8WiR8ERwFbNKc0Men0Msp5CSoZQTRpEiKYhhbQVKTlz9tfc2w
qcB23j8pPFWxP11mKYciUZFexDIz9+rNvmHQFFFxiVoib6URve3a6cxbz6zuScac
KHSynC/x+2tS4BDCmJ7mhJWUIcTGLhHxig5ruz7rMWJQLXAIg0JP0m1nQCyREmxs
FlAgc+SBYdvLBygpE+CkB1JDWDfa6PKS6RqMmzAsiQU6vVQxd603KWkTOSCrDTbd
QZkDTntHIcpLDQ2ReCdttM4QoA2k2t3UFfveDzKSJcfnO33gedlZ4uVdu+t7tNUd
JLtRQyTpql1k1nFA9TzfWl2S/py913QOhfesfVnZpbGNfrNPh7DI//EsO0BKW80g
L2hGzfW386LhgDwG0w9FWrMh1PDQZEvc6NOzW3DbjIzaBkdIxM3+J2tVs7xA9idj
H0kXCFVYGzBQ/FgcJtg1qecf9mIQ35xkTbRH9G+HEd/4XK0iQeTnB7I/e3F+OP1h
85pf6JN6do70Cr8YKvTq6n7M4IZ3nbMYcXiNS9isB+VOriJ4qGrJK4DnEQh0eICX
L2sPXw7gt7a0r+kUexpprFfscSIm2YljrCKb/2zxR+hYai4+/gZNguYb3g14dP2Q
TV1K2XN/VA8=
=GWU6
-END PGP SIGNATURE-


Re: Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I found out that it is not a bug in my code. I can run a

bin $ ./hadoop fs -ls /seDNS/data/33
ls: timed out waiting for rpc response

It times out for this directory, but before it does so, the name node
takes 2GB more heap and never gives it back.

Any ideas?

Gert


Gert Pfeifer wrote:
> Hi,
> I am running some code dealing with file system operations (copying
> files and deleting). While it is runnung the web interface of the name
> node tells me that the heap size grows dramatically.
> 
> Are there any server-side data structures that I have to close
> explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap
> in the name node...
> 
> I had something in mind like Statements in JDBC, but I just can't find
> anything.
> 
> Gert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIRKA/4RHiapZN5BAQKosRAA81JlhloItywcgwwuA8kxm/aRLDzzyAs+
EBvaC3FtJvzQuKMo8oxQtYliCxb3xMqi78Bg9DkRHB+xV2rCWVbB0uE8w17CQdLq
HnJ8H9/sz5TkFlLe8kDNBvKCfyMr5LXwVf5CQYIr3vj26pgqt2e49jg2pohuQCaq
g1oF5BzVTBWWGPDMOPjvcl5l1YEfVqZoOT5uZytkYYkvOGonWOrykOoDrDsFt3aH
VkWmY9lvouzsUFeDCeSI7EWrFRMcb7BOf45RhcUOdBJtNKSBLbGj8U5+o5iGB6gk
GY8GVlv27mfH9t0UOPnWAo9SfjIQqxVx95WrZNKFzj0j/XyaX9lyUM5zN055MyrT
ZqDTjWsEq3uWEErKSqvpYY+v5XZJVTa7M7Rb4LSUslhVmEG2+S7UudyjZAZlWmEk
1SPkrnxOUDT/gI/0nS24obCpBmLmM91HtDi88RPGnOVXzp6gcO4oTg46cfXeVCNQ
yCTACKfKzaUaARekPVWt64roM3t7/lbfjc59ZihCUhGwI3pDVs+vaCojyFMHlq0s
TrWUfAdNAdHb7H+d6JcW1SUJ++IL6WQgHigHsq++nOlaEdIHFFaq2z1QTJAfkCcR
D5odrmp4r1PeKxsZSl88yfnSflrvgCc4o4ccG5IVPDIzbMnelBE2NizdTVIKFXdP
0VzISbi9N1Q=
=Hf+i
-END PGP SIGNATURE-


Memory leak in DFS client

2008-07-21 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,
I am running some code dealing with file system operations (copying
files and deleting). While it is runnung the web interface of the name
node tells me that the heap size grows dramatically.

Are there any server-side data structures that I have to close
explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap
in the name node...

I had something in mind like Statements in JDBC, but I just can't find
anything.

Gert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSIQ4Bv4RHiapZN5BAQKEZRAAmqtfLezbX9j3i2PPLGSnywcetiiSKM3D
3CvrzG0gt+MI93GzsVtY/jOZH/m50kA4Ty0gFZfvOLKdbKLF6Z3EUsBFSWaQJUq2
LYBUe187494Tlu4uBVQLAeV0vJaDNiDo1iIjC1nqg8zSy4ucHjYEF1UUH+Y5nfBM
gSPzwXhlM7QF7NRNR3uI8OZ4pbMrSH//mDG16XGUQNFfD/HcBXZZ29ChMbEnUT5z
b2cxxtr5bBNJi5z38VAwfFlIXQa2w5JU/5Sbq48KujMKSaI9uzfAYD+/B1paNfcb
Aqxpdma4cU4BYxBRwkrhrZnA8r1v8t0GvxVu/t0CW08g6r6PUWyfk+W5mqNOs7jl
TCHH0Q74EatUkaXwY8roNAhiOb1nnJFQIZ/OYki2JUdahc0CeqQfy6J0VyAUttMc
o0qahpeXyTREe+XbbCxmks/Q6BP7x0ElLfmXKYuegrvDRgqzPmcPDayYA4vACquJ
Vw8wBA1VfzJid+aBR3M0WxjufK+6+u/JMoPMa6MdU7YFgsesGyUWWti1jjIGx1+C
fgXhTtKPA2eO/shj+svt9Ivn1Zdsi+phD5CsRavRbJCzcBSfP7ByYsqtYX4XfP5J
0STPZFqLESA+or5O2bAL2nwRtIMnYzCg7GQrpIfJAll+YE8LPVlteQCg6BpoLlAS
/tqLmL/DI98=
=ATq2
-END PGP SIGNATURE-


running hadoop with gij

2008-07-17 Thread Gert Pfeifer
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Did anyone try to get hadoop running on the Gnu java environment? Does
that work?

Cheers,
Gert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iQIVAwUBSH8wy/4RHiapZN5BAQLUEhAAwEmmA97HC0R+gZ1SOwxV/FwWZQmO3Y9v
CKzFVqQPPf2bxlpb5lED2K0+F0QstZDtslZ6i4cNH6s+amFYCgZhqdEU1djqQXdY
yxhYZ6FgS0+J9jdpU8b14uPv5IN23VPKa5MruJycMzH3WZnFsFK604QFstuvZQe1
P4no1By/yTJNQkWBfJ0drjPPc8lIMK1K99/z6WfmrgVJAL616YB+wIce+ssXAyWP
GUFscmHzE908NsJDrKqEYt9+dWr28MdBsgUI54ORDNYRh/0xvQixOe1T1HQn7w1O
XFmkKVmLN66FGn220tR9f+KaXBbcN6BwF0xTmVh5NVaWtJnqARQoU6qJEmwm/A1m
mS55OT+fYr9esb0ASIY4lSkfTeLNVrMsjbmMRw6QSx1a6BprTO+qHNHWlfxgMHr+
bg8NLAF8XmjrjgmuX90J9yFsZIPlnLRoLWNAlttm6ODZMp59+ogD80anBvQ8hTi+
52VX2Cagf78+Dismaxy0ykxQkexRfdqCAlAcvnbPqERhdzNeEWdB9c76ZBPiQuOz
WE+95jmb0MaiAhebTlSSze5GPpAqvX/b6crqffp3jDsN82mY1zQKxF38IX4CWtXy
3+H49CeKe8RP5n7hWjyJWiHqyQRV4v517g2qGfh2meQJESnIR4JwC4uohow01TiL
HNwAU/mqAp4=
=DIEZ
-END PGP SIGNATURE-


Name node heap space problem

2008-07-16 Thread Gert Pfeifer
Hi,
I am running a Hadoop DFS on a cluster of 5 data nodes with a name node
and one secondary name node.

I have 1788874 files and directories, 1465394 blocks = 3254268 total.
Heap Size max is 3.47 GB.

My problem is that I produce many small files. Therefore I have a cron
job which just runs daily across the new files and copies them into
bigger files and deletes the small files.

Apart from this program, even a fsck kills the cluster.

The problem is that, as soon as I start this program, the heap space of
the name node reaches 100 %.

What could be the problem? There are not many small files right now and
still it doesn't work. I guess we have this problem since the upgrade to
0.17.

Here is some additional data about the DFS:
Capacity :   2 TB
DFS Remaining   :   1.19 TB
DFS Used:   719.35 GB
DFS Used%   :   35.16 %

Thanks for hints,
Gert