Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)
I got it. For some reason getDefaultExtension() returns ".lzo_deflate". Is that a bug? Shouldn't it be .lzo? In the head revision I couldn't find it at all in http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/ There should be a Class LzoCodec.java. Was that moved to somewhere else? Gert Gert Pfeifer wrote: > Arun C Murthy wrote: >> On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote: >> >>> Hi, >>> I want to use an lzo file as input for a mapper. The record reader >>> determines the codec using a CompressionCodecFactory, like this: >>> >>> (Hadoop version 0.19.0) >>> >> http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html > > I should have mentioned that I have these native libs running: > 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader: > Loaded the native-hadoop library > 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec: > Successfully loaded & initialized native-lzo library > > Is that what you tried to point out with this link? > > Gert > >> hth, >> Arun >> >>> compressionCodecs = new CompressionCodecFactory(job); >>> System.out.println("Using codecFactory: "+compressionCodecs.toString()); >>> final CompressionCodec codec = compressionCodecs.getCodec(file); >>> System.out.println("Using codec: "+codec+" for file "+file.getName()); >>> >>> >>> The output that I get is: >>> >>> Using codecFactory: { etalfed_ozl.: >>> org.apache.hadoop.io.compress.LzoCodec } >>> Using codec: null for file test.lzo >>> >>> Of course, the mapper does not work without codec. What could be the >>> problem? >>> >>> Thanks, >>> Gert
Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)
Arun C Murthy wrote: > > On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote: > >> Hi, >> I want to use an lzo file as input for a mapper. The record reader >> determines the codec using a CompressionCodecFactory, like this: >> >> (Hadoop version 0.19.0) >> > > http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html I should have mentioned that I have these native libs running: 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec: Successfully loaded & initialized native-lzo library Is that what you tried to point out with this link? Gert > > hth, > Arun > >> compressionCodecs = new CompressionCodecFactory(job); >> System.out.println("Using codecFactory: "+compressionCodecs.toString()); >> final CompressionCodec codec = compressionCodecs.getCodec(file); >> System.out.println("Using codec: "+codec+" for file "+file.getName()); >> >> >> The output that I get is: >> >> Using codecFactory: { etalfed_ozl.: >> org.apache.hadoop.io.compress.LzoCodec } >> Using codec: null for file test.lzo >> >> Of course, the mapper does not work without codec. What could be the >> problem? >> >> Thanks, >> Gert
getting null from CompressionCodecFactory.getCodec(Path file)
Hi, I want to use an lzo file as input for a mapper. The record reader determines the codec using a CompressionCodecFactory, like this: (Hadoop version 0.19.0) compressionCodecs = new CompressionCodecFactory(job); System.out.println("Using codecFactory: "+compressionCodecs.toString()); final CompressionCodec codec = compressionCodecs.getCodec(file); System.out.println("Using codec: "+codec+" for file "+file.getName()); The output that I get is: Using codecFactory: { etalfed_ozl.: org.apache.hadoop.io.compress.LzoCodec } Using codec: null for file test.lzo Of course, the mapper does not work without codec. What could be the problem? Thanks, Gert
Re: Name node heap space problem
Bull's eye. I am using 0.17.1. Taeho Kang schrieb: Gert, What version of Hadoop are you using? One of the people at my work who is using 0.17.1 is reporting a similar problem - namenode's heapspace filling up too fast. This is the status of his cluster (17 node cluster with version 0.17.1) *- 174541 files and directories, 121000 blocks = 295541 total. Heap Size is 898.38 MB / 1.74 GB (50%) ** * Here is the status of one of my clusters. (70 node cluster with version 0.16.3) - *265241 files and directories, 1155060 blocks = 1420301 total. Heap Size is 797.94 MB / 1.39 GB (56%)* ** Notice that the second cluster has about 9 times more blocks than the first one (and more files and dir's, too) but heap usage is in similar figures (actually smaller...) Has anyone also noticed any problems/inefficiencies in namenode's memory utilization in 0.17.x version? On Mon, Jul 28, 2008 at 2:13 AM, Gert Pfeifer <[EMAIL PROTECTED]>wrote: There I have: export HADOOP_HEAPSIZE=8000 ,which should be enough (actually in this case I don't know). Running the fsck on the directory it turned out that there are 1785959 files in this dir... I have no clue how I can get the data out of there. Can I somehow calculate, how much heap a namenode would need to do an ls on this dir? Gert Taeho Kang schrieb: Check how much memory is allocated for the JVM running namenode. In a file HADOOP_INSTALL/conf/hadoop-env.sh you should change a line that starts with "export HADOOP_HEAPSIZE=1000" It's set to 1GB by default. On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer < [EMAIL PROTECTED]> wrote: Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Re: Name node heap space problem
There I have: export HADOOP_HEAPSIZE=8000 ,which should be enough (actually in this case I don't know). Running the fsck on the directory it turned out that there are 1785959 files in this dir... I have no clue how I can get the data out of there. Can I somehow calculate, how much heap a namenode would need to do an ls on this dir? Gert Taeho Kang schrieb: Check how much memory is allocated for the JVM running namenode. In a file HADOOP_INSTALL/conf/hadoop-env.sh you should change a line that starts with "export HADOOP_HEAPSIZE=1000" It's set to 1GB by default. On Fri, Jul 25, 2008 at 2:51 AM, Gert Pfeifer <[EMAIL PROTECTED]> wrote: Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Re: Name node heap space problem
Update on this one... I put some more memory in the machine running the name node. Now fsck is running. Unfortunately ls fails with a time-out. I identified one directory that causes the trouble. I can run fsck on it but not ls. What could be the problem? Gert Gert Pfeifer schrieb: Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert
Re: Can a MapReduce task only consist of a Map step?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Did you try to use the IdentityReducer? Zhou, Yunqing wrote: > I only use it to do something in parallel,but the reduce step will cost me > additional several days, is it possible to make hadoop do not use a reduce > step? > > Thanks > -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iQIVAwUBSIRfTP4RHiapZN5BAQLIwA/7Bmkza40S/UrFi1JLECppLfFwe7v+WcM1 H5keFsV3xrQ7Pyz8WiR8ERwFbNKc0Men0Msp5CSoZQTRpEiKYhhbQVKTlz9tfc2w qcB23j8pPFWxP11mKYciUZFexDIz9+rNvmHQFFFxiVoib6URve3a6cxbz6zuScac KHSynC/x+2tS4BDCmJ7mhJWUIcTGLhHxig5ruz7rMWJQLXAIg0JP0m1nQCyREmxs FlAgc+SBYdvLBygpE+CkB1JDWDfa6PKS6RqMmzAsiQU6vVQxd603KWkTOSCrDTbd QZkDTntHIcpLDQ2ReCdttM4QoA2k2t3UFfveDzKSJcfnO33gedlZ4uVdu+t7tNUd JLtRQyTpql1k1nFA9TzfWl2S/py913QOhfesfVnZpbGNfrNPh7DI//EsO0BKW80g L2hGzfW386LhgDwG0w9FWrMh1PDQZEvc6NOzW3DbjIzaBkdIxM3+J2tVs7xA9idj H0kXCFVYGzBQ/FgcJtg1qecf9mIQ35xkTbRH9G+HEd/4XK0iQeTnB7I/e3F+OP1h 85pf6JN6do70Cr8YKvTq6n7M4IZ3nbMYcXiNS9isB+VOriJ4qGrJK4DnEQh0eICX L2sPXw7gt7a0r+kUexpprFfscSIm2YljrCKb/2zxR+hYai4+/gZNguYb3g14dP2Q TV1K2XN/VA8= =GWU6 -END PGP SIGNATURE-
Re: Memory leak in DFS client
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I found out that it is not a bug in my code. I can run a bin $ ./hadoop fs -ls /seDNS/data/33 ls: timed out waiting for rpc response It times out for this directory, but before it does so, the name node takes 2GB more heap and never gives it back. Any ideas? Gert Gert Pfeifer wrote: > Hi, > I am running some code dealing with file system operations (copying > files and deleting). While it is runnung the web interface of the name > node tells me that the heap size grows dramatically. > > Are there any server-side data structures that I have to close > explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap > in the name node... > > I had something in mind like Statements in JDBC, but I just can't find > anything. > > Gert -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iQIVAwUBSIRKA/4RHiapZN5BAQKosRAA81JlhloItywcgwwuA8kxm/aRLDzzyAs+ EBvaC3FtJvzQuKMo8oxQtYliCxb3xMqi78Bg9DkRHB+xV2rCWVbB0uE8w17CQdLq HnJ8H9/sz5TkFlLe8kDNBvKCfyMr5LXwVf5CQYIr3vj26pgqt2e49jg2pohuQCaq g1oF5BzVTBWWGPDMOPjvcl5l1YEfVqZoOT5uZytkYYkvOGonWOrykOoDrDsFt3aH VkWmY9lvouzsUFeDCeSI7EWrFRMcb7BOf45RhcUOdBJtNKSBLbGj8U5+o5iGB6gk GY8GVlv27mfH9t0UOPnWAo9SfjIQqxVx95WrZNKFzj0j/XyaX9lyUM5zN055MyrT ZqDTjWsEq3uWEErKSqvpYY+v5XZJVTa7M7Rb4LSUslhVmEG2+S7UudyjZAZlWmEk 1SPkrnxOUDT/gI/0nS24obCpBmLmM91HtDi88RPGnOVXzp6gcO4oTg46cfXeVCNQ yCTACKfKzaUaARekPVWt64roM3t7/lbfjc59ZihCUhGwI3pDVs+vaCojyFMHlq0s TrWUfAdNAdHb7H+d6JcW1SUJ++IL6WQgHigHsq++nOlaEdIHFFaq2z1QTJAfkCcR D5odrmp4r1PeKxsZSl88yfnSflrvgCc4o4ccG5IVPDIzbMnelBE2NizdTVIKFXdP 0VzISbi9N1Q= =Hf+i -END PGP SIGNATURE-
Memory leak in DFS client
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am running some code dealing with file system operations (copying files and deleting). While it is runnung the web interface of the name node tells me that the heap size grows dramatically. Are there any server-side data structures that I have to close explicitly, except FSData{IN|Out}putStreams ? Anything that takes heap in the name node... I had something in mind like Statements in JDBC, but I just can't find anything. Gert -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iQIVAwUBSIQ4Bv4RHiapZN5BAQKEZRAAmqtfLezbX9j3i2PPLGSnywcetiiSKM3D 3CvrzG0gt+MI93GzsVtY/jOZH/m50kA4Ty0gFZfvOLKdbKLF6Z3EUsBFSWaQJUq2 LYBUe187494Tlu4uBVQLAeV0vJaDNiDo1iIjC1nqg8zSy4ucHjYEF1UUH+Y5nfBM gSPzwXhlM7QF7NRNR3uI8OZ4pbMrSH//mDG16XGUQNFfD/HcBXZZ29ChMbEnUT5z b2cxxtr5bBNJi5z38VAwfFlIXQa2w5JU/5Sbq48KujMKSaI9uzfAYD+/B1paNfcb Aqxpdma4cU4BYxBRwkrhrZnA8r1v8t0GvxVu/t0CW08g6r6PUWyfk+W5mqNOs7jl TCHH0Q74EatUkaXwY8roNAhiOb1nnJFQIZ/OYki2JUdahc0CeqQfy6J0VyAUttMc o0qahpeXyTREe+XbbCxmks/Q6BP7x0ElLfmXKYuegrvDRgqzPmcPDayYA4vACquJ Vw8wBA1VfzJid+aBR3M0WxjufK+6+u/JMoPMa6MdU7YFgsesGyUWWti1jjIGx1+C fgXhTtKPA2eO/shj+svt9Ivn1Zdsi+phD5CsRavRbJCzcBSfP7ByYsqtYX4XfP5J 0STPZFqLESA+or5O2bAL2nwRtIMnYzCg7GQrpIfJAll+YE8LPVlteQCg6BpoLlAS /tqLmL/DI98= =ATq2 -END PGP SIGNATURE-
running hadoop with gij
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Did anyone try to get hadoop running on the Gnu java environment? Does that work? Cheers, Gert -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iQIVAwUBSH8wy/4RHiapZN5BAQLUEhAAwEmmA97HC0R+gZ1SOwxV/FwWZQmO3Y9v CKzFVqQPPf2bxlpb5lED2K0+F0QstZDtslZ6i4cNH6s+amFYCgZhqdEU1djqQXdY yxhYZ6FgS0+J9jdpU8b14uPv5IN23VPKa5MruJycMzH3WZnFsFK604QFstuvZQe1 P4no1By/yTJNQkWBfJ0drjPPc8lIMK1K99/z6WfmrgVJAL616YB+wIce+ssXAyWP GUFscmHzE908NsJDrKqEYt9+dWr28MdBsgUI54ORDNYRh/0xvQixOe1T1HQn7w1O XFmkKVmLN66FGn220tR9f+KaXBbcN6BwF0xTmVh5NVaWtJnqARQoU6qJEmwm/A1m mS55OT+fYr9esb0ASIY4lSkfTeLNVrMsjbmMRw6QSx1a6BprTO+qHNHWlfxgMHr+ bg8NLAF8XmjrjgmuX90J9yFsZIPlnLRoLWNAlttm6ODZMp59+ogD80anBvQ8hTi+ 52VX2Cagf78+Dismaxy0ykxQkexRfdqCAlAcvnbPqERhdzNeEWdB9c76ZBPiQuOz WE+95jmb0MaiAhebTlSSze5GPpAqvX/b6crqffp3jDsN82mY1zQKxF38IX4CWtXy 3+H49CeKe8RP5n7hWjyJWiHqyQRV4v517g2qGfh2meQJESnIR4JwC4uohow01TiL HNwAU/mqAp4= =DIEZ -END PGP SIGNATURE-
Name node heap space problem
Hi, I am running a Hadoop DFS on a cluster of 5 data nodes with a name node and one secondary name node. I have 1788874 files and directories, 1465394 blocks = 3254268 total. Heap Size max is 3.47 GB. My problem is that I produce many small files. Therefore I have a cron job which just runs daily across the new files and copies them into bigger files and deletes the small files. Apart from this program, even a fsck kills the cluster. The problem is that, as soon as I start this program, the heap space of the name node reaches 100 %. What could be the problem? There are not many small files right now and still it doesn't work. I guess we have this problem since the upgrade to 0.17. Here is some additional data about the DFS: Capacity : 2 TB DFS Remaining : 1.19 TB DFS Used: 719.35 GB DFS Used% : 35.16 % Thanks for hints, Gert