Re: datanode auto down

2009-07-19 Thread Jason Venner
Did you run this command on the datanode that not responding? On Sun, Jul 19, 2009 at 3:59 AM, mingyang wrote: > in datanode logs , I found a new error message. > Would like to help solve the problem > > 2009-07-19 18:40:43,464 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeReg

Re: Output of a Reducer as a zip file?

2009-07-24 Thread Jason Venner
I used to write zip files in my reducer, it was very very fast, and pulling the files out of hdfs as also very fast. In part this is because each reducer might need to write 26k individual files, by writing them as a zip file there was only 1 hdfs file. The job ran about 15x faster that way. I do

Re: dfs fail to Unable to create new block

2009-07-28 Thread Jason Venner
Looks like a possible communication failure with your Datanode, possibly out of file descriptors or some networking issue? What version of hadoop are you running? 2009-07-28 18:01:30,622 WARN org.apache.hadoop.hdfs. > > DFSClient: Could not get block locations. Source file > "/data/segment/dat_4_8

Re: Output of a Reducer as a zip file?

2009-07-28 Thread Jason Venner
these into > > one zip output. > > > > If I do as suggested > > > > ZipOutputStream zos = new ZipOutputStream( fs.create("Output.zip")); > > > > how does this zos work instead of output? > > > > Thank you, > > Mark > > >

Re: Map performance with custom binary format

2009-07-28 Thread Jason Venner
Is it possible that your tasks are not falling evenly over the machines of your cluster, but piling up on a small number of machines? On Tue, Jul 28, 2009 at 3:35 PM, Scott Carey wrote: > See below: > > > On 7/28/09 12:15 PM, "william kinney" wrote: > > > Sorry, forgot to include that detail. >

Re: RecordReader Key/Value classes

2009-07-28 Thread Jason Venner
In hadoop 18 and beyond, the key and value do not have to Implement Writable. As a general rule, the key and value objects passed to the map task will be the same objects, with a fresh value initialized by the record reader. The output.collect method will serialize the value during the call (unless

Re: map side join

2009-07-30 Thread Jason Venner
The mapside join code builds multiple map tasks, each map task will receive as input one partition from each of your input sources. In your case, your job would have 3 map tasks, and each map task would be receive data from 1 partition in each source file. The mapside join code maintains a reader

Re: conf.setNumReduceTasks(1) but the code called 3 times

2009-07-30 Thread Jason Venner
A rule of thumb is to not enable speculative execution if the tasks have side effects that are not cleaned up on task abort. The tasktracker will clean up the task output directory on task abort. Writing your zip files into the task output directory will allow the framework to remove zip file creat

Re: dfs fail to Unable to create new block

2009-07-30 Thread Jason Venner
networks issue? > > Thanks, > Jianmin > > > > > ________ > From: Jason Venner > To: common-user@hadoop.apache.org > Sent: Tuesday, July 28, 2009 8:30:23 PM > Subject: Re: dfs fail to Unable to create new block > > Looks like a possible

Re: Reading GZIP input files.

2009-07-31 Thread Jason Venner
If the file names end in .gz, 18.3 will just work, you will get 1 map task per file. On Fri, Jul 31, 2009 at 8:01 AM, prashant ullegaddi < prashullega...@gmail.com> wrote: > Hi guys, > > I have a set of 1000 gzipped plain text files. How to read them in Hadoop? > Is there any built-in class avail

Re: how to process small fraction of input?

2009-07-31 Thread Jason Venner
If the read time is the bulk of the time, there is no simple way to handle this: You could greatly increase the failure tolerance for the map tasks, by setting mapred.max.map.failures.percent to 90%, turn of speculative execution and failed map task retry, then in your map or configure method abo

Re: Task process exit with nonzero status of 255

2009-08-03 Thread Jason Venner
That generally means that the process that is running the task, crashed. The actual map/reduce task is run in a separate jvm by the task tracker, and that JVM is exiting abnormally. This used to happen to my jobs quite a bit when they were using a buggy native library via jni. If you are trying to

Re: Difference between "Killed Task Attempts" and "Killed Tasks"

2009-08-03 Thread Jason Venner
You only get the killed tasks, when speculative execution is enabled, when one of a pair of identical tasks are running, finishes, the other task is killed, but it is not considered an attempt. On Mon, Aug 3, 2009 at 6:18 AM, Harish Mallipeddi < harish.mallipe...@gmail.com> wrote: > Agreed. But

Re: File is closed but data is not visible

2009-08-11 Thread Jason Venner
Please provide information on what version of hadoop you are using and the method of opening and closing the file. On Tue, Aug 11, 2009 at 12:48 AM, Pallavi Palleti < pallavi.pall...@corp.aol.com> wrote: > Hi all, > > We have an application where we pull logs from an external server(far apart >

Re: File is closed but data is not visible

2009-08-12 Thread Jason Venner
i wrote: > > Hi Jason, > > > > Apologies for missing version information in my previous mail. I am > > using hadoop-0.18.3. I am getting FSDataOutputStream object using > > fs.create(new Path(some_file_name)), where fs is FileSystem object. And, > > I am closin

Re: File is closed but data is not visible

2009-08-12 Thread Jason Venner
getting closed at expected time period. But, when I look for > the same file in hadoop cluster, it is still not created and if I wait > for another 1 to 2 hours, I could see the file. > > Thanks > Pallavi > > > -Original Message- > From: Jason Venner [mailto:jason.

Re: File is closed but data is not visible

2009-08-12 Thread Jason Venner
> logger.error("Unexpected error while writing to HDFS, exiting > ...", e); > // before exiting do the cleanup > close(reader); > > System.exit(-1); >} finally { > close(reader); > } > > Thanks > Pallavi > > > -

Re: What OS?

2009-08-13 Thread Jason Venner
Anyone have any performance numbers for Solaris or ZFS based datanodes. The directory and inode cache sizes are a limiting factor for linux for large and busy datanodes. On Wed, Aug 12, 2009 at 7:45 AM, tim robertson wrote: > Thanks guys. I'll chat with sys admin and see what he thinks. > We kn

Re: Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Jason Venner
Cloudera submits their patches back to the projects, and people are free to pick them up. It is becoming a normal thing to run a patched distribution, particularly since Yahoo made their version of 0.20 available. On Wed, Aug 19, 2009 at 5:46 AM, Edward Capriolo wrote: > Generally if I have an i

Re: Why the jobs are suspended when I add new nodes?

2009-08-19 Thread Jason Venner
I have added small numbers of nodes into running clusters, with running jobs without issue - when the machines were correctly configured for the cluster, so this is known to work at least in the 0.18 release series (when I was doing this operation). On Mon, Aug 17, 2009 at 6:56 AM, yang song wrot

Re: utilizing all cores on single-node hadoop

2009-08-19 Thread Jason Venner
Another reason you may not see full utilization of your map tasks per tracker is if the mean run time of a task is very short, All the slots are being used but the setup and teardown for each task is large enough in time compared to the run time of the task that it appears that not all the task slo

Re: How to deal with "too many fetch failures"?

2009-08-19 Thread Jason Venner
The number 1 cause of this is something that causes a connection to get a map output to fail. I have seen: 1) firewall 2) misconfigured ip addresses (ie: the task tracker attempting the fetch received an incorrect ip address when it looked up the name of the tasktracker with the map segment) 3) rar

Re: Faster alternative to FSDataInputStream

2009-08-21 Thread Jason Venner
It may be some kind of hostname name or reverse lookup delay, either on the origination or destination side. On Thu, Aug 20, 2009 at 10:43 AM, Raghu Angadi wrote: > Ananth T. Sarathy wrote: > >> it's on s3. and it always happens. >> > > I have no experience with S3. You might want to check out S3

Re: Help.

2009-08-21 Thread Jason Venner
It may be that the individual datanodes get different names for their ip addresses than the namenode does. It may also be that some subset of your namenode/datanodes do not have write access to the hdfs storage directories. On Mon, Aug 17, 2009 at 10:05 PM, qiu tian wrote: > Hi everyone. > I in

Re: JVM reuse

2009-08-21 Thread Jason Venner
I think simply because it was a new feature, and it really only helps for jobs where there are a large number of tasks compared to the available task slots, coupled with the concern that the subsequent tasks run in the jvm may not run identically to running in a fresh jvm. On Fri, Aug 21, 2009 at

Re: Testing Hadoop job

2009-08-26 Thread Jason Venner
I put together a framework for the Pro Hadoop book that I use quite a bit, and has some documentation in the book examples ;) I haven't tried it with 0.20.0 however. The nicest thing that I did with the framework was provide a way to run a persistent mini virtual cluster for running multiple tests

Re: Testing Hadoop job

2009-08-31 Thread Jason Venner
gt; i have used the basic functionalities of MRUnit testing framework > > i would like to know the limitations (e.g. i found out that MRUnit does > not > > check the partioner logic) and its feasibility with hadoop 0.20... > > No proper documentation i found ! :( > > &g

Re: discyp between different versions of Hadoop...

2009-09-06 Thread Jason Venner
You pretty much have to stage the files through somethime. If you can make source version of hadoop's fuse mount work, you can copy in, using the fuse mount as a source. On Sun, Sep 6, 2009 at 10:50 PM, C G wrote: > Sorry...subject should be "distcp" obviously... > Also trying to pull from the n

Re: discyp between different versions of Hadoop...

2009-09-07 Thread Jason Venner
Thank you, I don't think of the ftp interface at all and had completely forgotten it. On Mon, Sep 7, 2009 at 12:00 AM, Erik Forsberg wrote: > On Sun, 6 Sep 2009 22:45:28 -0700 (PDT) > C G wrote: > > > Hi All: > > Does anybody know if it's possible to distcp between an old version > > of Hadoop

Re: Multiple disks for DFS

2009-09-13 Thread Jason Venner
When you have multiple partitions specified for hdfs storage, they are used for block storage in a round robin fashion. If a partition has insufficient space it is dropped for the set used for storing new blocks. On Sun, Sep 13, 2009 at 3:01 AM, Stas Oskin wrote: > Hi. > > When I specify multipl

Re: Question about mapred.child.java.opts

2009-09-14 Thread Jason Venner
For streaming, you just need enough space to the run the child that forks and handles the streaming mapper/reducer and enough space to handle sorting the date with out the task getting stuck in GC hell. On Fri, Sep 11, 2009 at 12:30 PM, Mayuran Yogarajah < mayuran.yogara...@casalemedia.com> wrote

Re: "Timed out waiting for rpc response" after running a large number of jobs

2009-09-19 Thread Jason Venner
It is not uncommon for the task tracker http servers to get overwhelmed with requests for map outputs, when there where many map tasks. Increasing the number of threads can help. On Sat, Sep 19, 2009 at 6:32 PM, Kunsheng Chen wrote: > Hi everyone, > > > I am running two map-reduce program, they

Re: Multithread Question

2009-09-27 Thread Jason Venner
A Map/Reduce task is a single jvm, running a single map or reduce thread. The number of map and reduce task execution slots in a cluster is fixed at cluster start time (actually at the task tracker start time). This restriction may be lifted at some point. It is possible to tell a task to use a m

Re: Native libraries and HDFS

2009-09-28 Thread Jason Venner
The codec's are run on the client side, so no effect other than reduced storage / transfer time on the Datanode. The native codecs are generally more cpu efficient and there for time efficient on the client side. On Mon, Sep 28, 2009 at 5:39 PM, Stas Oskin wrote: > Hi. > > I have a question - ar

Re: dfs create block sticking

2009-09-28 Thread Jason Venner
How long does it take you to create a file in on one of your datanodes, in the dfs block storage area, while your job is running, it could simply be that the OS level file creation is taking longer than the RPC timeout. On Mon, Sep 28, 2009 at 5:30 PM, dave bayer wrote: > On a cluster running 0.

Re: dfs create block sticking

2009-09-29 Thread Jason Venner
create/use a file with the same name, therefore I got > AlreadyBeingCreatedException. > > your case may be different, but I thought to share mine. > > On Tue, Sep 29, 2009 at 11:03 AM, Jason Venner >wrote: > > > How long does it take you to create a file in on one of your da

Re: Distributed cache - are files unique per job?

2009-09-29 Thread Jason Venner
When you use the commandline -archives a directory "archives" is created in hdfs under the the per job submission area, to store the archives. So there should be no collisions, as long as no other job tracker is using the same system directory path (conf.get("mapred.system.dir", "/tmp/hadoop/mapred

Re: Final Reminder: NSF, Google, IBM CLuE PI Meeting: October 5, 2009

2009-09-29 Thread Jason Venner
You can also publish them on www.prohadoop.com, as well as announce your events ;) On Tue, Sep 29, 2009 at 7:58 AM, Oliver Senn wrote: > +1 > > > Steve Lihn wrote: > >> Can the group make these speeches available online (such as youtube) >> for the global community? >> >> Thx, steve >> >> On 9/2

Re: NameNode metadata destination

2009-10-01 Thread Jason Venner
If you are looking for moment by moment recovery, you need to have multiple directories, preferably on several devices, for your Namenode edit log (which is modified for each meta data change) and also multiple directories for the FS image, which is updated every few minutes by the secondary Namen

Re: NameNode metadata destination

2009-10-01 Thread Jason Venner
s? How reliable is this? > > Regards. > > > 2009/10/1 Jason Venner > > > If you are looking for moment by moment recovery, you need to have > multiple > > directories, preferably on several devices, for your Namenode edit log > > (which is modified for each meta da

Re: OutputCollector: key and value are separated by tab, why?

2009-10-01 Thread Jason Venner
They are separated by a tab, so that the KeyValueLineRecordReader can read the output. In hadoop 19, the paramter mapred.textoutputformat.separator can be set to an arbitrary string which will be used as the separator for TextOutputFormat. On Thu, Oct 1, 2009 at 6:54 PM, Mark Kerzner wrote: >

Re: Having multiple values in Value field

2009-10-05 Thread Jason Venner
You can always pass them as comma delimited strings, which is what you are already doing with your python streaming code, and then use Text as your value. On Mon, Oct 5, 2009 at 10:54 PM, akshaya iyengar wrote: > I am having issues having multiple values in my value field.My desired > result is >

Re: Creating Lucene index in Hadoop

2009-10-07 Thread Jason Venner
Check out katta, as it can pull indexes from hdfs and deploy them into your search cluster. Katta also handles index directories that have been packed into a zip file. Katta can pull indexes from any file system that hadoop supports, hdfs, s3, hftp, file etc. We have been doing this with our solr

Re: Recommended file-system for DataNode

2009-10-08 Thread Jason Venner
I have used xfs pretty extensively, it seemed to be somewhat faster than ext3. The only trouble we had related to some machines running the PAE 32 bit kernels, where we the filesystems lockup. That is an obscure use case however. Running JBOD with your dfs.data.dir listing a directory on each devi

Re: Recommended file-system for DataNode

2009-10-08 Thread Jason Venner
Busy datanodes become bound by the metadata lookup times for the directory and inode entries required to open a block. Anything that optimizes that will help substantially. We are thinking of playing with brtfs, and using a small SSD for our file system metadata, and the spinning disks for the blo

Re: Recommended file-system for DataNode

2009-10-08 Thread Jason Venner
noatime is absolutely essential, I forget to mention it, because it is automatic now for me. I have a fun story about atime, I have some Solaris machines with ZFS file systems, and I was doing a find on a 6 level hashed directory tree with 25 leaf nodes. The find on a cold idle file system wa

Re: fuse-dfs:fuse-dfs didn't recognize /dfs,-2

2009-10-08 Thread Jason Venner
Are you by chance running 0.19.0? That ls output looks like 0.19.0's output. On Thu, Oct 8, 2009 at 11:35 PM, yibo820217 wrote: > > Hi, > > I get the following error when trying to mount the fuse dfs, > > the first problem is: > [r...@puppet ~]# fuse_dfs_wrapper.sh dfs://100.207.100.25:9000/ /df

Re: Recommended file-system for DataNode

2009-10-12 Thread Jason Venner
Unless you are serving mail via imap or pop, it is generally considered safe. On Sun, Oct 11, 2009 at 1:11 AM, Stas Oskin wrote: > Hi. > > By the way, about the noatime - is it safe just to set this for all > partitions used, including / and boot? > > Thanks. > > 2009/10/9 Stas Oskin > > > Hi.

Re: Using Hadoop for File Conversion

2009-10-12 Thread Jason Venner
Hadoop is very well suited for file conversion. Do you have any more specific questions? What you might do is give your hadoop job as input, a file or set of files containing the paths or urls to the images you wish to convert Then in your map task, load the image file, apply your conversion and s

Re: map function

2009-10-12 Thread Jason Venner
Yes you may call it recursively. On Mon, Oct 12, 2009 at 9:42 AM, Amandeep Khurana wrote: > Nope (as far as I'm aware).. Why do you want that? > > On Mon, Oct 12, 2009 at 9:40 AM, hellpizza wrote: > > > > > Can map function be called recursively? > > -- > > View this message in context: > > ht

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-13 Thread Jason Venner
are your network interface or the namenode/jobtracker/datanodes saturated On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline wrote: > I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11 > c1.xlarge instances (1 master, 10 slaves), that is the biggest instance > available with

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
s on a query that takes 10 minutes, but that is still less than > what I see in scp transfers on EC2, which is typically about 30 MB/s. > > thanks > > Chris > > > Jason Venner wrote: > >> are your network interface or the namenode/jobtracker/datanodes saturated >

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
he value to Long.MAX_VALUE, but that is not what > I have found to be best. I see about 25% improvement at 300MB (3), > CPU utilization is up to about 50-70%+, but I am still fine tuning. > > > thanks! > > Chris > > Jason Venner wrote: > >> I remember having a proble

Re: How can I deploy 100 blocks onto 10 datanodes with each node have 10 blocks?

2009-10-19 Thread Jason Venner
If you set your replication count to one and on each datanode, create 10 files, you will achieve the pattern you are trying for. By default when a file is created on a machine hosting a datanode, that datanode will receive 1 replica of the file, and will be responsible for sending the file data to

Re: Problem to create sequence file for

2009-10-27 Thread Jason Venner
How large is the string that is being written? Does it contain the entire contents of your file? You may simple need to increase the heap size with your jvm. On Tue, Oct 27, 2009 at 3:43 AM, bhushan_mahale < bhushan_mah...@persistent.co.in> wrote: > Hi, > > I have written a code to create sequen

Re: Streaming ignoring stderr output

2009-10-27 Thread Jason Venner
Most likely one gets buffered when the file descriptor is a pipe and the other is at most line buffered as it is when the code is run by the streaming mapper tsak. On Mon, Oct 26, 2009 at 11:06 AM, Ryan Rosario wrote: > Thanks. I think that I may have tripped on some sort of bug. > Unfortunately,

Re: Secondary NameNodes or NFS exports?

2009-10-27 Thread Jason Venner
We have been having some trouble with the secondary on a cluster that has one edit log partition on an nfs server, with the namenode rejecting the merged images due to timestamp missmatches. On Mon, Oct 26, 2009 at 10:14 AM, Stas Oskin wrote: > Hi. > > Thanks for the advice, it seems that the i

Re: Problem to create sequence file for

2009-10-27 Thread Jason Venner
gt; Thanks for the reply. > The string is the entire content of the input text file. > It could as long as ~300MB. > I tried increasing jvm heap but unfortunately it was giving same error. > > Other option I am thinking is to split input files first. > > - Bhushan > -

Re: too many 100% mapper does not complete / finish / commit

2009-11-02 Thread Jason Venner
Nominally, when the map is done, the close is fired, and all framework opened output files are flushed and the task waits for all of the ack's from the block hosting datanodes, then the output committer stages files into the task output directory. It sounds like there may be an issue with the clos

Re: architecture help

2009-11-16 Thread Jason Venner
What version of hadoop are you using? It may be that you are creating a new connection in each map call. Create your connection in the configure, and close it in the close, perhaps committing every 1000 calls in the mapper, On Mon, Nov 16, 2009 at 3:33 PM, yz5od2 wrote: > Thanks all for the repl

Re: common reasons a map task would fail on a distributed cluster but not locally?

2009-11-16 Thread Jason Venner
The common reasons I have failures with streaming jobs are 1) the script exits with a non zero exit status, which is considered task failure by the task tracker 2) at least in 18 and 19, if the script writes to stdout before reading the first input record, the streaming code will NPE because all r

Re: Join Documentation Correct?

2009-11-19 Thread Jason Venner
Are you certain that your records are being split into key and value the way you expect. That is the usual reason for odd join behavior. I haven't used the join code past 19.1, however. On Wed, Nov 18, 2009 at 12:42 PM, Edmund Kohlwey wrote: > I'm using Cloudera's distribution for Hadoop 0.20.1

Re: to get hadoop working around with multiple users on the same instance

2009-11-21 Thread Jason Venner
disable hdfs permission checking dfs.permissions true If "true", enable permission checking in HDFS. If "false", permission checking is turned off, but all other behavior is unchanged. Switching from one parameter value to the other does not change the mode, owner or gr

Re: Is the mapper output type must the same as reducer if combiner is used ?

2009-11-22 Thread Jason Venner
Your combiner has to have the same output as the mapper, as it is run within the context of the mapper. On Sun, Nov 22, 2009 at 7:41 AM, Jeff Zhang wrote: > karthik, > > In your case, then the Combiner is not the same as Reducer. > So it is not necessary to make the combiner same as reducer, ri

Re: Saving Intermediate Results from the Mapper

2009-11-22 Thread Jason Venner
You can manually write the map output to a new file, there are a number of examples of opening a sequence file and writing to it on the web or in the example code for various hadoop books. You can also disable the removal of intermediate data, which will result in potentially large amounts of data

Re: Re: Re: Help in Hadoop

2009-11-22 Thread Jason Venner
set the number of reduce tasks to 1. 2009/11/22 > Hi everybody, > The 10 different map-reducers store their respective outputs in > 10 > different files. This is the snap shot > > had...@zeus:~/hadoop-0.19.1$ bin/hadoop dfs -ls output5 > Found 2 items > drwxr-xr-x - hadoop supergro

Re: bad connect ack

2009-11-22 Thread Jason Venner
At one point my admin staff added 5 machines to my cluster but accidently left port 50010 among onthers firewalled. This resulted in chaos for a while until the firewall was found. On Fri, Nov 20, 2009 at 1:01 PM, Bill Brune wrote: > > Hi I'm trying to get a small cluster up with hdfs (hadoop 0.

Re: Saving Intermediate Results from the Mapper

2009-11-23 Thread Jason Venner
ason, > > which option is for setting disable the removal of intermediate data ? > > Thank you > > Jeff Zhang > > > On Mon, Nov 23, 2009 at 10:27 AM, Jason Venner >wrote: > > > You can manually write the map output to a new file, there are a number > of > >

Re: to get hadoop working around with multiple users on the same instance

2009-11-24 Thread Jason Venner
in the ...-site.xml file. The name in ... varies with your hadoop version. On Tue, Nov 24, 2009 at 5:44 AM, Siddu wrote: > On Sun, Nov 22, 2009 at 8:19 AM, Jason Venner >wrote: > > > disable hdfs permission checking > > > > > > dfs.permissions > &g

Re: Processing 10MB files in Hadoop

2009-11-26 Thread Jason Venner
Are the record processing steps bound by a local machine resource - cpu, disk io or other? What I often do when I have lots of small files to handle is use the NlineInputFormat, as data locality for the input files is a much lessor issue than short task run times in that case, Each line of my inpu

Re: Secondary NameNodes or NFS exports?

2009-12-04 Thread Jason Venner
s because of >> timestamps? >> >> Regards. >> >> On Tue, Oct 27, 2009 at 4:49 PM, Jason Venner > >wrote: >> >> > We have been having some trouble with the secondary on a cluster that >> has >> > one edit log partition on an nfs server,

Re: LeaseExpiredException Exception

2009-12-08 Thread Jason Venner
Is it possible that this is occurring in a task that is being killed by the framework. Sometimes there is a little lag, between the time the tracker 'kills a task' and the task fully dies, you could be getting into a situation like that where the task is in the process of dying but the last write i

Re: Why I can only run 2 map/reduce task at a time?

2009-12-22 Thread Jason Venner
1) the changes to the number of map/reduce slots per task tracker is fixed at task tracker start time, not at job start time 2) the rate of launching tasks is relatively slow through hadoop 0.19 3) the number of tasks for a job is determined by the number of input files and the computed split size

Re: Writing different output types from map.

2009-12-22 Thread Jason Venner
Your choice is as Edward says, write a wrapper class to hold all of the objects you wish to write, or to write multiple files by manually opening additional output files and writing the particular objects to their own file. On Thu, Dec 10, 2009 at 11:31 AM, Edward Capriolo wrote: > On Thu, Dec 10

Re: all or nothing?

2009-12-22 Thread Jason Venner
If you are using a version of hadoop that has a scheduler, which appeared in 19, you can provide scheduling domains that include subsets of the machines in your cluster, otherwise jobs fill as much of the cluster and the cluster resources are shared somewhat unevenly among all of the currently runn

Re: Large Text object to String conversion

2009-12-22 Thread Jason Venner
The text class supports low level access to the underlying byte array in the text object You can call getbytes directly and then incrementally transcode the bytes into characters using the charset encoder tools, or call the charAt method to get the characters one by 1. The bytesToCodePoint method

Re: File Split

2009-12-22 Thread Jason Venner
The way an input reader works with a file split, is that the input reader is responsible for finding the first start of a record boundary in the input split, and to stop at the first record end boundary at or after the end of the input split. In your case if your image data is structured in the fi

Re: sharing variables across chained jobs

2009-12-23 Thread Jason Venner
If your jobs are launched by separate jvm instances, the only real persistence framework you have is hdfs. You have to basic choices: 1. Write a summary data to a persistent store, an hdfs file being a simple case, that your next job reads 2. Write the data you need as a job counter, via

Re: Secondary NameNodes or NFS exports?

2009-12-23 Thread Jason Venner
the rolled (old) edit log. As long as no transactions have hit, the time stamps are the same. On Wed, Dec 23, 2009 at 11:23 AM, Stas Oskin wrote: > Hi. > > What was your solution to this then? > > Regards. > > On Sat, Dec 5, 2009 at 7:43 AM, Jason Venner > wrote: &g

Re: Secondary NameNodes or NFS exports?

2009-12-23 Thread Jason Venner
> We roll the edits log successfully during periods of high transfer, when a > new file is being created every 1 second or so. > > We have had issues with unmergeable edits before - there might be some race > conditions in this area. > > Brian > > On Dec 23, 2009, at 7:07

Re: Secondary NameNodes or NFS exports?

2009-12-24 Thread Jason Venner
eckpoints will overlap and might trigger this. (this is conjecture, so > definitely worth testing) > > -Todd > > On Wed, Dec 23, 2009 at 6:38 PM, Jason Venner >wrote: > > > I agree, it seems very wrong, that is why I need a block of time to > really > > verify t

Re: hadoop job progress going back

2009-12-27 Thread Jason Venner
There are some issues with the way counter values are collected and summarized that cause this display behavior. It partially has to do with speculative execution. I believe it is fixed in the later versions of 0.19 and beyond. I can't remember or find the jira associated with this. On Sun, Dec 2

Re: large reducer output with same key

2009-12-31 Thread Jason Venner
the mapred.local.dir paramter will be used by each tasktracker node to fprovide directory(ies) to store transitory data about the tasks the tasktracker runs. This includes the map output, and can be very large. On Thu, Dec 31, 2009 at 10:03 AM, himanshu chandola < himanshu_cool...@yahoo.com> wrote

Re: large reducer output with same key

2010-01-02 Thread Jason Venner
> Morpheus: Why Not? > Neo: Because I don't like the idea that I'm not in control of my life. > > > > - Original Message > From: Jason Venner > To: common-user@hadoop.apache.org > Sent: Thu, December 31, 2009 1:46:47 PM > Subject: Re: large reduce

Re: Passing whole text file to a single map

2010-01-23 Thread Jason Venner
http://prohadoop.ning.com/forum/topics/passing-whole-file-to-map On Sat, Jan 23, 2010 at 8:41 AM, Edward Capriolo wrote: > My bible code problem is someone similar. I have many small files and > one mapper needs to process an entire file. So I generate an input > file > > /user/bc/ecapriolo/bible

Re: distributing hadoop push

2010-01-24 Thread Jason Venner
You can indeed use file:/// urls, when the mount point is shared. Expect extreme io loading on the machines hosting that mount point ;) On Sat, Jan 23, 2010 at 8:57 AM, prasenjit mukherjee wrote: > I have hundreds of large files ( ~ 100MB ) in a /mnt/ location which is > shared by all my hadoo

Re: JNI in MAp REuce

2010-02-18 Thread Jason Venner
We used do this all the time at attributor. Now if I can remember how we did it. If the libraries are constant you can just install them on your nodes to save pushing them through the distributed cache, and then setup the LD_LIBRARY_PATH correctly. The key issue if you push them through the distr

Re: What is the biggest problem of extremely large hadoop cluster ?

2010-02-21 Thread Jason Venner
Underlying network bandwidth and rack locality, as well as the operational overhead of managing the machines. After a certain scale point, there will most always be at least one machine failing. On Sun, Feb 21, 2010 at 7:54 AM, Jeff Zhang wrote: > -- Forwarded message -- > From:

Re: Many child processes dont exit

2010-02-22 Thread Jason Venner
Someone is using a threadpool that does not have daemon priority threads, and that is not shutdown before the main method returns. The non daemon threads prevent the jvm from exiting. We had this problem for a while and modified the Child.main to exit, rather than trying to work out and fix the thi

Re: Hadoop DFS IO Performance measurement

2010-03-31 Thread Jason Venner
Unless you are getting all local IO, and or you have better than GigE nic interfaces 100MB/sec is your cap. For local IO the bound is going to be your storage subsystem. Decent drives in a raid 0 interface are going to cap out on those machines about 400MB/sec, which is the buffer cache bandwidth

Re: Hadoop DFS IO Performance measurement

2010-03-31 Thread Jason Venner
ayer which adds additional copying and latency. On Wed, Mar 31, 2010 at 6:31 PM, Jason Venner wrote: > Unless you are getting all local IO, and or you have better than GigE > nic interfaces > 100MB/sec is your cap. > > For local IO the bound is going to be your storage subsystem. > Decent