Re: MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Ankur Goel
One way of doing what you need is to extend MultipleTextOutputFormat and override the following APIs - generateFileNameForKeyValue() - generateActualKey() - generateActualValue() You will need to prefix the directory and file-name of your choice to the key/value depending upon your needs. Assum

RE:SequenceFile and streaming

2009-05-28 Thread walter steffe
Hi Tom, i have seen the tar-to-seq tool but the person who made it says it is very slow: "It took about an hour and a half to convert a 615MB tar.bz2 file to an 868MB sequence file". To me it is not acceptable. Normally to generate a tar file from 615MB od data it take s less then one minute. A

Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
Yes I am able to ping and ssh between two virtual machine and even i have set ip address of both the virtual machines in their respective /etc/hosts file ... thanx for reply .. if you suggest some other thing which i could have missed or any remedy Regards, Ashish

Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread Pankil Doshi
make sure u can ping that data node and ssh it. On Thu, May 28, 2009 at 12:02 PM, ashish pareek wrote: > HI , > I am trying to step up a hadoop cluster on 512 MB machine and using > hadoop 0.18 and have followed procedure given in apache hadoop site for > hadoop cluster. > I included

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread jason hadoop
Use the mapside join stuff, if I understand your problem it provides a good solution but requires getting over the learning hurdle. Well described in chapter 8 of my book :) On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel wrote: > I believe PIG, and I know Cascading use a kind of 'spillable' li

Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
Hi some help me out On Thu, May 28, 2009 at 10:32 PM, ashish pareek wrote: > HI , > I am trying to step up a hadoop cluster on 512 MB machine and using > hadoop 0.18 and have followed procedure given in apache hadoop site for > hadoop cluster. > I included in conf/slaves

Re: Reduce() time takes ~4x Map()

2009-05-28 Thread jason hadoop
At the minimal level, enable map output compression, it may make some difference, mapred.compress.map.output. Sorting is very expensive when there are many keys and the values are large. Are you quite certain your keys are unique. Also, do you need them sorted by document id? On Thu, May 28, 2009

Re: Reduce() time takes ~4x Map()

2009-05-28 Thread Jothi Padmanabhan
Hi David, If you go to JobTrackerHistory and then click on this job and then do Analyse This Job, you should be able to get the split up timings for the individual phases of the map and reduce tasks, including the average, best and worst times. Could you provide those numbers so that we can get a

Re: InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
On Thu, May 28, 2009 at 9:50 AM, Owen O'Malley wrote: > > The update to the terasort example has an InputFormat that does exactly > that. The key is 10 bytes and the value is the next 90 bytes. It is pretty > easy to write, but I should upload it soon. The output types are Text, but > they just h

MultipleOutputs or MultipleTextOutputFormat?

2009-05-28 Thread Kevin Peterson
I am trying to figure out the best way to split output into different directories. My goal is to have a directory structure allowing me to add the content from each batch into the right bucket, like this: ... /content/200904/batch_20090429 /content/200904/batch_20090430 /content/200904/batch_20090

Re: New version/API stable?

2009-05-28 Thread Alex Loddengaard
0.19 is considered unstable by us at Cloudera and by the Y! folks; they never deployed it to their clusters. That said, we recommend 0.18.3 as the most stable version of Hadoop right now. Y! has (or will soon) deploy(ed) 0.20, which implies that it's at least stable enough for them to give it a g

Question: index package in contrib (lucene index)

2009-05-28 Thread Tenaali Ram
Hi, I am trying to understand the code of index package to build a distributed Lucene index. I have some very basic questions and would really appreciate if someone can help me understand this code- 1) If I already have Lucene index (divided into shards), should I upload these indexes into HDFS a

New version/API stable?

2009-05-28 Thread David Rosenstrauch
Hadoop noob here, just starting to learn it, as we're planning to start using it heavily in our processing. Just wondering, though, which version of the code I should start learning/working with. It looks like the Hadoop API changed pretty significantly from 0.19 to 0.20 (e.g., org.apache.hadoop.

How do I convert DataInput and ResultSet to array of String?

2009-05-28 Thread dealmaker
Hi, How do I convert DataInput to array of String? How do I convert ResultSet to array of String? Thanks. Following is the code: static class Record implements Writable, DBWritable { String [] aSAssoc; public void write(DataOutput arg0) throws IOException { throw new Unsuppo

Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Thanks Damien. And can i update a file with hadoop or just create it and read it later? Olivier On Thu, May 28, 2009 at 1:31 PM, Damien Cooke wrote: > Olivier, > Append is not supported or recommended at this point. You can turn it on > via dfs.support.append in hdfs-site.xml under 0.20.0. T

Re: hadoop hardware configuration

2009-05-28 Thread Patrick Angeles
On Thu, May 28, 2009 at 6:02 AM, Steve Loughran wrote: > That really depends on the work you are doing...the bytes in/out to CPU > work, and the size of any memory structures that are built up over the run. > > With 1 core per physical disk, you get the bandwidth of a single disk per > CPU; for s

Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman
On May 28, 2009, at 2:00 PM, Patrick Angeles wrote: On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman >wrote: We do both -- push the disk image out to NFS and have a mirrored SAS hard drives on the namenode. The SAS drives appear to be overkill. This sounds like a nice approach, takin

Re: hadoop hardware configuration

2009-05-28 Thread Patrick Angeles
On Thu, May 28, 2009 at 10:24 AM, Brian Bockelman wrote: > > We do both -- push the disk image out to NFS and have a mirrored SAS hard > drives on the namenode. The SAS drives appear to be overkill. > This sounds like a nice approach, taking into account hardware, labor and downtime costs... $70

Re: Persistent storage on EC2

2009-05-28 Thread Kevin Peterson
On Tue, May 26, 2009 at 7:50 PM, Malcolm Matalka < mmata...@millennialmedia.com> wrote: > I'm using EBS volumes to have a persistent HDFS on EC2. Do I need to keep > the master updated on how to map the internal IPs, which change as I > understand, to a known set of host names so it knows where t

org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-28 Thread ashish pareek
HI , I am trying to step up a hadoop cluster on 512 MB machine and using hadoop 0.18 and have followed procedure given in apache hadoop site for hadoop cluster. I included in conf/slaves two datanode i.e including the namenode vitrual machine and other machine virtual machine . an

Reduce() time takes ~4x Map()

2009-05-28 Thread David Batista
Hi everyone, I'm processing XML files, around 500MB each with several documents, for the map() function I pass a document from the XML file, which takes some time to process depending on the size - I'm applying NER to texts. Each document has a unique identifier, so I'm using that identifier as a

Re: Appending to a file / updating a file

2009-05-28 Thread Damien Cooke
Olivier, Append is not supported or recommended at this point. You can turn it on via dfs.support.append in hdfs-site.xml under 0.20.0. There have been some issues making it reliable. If this is not production code or a production job then turning it on will probably have no detrimental

Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman
On May 28, 2009, at 10:32 AM, Ian Soboroff wrote: Brian Bockelman writes: Despite my trying, I've never been able to come even close to pegging the CPUs on our NN. I'd recommend going for the fastest dual-cores which are affordable -- latency is king. Clue? Surely the latencies in Had

Re: hadoop hardware configuration

2009-05-28 Thread Ian Soboroff
Brian Bockelman writes: > Despite my trying, I've never been able to come even close to pegging > the CPUs on our NN. > > I'd recommend going for the fastest dual-cores which are affordable -- > latency is king. Clue? Surely the latencies in Hadoop that dominate are not cured with faster proce

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Chris K Wensel
I believe PIG, and I know Cascading use a kind of 'spillable' list that can be re-iterated across. PIG's version is a bit more sophisticated last I looked. that said, if you were using either one of them, you wouldn't need to write your own many-to-many join. cheers, ckw On May 28, 2009,

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMap(); while (i have ram

Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Todd Lipcon
Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage t

Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
did you restart hadoop? sorry i'm stuck in the middle of something so can't give this more attention. i can assure you however that we have append working in our POC ... and the code isn't that much different to what you have posted. -sd On Thu, May 28, 2009 at 3:31 PM, Olivier Smadja wrote: >

Re: InputFormat for fixed-width records?

2009-05-28 Thread Owen O'Malley
On May 28, 2009, at 5:15 AM, Stuart White wrote: I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length The update to the terasort example has an InputFormat that does exactly that. The key is 10 bytes and the val

Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Thanks Sacha, I have now my hdfs-site.xml like that : (as the hadoop-site.xml seems to be deprecated) dfs.support.append true But I continue receiving the exception. Checking the hadoop source code, I saw public FSDataOutputStream append(Path f, int bufferSize,

Re: SequenceFile and streaming

2009-05-28 Thread Tom White
Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe wrote: > Hello >  I am a new user and I would like to use hadoop streaming with > SequenceFile in both input and output side. > > -The first difficoulty arises from the lack of a simple tool to generate > a SequenceFile starting from a set

Re: hadoop hardware configuration

2009-05-28 Thread Brian Bockelman
On May 28, 2009, at 5:02 AM, Steve Loughran wrote: Patrick Angeles wrote: Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question. This is an interesting question. Obviously as an HP employee you must assume that I'm biased when

Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
http://www.mail-archive.com/core-user@hadoop.apache.org/msg10002.html On Thu, May 28, 2009 at 3:03 PM, Olivier Smadja wrote: > Hi Sacha! > > Thanks for the quick answer. Is there a simple way to search the mailing > list? by text or by author. > > At http://mail-archives.apache.org/mod_mbox/hado

Re: Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Hi Sacha! Thanks for the quick answer. Is there a simple way to search the mailing list? by text or by author. At http://mail-archives.apache.org/mod_mbox/hadoop-core-user/ I only see a browse per month... Thanks, Olivier On Thu, May 28, 2009 at 10:57 AM, Sasha Dolgy wrote: > append isn't su

Re: Appending to a file / updating a file

2009-05-28 Thread Sasha Dolgy
append isn't supported without modifying the configuration file for hadoop. check out the mailling list threads ... i've sent a post in the past explaining how to enable it. On Thu, May 28, 2009 at 2:46 PM, Olivier Smadja wrote: > Hello, > > I'm trying hadoop for the first time and I'm just tryi

Re: InputFormat for fixed-width records?

2009-05-28 Thread Tom White
Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read f

Appending to a file / updating a file

2009-05-28 Thread Olivier Smadja
Hello, I'm trying hadoop for the first time and I'm just trying to create a file and append some text in it with the following code: import java.io.IOException; import org.apache.hadoop.conf. Configuration; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileSystem;

Re: Issue with usage of fs -test

2009-05-28 Thread pankaj jairath
Thanks, Koji. This is the issue I am facing and I have been using version 0.18.x. -/Pankaj Koji Noguchi wrote: Maybe https://issues.apache.org/jira/browse/HADOOP-3792 ? Koji -Original Message- From: pankaj jairath [mailto:pjair...@yahoo-inc.com] Sent: Thursday, May 28, 2009 4:49 AM

InputFormat for fixed-width records?

2009-05-28 Thread Stuart White
I need to process a dataset that contains text records of fixed length in bytes. For example, each record may be 100 bytes in length, with the first field being the first 10 bytes, the second field being the second 10 bytes, etc... There are no newlines on the file. Field values have been either

Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread Stuart White
I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented t

RE: Issue with usage of fs -test

2009-05-28 Thread Koji Noguchi
Maybe https://issues.apache.org/jira/browse/HADOOP-3792 ? Koji -Original Message- From: pankaj jairath [mailto:pjair...@yahoo-inc.com] Sent: Thursday, May 28, 2009 4:49 AM To: core-user@hadoop.apache.org Subject: Issue with usage of fs -test Hello, I am facing a strange issue, where i

Issue with usage of fs -test

2009-05-28 Thread pankaj jairath
Hello, I am facing a strange issue, where in the /fs -test -e/ fails and /fs -ls/ succeeds to list the file. Following is the grep of such a result : bin]$ hadoop fs -ls /projects/myproject///.done Found 1 items -rw--- 3 user hdfs 0 2009-03-19 22:28 /projects/mypro

Re: hadoop hardware configuration

2009-05-28 Thread Steve Loughran
Patrick Angeles wrote: Sorry for cross-posting, I realized I sent the following to the hbase list when it's really more a Hadoop question. This is an interesting question. Obviously as an HP employee you must assume that I'm biased when I say HP DL160 servers are good value for the workers,

Re: Can I have Reducer with No Output?

2009-05-28 Thread Jothi Padmanabhan
If your reducer does not write anything, you could look at NullOutputFormat as well. Jothi On 5/28/09 1:38 PM, "tim robertson" wrote: > Yes you can do this. > > It is complaining because you are not declaring the output types in > the method signature, but you will not use them anyway. > > S

Re: Can I have Reducer with No Output?

2009-05-28 Thread tim robertson
Yes you can do this. It is complaining because you are not declaring the output types in the method signature, but you will not use them anyway. So please try private static class Reducer extends MapReduceBase implements Reducer { ... The output format will be a TextOutputFormat, but it will no

Can I have Reducer with No Output?

2009-05-28 Thread dealmaker
Hi, I have maps that do most of the work, and they output the data into a reducer, so basically key is a constant, and the reducer combines all the input from maps into a file and it does "LOAD_DATA" the file into mysql db. So, there won't be any output.collect ( ) in reducer function. But whe

Intermittent "Already Being Created Exception"

2009-05-28 Thread Palleti, Pallavi
Hi all, I have a 50 node cluster and I am trying to write some logs of size 1GB each into hdfs. I need to write them in temporal fashion say for every 15 mins worth of data, I am closing previously opened file and creating a new file. The snippet of code is if() {