Re: Is it possible in Hadoop to overwrite or update a file?
You can overwrite it, but you can't update it. Soon you will be able to append to it, but you won't be able to do any other updates. On 4/2/08 11:39 PM, "Garri Santos" <[EMAIL PROTECTED]> wrote: > Hi! > > I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering > if it's just fine to update or overwrite a file copied to hadoop? > > > Thanks, > Garri
Re: Is it possible in Hadoop to overwrite or update a file?
On Apr 2, 2008, at 11:39 PM, Garri Santos wrote: Hi! I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering if it's just fine to update or overwrite a file copied to hadoop? No. Although we are making progress on HADOOP-1700, which would allow appending onto files. -- Owen
Is it possible in Hadoop to overwrite or update a file?
Hi! I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering if it's just fine to update or overwrite a file copied to hadoop? Thanks, Garri
Re: Help: libhdfs SIGSEGV
Hello Christian. As you said, it does work. Thank you very much. Would you like to explain further to me why it can work? -- Yingyuan Christian Kunz 写道: > Hi Yingyuan, > > Did you try to connect to hdfs in main function before spawning off the > threads and using the same handle in all the threads? > > -Christian > > > Christian
Re: Help: libhdfs SIGSEGV
Hi Yingyuan, Did you try to connect to hdfs in main function before spawning off the threads and using the same handle in all the threads? -Christian On 4/2/08 6:21 PM, "Yingyuan Cheng" <[EMAIL PROTECTED]> wrote: > Hello Arun and all. > > I generated a data block in my main function, then spawned serveral > threads to write this block serveral times to different files in hdfs, > my thread function as following: > > --- > // File: hdfsbench.cpp > > void* worker_thread_w(void *arg) > { > int idx = (int)arg; > std::string path = g_config.path_prefix; > string_append(path, "%d", idx); > > hdfsFS fs = hdfsConnect("default", 0); > if(!fs) { > PERRORL(ERROR, "Thread %d failed to connect to hdfs!\n", idx); > g_workers->at(idx).status = TS_FAIL; > return (void*)false; > } > > hdfsFile writeFile = hdfsOpenFile(fs, path.c_str(), O_WRONLY, 0, 0, 0); > if(!writeFile) { > PERRORL(ERROR, "Thread %d failed to open %s for writing!\n", idx, > path.c_str()); > g_workers->at(idx).status = TS_FAIL; > //hdfsDisconnect(fs); > return (void*)false; > } > > int i; > int bwrite, boffset; > struct timeval tv_start, tv_end; > > gettimeofday(&tv_start, NULL); > > for (i = 0; i < g_config.num_blocks; i++) { > boffset = 0; > while (boffset < g_config.block_size) { > bwrite = hdfsWrite(fs, writeFile, g_buffer->ptr + boffset, > g_config.block_size - boffset); > if (bwrite <0) { > PERRORL(ERROR, "Thread %d failed when writing %s at block %d offset %d\n", > idx, path.c_str(), i, boffset); > g_workers->at(idx).status = TS_FAIL; > i = g_config.num_blocks; > break; > } > boffset += bwrite; > g_workers->at(idx).nbytes += bwrite; > } > } > > gettimeofday(&tv_end, NULL); > hdfsCloseFile(fs, writeFile); > //hdfsDisconnect(fs); > g_workers->at(idx).elapsed = get_timeval_differ(tv_start, tv_end); > > if (g_workers->at(idx).status == TS_UNSET || > g_workers->at(idx).status == TS_RUNNING) { > g_workers->at(idx).status = TS_SUCCESS; > } > > return (void*)true; > } > > --- > > And the following is my running script: > > --- > > #!/bin/bash > # > # File: hdfsbench.sh > > HADOOP_HOME="/opt/hadoop" > JAVA_HOME="/opt/java" > > export > CLASSPATH="$HADOOP_HOME/hadoop-0.16.1-core.jar:$HADOOP_HOME/lib/commons-loggin > g-1.0.4.jar:$HADOOP_HOME/lib/log4j-1.2.13.jar:$HADOOP_HOME/conf" > export LD_LIBRARY_PATH="$HADOOP_HOME/libhdfs:$JAVA_HOME/lib/i386/server" > > ./hdfsbench $@ > > --- > > > Yingyuan > > Arun C Murthy 写道: >> >> On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote: >> >>> Hello. >>> >>> Is libhdfs thread-safe? I can run single thread reading/writing HDFS >>> through libhdfs well, but when incrementing number of threads to 2 or >>> above, I received sigsegv error: >>> >> >> Could you explain a bit more? What are you doing in different threads? >> Are you writing to the same file? >> >> Arun >> >
Re: Help: libhdfs SIGSEGV
Hello Arun and all. I generated a data block in my main function, then spawned serveral threads to write this block serveral times to different files in hdfs, my thread function as following: --- // File: hdfsbench.cpp void* worker_thread_w(void *arg) { int idx = (int)arg; std::string path = g_config.path_prefix; string_append(path, "%d", idx); hdfsFS fs = hdfsConnect("default", 0); if(!fs) { PERRORL(ERROR, "Thread %d failed to connect to hdfs!\n", idx); g_workers->at(idx).status = TS_FAIL; return (void*)false; } hdfsFile writeFile = hdfsOpenFile(fs, path.c_str(), O_WRONLY, 0, 0, 0); if(!writeFile) { PERRORL(ERROR, "Thread %d failed to open %s for writing!\n", idx, path.c_str()); g_workers->at(idx).status = TS_FAIL; //hdfsDisconnect(fs); return (void*)false; } int i; int bwrite, boffset; struct timeval tv_start, tv_end; gettimeofday(&tv_start, NULL); for (i = 0; i < g_config.num_blocks; i++) { boffset = 0; while (boffset < g_config.block_size) { bwrite = hdfsWrite(fs, writeFile, g_buffer->ptr + boffset, g_config.block_size - boffset); if (bwrite <0) { PERRORL(ERROR, "Thread %d failed when writing %s at block %d offset %d\n", idx, path.c_str(), i, boffset); g_workers->at(idx).status = TS_FAIL; i = g_config.num_blocks; break; } boffset += bwrite; g_workers->at(idx).nbytes += bwrite; } } gettimeofday(&tv_end, NULL); hdfsCloseFile(fs, writeFile); //hdfsDisconnect(fs); g_workers->at(idx).elapsed = get_timeval_differ(tv_start, tv_end); if (g_workers->at(idx).status == TS_UNSET || g_workers->at(idx).status == TS_RUNNING) { g_workers->at(idx).status = TS_SUCCESS; } return (void*)true; } --- And the following is my running script: --- #!/bin/bash # # File: hdfsbench.sh HADOOP_HOME="/opt/hadoop" JAVA_HOME="/opt/java" export CLASSPATH="$HADOOP_HOME/hadoop-0.16.1-core.jar:$HADOOP_HOME/lib/commons-logging-1.0.4.jar:$HADOOP_HOME/lib/log4j-1.2.13.jar:$HADOOP_HOME/conf" export LD_LIBRARY_PATH="$HADOOP_HOME/libhdfs:$JAVA_HOME/lib/i386/server" ./hdfsbench $@ --- Yingyuan Arun C Murthy 写道: On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote: Hello. Is libhdfs thread-safe? I can run single thread reading/writing HDFS through libhdfs well, but when incrementing number of threads to 2 or above, I received sigsegv error: Could you explain a bit more? What are you doing in different threads? Are you writing to the same file? Arun
Re: Test Data for Hadoop Student
My appology to the list - the previous email was supposed to go to Aaron Kimball. Bruce
Re: Test Data for Hadoop Student
I met Christophe at the Hadoop Conference at Yahoo last week. I really liked him. He asked me to maintain the Google Ubuntu Hadoop image, I sent him the following about my project. Would you read it and offer any comments? I sent him the following: "Can I tell you more about my Hadoop in education project? My project started when I found out Amazon.com will ( who would have thought ) will let you rent their computers by the hour. I realized this would be an ideal way for small schools ( really ALL schools - even MIT/Berkley has a hard time coming up with a hundred idle computers for a student to use ) to have access to the resources to expose students ( not just Computer Science, but Physics, Astronomy, Biology, etc ) to working in this environment. That is what my independent study project is about. I am producing a student client workstation image and a department server image with everything needed to teach the a course and hookup to Amazon. The course ware I am getting from the University of Washington and documentation from all over. I first emailed the hadoop list and Aaron Kimball responded and offered his courseware and to "highly endorse your Amazon EC2 idea for doing your labs" . I would love any resources you can point me to. The economic model is the students don't need to buy a textbook and use the money to buy computer time from Amazon. I have already gotten an agency of the state of California to fund my computer time as a student, so that is a good precedent. I am getting weather data and an application to process it with snazzy graphics as a student project. I hope to add more as time goes on. I know I can be accused of the buzzword of the moment, but I hope to put on the server image software to provide a linkup to other people using the server to form a community. The more communication, the faster thing will happen. So model is more "seeding" than "doing". This is an example of what I am putting into this. This needs to work "on its own", with common problems pre-solved, without a lot of case by case work at each institution. I talked to Jinesh Varia from Amazon and I may be able to get them to design a custom product for education, accounting and billing that works with my server to make it simple and secure for the students and teacher to use. If a teacher has to deal with money and billing, it will fail. If you require a school's bureaucracy to handle something new, it will be a harder sell. Schools supply student course needs through the school bookstore. Who is the approved vendor to the school bookstore, Amazon? More case by case work.. And then they want to mark it up 50%-100%. No no no. So a student logs into my server and buys time like you would anything else on the Internet. But bookkeeping and usage records are kept for the teacher. Simple. No extra work in this area to offer the course. When I met you, I felt I had met someone who thought exactly like me on how important it is to facilitate moving people along from not just CS, but other areas to take advantage of the potential this technology. I want to make this available to the guy off in a corner somewhere who has a crazy idea that deserves a Nobel Prize to have what it takes to succeed. Elite should be elite based on worth, not restricted by access to elite level resources as much as can be made possible. " I would appreciate your comments and if you like the ideas, any support you could give yourself and any encouragement you could give Christophe to support this would be appreciated, BTW - my personal email is [EMAIL PROTECTED] electricranch - "herds of CPU's". I use the gmail address for lists that may expose me to spam. Bruce On Fri, Nov 16, 2007 at 7:21 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote: > Bruce, > > I helped design and teach an undergrad course based on Hadoop last year. > Along with some folks at Google, we then made the resources available > together to distribute to other universities and the public at large (via > Creative Commons license, actually). > > All the materials are available online here: > http://code.google.com/edu/content/parallel.html > (lecture notes, labs, and even video lectures.) > > It includes suggested lab activities. Good free data sets you can download > include Netflix prize data and a copy of the wikipedia corpus. Of course, > you can set up Nutch and do your own web crawl too. > > We also highly endorse the Amazon EC2 idea for doing your own labs :) > > Best of luck, > - Aaron > > > > > > Edward Bruce Williams wrote: > > Hello > > > > > > I am a student doing an independent study project investigating the > > possibility of teaching large scale computing on a small scale budget. Th > > > > > > My thought is to use available Open Source ( Hadoop) and Creative Commons > > and other materials as the text. A student could then do significant > > computing on Amazon for the cost of what they would usually pay for a > > textbook. I have co
secondary namenode web interface
Hi, I'm running Hadoop (latest snapshot) on several machines and in our setup namenode and secondarynamenode are on different systems. I see from the logs than secondary namenode regularly checkpoints fs from primary namenode. But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in my browser I see something like this: HTTP ERROR: 500 init RequestURI=/dfshealth.jsp Powered by Jetty:// And in secondary's log I find these lines: 2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp: java.lang.NullPointerException at org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:539) at java.lang.Class.newInstance0(Class.java:373) at java.lang.Class.newInstance(Class.java:326) at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199) at org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:326) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405) at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567) at org.mortbay.http.HttpContext.handle(HttpContext.java:1565) at org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635) at org.mortbay.http.HttpContext.handle(HttpContext.java:1517) at org.mortbay.http.HttpServer.service(HttpServer.java:954) at org.mortbay.http.HttpConnection.service(HttpConnection.java:814) at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981) at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831) at org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244) at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357) at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534) Is something missing from my configuration? Anybody else seen these? Thanks, -Yuri
Re: distcp fails :Input source not found
It might be a bug. Could you try the following? bin/hadoop fs -ls s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml Nicholas - Original Message From: Prasan Ary <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, April 2, 2008 7:41:50 AM Subject: Re: distcp fails :Input source not found Anybody ? Any thoughts why this might be happening? Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin . . . The file I try and copy is 9KB (I noticed previous discussion on empty files and files that are > 10MB) > First I make sure that we can copy the file from s3 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -copyToLocal s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /usr/InputFileFormat.xml > Now I see that the file is copied to the ec2 master (where I'm logged in) [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input* /usr/InputFileFormat.xml > Next I make sure I can access the HDFS and that the input directory is there [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls / Found 2 items /input 2008-04-01 15:45 /mnt 2008-04-01 15:42 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /input/ Found 0 items > I make sure hadoop is running just fine by running an example [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar hadoop-0.15.3-examples.jar pi 10 1000 Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to process : 10 08/04/01 17:38:14 INFO mapred.JobClient: Running job: job_200804011542_0001 08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0% 08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0% 08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0% 08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0% 08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0% 08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0% 08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0% 08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0% 08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0% 08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20% 08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100% 08/04/01 17:38:45 INFO mapred.JobClient: Job complete: job_200804011542_0001 08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9 08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1 08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework 08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10 08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20 08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240 08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20 Job Finished in 31.028 seconds Estimated value of PI is 3.1556 > Finally, I try and copy the file over [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /input/InputFileFormat.xml With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not exist. at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563) - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
Error msg: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
Hi, me and my coleagues are implementing a small search engine in my University Laboratory, and we would like to use Hadoop as file system. For now we are having troubles in running the following simple code example: #include "hdfs.h" int main(int argc, char **argv) { hdfsFS fs = hdfsConnect("apolo.latin.dcc.ufmg.br", 51070); if(!fs) { fprintf(stderr, "Oops! Failed to connect to hdfs!\n"); exit(-1); } int result = hdfsDisconnect(fs); if(!result) { fprintf(stderr, "Oops! Failed to connect to hdfs!\n"); exit(-1); } } The error msg is: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration We configured the following enviroment variables: export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13 export OS_NAME=linux export OS_ARCH=i386 export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs export SHLIB_VERSION=1 export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4 export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs The following commands were used to compile the codes: In directory: hadoop-0.14.4/src/c++/libhdfs 1 - make all 2 - gcc -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ -c my_hdfs_test.c 3 - gcc my_hdfs_test.o -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ -L/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs -lhdfs -L/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server -ljvm -o my_hdfs_test Obs: The hadoop file system seems to be ok once we can run commands like this: [EMAIL PROTECTED]:/mnt/hd1/hadoop/hadoop-0.14.4$ bin/hadoop dfs -ls Found 0 items -- []s Anisio Mendes Lacerda
Re: Hadoop Port Configuration
hi, If your master and slave are two different boxes, don't use 127.0.0.1 as the address. Use something in your LAN, e.g., 192.168.x.x, 10.x.x.x, etc. HTH, Yan 2008/4/1, Natarajan, Senthil <[EMAIL PROTECTED]>: > Hi, > > I am using default settings from hadoop-default.xml and hadoop-site.xml > > And I just changed this port number > > mapred.task.tracker.report.address > > > > > > I created the firewall rule to allow port range 5:50100 between the > slaves and master. > > > > But reduce on the slaves using some other ports seems. So Reduce always > hangs with firewall enabled. If I disable the firewall it works fine. > > Could you please let me know what I am missing or where to control the > hadoop random port creation? > > > > Thanks, > > Senthil >
Re: one key per output part file
Thanks for this information - I might be missing something here, but can my perl script reducer (which is run via streaming, and is not linked to HDFS libraries) just start writing to HDFS? I thought I would have to write it locally ie in "." for the reduce script and then rely on the MapReduce mechanism to promote the file into the output directory... Thanks for all the help! Ashish On Wed, Apr 2, 2008 at 11:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Writing to HDFS leaves the files as accessible as anything else, if not > more > so. > > You can retrieve a file using a URL of the form: > > http:///data/ > > Similarly, you can list a directory using a similar URL (whose details I > forget for the nonce). > > On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> > > wrote: > > > >> curious - why do we need a file per XXX? > >> > >> - if the data needs to be exported (either to a sql db or an external > file > >> system) - then why not do so directly from the reducer (instead of > trying to > >> create these intermediate small files in hdfs)? data can be written to > tmp > >> tables/files and can be overwritten in case the reducer re-runs (and > then > >> committed to final location once the job is complete) > >> > > > > The second case (data needs to be exported) is the reason that I have. > Each > > of these small files is used in an external process. This seems like a > good > > solution - only question then is where can these files be written to > safely? > > Local directory? /tmp? > > > > Ashish > > > > > > > >> > >> > >> > >> -Original Message- > >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal > >> Sent: Tue 4/1/2008 6:42 PM > >> To: core-user@hadoop.apache.org > >> Subject: Re: one key per output part file > >> > >> This seems like a reasonable solution - but I am using Hadoop streaming > >> and > >> byreducer is a perl script. Is it possible to handle side-effect files > in > >> streaming? I havent found > >> anything that indicates that you can... > >> > >> Ashish > >> > >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> > >>> > >>> > >>> Try opening the desired output file in the reduce method. Make sure > >> that > >>> the output files are relative to the correct task specific directory > >> (look > >>> for side-effect files on the wiki). > >>> > >>> > >>> > >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > >>> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > >>> that > will generate output where a single key is found in a single output > >> part > file. > Does anyone know how to ensure this condition? I want the reduce task > >>> (no > matter how many are specified), to only receive > key-value output from a single key each, process the key-value pairs > >> for > this key, write an output part-XXX file, and only > then process the next key. > > Here is the task that I am trying to accomplish: > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > Output: Each part-XXX should contain the lines of T that contain the > >>> word > from line XXX in V. > > Any help/ideas are appreciated. > > Ashish > >>> > >>> > >> > >> > >
Re: one key per output part file
Writing to HDFS leaves the files as accessible as anything else, if not more so. You can retrieve a file using a URL of the form: http:///data/ Similarly, you can list a directory using a similar URL (whose details I forget for the nonce). On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> > wrote: > >> curious - why do we need a file per XXX? >> >> - if the data needs to be exported (either to a sql db or an external file >> system) - then why not do so directly from the reducer (instead of trying to >> create these intermediate small files in hdfs)? data can be written to tmp >> tables/files and can be overwritten in case the reducer re-runs (and then >> committed to final location once the job is complete) >> > > The second case (data needs to be exported) is the reason that I have. Each > of these small files is used in an external process. This seems like a good > solution - only question then is where can these files be written to safely? > Local directory? /tmp? > > Ashish > > > >> >> >> >> -Original Message- >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal >> Sent: Tue 4/1/2008 6:42 PM >> To: core-user@hadoop.apache.org >> Subject: Re: one key per output part file >> >> This seems like a reasonable solution - but I am using Hadoop streaming >> and >> byreducer is a perl script. Is it possible to handle side-effect files in >> streaming? I havent found >> anything that indicates that you can... >> >> Ashish >> >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >>> >>> >>> Try opening the desired output file in the reduce method. Make sure >> that >>> the output files are relative to the correct task specific directory >> (look >>> for side-effect files on the wiki). >>> >>> >>> >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: >>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce >>> that will generate output where a single key is found in a single output >> part file. Does anyone know how to ensure this condition? I want the reduce task >>> (no matter how many are specified), to only receive key-value output from a single key each, process the key-value pairs >> for this key, write an output part-XXX file, and only then process the next key. Here is the task that I am trying to accomplish: Input: Corpus T (lines of text), Corpus V (each line has 1 word) Output: Each part-XXX should contain the lines of T that contain the >>> word from line XXX in V. Any help/ideas are appreciated. Ashish >>> >>> >> >>
Re: one key per output part file
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote: > curious - why do we need a file per XXX? > > - if the data needs to be exported (either to a sql db or an external file > system) - then why not do so directly from the reducer (instead of trying to > create these intermediate small files in hdfs)? data can be written to tmp > tables/files and can be overwritten in case the reducer re-runs (and then > committed to final location once the job is complete) > The second case (data needs to be exported) is the reason that I have. Each of these small files is used in an external process. This seems like a good solution - only question then is where can these files be written to safely? Local directory? /tmp? Ashish > > > > -Original Message- > From: [EMAIL PROTECTED] on behalf of Ashish Venugopal > Sent: Tue 4/1/2008 6:42 PM > To: core-user@hadoop.apache.org > Subject: Re: one key per output part file > > This seems like a reasonable solution - but I am using Hadoop streaming > and > byreducer is a perl script. Is it possible to handle side-effect files in > streaming? I havent found > anything that indicates that you can... > > Ashish > > On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > > > > > Try opening the desired output file in the reduce method. Make sure > that > > the output files are relative to the correct task specific directory > (look > > for side-effect files on the wiki). > > > > > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > > that > > > will generate output where a single key is found in a single output > part > > > file. > > > Does anyone know how to ensure this condition? I want the reduce task > > (no > > > matter how many are specified), to only receive > > > key-value output from a single key each, process the key-value pairs > for > > > this key, write an output part-XXX file, and only > > > then process the next key. > > > > > > Here is the task that I am trying to accomplish: > > > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > > Output: Each part-XXX should contain the lines of T that contain the > > word > > > from line XXX in V. > > > > > > Any help/ideas are appreciated. > > > > > > Ashish > > > > > >
Re: distcp fails :Input source not found
Anybody ? Any thoughts why this might be happening? Here is what is happening directly from the ec2 screen. The ID and Secret Key are the only things changed. I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster using the ec2 scripts in the src/contrib/ec2/bin . . . The file I try and copy is 9KB (I noticed previous discussion on empty files and files that are > 10MB) > First I make sure that we can copy the file from s3 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -copyToLocal s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /usr/InputFileFormat.xml > Now I see that the file is copied to the ec2 master (where I'm logged in) [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input* /usr/InputFileFormat.xml > Next I make sure I can access the HDFS and that the input directory is there [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls / Found 2 items /input 2008-04-01 15:45 /mnt 2008-04-01 15:42 [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /input/ Found 0 items > I make sure hadoop is running just fine by running an example [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar hadoop-0.15.3-examples.jar pi 10 1000 Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to process : 10 08/04/01 17:38:14 INFO mapred.JobClient: Running job: job_200804011542_0001 08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0% 08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0% 08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0% 08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0% 08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0% 08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0% 08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0% 08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0% 08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0% 08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20% 08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100% 08/04/01 17:38:45 INFO mapred.JobClient: Job complete: job_200804011542_0001 08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9 08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1 08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10 08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework 08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10 08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20 08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240 08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2 08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20 Job Finished in 31.028 seconds Estimated value of PI is 3.1556 > Finally, I try and copy the file over [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml /input/InputFileFormat.xml With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not exist. at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563) - You rock. That's why Blockbuster's offering you one month of Blockbuster Total Access, No Cost.
Re: Help: libhdfs SIGSEGV
On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote: Hello. Is libhdfs thread-safe? I can run single thread reading/writing HDFS through libhdfs well, but when incrementing number of threads to 2 or above, I received sigsegv error: Could you explain a bit more? What are you doing in different threads? Are you writing to the same file? Arun # # An unexpected error has been detected by Java Runtime Environment: # # Internal Error (53484152454432554E54494D450E4350500214), pid=15614, tid=1080834960 # # Java VM: Java HotSpot(TM) Server VM (1.6.0_03-b05 mixed mode) # An error report file with more information is saved as hs_err_pid15614.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # /bin/sh: line 1: 15614 Aborted ./hdfsbench -a w /tmp/test/txt -t 2 and the bt output: (gdb) bt full #0 0x061707b8 in ChunkPool::allocate () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #1 0x061703a6 in Arena::Arena () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #2 0x06501729 in Thread::Thread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #3 0x06503144 in JavaThread::JavaThread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #4 0x062f006e in attach_current_thread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #5 0x062eee08 in jni_AttachCurrentThread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #6 0xb7f1a4bd in getJNIEnv () at hdfsJniHelper.c:347 vm = (JavaVM *) 0x65ccfcc vmBuf = 0xb74a0270 env = (JNIEnv *) 0xb7eb73e2 rv = 106745804 noVMs = 1 #7 0xb7f17143 in hdfsConnect (host=0x804aac2 "default", port=0) at hdfs.c:119 env = (JNIEnv *) 0x0 jConfiguration = (jobject) 0x0 jFS = (jobject) 0x0 jURI = (jobject) 0x0 jURIString = (jstring) 0x0 jVal = {z = 0 '\0', b = 0 '\0', c = 0, s = 0, i = 0, j = 0, f = 0, d = 0, l = 0x0} cURI = 0x0 gFsRef = (jobject) 0x0 #8 0x0804904b in worker_thread_w (arg=0x1) at hdfsbench.cpp:192 idx = 1 path = {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, fields>}, _M_p = 0x804e484 "/tmp/test/txt1"}} fs = (hdfsFS) 0x0 writeFile = (hdfsFile) 0x0 i = 0 bwrite = 0 boffset = 0 tv_start = {tv_sec = 0, tv_usec = 0} tv_end = {tv_sec = 0, tv_usec = 0} #9 0xb7f2246b in start_thread () from /lib/tls/i686/cmov/ libpthread.so.0 No symbol table info available. #10 0xb7da06de in clone () from /lib/tls/i686/cmov/libc.so.6 -- Yingyuan Cheng
Re: DFS get blocked when writing a file.
Thanks very much for the help. I will investigate more about that. Iván El lun, 31-03-2008 a las 11:11 -0700, Raghu Angadi escribió: > Iván, > > Whether this was expected or an error depends on what happened on the > client. This could happen and would not be a bug if client was killed > for some other reason for e.g. But if client is also similarly surprised > then its a different case. > > You could grep for this block in NameNode log and client. If you are > still interested in looking into this, I would suggest opening a jira. > > Raghu. > > Iván de Prado wrote: > > Thanks, > > > > I have tried with the trunk version and now the exception "Trying to > > change block file offset of block blk_... to ... but actual size of file > > is ..." has disappeared and the jobs don't seems to get blocked. > > > > But I have another "Broken Pipe" and "EOF" exceptions in the dfs logs. > > They seems similar to https://issues.apache.org/jira/browse/HADOOP-2042 > > ticket. The Jobs ends but not sure if they are executed smoothly. are > > these exceptions normal? As example, the exceptions for the block > > (6801211507359331627) appears in two nodes (I have 2 as replication) and > > looks like: > > > > hn2: 2008-03-31 05:03:13,736 INFO org.apache.hadoop.dfs.DataNode: > > Datanode 0 forwarding connect ack to upstream firstbadlink is > > hn2: 2008-03-31 05:03:14,507 INFO org.apache.hadoop.dfs.DataNode: > > Receiving block blk_6801211507359331627 src: /172.16.3.6:38218 > > dest: /172.16.3.6:50010 > > > > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode: > > Exception in receiveBlock for block blk_6801211507359331627 > > java.io.EOFException > > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode: > > PacketResponder 0 for block blk_6801211507359331627 Interrupted. > > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode: > > PacketResponder 0 for block blk_6801211507359331627 terminating > > hn2: 2008-03-31 05:04:14,530 INFO org.apache.hadoop.dfs.DataNode: > > writeBlock blk_6801211507359331627 received exception > > java.io.EOFException > > hn2: 2008-03-31 05:04:14,530 ERROR org.apache.hadoop.dfs.DataNode: > > 172.16.3.4:50010:DataXceiver: java.io.EOFException > > hn2:at java.io.DataInputStream.readInt(DataInputStream.java:375) > > hn2:at org.apache.hadoop.dfs.DataNode > > $BlockReceiver.receiveBlock(DataNode.java:2243) > > hn2:at org.apache.hadoop.dfs.DataNode > > $DataXceiver.writeBlock(DataNode.java:1157) > > hn2:at org.apache.hadoop.dfs.DataNode > > $DataXceiver.run(DataNode.java:938) > > hn2:at java.lang.Thread.run(Thread.java:619) > > > > hn4: 2008-03-31 05:03:13,590 INFO org.apache.hadoop.dfs.DataNode: > > Datanode 0 forwarding connect ack to upstream firstbadlink is > > hn4: 2008-03-31 05:03:14,506 INFO org.apache.hadoop.dfs.DataNode: > > Receiving block blk_6801211507359331627 src: /172.16.3.6:41112 > > dest: /172.16.3.6:50010 > > > > hn4: 2008-03-31 05:03:26,825 INFO org.apache.hadoop.dfs.DataNode: > > Exception in receiveBlock for block blk_6801211507359331627 > > java.io.EOFException > > > > hn4: 2008-03-31 05:04:14,524 INFO org.apache.hadoop.dfs.DataNode: > > PacketResponder blk_6801211507359331627 1 Exception > > java.net.SocketException: Broken pipe > > hn4:at java.net.SocketOutputStream.socketWrite0(Native Method) > > hn4:at > > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92) > > hn4:at > > java.net.SocketOutputStream.write(SocketOutputStream.java:136) > > hn4:at java.io.DataOutputStream.writeLong(DataOutputStream.java:207) > > hn4:at org.apache.hadoop.dfs.DataNode > > $PacketResponder.run(DataNode.java:1825) > > hn4:at java.lang.Thread.run(Thread.java:619) > > hn4: > > hn4: 2008-03-31 05:04:14,525 INFO org.apache.hadoop.dfs.DataNode: > > PacketResponder 1 for block blk_6801211507359331627 terminating > > hn4: 2008-03-31 05:04:14,525 INFO org.apache.hadoop.dfs.DataNode: > > writeBlock blk_6801211507359331627 received exception > > java.io.EOFException > > hn4: 2008-03-31 05:04:14,526 ERROR org.apache.hadoop.dfs.DataNode: > > 172.16.3.6:50010:DataXceiver: java.io.EOFException > > hn4:at java.io.DataInputStream.readInt(DataInputStream.java:375) > > hn4:at org.apache.hadoop.dfs.DataNode > > $BlockReceiver.receiveBlock(DataNode.java:2243) > > hn4:at org.apache.hadoop.dfs.DataNode > > $DataXceiver.writeBlock(DataNode.java:1157) > > hn4:at org.apache.hadoop.dfs.DataNode > > $DataXceiver.run(DataNode.java:938) > > hn4:at java.lang.Thread.run(Thread.java:619) > > hn4: > > > > Many thanks, > > > > Iván de Prado Alonso > > http://ivandeprado.blogspot.com/
Re: What happens if a namenode fails?
hello , See this URL , which might can help you out for your query. http://www.nabble.com/Namenode-cluster-and-fail-over-td15903856.html --- Peeyush On Tue, 2008-04-01 at 14:44 -0700, Xavier Stevens wrote: > What happens to your data if the namenode fails (hardware failure)? > Assuming you replace it with a fresh box can you restore all of your > data from the slaves? > > -Xavier
Re: Nutch and Distributed Lucene
Hi Ning, Thanks a lot ! Naama On Tue, Apr 1, 2008 at 7:06 PM, Ning Li <[EMAIL PROTECTED]> wrote: > Hi, > > Nutch builds Lucene indexes. But Nutch is much more than that. It is a > web search application software that crawls the web, inverts links and > builds indexes. Each step is one or more Map/Reduce jobs. You can find > more information at http://lucene.apache.org/nutch/ > > The Map/Reduce job to build Lucene indexes in Nutch is customized to > the data schema/structures used in Nutch. The index contrib package in > Hadoop provides a general/configurable process to build Lucene indexes > in parallel using a Map/Reduce job. That's the main difference. There > is also the difference that the index build job in Nutch builds > indexes in reduce tasks, while the index contrib package builds > indexes in both map and reduce tasks and there are advantages in doing > that... > > Regards, > Ning > > > On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I'd like to know if Nutch is running on top of Lucene, or is it non > related > > to Lucene. I.e. indexing, parsing, crawling, internal data structures > ... - > > all written from scratch using MapReduce (my impression) ? > > > > What is the relation between Nutch and the distributed Lucene patch that > was > > inserted lately into Hadoop ? > > > > Thanks for any enlightening, > > Naama > > > > -- > > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 > oo > > 00 oo 00 oo > > "If you want your children to be intelligent, read them fairy tales. If > you > > want them to be more intelligent, read them more fairy tales." (Albert > > Einstein) > > > -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
Help: libhdfs SIGSEGV
Hello. Is libhdfs thread-safe? I can run single thread reading/writing HDFS through libhdfs well, but when incrementing number of threads to 2 or above, I received sigsegv error: # # An unexpected error has been detected by Java Runtime Environment: # # Internal Error (53484152454432554E54494D450E4350500214), pid=15614, tid=1080834960 # # Java VM: Java HotSpot(TM) Server VM (1.6.0_03-b05 mixed mode) # An error report file with more information is saved as hs_err_pid15614.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # /bin/sh: line 1: 15614 Aborted ./hdfsbench -a w /tmp/test/txt -t 2 and the bt output: (gdb) bt full #0 0x061707b8 in ChunkPool::allocate () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #1 0x061703a6 in Arena::Arena () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #2 0x06501729 in Thread::Thread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #3 0x06503144 in JavaThread::JavaThread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #4 0x062f006e in attach_current_thread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #5 0x062eee08 in jni_AttachCurrentThread () from /usr/lib/jvm/java/jre/lib/i386/server/libjvm.so No symbol table info available. #6 0xb7f1a4bd in getJNIEnv () at hdfsJniHelper.c:347 vm = (JavaVM *) 0x65ccfcc vmBuf = 0xb74a0270 env = (JNIEnv *) 0xb7eb73e2 rv = 106745804 noVMs = 1 #7 0xb7f17143 in hdfsConnect (host=0x804aac2 "default", port=0) at hdfs.c:119 env = (JNIEnv *) 0x0 jConfiguration = (jobject) 0x0 jFS = (jobject) 0x0 jURI = (jobject) 0x0 jURIString = (jstring) 0x0 jVal = {z = 0 '\0', b = 0 '\0', c = 0, s = 0, i = 0, j = 0, f = 0, d = 0, l = 0x0} cURI = 0x0 gFsRef = (jobject) 0x0 #8 0x0804904b in worker_thread_w (arg=0x1) at hdfsbench.cpp:192 idx = 1 path = {static npos = 4294967295, _M_dataplus = {> = {<__gnu_cxx::new_allocator> = {}, }, _M_p = 0x804e484 "/tmp/test/txt1"}} fs = (hdfsFS) 0x0 writeFile = (hdfsFile) 0x0 i = 0 bwrite = 0 boffset = 0 tv_start = {tv_sec = 0, tv_usec = 0} tv_end = {tv_sec = 0, tv_usec = 0} #9 0xb7f2246b in start_thread () from /lib/tls/i686/cmov/libpthread.so.0 No symbol table info available. #10 0xb7da06de in clone () from /lib/tls/i686/cmov/libc.so.6 -- Yingyuan Cheng
RE: one key per output part file
curious - why do we need a file per XXX? - if further processing is going to be done in hadoop itself - then it's hard to see a reason. One can always have multiple entries in the same hdfs file. note that it's possible to align map task splits on sort key boundaries in pre-sorted data (it's not something that hadoop supports natively right now - but u can write ur own inputformat to do this). meaning - that subsequent processing that wants all entries corresponding to XXX in one group (as in a reducer) can do so in the map phase itself (ie. - it's damned cheap and doesn't require sorting data all over again). - if the data needs to be exported (either to a sql db or an external file system) - then why not do so directly from the reducer (instead of trying to create these intermediate small files in hdfs)? data can be written to tmp tables/files and can be overwritten in case the reducer re-runs (and then committed to final location once the job is complete) -Original Message- From: [EMAIL PROTECTED] on behalf of Ashish Venugopal Sent: Tue 4/1/2008 6:42 PM To: core-user@hadoop.apache.org Subject: Re: one key per output part file This seems like a reasonable solution - but I am using Hadoop streaming and byreducer is a perl script. Is it possible to handle side-effect files in streaming? I havent found anything that indicates that you can... Ashish On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > Try opening the desired output file in the reduce method. Make sure that > the output files are relative to the correct task specific directory (look > for side-effect files on the wiki). > > > > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote: > > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce > that > > will generate output where a single key is found in a single output part > > file. > > Does anyone know how to ensure this condition? I want the reduce task > (no > > matter how many are specified), to only receive > > key-value output from a single key each, process the key-value pairs for > > this key, write an output part-XXX file, and only > > then process the next key. > > > > Here is the task that I am trying to accomplish: > > > > Input: Corpus T (lines of text), Corpus V (each line has 1 word) > > Output: Each part-XXX should contain the lines of T that contain the > word > > from line XXX in V. > > > > Any help/ideas are appreciated. > > > > Ashish > >