Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-02 Thread Ted Dunning

You can overwrite it, but you can't update it.  Soon you will be able to
append to it, but you won't be able to do any other updates.


On 4/2/08 11:39 PM, "Garri Santos" <[EMAIL PROTECTED]> wrote:

> Hi!
> 
> I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering
> if it's just fine to update or overwrite a file copied to hadoop?
> 
> 
> Thanks,
> Garri



Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-02 Thread Owen O'Malley


On Apr 2, 2008, at 11:39 PM, Garri Santos wrote:


Hi!

I'm starting to take alook at hadoop and the whole HDFS idea. I'm  
wondering

if it's just fine to update or overwrite a file copied to hadoop?


No. Although we are making progress on HADOOP-1700, which would allow  
appending onto files.


-- Owen


Is it possible in Hadoop to overwrite or update a file?

2008-04-02 Thread Garri Santos
Hi!

I'm starting to take alook at hadoop and the whole HDFS idea. I'm wondering
if it's just fine to update or overwrite a file copied to hadoop?


Thanks,
Garri


Re: Help: libhdfs SIGSEGV

2008-04-02 Thread Yingyuan Cheng
Hello Christian.

As you said, it does work. Thank you very much.

Would you like to explain further to me why it can work?

--

Yingyuan


Christian Kunz 写道:
> Hi Yingyuan,
>
> Did you try to connect to hdfs in main function before spawning off the
> threads and using the same handle in all the threads?
>
> -Christian
>
>
> Christian


Re: Help: libhdfs SIGSEGV

2008-04-02 Thread Christian Kunz
Hi Yingyuan,

Did you try to connect to hdfs in main function before spawning off the
threads and using the same handle in all the threads?

-Christian


On 4/2/08 6:21 PM, "Yingyuan Cheng" <[EMAIL PROTECTED]> wrote:

> Hello Arun and all.
> 
> I generated a data block in my main function, then spawned serveral
> threads to write this block serveral times to different files in hdfs,
> my thread function as following:
> 
> ---
> // File: hdfsbench.cpp
> 
> void* worker_thread_w(void *arg)
> {
> int idx = (int)arg;
> std::string path = g_config.path_prefix;
> string_append(path, "%d", idx);
> 
> hdfsFS fs = hdfsConnect("default", 0);
> if(!fs) {
> PERRORL(ERROR, "Thread %d failed to connect to hdfs!\n", idx);
> g_workers->at(idx).status = TS_FAIL;
> return (void*)false;
> }
> 
> hdfsFile writeFile = hdfsOpenFile(fs, path.c_str(), O_WRONLY, 0, 0, 0);
> if(!writeFile) {
> PERRORL(ERROR, "Thread %d failed to open %s for writing!\n", idx,
> path.c_str());
> g_workers->at(idx).status = TS_FAIL;
> //hdfsDisconnect(fs);
> return (void*)false;
> }
> 
> int i;
> int bwrite, boffset;
> struct timeval tv_start, tv_end;
> 
> gettimeofday(&tv_start, NULL);
> 
> for (i = 0; i < g_config.num_blocks; i++) {
> boffset = 0;
> while (boffset < g_config.block_size) {
> bwrite = hdfsWrite(fs, writeFile, g_buffer->ptr + boffset,
> g_config.block_size - boffset);
> if (bwrite <0) {
> PERRORL(ERROR, "Thread %d failed when writing %s at block %d offset %d\n",
> idx, path.c_str(), i, boffset);
> g_workers->at(idx).status = TS_FAIL;
> i = g_config.num_blocks;
> break;
> }
> boffset += bwrite;
> g_workers->at(idx).nbytes += bwrite;
> }
> }
> 
> gettimeofday(&tv_end, NULL);
> hdfsCloseFile(fs, writeFile);
> //hdfsDisconnect(fs);
> g_workers->at(idx).elapsed = get_timeval_differ(tv_start, tv_end);
> 
> if (g_workers->at(idx).status == TS_UNSET ||
> g_workers->at(idx).status == TS_RUNNING) {
> g_workers->at(idx).status = TS_SUCCESS;
> }
> 
> return (void*)true;
> }
> 
> ---
> 
> And the following is my running script:
> 
> ---
> 
> #!/bin/bash
> #
> # File: hdfsbench.sh
> 
> HADOOP_HOME="/opt/hadoop"
> JAVA_HOME="/opt/java"
> 
> export 
> CLASSPATH="$HADOOP_HOME/hadoop-0.16.1-core.jar:$HADOOP_HOME/lib/commons-loggin
> g-1.0.4.jar:$HADOOP_HOME/lib/log4j-1.2.13.jar:$HADOOP_HOME/conf"
> export LD_LIBRARY_PATH="$HADOOP_HOME/libhdfs:$JAVA_HOME/lib/i386/server"
> 
> ./hdfsbench $@
> 
> ---
> 
> 
> Yingyuan
> 
> Arun C Murthy 写道:
>> 
>> On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote:
>> 
>>> Hello.
>>> 
>>> Is libhdfs thread-safe? I can run single thread reading/writing HDFS
>>> through libhdfs well, but when incrementing number of threads to 2 or
>>> above, I received sigsegv error:
>>> 
>> 
>> Could you explain a bit more? What are you doing in different threads?
>> Are you writing to the same file?
>> 
>> Arun
>> 
> 



Re: Help: libhdfs SIGSEGV

2008-04-02 Thread Yingyuan Cheng

Hello Arun and all.

I generated a data block in my main function, then spawned serveral 
threads to write this block serveral times to different files in hdfs, 
my thread function as following:


---
// File: hdfsbench.cpp

void* worker_thread_w(void *arg)
{
int idx = (int)arg;
std::string path = g_config.path_prefix;
string_append(path, "%d", idx);

hdfsFS fs = hdfsConnect("default", 0);
if(!fs) {
PERRORL(ERROR, "Thread %d failed to connect to hdfs!\n", idx);
g_workers->at(idx).status = TS_FAIL;
return (void*)false;
}

hdfsFile writeFile = hdfsOpenFile(fs, path.c_str(), O_WRONLY, 0, 0, 0);
if(!writeFile) {
PERRORL(ERROR, "Thread %d failed to open %s for writing!\n", idx, 
path.c_str());

g_workers->at(idx).status = TS_FAIL;
//hdfsDisconnect(fs);
return (void*)false;
}

int i;
int bwrite, boffset;
struct timeval tv_start, tv_end;

gettimeofday(&tv_start, NULL);

for (i = 0; i < g_config.num_blocks; i++) {
boffset = 0;
while (boffset < g_config.block_size) {
bwrite = hdfsWrite(fs, writeFile, g_buffer->ptr + boffset,
g_config.block_size - boffset);
if (bwrite <0) {
PERRORL(ERROR, "Thread %d failed when writing %s at block %d offset %d\n",
idx, path.c_str(), i, boffset);
g_workers->at(idx).status = TS_FAIL;
i = g_config.num_blocks;
break;
}
boffset += bwrite;
g_workers->at(idx).nbytes += bwrite;
}
}

gettimeofday(&tv_end, NULL);
hdfsCloseFile(fs, writeFile);
//hdfsDisconnect(fs);
g_workers->at(idx).elapsed = get_timeval_differ(tv_start, tv_end);

if (g_workers->at(idx).status == TS_UNSET ||
g_workers->at(idx).status == TS_RUNNING) {
g_workers->at(idx).status = TS_SUCCESS;
}

return (void*)true;
}

---

And the following is my running script:

---

#!/bin/bash
#
# File: hdfsbench.sh

HADOOP_HOME="/opt/hadoop"
JAVA_HOME="/opt/java"

export 
CLASSPATH="$HADOOP_HOME/hadoop-0.16.1-core.jar:$HADOOP_HOME/lib/commons-logging-1.0.4.jar:$HADOOP_HOME/lib/log4j-1.2.13.jar:$HADOOP_HOME/conf"

export LD_LIBRARY_PATH="$HADOOP_HOME/libhdfs:$JAVA_HOME/lib/i386/server"

./hdfsbench $@

---


Yingyuan

Arun C Murthy 写道:


On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote:


Hello.

Is libhdfs thread-safe? I can run single thread reading/writing HDFS
through libhdfs well, but when incrementing number of threads to 2 or
above, I received sigsegv error:



Could you explain a bit more? What are you doing in different threads? 
Are you writing to the same file?


Arun





Re: Test Data for Hadoop Student

2008-04-02 Thread Bruce Williams
My appology to the list - the previous email was supposed to go to
Aaron Kimball.

Bruce


Re: Test Data for Hadoop Student

2008-04-02 Thread Bruce Williams
I met Christophe at the Hadoop Conference at Yahoo last week. I really
liked him. He asked me to maintain the Google Ubuntu Hadoop image, I
sent him the following about my project. Would you read it and offer
any comments?

I sent him the following:

"Can I tell you more about my Hadoop in education project?

My project started when I found out  Amazon.com will ( who would have
thought ) will let you rent their computers by the hour.  I realized
this would be an ideal way for small schools ( really ALL schools -
even MIT/Berkley has a hard time coming up with a hundred idle
computers  for  a student to use )  to have  access to the  resources
 to  expose  students  ( not just Computer  Science, but Physics,
Astronomy, Biology, etc ) to working  in this  environment.  That is
what my independent study project  is about.  I am producing a student
client workstation image and a department  server image with
everything needed to teach the a course and hookup to Amazon. The
course ware I am getting from the University of Washington and
documentation from all over. I first emailed the hadoop list and Aaron
Kimball responded and offered his courseware and to "highly endorse
your Amazon EC2 idea for doing your labs" . I would love any resources
you can point me to.  The economic model is the students don't need to
buy a textbook and use the money to buy computer time from Amazon. I
have already gotten an agency of the state of California  to fund  my
computer time as a student, so that is a good precedent.

I am getting weather data and an application to process it with snazzy
graphics as a student project. I hope to add more as time goes on. I
know I can be accused of the buzzword of the moment, but I hope to put
on the server image software to provide a linkup to other people using
the server to form a community. The more communication, the faster
thing will happen. So model is more "seeding" than "doing".

This is an example of what I am putting into this. This needs to work
"on its own", with common problems pre-solved, without a lot of case
by case work at each institution. I talked to Jinesh Varia from Amazon
and I may be able to get them to design a custom product for
education, accounting and billing that works with my server to make it
simple and secure for the students and teacher to use. If a teacher
has to deal with  money and billing, it will fail. If you require a
school's bureaucracy to handle something new, it will be a harder
sell. Schools supply student course needs through the school
bookstore. Who is the approved vendor to the school bookstore, Amazon?
More case by case work..  And then they want to mark it up 50%-100%.
No no no. So a student logs into my server and buys time like you
would anything else on the Internet. But bookkeeping and usage records
are kept for the teacher. Simple. No extra work in this area to offer
the course.

When I met you, I felt I had met someone who thought exactly like me
on how important it is to facilitate moving people along from not just
CS, but other areas to take advantage of the potential  this
technology. I want to make this available to the guy off in a corner
somewhere who has a crazy idea that deserves a Nobel Prize to have
what it takes to succeed.  Elite should be elite based on worth, not
restricted by access to elite level resources as much as can be made
possible. "

I would appreciate your  comments and if you like the ideas, any
support you could give yourself and any encouragement you could give
Christophe to support this would be appreciated,

BTW - my personal email is [EMAIL PROTECTED] electricranch -
"herds of CPU's". I use the gmail address for lists that may expose me
to spam.

Bruce



On Fri, Nov 16, 2007 at 7:21 PM, Aaron Kimball <[EMAIL PROTECTED]> wrote:
> Bruce,
>
> I helped design and teach an undergrad course based on Hadoop last year.
> Along with some folks at Google, we then made the resources available
> together to distribute to other universities and the public at large (via
> Creative Commons license, actually).
>
> All the materials are available online here:
> http://code.google.com/edu/content/parallel.html
> (lecture notes, labs, and even video lectures.)
>
> It includes suggested lab activities. Good free data sets you can download
> include Netflix prize data and a copy of the wikipedia corpus. Of course,
> you can set up Nutch and do your own web crawl too.
>
> We also highly endorse the Amazon EC2 idea for doing your own labs :)
>
> Best of luck,
> - Aaron
>
>
>
>
>
> Edward Bruce Williams wrote:
> > Hello
> >
> >
> > I am a student doing an independent study project investigating the
> > possibility of teaching large scale computing on a small scale budget.  Th
> >
> >
> > My thought is to use available Open Source ( Hadoop) and Creative Commons
> > and other materials as the text.  A student could then do significant
> > computing on Amazon for the cost of what they would usually pay for a
> > textbook.  I have co

secondary namenode web interface

2008-04-02 Thread Yuri Pradkin
Hi,

I'm running Hadoop (latest snapshot) on several machines and in our setup 
namenode 
and secondarynamenode are on different systems.  I see from the logs than 
secondary
namenode regularly checkpoints fs from primary namenode.

But when I go to the secondary namenode HTTP (dfs.secondary.http.address) in
my browser I see something like this:

HTTP ERROR: 500
init
RequestURI=/dfshealth.jsp
Powered by Jetty://

And in secondary's log I find these lines:

2008-04-02 11:26:25,357 WARN /: /dfshealth.jsp:
java.lang.NullPointerException
at org.apache.hadoop.dfs.dfshealth_jsp.(dfshealth_jsp.java:21)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:539)
at java.lang.Class.newInstance0(Class.java:373)
at java.lang.Class.newInstance(Class.java:326)
at org.mortbay.jetty.servlet.Holder.newInstance(Holder.java:199)
at 
org.mortbay.jetty.servlet.ServletHolder.getServlet(ServletHolder.java:326)
at 
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:405)
at 
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
at 
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
at org.mortbay.http.HttpServer.service(HttpServer.java:954)
at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
at 
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

Is something missing from my configuration?  Anybody else seen these?

Thanks,

  -Yuri


Re: distcp fails :Input source not found

2008-04-02 Thread s29752-hadoopuser
It might be a bug.  Could you try the following?
bin/hadoop fs -ls s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml

Nicholas


- Original Message 
From: Prasan Ary <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, April 2, 2008 7:41:50 AM
Subject: Re: distcp fails :Input source not found

Anybody ? Any thoughts why this might be happening?
   
  Here is what is happening directly from the ec2 screen. The ID and
 Secret Key are the only things changed.
  
  I'm running hadoop 15.3 from the public ami. I launched a 2 machine
 cluster using the ec2 scripts in  the src/contrib/ec2/bin . . .

The file I try and copy is 9KB (I noticed previous discussion on
 empty files and files that are > 10MB)
   
  > First I make sure that we can copy the file from s3
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs
 -copyToLocal s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml
 /usr/InputFileFormat.xml
 
   > Now I see that the file is copied to the ec2 master (where I'm
 logged in)
  [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input*
  /usr/InputFileFormat.xml
   
  > Next I make sure I can access the HDFS and that the input
 directory is there
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /
  Found 2 items
  /input  2008-04-01 15:45
  /mnt  2008-04-01 15:42
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls
 /input/
  Found 0 items
   
  > I make sure hadoop is running just fine by running an example
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar
 hadoop-0.15.3-examples.jar pi 10 1000
  Number of Maps = 10 Samples per Map = 1000
  Wrote input for Map #0
  Wrote input for Map #1
  Wrote input for Map #2
  Wrote input for Map #3
  Wrote input for Map #4
  Wrote input for Map #5
  Wrote input for Map #6
  Wrote input for Map #7
  Wrote input for Map #8
  Wrote input for Map #9
  Starting Job
  08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to
 process : 10
  08/04/01 17:38:14 INFO mapred.JobClient: Running job:
 job_200804011542_0001
  08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0%
  08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0%
  08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0%
  08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0%
  08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0%
  08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0%
  08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0%
  08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0%
  08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0%
  08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20%
  08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100%
  08/04/01 17:38:45 INFO mapred.JobClient: Job complete:
 job_200804011542_0001
  08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9
  08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 
  08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1
  08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework
  08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20
  08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240
  08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20
  Job Finished in 31.028 seconds
  Estimated value of PI is 3.1556
   
  > Finally, I try and copy the file over
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp
 s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml
 /input/InputFileFormat.xml
  With failures, global counters are inaccurate; consider running with
 -i
  Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input
 source s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not
 exist.
  at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470)
  at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563)


   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.




Error msg: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2008-04-02 Thread Anisio Mendes Lacerda
Hi,

me and my coleagues are implementing a small search engine in my University
Laboratory,
and we would like to use Hadoop as file system.

For now we are having troubles in running the following simple code example:


#include "hdfs.h"
int main(int argc, char **argv) {
hdfsFS fs = hdfsConnect("apolo.latin.dcc.ufmg.br", 51070);
if(!fs) {
fprintf(stderr, "Oops! Failed to connect to hdfs!\n");
exit(-1);
}
int result = hdfsDisconnect(fs);
if(!result) {
fprintf(stderr, "Oops! Failed to connect to hdfs!\n");
exit(-1);
}
}


The error msg is:

Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/hadoop/conf/Configuration



We configured the following enviroment variables:

export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13
export OS_NAME=linux
export OS_ARCH=i386
export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs
export SHLIB_VERSION=1

export HADOOP_HOME=/mnt/hd1/hadoop/hadoop-0.14.4
export HADOOP_CONF_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/conf
export HADOOP_LOG_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/logs

The following commands were used to compile the codes:

In directory: hadoop-0.14.4/src/c++/libhdfs

1 - make all
2 - gcc -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/ -c my_hdfs_test.c
3 - gcc my_hdfs_test.o -I/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/include/
-L/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs -lhdfs
-L/usr/lib/jvm/java-1.5.0-sun-1.5.0.13/jre/lib/i386/server -ljvm  -o
my_hdfs_test

Obs:

The hadoop file system seems to be ok once we can run commands like this:

[EMAIL PROTECTED]:/mnt/hd1/hadoop/hadoop-0.14.4$ bin/hadoop dfs -ls
Found 0 items






-- 
[]s

Anisio Mendes Lacerda


Re: Hadoop Port Configuration

2008-04-02 Thread Liu Yan
hi,

If your master and slave are two different boxes, don't use 127.0.0.1
as the address. Use something in your LAN, e.g., 192.168.x.x,
10.x.x.x, etc.

HTH,
Yan

2008/4/1, Natarajan, Senthil <[EMAIL PROTECTED]>:
> Hi,
>
>  I am using default settings from hadoop-default.xml and hadoop-site.xml
>
>  And I just changed this port number
>
>  mapred.task.tracker.report.address
>
>   
>
>
>
>  I created the firewall rule to allow port range 5:50100 between the 
> slaves and master.
>
>
>
>  But reduce on the slaves using some other ports seems. So Reduce always 
> hangs with firewall enabled. If I disable the firewall it works fine.
>
>  Could you please let me know what I am missing or where to control the 
> hadoop random port creation?
>
>
>
>  Thanks,
>
> Senthil
>


Re: one key per output part file

2008-04-02 Thread Ashish Venugopal
Thanks for this information - I might be missing something here, but can my
perl script reducer (which is run via streaming, and is not linked to HDFS
libraries) just start writing to HDFS?
I thought I would have to write it locally ie in "." for the reduce script
and then rely on the MapReduce mechanism to promote the file into the output
directory...
Thanks for all the help!

Ashish



On Wed, Apr 2, 2008 at 11:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Writing to HDFS leaves the files as accessible as anything else, if not
> more
> so.
>
> You can retrieve a file using a URL of the form:
>
>  http:///data/
>
> Similarly, you can list a directory using a similar URL (whose details I
> forget for the nonce).
>
> On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
> > wrote:
> >
> >> curious - why do we need a file per XXX?
> >>
> >> - if the data needs to be exported (either to a sql db or an external
> file
> >> system) - then why not do so directly from the reducer (instead of
> trying to
> >> create these intermediate small files in hdfs)? data can be written to
> tmp
> >> tables/files and can be overwritten in case the reducer re-runs (and
> then
> >> committed to final location once the job is complete)
> >>
> >
> > The second case (data needs to be exported) is the reason that I have.
> Each
> > of these small files is used in an external process. This seems like a
> good
> > solution - only question then is where can these files be written to
> safely?
> > Local directory? /tmp?
> >
> > Ashish
> >
> >
> >
> >>
> >>
> >>
> >> -Original Message-
> >> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
> >> Sent: Tue 4/1/2008 6:42 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Re: one key per output part file
> >>
> >> This seems like a reasonable solution - but I am using Hadoop streaming
> >> and
> >> byreducer is a perl script. Is it possible to handle side-effect files
> in
> >> streaming? I havent found
> >> anything that indicates that you can...
> >>
> >> Ashish
> >>
> >> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >>
> >>>
> >>>
> >>> Try opening the desired output file in the reduce method.  Make sure
> >> that
> >>> the output files are relative to the correct task specific directory
> >> (look
> >>> for side-effect files on the wiki).
> >>>
> >>>
> >>>
> >>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
> >>>
>  Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> >>> that
>  will generate output where a single key is found in a single output
> >> part
>  file.
>  Does anyone know how to ensure this condition? I want the reduce task
> >>> (no
>  matter how many are specified), to only receive
>  key-value output from a single key each, process the key-value pairs
> >> for
>  this key, write an output part-XXX file, and only
>  then process the next key.
> 
>  Here is the task that I am trying to accomplish:
> 
>  Input: Corpus T (lines of text), Corpus V (each line has 1 word)
>  Output: Each part-XXX should contain the lines of T that contain the
> >>> word
>  from line XXX in V.
> 
>  Any help/ideas are appreciated.
> 
>  Ashish
> >>>
> >>>
> >>
> >>
>
>


Re: one key per output part file

2008-04-02 Thread Ted Dunning


Writing to HDFS leaves the files as accessible as anything else, if not more
so.

You can retrieve a file using a URL of the form:

  http:///data/

Similarly, you can list a directory using a similar URL (whose details I
forget for the nonce).

On 4/2/08 7:57 AM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
> wrote:
> 
>> curious - why do we need a file per XXX?
>> 
>> - if the data needs to be exported (either to a sql db or an external file
>> system) - then why not do so directly from the reducer (instead of trying to
>> create these intermediate small files in hdfs)? data can be written to tmp
>> tables/files and can be overwritten in case the reducer re-runs (and then
>> committed to final location once the job is complete)
>> 
> 
> The second case (data needs to be exported) is the reason that I have. Each
> of these small files is used in an external process. This seems like a good
> solution - only question then is where can these files be written to safely?
> Local directory? /tmp?
> 
> Ashish
> 
> 
> 
>> 
>> 
>> 
>> -Original Message-
>> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
>> Sent: Tue 4/1/2008 6:42 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: one key per output part file
>> 
>> This seems like a reasonable solution - but I am using Hadoop streaming
>> and
>> byreducer is a perl script. Is it possible to handle side-effect files in
>> streaming? I havent found
>> anything that indicates that you can...
>> 
>> Ashish
>> 
>> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>>> 
>>> 
>>> Try opening the desired output file in the reduce method.  Make sure
>> that
>>> the output files are relative to the correct task specific directory
>> (look
>>> for side-effect files on the wiki).
>>> 
>>> 
>>> 
>>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>>> 
 Hi, I am using Hadoop streaming and I am trying to create a MapReduce
>>> that
 will generate output where a single key is found in a single output
>> part
 file.
 Does anyone know how to ensure this condition? I want the reduce task
>>> (no
 matter how many are specified), to only receive
 key-value output from a single key each, process the key-value pairs
>> for
 this key, write an output part-XXX file, and only
 then process the next key.
 
 Here is the task that I am trying to accomplish:
 
 Input: Corpus T (lines of text), Corpus V (each line has 1 word)
 Output: Each part-XXX should contain the lines of T that contain the
>>> word
 from line XXX in V.
 
 Any help/ideas are appreciated.
 
 Ashish
>>> 
>>> 
>> 
>> 



Re: one key per output part file

2008-04-02 Thread Ashish Venugopal
On Wed, Apr 2, 2008 at 3:36 AM, Joydeep Sen Sarma <[EMAIL PROTECTED]>
wrote:

> curious - why do we need a file per XXX?
>
> - if the data needs to be exported (either to a sql db or an external file
> system) - then why not do so directly from the reducer (instead of trying to
> create these intermediate small files in hdfs)? data can be written to tmp
> tables/files and can be overwritten in case the reducer re-runs (and then
> committed to final location once the job is complete)
>

The second case (data needs to be exported) is the reason that I have. Each
of these small files is used in an external process. This seems like a good
solution - only question then is where can these files be written to safely?
Local directory? /tmp?

Ashish



>
>
>
> -Original Message-
> From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
> Sent: Tue 4/1/2008 6:42 PM
> To: core-user@hadoop.apache.org
> Subject: Re: one key per output part file
>
> This seems like a reasonable solution - but I am using Hadoop streaming
> and
> byreducer is a perl script. Is it possible to handle side-effect files in
> streaming? I havent found
> anything that indicates that you can...
>
> Ashish
>
> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> >
> >
> > Try opening the desired output file in the reduce method.  Make sure
> that
> > the output files are relative to the correct task specific directory
> (look
> > for side-effect files on the wiki).
> >
> >
> >
> > On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
> >
> > > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> > that
> > > will generate output where a single key is found in a single output
> part
> > > file.
> > > Does anyone know how to ensure this condition? I want the reduce task
> > (no
> > > matter how many are specified), to only receive
> > > key-value output from a single key each, process the key-value pairs
> for
> > > this key, write an output part-XXX file, and only
> > > then process the next key.
> > >
> > > Here is the task that I am trying to accomplish:
> > >
> > > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > > Output: Each part-XXX should contain the lines of T that contain the
> > word
> > > from line XXX in V.
> > >
> > > Any help/ideas are appreciated.
> > >
> > > Ashish
> >
> >
>
>


Re: distcp fails :Input source not found

2008-04-02 Thread Prasan Ary
Anybody ? Any thoughts why this might be happening?
   
  Here is what is happening directly from the ec2 screen. The ID and
 Secret Key are the only things changed.
  
  I'm running hadoop 15.3 from the public ami. I launched a 2 machine
 cluster using the ec2 scripts in  the src/contrib/ec2/bin . . .

The file I try and copy is 9KB (I noticed previous discussion on
 empty files and files that are > 10MB)
   
  > First I make sure that we can copy the file from s3
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs
 -copyToLocal s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml
 /usr/InputFileFormat.xml
 
   > Now I see that the file is copied to the ec2 master (where I'm
 logged in)
  [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input*
  /usr/InputFileFormat.xml
   
  > Next I make sure I can access the HDFS and that the input
 directory is there
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /
  Found 2 items
  /input  2008-04-01 15:45
  /mnt  2008-04-01 15:42
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls
 /input/
  Found 0 items
   
  > I make sure hadoop is running just fine by running an example
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar
 hadoop-0.15.3-examples.jar pi 10 1000
  Number of Maps = 10 Samples per Map = 1000
  Wrote input for Map #0
  Wrote input for Map #1
  Wrote input for Map #2
  Wrote input for Map #3
  Wrote input for Map #4
  Wrote input for Map #5
  Wrote input for Map #6
  Wrote input for Map #7
  Wrote input for Map #8
  Wrote input for Map #9
  Starting Job
  08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to
 process : 10
  08/04/01 17:38:14 INFO mapred.JobClient: Running job:
 job_200804011542_0001
  08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0%
  08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0%
  08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0%
  08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0%
  08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0%
  08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0%
  08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0%
  08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0%
  08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0%
  08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20%
  08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100%
  08/04/01 17:38:45 INFO mapred.JobClient: Job complete:
 job_200804011542_0001
  08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9
  08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 
  08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1
  08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework
  08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20
  08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240
  08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20
  Job Finished in 31.028 seconds
  Estimated value of PI is 3.1556
   
  > Finally, I try and copy the file over
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp
 s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml
 /input/InputFileFormat.xml
  With failures, global counters are inaccurate; consider running with
 -i
  Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input
 source s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not
 exist.
  at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470)
  at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563)


   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

Re: Help: libhdfs SIGSEGV

2008-04-02 Thread Arun C Murthy


On Apr 2, 2008, at 1:36 AM, Yingyuan Cheng wrote:


Hello.

Is libhdfs thread-safe? I can run single thread reading/writing HDFS
through libhdfs well, but when incrementing number of threads to 2 or
above, I received sigsegv error:



Could you explain a bit more? What are you doing in different  
threads? Are you writing to the same file?


Arun


#
# An unexpected error has been detected by Java Runtime Environment:
#
# Internal Error (53484152454432554E54494D450E4350500214), pid=15614,
tid=1080834960
#
# Java VM: Java HotSpot(TM) Server VM (1.6.0_03-b05 mixed mode)
# An error report file with more information is saved as  
hs_err_pid15614.log

#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
/bin/sh: line 1: 15614 Aborted ./hdfsbench -a w /tmp/test/txt -t 2


and the bt output:


(gdb) bt full
#0 0x061707b8 in ChunkPool::allocate () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#1 0x061703a6 in Arena::Arena () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#2 0x06501729 in Thread::Thread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#3 0x06503144 in JavaThread::JavaThread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#4 0x062f006e in attach_current_thread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#5 0x062eee08 in jni_AttachCurrentThread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#6 0xb7f1a4bd in getJNIEnv () at hdfsJniHelper.c:347
vm = (JavaVM *) 0x65ccfcc
vmBuf = 0xb74a0270
env = (JNIEnv *) 0xb7eb73e2
rv = 106745804
noVMs = 1
#7 0xb7f17143 in hdfsConnect (host=0x804aac2 "default", port=0) at
hdfs.c:119
env = (JNIEnv *) 0x0
jConfiguration = (jobject) 0x0
jFS = (jobject) 0x0
jURI = (jobject) 0x0
jURIString = (jstring) 0x0
jVal = {z = 0 '\0', b = 0 '\0', c = 0, s = 0, i = 0, j = 0, f = 0, d =
0, l = 0x0}
cURI = 0x0
gFsRef = (jobject) 0x0
#8 0x0804904b in worker_thread_w (arg=0x1) at hdfsbench.cpp:192
idx = 1
path = {static npos = 4294967295,
_M_dataplus = {> =
{<__gnu_cxx::new_allocator> = {}, fields>},

_M_p = 0x804e484 "/tmp/test/txt1"}}
fs = (hdfsFS) 0x0
writeFile = (hdfsFile) 0x0
i = 0
bwrite = 0
boffset = 0
tv_start = {tv_sec = 0, tv_usec = 0}
tv_end = {tv_sec = 0, tv_usec = 0}
#9 0xb7f2246b in start_thread () from /lib/tls/i686/cmov/ 
libpthread.so.0

No symbol table info available.
#10 0xb7da06de in clone () from /lib/tls/i686/cmov/libc.so.6


--
Yingyuan Cheng





Re: DFS get blocked when writing a file.

2008-04-02 Thread Iván de Prado
Thanks very much for the help. 

I will investigate more about that. 

Iván

El lun, 31-03-2008 a las 11:11 -0700, Raghu Angadi escribió:
> Iván,
> 
> Whether this was expected or an error depends on what happened on the 
> client. This could happen and would not be a bug if client was killed 
> for some other reason for e.g. But if client is also similarly surprised 
>   then its a different case.
> 
> You could grep for this block in NameNode log and client. If you are 
> still interested in looking into this, I would suggest opening a jira.
> 
> Raghu.
> 
> Iván de Prado wrote:
> > Thanks, 
> > 
> > I have tried with the trunk version and now the exception "Trying to
> > change block file offset of block blk_... to ... but actual size of file
> > is ..." has disappeared and the jobs don't seems to get blocked.
> > 
> > But I have another "Broken Pipe" and "EOF" exceptions in the dfs logs.
> > They seems similar to https://issues.apache.org/jira/browse/HADOOP-2042
> > ticket. The Jobs ends but not sure if they are executed smoothly. are
> > these exceptions normal? As example, the exceptions for the block
> > (6801211507359331627) appears in two nodes (I have 2 as replication) and
> > looks like:
> > 
> > hn2: 2008-03-31 05:03:13,736 INFO org.apache.hadoop.dfs.DataNode:
> > Datanode 0 forwarding connect ack to upstream firstbadlink is 
> > hn2: 2008-03-31 05:03:14,507 INFO org.apache.hadoop.dfs.DataNode:
> > Receiving block blk_6801211507359331627 src: /172.16.3.6:38218
> > dest: /172.16.3.6:50010
> > 
> > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode:
> > Exception in receiveBlock for block blk_6801211507359331627
> > java.io.EOFException
> > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode:
> > PacketResponder 0 for block blk_6801211507359331627 Interrupted.
> > hn2: 2008-03-31 05:04:14,528 INFO org.apache.hadoop.dfs.DataNode:
> > PacketResponder 0 for block blk_6801211507359331627 terminating
> > hn2: 2008-03-31 05:04:14,530 INFO org.apache.hadoop.dfs.DataNode:
> > writeBlock blk_6801211507359331627 received exception
> > java.io.EOFException
> > hn2: 2008-03-31 05:04:14,530 ERROR org.apache.hadoop.dfs.DataNode:
> > 172.16.3.4:50010:DataXceiver: java.io.EOFException
> > hn2:at java.io.DataInputStream.readInt(DataInputStream.java:375)
> > hn2:at org.apache.hadoop.dfs.DataNode
> > $BlockReceiver.receiveBlock(DataNode.java:2243)
> > hn2:at org.apache.hadoop.dfs.DataNode
> > $DataXceiver.writeBlock(DataNode.java:1157)
> > hn2:at org.apache.hadoop.dfs.DataNode
> > $DataXceiver.run(DataNode.java:938)
> > hn2:at java.lang.Thread.run(Thread.java:619)
> > 
> > hn4: 2008-03-31 05:03:13,590 INFO org.apache.hadoop.dfs.DataNode:
> > Datanode 0 forwarding connect ack to upstream firstbadlink is 
> > hn4: 2008-03-31 05:03:14,506 INFO org.apache.hadoop.dfs.DataNode:
> > Receiving block blk_6801211507359331627 src: /172.16.3.6:41112
> > dest: /172.16.3.6:50010
> > 
> > hn4: 2008-03-31 05:03:26,825 INFO org.apache.hadoop.dfs.DataNode:
> > Exception in receiveBlock for block blk_6801211507359331627
> > java.io.EOFException
> > 
> > hn4: 2008-03-31 05:04:14,524 INFO org.apache.hadoop.dfs.DataNode:
> > PacketResponder blk_6801211507359331627 1 Exception
> > java.net.SocketException: Broken pipe
> > hn4:at java.net.SocketOutputStream.socketWrite0(Native Method)
> > hn4:at
> > java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
> > hn4:at
> > java.net.SocketOutputStream.write(SocketOutputStream.java:136)
> > hn4:at java.io.DataOutputStream.writeLong(DataOutputStream.java:207)
> > hn4:at org.apache.hadoop.dfs.DataNode
> > $PacketResponder.run(DataNode.java:1825)
> > hn4:at java.lang.Thread.run(Thread.java:619)
> > hn4: 
> > hn4: 2008-03-31 05:04:14,525 INFO org.apache.hadoop.dfs.DataNode:
> > PacketResponder 1 for block blk_6801211507359331627 terminating
> > hn4: 2008-03-31 05:04:14,525 INFO org.apache.hadoop.dfs.DataNode:
> > writeBlock blk_6801211507359331627 received exception
> > java.io.EOFException
> > hn4: 2008-03-31 05:04:14,526 ERROR org.apache.hadoop.dfs.DataNode:
> > 172.16.3.6:50010:DataXceiver: java.io.EOFException
> > hn4:at java.io.DataInputStream.readInt(DataInputStream.java:375)
> > hn4:at org.apache.hadoop.dfs.DataNode
> > $BlockReceiver.receiveBlock(DataNode.java:2243)
> > hn4:at org.apache.hadoop.dfs.DataNode
> > $DataXceiver.writeBlock(DataNode.java:1157)
> > hn4:at org.apache.hadoop.dfs.DataNode
> > $DataXceiver.run(DataNode.java:938)
> > hn4:at java.lang.Thread.run(Thread.java:619)
> > hn4: 
> > 
> > Many thanks, 
> > 
> > Iván de Prado Alonso
> > http://ivandeprado.blogspot.com/



Re: What happens if a namenode fails?

2008-04-02 Thread Peeyush Bishnoi
hello ,

See this URL , which might can help you out for your query.

http://www.nabble.com/Namenode-cluster-and-fail-over-td15903856.html 

---
Peeyush


On Tue, 2008-04-01 at 14:44 -0700, Xavier Stevens wrote:

> What happens to your data if the namenode fails (hardware failure)?
> Assuming you replace it with a fresh box can you restore all of your
> data from the slaves?
>  
> -Xavier


Re: Nutch and Distributed Lucene

2008-04-02 Thread Naama Kraus
Hi Ning,

Thanks a lot !

Naama

On Tue, Apr 1, 2008 at 7:06 PM, Ning Li <[EMAIL PROTECTED]> wrote:

> Hi,
>
> Nutch builds Lucene indexes. But Nutch is much more than that. It is a
> web search application software that crawls the web, inverts links and
> builds indexes. Each step is one or more Map/Reduce jobs. You can find
> more information at http://lucene.apache.org/nutch/
>
> The Map/Reduce job to build Lucene indexes in Nutch is customized to
> the data schema/structures used in Nutch. The index contrib package in
> Hadoop provides a general/configurable process to build Lucene indexes
> in parallel using a Map/Reduce job. That's the main difference. There
> is also the difference that the index build job in Nutch builds
> indexes in reduce tasks, while the index contrib package builds
> indexes in both map and reduce tasks and there are advantages in doing
> that...
>
> Regards,
> Ning
>
>
> On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I'd like to know if Nutch is running on top of Lucene, or is it non
> related
> > to Lucene. I.e. indexing, parsing, crawling, internal data structures
> ... -
> > all written from scratch using MapReduce (my impression) ?
> >
> > What is the relation between Nutch and the distributed Lucene patch that
> was
> > inserted lately into Hadoop ?
> >
> > Thanks for any enlightening,
> > Naama
> >
> > --
> > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00
> oo
> > 00 oo 00 oo
> > "If you want your children to be intelligent, read them fairy tales. If
> you
> > want them to be more intelligent, read them more fairy tales." (Albert
> > Einstein)
> >
>



-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)


Help: libhdfs SIGSEGV

2008-04-02 Thread Yingyuan Cheng
Hello.

Is libhdfs thread-safe? I can run single thread reading/writing HDFS
through libhdfs well, but when incrementing number of threads to 2 or
above, I received sigsegv error:

#
# An unexpected error has been detected by Java Runtime Environment:
#
# Internal Error (53484152454432554E54494D450E4350500214), pid=15614,
tid=1080834960
#
# Java VM: Java HotSpot(TM) Server VM (1.6.0_03-b05 mixed mode)
# An error report file with more information is saved as hs_err_pid15614.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
/bin/sh: line 1: 15614 Aborted ./hdfsbench -a w /tmp/test/txt -t 2


and the bt output:


(gdb) bt full
#0 0x061707b8 in ChunkPool::allocate () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#1 0x061703a6 in Arena::Arena () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#2 0x06501729 in Thread::Thread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#3 0x06503144 in JavaThread::JavaThread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#4 0x062f006e in attach_current_thread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#5 0x062eee08 in jni_AttachCurrentThread () from
/usr/lib/jvm/java/jre/lib/i386/server/libjvm.so
No symbol table info available.
#6 0xb7f1a4bd in getJNIEnv () at hdfsJniHelper.c:347
vm = (JavaVM *) 0x65ccfcc
vmBuf = 0xb74a0270
env = (JNIEnv *) 0xb7eb73e2
rv = 106745804
noVMs = 1
#7 0xb7f17143 in hdfsConnect (host=0x804aac2 "default", port=0) at
hdfs.c:119
env = (JNIEnv *) 0x0
jConfiguration = (jobject) 0x0
jFS = (jobject) 0x0
jURI = (jobject) 0x0
jURIString = (jstring) 0x0
jVal = {z = 0 '\0', b = 0 '\0', c = 0, s = 0, i = 0, j = 0, f = 0, d =
0, l = 0x0}
cURI = 0x0
gFsRef = (jobject) 0x0
#8 0x0804904b in worker_thread_w (arg=0x1) at hdfsbench.cpp:192
idx = 1
path = {static npos = 4294967295,
_M_dataplus = {> =
{<__gnu_cxx::new_allocator> = {}, },
_M_p = 0x804e484 "/tmp/test/txt1"}}
fs = (hdfsFS) 0x0
writeFile = (hdfsFile) 0x0
i = 0
bwrite = 0
boffset = 0
tv_start = {tv_sec = 0, tv_usec = 0}
tv_end = {tv_sec = 0, tv_usec = 0}
#9 0xb7f2246b in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
No symbol table info available.
#10 0xb7da06de in clone () from /lib/tls/i686/cmov/libc.so.6


--
Yingyuan Cheng



RE: one key per output part file

2008-04-02 Thread Joydeep Sen Sarma
curious - why do we need a file per XXX?

- if further processing is going to be done in hadoop itself - then it's hard 
to see a reason. One can always have multiple entries in the same hdfs file. 
note that it's possible to align map task splits on sort key boundaries in 
pre-sorted data (it's not something that hadoop supports natively right now - 
but u can write ur own inputformat to do this). meaning - that subsequent 
processing that wants all entries corresponding to XXX in one group (as in a 
reducer) can do so in the map phase itself (ie. - it's damned cheap and doesn't 
require sorting data all over again).

- if the data needs to be exported (either to a sql db or an external file 
system) - then why not do so directly from the reducer (instead of trying to 
create these intermediate small files in hdfs)? data can be written to tmp 
tables/files and can be overwritten in case the reducer re-runs (and then 
committed to final location once the job is complete)



-Original Message-
From: [EMAIL PROTECTED] on behalf of Ashish Venugopal
Sent: Tue 4/1/2008 6:42 PM
To: core-user@hadoop.apache.org
Subject: Re: one key per output part file
 
This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...

Ashish

On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Try opening the desired output file in the reduce method.  Make sure that
> the output files are relative to the correct task specific directory (look
> for side-effect files on the wiki).
>
>
>
> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>