Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Allen Wittenauer

On Jan 28, 2011, at 5:47 PM, Keith Wiley wrote:

> On Jan 28, 2011, at 15:50 , Greg Roelofs wrote:
> 
>> Does your .so depend on any other potentially thread-unsafe .so that other
>> (non-Hadoop) processes might be using?  System libraries like zlib are safe
>> (else they wouldn't make very good system libraries), but maybe some other
>> research library or something?  (That's a long shot, but I'm pretty much
>> grasping at straws here.)
> 
> Yeah, I dunno.  It's a very complicated system that hits all kinds of popular 
> conventional libraries: boost, eigen, countless other things.  I doubt of it 
> is being access however.  This is a dedicated cluster so if my task is the 
> only one running, then it's only concurrent with the OS itself (and the JVM 
> and Hadoop).

By chance, do you have jvm reuse turned on?




Re: Draining/Decommisioning a tasktracker

2011-01-28 Thread Allen Wittenauer

On Jan 28, 2011, at 1:09 AM, rishi pathak wrote:

> Hi,
>Is there a way to drain a tasktracker. What we require is not to
> schedule any more map/red tasks onto a tasktracker(mark it offline) but
> still the running tasks should not be affected.
> 

Decommissioning task trackers was added in 0.21.



Re: Why do I get SocketTimeoutException?

2011-01-28 Thread li ping
It could be the exception caused by connection between NN and DN

On Sat, Jan 29, 2011 at 8:11 AM, hadoop user  wrote:

> What are possible causes due to which I might get SocketTimeoutException ?
>
>
> 11/01/28 19:01:36 INFO hdfs.DFSClient: Exception in createBlockOutputStream
> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=/XX.XXX.XX.X:50010]
> 11/01/28 19:01:36 INFO hdfs.DFSClient: Abandoning block
> blk_987175206123664825_1215418
>
> Thanks,
> Ravi
>



-- 
-李平


Re: Distributed indexing with Hadoop

2011-01-28 Thread Lance Norskog
Look at the Reuters example in the Mahout project: http://mahout.apache.org

On Fri, Jan 28, 2011 at 2:49 AM, Marco Didonna  wrote:
> Hello everyone,
> I am building an hadoop "app" to quickly index a corpus of documents.
> This app will accept one or more XML file that will contain the corpus.
> Each document is made up of several section: title, authors,
> body...these section are not static and depend on the collection. Here's
> a sample glimpse of how the xml input file looks like:
>
> 
>  the divine comedy 
> Dante
> halfway along our life's path...
> 
> 
>
> ...
>
> 
>
> I would like to discuss some implementation choices:
>
> - which is the best way to "tell" my hadoop app which section to expect
> between  and  tags?
>
> - is it more appropriate to implement a record reader that passes to the
> mapper the whole content of the document tag or section by section. I
> was wondering which parser to use, a dom-like one or a sax-like
> one...any library (efficient) to recommend?
>
> - do you know any library I could use to process text? By text
> processing I mean common preprocessing operation like tokenization,
> stopword elimination...I was thinking of using lucene's engine...can it
> be a bottleneck?
>
> I am looking forward to read your opinion
>
> Thanks,
>
> Marco
>
>



-- 
Lance Norskog
goks...@gmail.com


message transmission in Hadoop

2011-01-28 Thread Da Zheng
Hello,

I monitored system calls of HDFS with systemtap and found HDFS actually sends
many 1-byte data to the network. I could also see many 8-byte and 64-byte data
written to the OS though I don't know whether they are written to the disk or
sent to the network. I did see many 8-byte data sent to the network. The number
of these data is several times more than 64KB data packet sent by HDFS.

Could anyone tell me why HDFS sends so many small packets? heartbeat messages?
RPCs? It doesn't seem to me these messages can be just 1 byte.

Thanks,
Da


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Greg Roelofs
I wrote:

>> Btw, keep in mind that there are memory-related bugs that don't show up
>> until there's something big in memory that pushes the code in question
>> up into a region with different data patterns in it (most frequently zero
>> vs. non-zero, but others are possible).  IOW, maybe the code is dependent
>> on uninitialized memory, but you were getting lucky when you ran it outside
>> of Hadoop.  Have you run it through valgrind or Purify or similar?

Keith Wiley wrote:

> Valgrind has turned out to be almost useless.  It can't "reach"
> through the JVM through JNI to the .so code.  If I don't
> tell valgrind to following children, it obviously produces
> no relevant output, but if I do tell it to follow children,
> it can't successfully launch a VM to run Java in:

> Error occurred during initialization of VM
> Unknown x64 processor: SSE2 not supported

> Sigh...any thoughts on running Valgrind on Hadoop->JVM->JNI->native code?

I actually meant something simpler:  if we posit that the bug is actually
in the library code but isn't always triggering a segfault due to random
memory conditions (i.e., "getting lucky"), then running valgrind on it in
a non-Java context (i.e., what you said "runs perfectly fine outside Hadoop")
should detect such bug(s).

If that shows nothing, and you're not passing buffers across the JNI boundary
(=> possible GC issues, perhaps subtle ones?), then I'm out of ideas.  Again.
Sorry. :-/

Greg


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Greg Roelofs
Todd Lipcon wrote:

> JNI may also work fine with no GC running, but then work badly when GC kicks
> in at a bad time. For example, if you grab a pointer to a String or array,
> you need to essentially lock them so the GC doesn't relocate the objects
> underneath you. For example, maybe you're releasing one of these references
> and then continuing to use it?

Excellent point!  And one I should have remembered, too.

Keith, take a look at the native ZlibCompressor interface to see one way of
handling this.  (It pins a number of buffers in memory and puts them into a
pool, if I remember Chris Douglas's explanation correctly.  I didn't need to
dive into that level of detail myself for what I was working on, so I never
touched the buffer code and might not be remembering it entirely accurately,
but that's the gist, anyway.)

On 0.20.x or 0.22/trunk:
  hadoop-common/src/native/src/org/apache/hadoop/io/compress
  hadoop-common/src/{core or java}/org/apache/hadoop/io/compress

Greg


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Keith Wiley
Hmmm, none of my native code is using JNI objects, so the memory in question 
should have no relationship to the Java heap or any other aspect of the Java 
world, but I admit, I'm unclear on who Java, JNI, and native libraries divide 
memory between one another or how they trade responsibility for allocating and 
deallocating memory.

I'll consider it, but since I'm not talking about JNI objects, I don't that can 
be it.  Do you think I'm misunderstanding something?

On Jan 28, 2011, at 16:24 , Todd Lipcon wrote:

> JNI may also work fine with no GC running, but then work badly when GC kicks
> in at a bad time. For example, if you grab a pointer to a String or array,
> you need to essentially lock them so the GC doesn't relocate the objects
> underneath you. For example, maybe you're releasing one of these references
> and then continuing to use it?
> 
> -Todd



Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Luminous beings are we, not this crude matter."
  -- Yoda






Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Keith Wiley
On Jan 28, 2011, at 15:50 , Greg Roelofs wrote:

> Does your .so depend on any other potentially thread-unsafe .so that other
> (non-Hadoop) processes might be using?  System libraries like zlib are safe
> (else they wouldn't make very good system libraries), but maybe some other
> research library or something?  (That's a long shot, but I'm pretty much
> grasping at straws here.)

Yeah, I dunno.  It's a very complicated system that hits all kinds of popular 
conventional libraries: boost, eigen, countless other things.  I doubt of it is 
being access however.  This is a dedicated cluster so if my task is the only 
one running, then it's only concurrent with the OS itself (and the JVM and 
Hadoop).

>> Yes, not thread safe, but what difference could that make if I
>> don't use the library in a multi-threaded fashion.  One map task,
>> one node, one Java thread calling JNI and using the native code?
>> How do thread safety issues factor into this?  I admit, it's
>> my theory that threads might be involved somehow, but I don't
>> understand how, I'm just shooting in the dark since I can't
>> solve this problem any other way yet.
> 
> Since you can reproduce it in standalone mode, can you enable core dumps
> so you can see the backtrace of the code that segfaults?  Knowing what
> specifically broke and how it got there is always a big help.

Yep, I've got core dumps and I've run them through gdb.  I know that the code 
often dies very deep inside ostensibly standard libraries, like eigen for 
example...which leads me to believe the memory corruption happened long before 
the code reached that point.

> Btw, keep in mind that there are memory-related bugs that don't show up
> until there's something big in memory that pushes the code in question
> up into a region with different data patterns in it (most frequently zero
> vs. non-zero, but others are possible).  IOW, maybe the code is dependent
> on uninitialized memory, but you were getting lucky when you ran it outside
> of Hadoop.  Have you run it through valgrind or Purify or similar?


Valgrind has turned out to be almost useless.  It can't "reach" through the JVM 
through JNI to the .so code.  If I don't tell valgrind to following children, 
it obviously produces no relevant output, but if I do tell it to follow 
children, it can't successfully launch a VM to run Java in:

Error occurred during initialization of VM
Unknown x64 processor: SSE2 not supported

Sigh...any thoughts on running Valgrind on Hadoop->JVM->JNI->native code?

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
  -- Edwin A. Abbott, Flatland






Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Todd Lipcon
JNI may also work fine with no GC running, but then work badly when GC kicks
in at a bad time. For example, if you grab a pointer to a String or array,
you need to essentially lock them so the GC doesn't relocate the objects
underneath you. For example, maybe you're releasing one of these references
and then continuing to use it?

-Todd

On Fri, Jan 28, 2011 at 3:50 PM, Greg Roelofs  wrote:

> Keith Wiley wrote:
>
> > (1) Speculative execution would occur on a completely different
> > node, so there definitely isn't any thread cross-talk (in memory).
> > So long as they don't rely on reading/writing temp files from
> > HDFS I don't see how they could have any effect on one another.
>
> Good point.
>
> > (2) I am also getting seg faults when I run in noncluster
> > standalone mode, which is a single nonspeculated thread..I
> > presume.
>
> That's the same as "pseudo-distributed mode"?
>
> > Can you explain your thoughts on speculative execution w.r.t. the
> > problems I'm having?
>
> Thoughts?  You expect me to have thoughts, too??
>
> :-)
>
> I had not fully thought through the spec ex idea; it was the only thing
> I could think of that might put two (otherwise independent) JNI-using tasks
> onto the same node.  But as you point out above, it wouldn't...
>
> Does your .so depend on any other potentially thread-unsafe .so that other
> (non-Hadoop) processes might be using?  System libraries like zlib are safe
> (else they wouldn't make very good system libraries), but maybe some other
> research library or something?  (That's a long shot, but I'm pretty much
> grasping at straws here.)
>
> > Yes, not thread safe, but what difference could that make if I
> > don't use the library in a multi-threaded fashion.  One map task,
> > one node, one Java thread calling JNI and using the native code?
> > How do thread safety issues factor into this?  I admit, it's
> > my theory that threads might be involved somehow, but I don't
> > understand how, I'm just shooting in the dark since I can't
> > solve this problem any other way yet.
>
> Since you can reproduce it in standalone mode, can you enable core dumps
> so you can see the backtrace of the code that segfaults?  Knowing what
> specifically broke and how it got there is always a big help.
>
> Btw, keep in mind that there are memory-related bugs that don't show up
> until there's something big in memory that pushes the code in question
> up into a region with different data patterns in it (most frequently zero
> vs. non-zero, but others are possible).  IOW, maybe the code is dependent
> on uninitialized memory, but you were getting lucky when you ran it outside
> of Hadoop.  Have you run it through valgrind or Purify or similar?
>
> Greg
>



-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Hadoop Version

2011-01-28 Thread hadoop user
Redirecting to common-user,
you can check hadoop version by using any of the following methods.

CLI : using hadoop version command.
bin/hadoop version

Web Interface:
Check Name node or Job tracker web interface. It will show version number.

-
Ravi

On Fri, Jan 28, 2011 at 11:24 AM,  wrote:

>  Hello all,
> I am having issues with accessing hdfs and I figured its due to version
> mismatch. I know my jar files have multiple copies of hadoop (pig has its
> own, I have hadoop 0.20.2 and Whirr had its own hadoop copy). My question
> how to find the right version of hadoop that matches with the one I
> installed? Where is the hadoop version info stored?
>
> 2011-01-28 14:17:24,729 WARN org.apache.hadoop.ipc.Server: Incorrect header
> or version mismatch from xx.xx.xx.xx:57271 got version 4 expected version 3
>
> Praveen
>
>


Why do I get SocketTimeoutException?

2011-01-28 Thread hadoop user
What are possible causes due to which I might get SocketTimeoutException ?


11/01/28 19:01:36 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=/XX.XXX.XX.X:50010]
11/01/28 19:01:36 INFO hdfs.DFSClient: Abandoning block
blk_987175206123664825_1215418

Thanks,
Ravi


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Greg Roelofs
Keith Wiley wrote:

> (1) Speculative execution would occur on a completely different
> node, so there definitely isn't any thread cross-talk (in memory).
> So long as they don't rely on reading/writing temp files from
> HDFS I don't see how they could have any effect on one another.

Good point.

> (2) I am also getting seg faults when I run in noncluster
> standalone mode, which is a single nonspeculated thread..I
> presume.

That's the same as "pseudo-distributed mode"?

> Can you explain your thoughts on speculative execution w.r.t. the
> problems I'm having?

Thoughts?  You expect me to have thoughts, too??

:-)

I had not fully thought through the spec ex idea; it was the only thing
I could think of that might put two (otherwise independent) JNI-using tasks
onto the same node.  But as you point out above, it wouldn't...

Does your .so depend on any other potentially thread-unsafe .so that other
(non-Hadoop) processes might be using?  System libraries like zlib are safe
(else they wouldn't make very good system libraries), but maybe some other
research library or something?  (That's a long shot, but I'm pretty much
grasping at straws here.)

> Yes, not thread safe, but what difference could that make if I
> don't use the library in a multi-threaded fashion.  One map task,
> one node, one Java thread calling JNI and using the native code?
> How do thread safety issues factor into this?  I admit, it's
> my theory that threads might be involved somehow, but I don't
> understand how, I'm just shooting in the dark since I can't
> solve this problem any other way yet.

Since you can reproduce it in standalone mode, can you enable core dumps
so you can see the backtrace of the code that segfaults?  Knowing what
specifically broke and how it got there is always a big help.

Btw, keep in mind that there are memory-related bugs that don't show up
until there's something big in memory that pushes the code in question
up into a region with different data patterns in it (most frequently zero
vs. non-zero, but others are possible).  IOW, maybe the code is dependent
on uninitialized memory, but you were getting lucky when you ran it outside
of Hadoop.  Have you run it through valgrind or Purify or similar?

Greg


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Keith Wiley

On Jan 28, 2011, at 13:37 , Greg Roelofs wrote:

> Keith Wiley wrote:
> 
>> However, it is also not run in any concurrent fashion except
>> w.r.t. to the JVM itself.  For example, my map task doesn't
>> make parallel calls through JNI to the native code on concurrent
>> threads at the Java level, nor does the native code itself spawn
>> any threads (like I said, it isn't even compiled with pthreads).
> 
> Is speculative execution enabled?

I suppose I hadn't specifically disabled that, but two points:

(1) Speculative execution would occur on a completely different node, so there 
definitely isn't any thread cross-talk (in memory).  So long as they don't rely 
on reading/writing temp files from HDFS I don't see how they could have any 
effect on one another.

(2) I am also getting seg faults when I run in noncluster standalone mode, 
which is a single nonspeculated thread..I presume.

Can you explain your thoughts on speculative execution w.r.t. the problems I'm 
having?

>> So, the question is, in the scenario I have described, is there
>> any reason to suspect that the cause of my problems is some
>> sort of thread trampling between the native code and something
>> else in the surrounding environment (the JVM or something like
>> that), especially in the context of the surrounding Hadoop
>> infrastructure?  It doesn't really make any sense to me, but
>> I'm running out of ideas.
> 
> I don't see any obvious possibilities except speculative execution, and
> even that would depend on how the shared library was written.  Does it
> contain any global or static variables?  If so, it's almost certainly
> not thread-safe (unless, say, a global variable were basically write-
> only and advisory only, e.g., used only in an error message or a summary
> message at the end).


Yes, not thread safe, but what difference could that make if I don't use the 
library in a multi-threaded fashion.  One map task, one node, one Java thread 
calling JNI and using the native code?  How do thread safety issues factor into 
this?  I admit, it's my theory that threads might be involved somehow, but I 
don't understand how, I'm just shooting in the dark since I can't solve this 
problem any other way yet.

Thanks for the input.  Can you tell me what you're thinking w.r.t. speculative 
execution?

I'll try it without, but I don't see how it could alter the standalone behavior.

Cheers!


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"Luminous beings are we, not this crude matter."
  -- Yoda






Re: Draining/Decommisioning a tasktracker

2011-01-28 Thread phil young
There are some hooks available in the schedulers that could be useful also.
I think they were expected to be used to allow you to schedule tasks based
on load average on the host, but I'd expect you can customize them for your
purpose.


On Fri, Jan 28, 2011 at 6:46 AM, Harsh J  wrote:

> Moving discussion to the MapReduce-User list:
> mapreduce-u...@hadoop.apache.org
>
> Reply inline:
>
> On Fri, Jan 28, 2011 at 2:39 PM, rishi pathak 
> wrote:
> > Hi,
> >Is there a way to drain a tasktracker. What we require is not to
> > schedule any more map/red tasks onto a tasktracker(mark it offline) but
> > still the running tasks should not be affected.
>
> You could simply shut the TT down. MapReduce was designed with faults
> in mind and thus tasks that are running on a particular TaskTracker
> can be re-run elsewhere if they failed. Is this not usable in your
> case?
>
> --
> Harsh J
> www.harshj.com
>


Re: Thread safety issues with JNI/native code from map tasks?

2011-01-28 Thread Greg Roelofs
Keith Wiley wrote:

> However, it is also not run in any concurrent fashion except
> w.r.t. to the JVM itself.  For example, my map task doesn't
> make parallel calls through JNI to the native code on concurrent
> threads at the Java level, nor does the native code itself spawn
> any threads (like I said, it isn't even compiled with pthreads).

Is speculative execution enabled?

> So, the question is, in the scenario I have described, is there
> any reason to suspect that the cause of my problems is some
> sort of thread trampling between the native code and something
> else in the surrounding environment (the JVM or something like
> that), especially in the context of the surrounding Hadoop
> infrastructure?  It doesn't really make any sense to me, but
> I'm running out of ideas.

I don't see any obvious possibilities except speculative execution, and
even that would depend on how the shared library was written.  Does it
contain any global or static variables?  If so, it's almost certainly
not thread-safe (unless, say, a global variable were basically write-
only and advisory only, e.g., used only in an error message or a summary
message at the end).

No other ideas.

Greg


Re: Java->native .so->seg fault->core dump file?

2011-01-28 Thread Keith Wiley
On Jan 28, 2011, at 09:39 , Allen Wittenauer wrote:

> 
> On Jan 21, 2011, at 12:57 PM, Keith Wiley wrote:
>> and I have this in my .bashrc (which I believe should be propagated to the 
>> slave nodes):
>>  ulimit -c unlimited
> 
>   .bashrc likely isn't executed at task startup, btw.  Also, you would 
> need to have this in whatever account is used to run the tasktracker...

True...good point.

>> and in my native code I call rlimit() and write the results, where I see:
>>  RLIMIT_CORE:  18446744073709551615 18446744073709551615
>> 
>> which indicates the "unlimited" setting, but I can't find any core dump 
>> files in the node's hadoop directories after the job runs.
>> 
>> Any ideas what I'm doing wrong?
> 
>   Which operating system?  On Linux, what is the value of 
> /proc/sys/kernel/core_pattern ? On Solaris, what is in /etc/coreadm.conf ?

Linux.  Are you asking the value on the cluster or on my local machine?  The 
value of "/proc/sys/kernel/core_pattern" on the namenode (I guess) is "core".

Thanks.


Keith Wiley   kwi...@keithwiley.com   www.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
  -- Homer Simpson






Example for "Strings Comparison"

2011-01-28 Thread Rawan AlSaad

Dear all,
 
I am looking for an example for a Mapreduce java implementation for strings 
comparison [pair-wise comparison]. If anybody have gone through a similar 
example before, could you please help pointing me to a code example for this?
 
Thanks
Rawan
  

Re: Java->native .so->seg fault->core dump file?

2011-01-28 Thread Allen Wittenauer

On Jan 21, 2011, at 12:57 PM, Keith Wiley wrote:
> and I have this in my .bashrc (which I believe should be propagated to the 
> slave nodes):
>   ulimit -c unlimited

.bashrc likely isn't executed at task startup, btw.  Also, you would 
need to have this in whatever account is used to run the tasktracker...

> and in my native code I call rlimit() and write the results, where I see:
>   RLIMIT_CORE:  18446744073709551615 18446744073709551615
> 
> which indicates the "unlimited" setting, but I can't find any core dump files 
> in the node's hadoop directories after the job runs.
> 
> Any ideas what I'm doing wrong?

Which operating system?  On Linux, what is the value of 
/proc/sys/kernel/core_pattern ? On Solaris, what is in /etc/coreadm.conf ?



Re: Best way to limit the number of concurrent tasks per job on hadoop 0.20.2

2011-01-28 Thread Allen Wittenauer

On Jan 25, 2011, at 12:48 PM, Renaud Delbru wrote:

> As it seems that the capacity and fair schedulers in hadoop 0.20.2 do not 
> allow a hard upper limit in number of concurrent tasks, do anybody know any 
> other solutions to achieve this ?

The specific change for capacity scheduler has been backported to 0.20.2 as 
part of https://issues.apache.org/jira/browse/MAPREDUCE-1105 .  Note that 
you'll also need https://issues.apache.org/jira/browse/MAPREDUCE-1160 which 
fixes a logging bug in the JobTracker.  Otherwise your logs will fill up.



Re: Draining/Decommisioning a tasktracker

2011-01-28 Thread Harsh J
Moving discussion to the MapReduce-User list: mapreduce-u...@hadoop.apache.org

Reply inline:

On Fri, Jan 28, 2011 at 2:39 PM, rishi pathak  wrote:
> Hi,
>        Is there a way to drain a tasktracker. What we require is not to
> schedule any more map/red tasks onto a tasktracker(mark it offline) but
> still the running tasks should not be affected.

You could simply shut the TT down. MapReduce was designed with faults
in mind and thus tasks that are running on a particular TaskTracker
can be re-run elsewhere if they failed. Is this not usable in your
case?

-- 
Harsh J
www.harshj.com


Hadoop UK Meetup: 10th February

2011-01-28 Thread Dan
Hey,

The next UK Hadoop user group meetup is going to be on February 10th at
Skills Matter in London. We're got two talks arranged for the evening :-

*Overview of Hadoop in 2010 and what's coming up in 2011*
 *Pig & Project Voldemort: Big data loading (15 min)*
 Dan Harvey  is a Datamining Engineer at
Mendeley 

 *Lily - an open source smart content repository running on top of HBase and
friends. (45 mins)*
 Steven Noels  is co-founder and CEO of
 Outerthought 

So it should be an interesting evening, if you would like to attend you need
to sign up on the Skills Matter page http://bit.ly/gF0FN3

Also if you are not yet following us on twitter you
can do so @hug_uk  for more information about
what we're doing in the UK.

Thanks,
Dan


Distributed indexing with Hadoop

2011-01-28 Thread Marco Didonna
Hello everyone,
I am building an hadoop "app" to quickly index a corpus of documents.
This app will accept one or more XML file that will contain the corpus.
Each document is made up of several section: title, authors,
body...these section are not static and depend on the collection. Here's
a sample glimpse of how the xml input file looks like:


 the divine comedy 
Dante
halfway along our life's path...



...



I would like to discuss some implementation choices:

- which is the best way to "tell" my hadoop app which section to expect
between  and  tags?

- is it more appropriate to implement a record reader that passes to the
mapper the whole content of the document tag or section by section. I
was wondering which parser to use, a dom-like one or a sax-like
one...any library (efficient) to recommend?

- do you know any library I could use to process text? By text
processing I mean common preprocessing operation like tokenization,
stopword elimination...I was thinking of using lucene's engine...can it
be a bottleneck?

I am looking forward to read your opinion

Thanks,

Marco



Draining/Decommisioning a tasktracker

2011-01-28 Thread rishi pathak
Hi,
Is there a way to drain a tasktracker. What we require is not to
schedule any more map/red tasks onto a tasktracker(mark it offline) but
still the running tasks should not be affected.



-- 
---
Rishi Pathak
National PARAM Supercomputing Facility
C-DAC, Pune, India