Hadoop + Servlet Problems

2008-08-08 Thread Kylie McCormick
Hi!
I've gotten Hadoop to run a search as I want, but now I'm trying to
add a servlet component to it.

All of Hadoop works properly, but when I set components from the
servlet instead of setting them via the command-line, Hadoop only
produces temporary output files and doesn't complete.

I've looked at Nutch's NutchBean + Cached file for the servlet
information from Nutch, and there is nothing terribly enlightening
there in the code. Does anyone have any information on Hadoop +
Tomcat/Servlets?

Thanks,
Kylie

--
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Re: Hadoop also applicable in a web app environment?

2008-08-05 Thread Kylie McCormick
Hello:
I am actually working on this myself on my project Multisearch. The Map()
function uses clients to connect to services and collect responses, and the
Reduce() function merges them together. I'm working on putting this into a
Servlet as well, too, so it can be used via Tomcat.

I've worked with a number of different web services... OGSA-DAI and Axis Web
Services. My experience with Hadoop (which is not entirely researched yet)
is that it is faster than using these other methods alone. Hopefully by the
end of the summer I'll have some more research on this topic (about speed).

The other links posted here are really helpful...

Kylie


On Tue, Aug 5, 2008 at 10:11 AM, Mork0075 <[EMAIL PROTECTED]> wrote:

> Hello,
>
> i just discovered the Hadoop project and it looks really interesting to me.
> As i can see at the moment, Hadoop is really useful for data intensive
> computations. Is there a Hadoop scenario for scaling web applications too?
> Normally web applications are not that computation heavy. The need of
> scaling them, arises from increasing users, which perform (every user in his
> session) simple operations like querying some data from the database.
>
> So distributing this scenario, a Hadoop job would be to "map" the requests
> to a certain server in the cluster and "reduce" it. But this is what load
> balancers normally do, this doenst solve the scalabilty problem so far.
>
> So my question: is there a Hadoop scenario for "non computation heavy but
> heavy load" web applications?
>
> Thanks a lot
>



-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Appending to Output files

2008-08-02 Thread Kylie McCormick
Hi there:
I have built a successful Map/Reduce set that generates the results of a
search. I would like to simply iterate through various queries/searches with
a different map/reduce object per each query. However, the output won't be
written to the same output file (which I want).

Is it possible to append output to the same file/output as another
Map/Reduce run?

Thanks,
Kylie

-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Confused about Reduce functions

2008-07-23 Thread Kylie McCormick
Hello!
I have been getting NullPointerExceptions in my reduce() function, with the
code below. (If have removed all the "check for null pointer" if-statements,
but they are there for every object.)

I based my code off of the Word Count example. Essentially, the reduce
function is to rescore the DocumentWritable[] part of the
ResultSetWritable[] object and then let the OutputCollector have it.

However, I discovered that the iterator is returning empty instances
(eg--the initialized value when ResultSetWritable() is called). When I
commented out this function to see what would happen, I got the following
error

java.lang.RuntimeException: java.lang.NoSuchMethodException:
edu.arsc.multisearch.ResultSetWritable.()
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:62)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.readNextValue(ReduceTask.java:291)
at
org.apache.hadoop.mapred.ReduceTask$ValuesIterator.next(ReduceTask.java:232)
at
org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:311)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:38)
at edu.arsc.multisearch.MergeReduce.reduce(MergeReduce.java:18)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:391)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:201)
Caused by: java.lang.NoSuchMethodException:
edu.arsc.multisearch.ResultSetWritable.()
at java.lang.Class.getConstructor0(Class.java:2706)
at java.lang.Class.getDeclaredConstructor(Class.java:1985)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
... 9 more

...So I am confused about what the Iterator is doing. Is it generating new
ResultSetWritable objects through reflection? Why? I thought Iterator had
all the values associated with the given Key from the OutputCollector of the
Map function...

Thanks,
Kylie

CODE---
public void reduce(Text text, Iterator iterator,
OutputCollector outputCol, Reporter reporter) {

//create the appropriate kind of final set (naive merge for now)
NaiveMergeSet nms = new NaiveMergeSet();

//iterate through the ServiceWritable, taking the output and merging it
while(iterator.hasNext()) {

//grab the output result set
ResultSetWritable rsw = iterator.next();
DocumentWritable[] docs = rsw.getResults();

//merge them together
DocumentWritable[] newScores = nms.merge(docs);

//set the new docs to old RSW
rsw.setResults(newScores);

Text newKey = new Text(Multisearch.getQuery());



  }

}


Hadoop with Axis

2008-07-18 Thread Kylie McCormick
Hello Again:
I'm currently running Hadoop with various Client  objects in the Map phase.
A given Axis services provides the class of the Client to be used in this
situation, which runs the call over the wire to the provided URL and
translates the objects returned into Writable objects.

When I use the code without Hadoop, it runs just fine--objects are returned
from over the wire. When I run the code inside of Hadoop's structure, I am
getting null objects within the return type (although the return type itself
is not null) from the service. This is literally the same code.

Do you think this is a time-thing, where the connection is taking too long
so Hadoop kills it? It's only a few seconds, but I thought I should ask. Are
there other things I should be looking into?

Thanks,
Kylie

-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Re: Logging and JobTracker

2008-07-16 Thread Kylie McCormick
I am running Hadoop remotely on a server--I'm not using the web-ui, so I'm
not aware of where to look for it... sorry. Most of the work I've done with
the logging has come from the 'Hadoop Cluster Setup' page.

Thanks,
Kylie

On Wed, Jul 16, 2008 at 3:49 PM, Arun C Murthy <[EMAIL PROTECTED]> wrote:

>
> On Jul 16, 2008, at 4:09 PM, Kylie McCormick wrote:
>
>  Hello (Again):
>> I've managed to get Map/Reduce on its feet and running, but the JobClient
>> runs the Map() to 100% then idles. At least, I think it's idling. It's
>> certainly not updating, and I let it run 10+ minutes.
>>
>> I tried to get the history of the job and/or the logs, and I seem to be
>> running into snares.
>>
>
> Does the 'JobHistory' link at the bottom of the web-ui of the JobTracker
> work for you?
>
> Arun
>



-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Logging and JobTracker

2008-07-16 Thread Kylie McCormick
Hello (Again):
I've managed to get Map/Reduce on its feet and running, but the JobClient
runs the Map() to 100% then idles. At least, I think it's idling. It's
certainly not updating, and I let it run 10+ minutes.

I tried to get the history of the job and/or the logs, and I seem to be
running into snares. For example, the command $HADOOP/bin/hadoop job
-history all hadoopOutput  produces the following error:

08/07/16 15:06:15 WARN fs.FileSystem: "localhost:9000" is a deprecated
filesystem name. Use "hdfs://localhost:9000/" instead.
Exception in thread "main" java.io.IOException: Not able to initialize
History viewer
at
org.apache.hadoop.mapred.HistoryViewer.(HistoryViewer.java:92)
at
org.apache.hadoop.mapred.JobClient.viewHistory(JobClient.java:1335)
at org.apache.hadoop.mapred.JobClient.run(JobClient.java:1299)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobClient.main(JobClient.java:1397)

I have the NameNode/TaskTracker set up via the Cluster Setup tutorial. I
have passwordless SSH, but I can't seem to get the history to work.

I've also modified the log4j.properties to include a hadooplog, but no files
are being created.

Any advice on how to get the log files and/or history working?

Thanks,
Kylie


-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Re: Codec Returning null

2008-07-16 Thread Kylie McCormick
Hello Abdul:
Thanks very much! You're right, I need to check the file ending before using
Codec.

Thanks,
Kylie

On Wed, Jul 16, 2008 at 10:53 AM, Abdul Qadeer <[EMAIL PROTECTED]>
wrote:

> Kylie,
>
> If your file is e.g gzip compressed its name should end in .gz.  Codec
> factory
> tries to recognize compressed files by its extensions.
>
> Abdul Qadeer
>
> On Wed, Jul 16, 2008 at 11:47 AM, Kylie McCormick <
> [EMAIL PROTECTED]>
> wrote:
>
> > Hello:
> > My filename extension for the input file is .txt.
> >
> > Thanks,
> > Kylie
> >
> > On Wed, Jul 16, 2008 at 3:21 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> >
> > > What does your file name extension look like?
> > >
> > > > -Original Message-
> > > > From: Kylie McCormick [mailto:[EMAIL PROTECTED]
> > > > Sent: Wednesday, July 16, 2008 11:18 AM
> > > > To: core-user@hadoop.apache.org
> > > > Subject: Codec Returning null
> > > >
> > > > Hello Again!
> > > >
> > > > I'm running into a NullPointerException from the following
> > > > code (taken from a recordreader). None of the other variables
> > > > are returning null, I've check them all. I've checked the
> > > > documentation, and I still don't know why the
> > > > compressionCodecFactory would return a null result. Any ideas?
> > > >
> > > > Thank you!
> > > > Kylie
> > > >
> > > > public ServiceRecordReader(Configuration job, FileSplit split)
> > > > throws java.io.IOException {
> > > >
> > > > final Path file = split.getPath();
> > > > compressionCodecs = new CompressionCodecFactory(job);
> > > > final CompressionCodec codec =
> > > > compressionCodecs.getCodec(file);
> > > >
> > > > //since we're not concerned with splits, we'll
> > > > just open the objects
> > > > FileSystem fs = file.getFileSystem(job);
> > > > FSDataInputStream fileIn = fs.open(split.getPath());
> > > >
> > > >
> > > > if(codec != null) {
> > > >
> > > > in = new
> > > > ServiceReader(codec.createInputStream(fileIn));
> > > >
> > > > } else {
> > > >
> > > > throw new java.io.IOException("codec is null,
> > > > curses... ");
> > > >
> > > > }
> > > >
> > >
> > >
> >
> >
> > --
> > The Circle of the Dragon -- unlock the mystery that is the dragon.
> > http://www.blackdrago.com/index.html
> >
> > "Light, seeking light, doth the light of light beguile!"
> > -- William Shakespeare's Love's Labor's Lost
> >
>



-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Re: Codec Returning null

2008-07-16 Thread Kylie McCormick
Hello:
My filename extension for the input file is .txt.

Thanks,
Kylie

On Wed, Jul 16, 2008 at 3:21 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:

> What does your file name extension look like?
>
> > -Original Message-----
> > From: Kylie McCormick [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, July 16, 2008 11:18 AM
> > To: core-user@hadoop.apache.org
> > Subject: Codec Returning null
> >
> > Hello Again!
> >
> > I'm running into a NullPointerException from the following
> > code (taken from a recordreader). None of the other variables
> > are returning null, I've check them all. I've checked the
> > documentation, and I still don't know why the
> > compressionCodecFactory would return a null result. Any ideas?
> >
> > Thank you!
> > Kylie
> >
> > public ServiceRecordReader(Configuration job, FileSplit split)
> > throws java.io.IOException {
> >
> > final Path file = split.getPath();
> > compressionCodecs = new CompressionCodecFactory(job);
> > final CompressionCodec codec =
> > compressionCodecs.getCodec(file);
> >
> > //since we're not concerned with splits, we'll
> > just open the objects
> > FileSystem fs = file.getFileSystem(job);
> > FSDataInputStream fileIn = fs.open(split.getPath());
> >
> >
> > if(codec != null) {
> >
> > in = new
> > ServiceReader(codec.createInputStream(fileIn));
> >
> > } else {
> >
> > throw new java.io.IOException("codec is null,
> > curses... ");
> >
> > }
> >
>
>


-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Codec Returning null

2008-07-15 Thread Kylie McCormick
Hello Again!

I'm running into a NullPointerException from the following code (taken from
a recordreader). None of the other variables are returning null, I've check
them all. I've checked the documentation, and I still don't know why the
compressionCodecFactory would return a null result. Any ideas?

Thank you!
Kylie

public ServiceRecordReader(Configuration job, FileSplit split)
throws java.io.IOException {

final Path file = split.getPath();
compressionCodecs = new CompressionCodecFactory(job);
final CompressionCodec codec = compressionCodecs.getCodec(file);

//since we're not concerned with splits, we'll just open the
objects
FileSystem fs = file.getFileSystem(job);
FSDataInputStream fileIn = fs.open(split.getPath());


if(codec != null) {

in = new ServiceReader(codec.createInputStream(fileIn));

} else {

throw new java.io.IOException("codec is null, curses... ");

}


Re: Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hello Chris:
Thanks for the prompt reply!

So, to conclude from your note:
-- Presently, my RecordReader converts XML strings from a file to MyWritable
object
-- When readFields is called, RecordReader should provide the next
MyWritable object, if there is one
-- When write is called, MyWriter should write the objects out

The RecordReader is record-oriented, but both the readFields and write
functions are byte-oriented... in order for Hadoop to be happy, I need to
coordinate my record-oriented to byte-oriented.

Is this correct? I just want to make sure before I tinker more with the
code, to have the design properly down.

Thanks!
Kylie


On Mon, Jul 14, 2008 at 3:43 PM, Chris Douglas <[EMAIL PROTECTED]>
wrote:

> It's easiest to consider write as a function that converts your record to
> bytes and readFields as a function restoring your record from bytes. So it
> should be the case that:
>
> MyWritable i = new MyWritable();
> i.initWithData(some_data);
> i.write(byte_stream);
> ...
> MyWritable j = new MyWritable();
> j.initWithData(some_other_data); // (1)
> j.readFields(byte_stream);
> assert i.equals(j);
>
> Note that the assert should be true whether or not (1) is present, i.e. a
> call to readFields should be deterministic and without hysteresis (it should
> make no difference whether the Writable is newly created or if it formally
> held some other state). readFields must also consume the entire record, so
> for example, if write outputs three integers, readFields must consume three
> integers. Variable-sized Writables are common, but any optional/variably
> sized fields must be encoded to satisfy the preceding.
>
> So if your MyBigWritable record held two ints (integerA, integerB) and a
> MyWritable (my_writable), its write method might look like:
>
> out.writeInt(integerA);
> out.writeInt(integerB);
> my_writable.write(out);
>
> and readFields would restore:
>
> integerA = in.readInt(in);
> integerB = in.readInt(in);
> my_writable.readFields(in);
>
> There are many examples in the source of simple, compound, and
> variably-sized Writables.
>
> Your RecordReader is responsible for providing a key and value to your map.
> Most generic formats rely on Writables or another mode of serialization to
> write and restore objects to/from structured byte sequences, but less
> generic InputFormats will create Writables from byte streams.
> TextInputFormat, for example, will create Text objects from CR-delimited
> files, though Text objects are not, themselves, encoded in the file. In
> constrast, a SequenceFile storing the same data will encode the Text object
> (using its write method) and will restore that object as encoded.
>
> The critical difference is that the framework needs to convert your record
> to a byte stream at various points- hence the Writable interface- while you
> may be more particular about the format from which you consume and the
> format to which you need your output to conform. Note that you can elect to
> use a different serialization framework if you prefer.
>
> If your data structure will be used as a key (implementing
> WritableComparable), it's strongly recommended that you implement a
> RawComparator, which can compare the serialized bytes directly without
> deserializing both arguments. -C
>
>
> On Jul 14, 2008, at 3:39 PM, Kylie McCormick wrote:
>
>  Hi There!
>> I'm currently working on code for my own Writable object (called
>> ServiceWritable) and I've been working off LongWritable for this one. I
>> was
>> wondering, however, about the following two functions:
>>
>> public void readFields(java.io.DataInput in)
>> and
>> public void write(java.io.DataOutput out)
>>
>> I have my own RecordReader object to read in the complex type Service, and
>> I
>> also have my own Writer object to write my complex type ResultSet for
>> output. In LongWritable, the code is very simple:
>>
>> value = in.readLong()
>> and
>> out.writeLong(value);
>>
>> Since I am dealing with more complex objects, the ObjectWritable won't
>> help
>> me. I'm a little confused with the interaction here between my
>> RecordReader,
>> and Writer objects--because there does not seem to be any directly. Can
>> someone help me out here?
>>
>> Thanks,
>> Kylie
>>
>
>


-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost


Writable readFields and write functions

2008-07-14 Thread Kylie McCormick
Hi There!
I'm currently working on code for my own Writable object (called
ServiceWritable) and I've been working off LongWritable for this one. I was
wondering, however, about the following two functions:

public void readFields(java.io.DataInput in)
and
public void write(java.io.DataOutput out)

I have my own RecordReader object to read in the complex type Service, and I
also have my own Writer object to write my complex type ResultSet for
output. In LongWritable, the code is very simple:

value = in.readLong()
and
out.writeLong(value);

Since I am dealing with more complex objects, the ObjectWritable won't help
me. I'm a little confused with the interaction here between my RecordReader,
and Writer objects--because there does not seem to be any directly. Can
someone help me out here?

Thanks,
Kylie


FileInput / RecordReader question

2008-07-11 Thread Kylie McCormick
Hello Again:
I'm currently working with the code for inputs (and inputsplit) with Hadoop.
There is some helpful information on the Map-Reduce tutorial, but I'm having
some issues with the coding-end of it.

I would like to have a file that lists each of the end points I want to
contact, with the following information also listed: URL, client class, and
name. Right now, I see I need to use a RecordReader, since logical splitting
of the file could cause larger entries to be cut in half or shorter entries
to be bunched together. As of right now, the StreamXMLRecordReader is the
closest variation to want I want to use.

(StreamXMLRecordReader information @
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
)

However, I'm not certain it will provide the functionality that I need. I
would need to extract the three strings to generate the appropriate value.
Is there another tutorial on Input/InputSplit for Hadoop? I am attempting to
code my own RecordReader, and I'm uncertain if that would be necessary...
and, if it is, specifics of the code.

Thanks,
Kylie


Re: Hadoop Architecture Question: Distributed Information Retrieval

2008-07-10 Thread Kylie McCormick
Thanks for the replies! If I use a single reducer, however, would it be
possible for there to be only one object (FinalSet) to which the Reduce
function merges? If not, I could redo the structure of the program, but I
was hoping to maintain it as much as possible.

Yes, I am aware of Nutch, and I've been using some of the documentation to
help with my new design. It's quite exciting! I'm hoping to have another
Java package with which to continue work on large TREC tracks.

My work with OGSA-DAI can be seen @
http://snowy.arsc.alaska.edu:8080/edu/arsc/multisearch/ if you're
interested, and by the end of the summer I hope to have a write up that
discusses the differences (esp. performance) between the two. The system
from last year was used on this year's TREC collection (with 1,000 services
and 10,000 queries) and performed fairly well. I'm hoping Hadoop will make
more sense and run faster.


Thank you,
Kylie

On Thu, Jul 10, 2008 at 1:47 PM, Steve Loughran <[EMAIL PROTECTED]> wrote:

> Kylie McCormick wrote:
>
>> Hello!
>> My name is Kylie McCormick, and I'm currently working on creating a
>> distributed information retrieval package with Hadoop based on my previous
>> work with other middlewares like OGSA-DAI. I've been developing a design
>> that works with the structures of the other systems I have put together
>> for
>> distributed IR.
>>
>
>
> It would be interesting to see your write up of the different experiences
> that OGSA-DAI's storage model offers versus that of hadoop.
>
> -steve
>



-- 
The Circle of the Dragon -- unlock the mystery that is the dragon.
http://www.blackdrago.com/index.html

"Light, seeking light, doth the light of light beguile!"
-- William Shakespeare's Love's Labor's Lost