Re: Hadoop supports RDBMS?

2008-06-25 Thread Enis Soztutar
Yes, there is a way to use DBMS over JDBC. The feature is not realeased 
yet, but you can try it out, and give valuable feedback to us.
You can find the patch and the jira issue at : 
https://issues.apache.org/jira/browse/HADOOP-2536



Lakshmi Narayanan wrote:

Has anyone tried using any RDBMS with the hadoop?  If the data is stored in
the database is there any way we can use the mapreduce with the database
instead of the HDFS?

  


Is Hadoop the thing for us ?

2008-06-25 Thread Igor Nikolic

Hello list

We will be getting access to a cluster soon, and I was wondering whether 
this I should use Hadoop ?  Or am I better of with the usual batch 
schedulers such as ProActive etc ? I am not a CS/CE person, and from 
reading the website I can not get a sense of whether hadoop is for me.


A little background:
We have a  relatively large agent based simulation ( 20+ MB jar) that 
needs to be swept across very large parameter spaces. Agents communicate 
only within the simulation, so there is no interprocess communication. 
The parameter vector is max 20 long , the simulation may take 5-10 
minutes on a normal desktop and it might return a few mb of raw data. We 
need 10k-100K runs, more if possible.




Thanks for advice, even a short yes/no is welcome

Greetings
Igor

--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl



Re: realtime hadoop

2008-06-25 Thread Daniel
2008/6/24 Konstantin Shvachko <[EMAIL PROTECTED]>:

> > Also HDFS might be critical since to access your data you need to close
> the file
>
> Not anymore. Since 0.16 files are readable while being written to.

Does this mean i can open some file as map input and the reduce output ? So
i can update the files instead of creating new ones.
Also if i want to do query in the records, should i rather use Hbase instead
of HDFS? - say if we have large size of data stored as (key, value).

Thanks.

>
>
> >> it as fast as possible. I need to be able to maintain some guaranteed
> >> max. processing time, for example under 3 minutes.
>
> It looks like you do not need very strict guarantees.
> I think you can use hdfs as a data-storage.
> Don't know what kind of data-processing you do, but I agree with Stefan
> that map-reduce is designed for batch tasks rather than for real-time
> processing.
>
>
>
>
> Stefan Groschupf wrote:
>
>> Hadoop might be the wrong technology for you.
>> Map Reduce is a batch processing mechanism. Also HDFS might be critical
>> since to access your data you need to close the file - means you might have
>> many small file, a situation where hdfs is not very strong (namespace is
>> hold in memory).
>> Hbase might be an interesting tool for you, also zookeeper if you want to
>> do something home grown...
>>
>>
>>
>> On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:
>>
>>  Hi!
>>>
>>> I am considering using Hadoop for (almost) realime data processing. I
>>> have data coming every second and I would like to use hadoop cluster
>>> to process
>>> it as fast as possible. I need to be able to maintain some guaranteed
>>> max. processing time, for example under 3 minutes.
>>>
>>> Does anybody have experience with using Hadoop in such manner? I will
>>> appreciate if you can share your experience or give me pointers
>>> to some articles or pages on the subject.
>>>
>>> Vadim
>>>
>>>
>> ~~~
>> 101tec Inc.
>> Menlo Park, California, USA
>> http://www.101tec.com
>>
>>
>>
>>


running hadoop remotely from inside a java program

2008-06-25 Thread Deyaa Adranale

hello,

i am developing a tool that will do some analysis tasks using hadoop 
map/reduce on a cluster


the tool user interfaces will be run on the client windows system and 
should run the analysis tasks as map/reduce jobs on a hadoop cluster 
(configured by the user).


my question is how to run hadoop jobs on a cluster from a client machine 
(other than the master) from inside java code.
I know that I should have a hadoop installation on the client that 
should be configured to point to the cluster's master, but I am not sure 
how to do it.


another necessity for my tool would be to copy files from the local 
client file system to the HDFS on the cluster. I am also not sure if I 
can access the HDFS of the cluster from a client machine using java code.


hope anybody could give me some hints

thanks,

Deyaa


Re: Is Hadoop the thing for us ?

2008-06-25 Thread John Martyniak
I am new to Hadoop.  So take this information with a grain of salt.
But the power of Hadoop is breaking down big problems into small pieces and
spreading it across many (thousands) of machines, in effect creating a
massively parallel processing engine.

But in order to take advantage of that functionality you must write your
application to take advantage of it, using the Hadoop frameworks.

So if I understand  your dilemma correctly.  I do not think that Hadoop is
for you, unless you want to re-write your app to take advantage of it.  And
I suspect that if you have access to a traditional cluster, that will be a
better alternative for you.

Hope that this helps some.

-John


On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <[EMAIL PROTECTED]> wrote:

> Hello list
>
> We will be getting access to a cluster soon, and I was wondering whether
> this I should use Hadoop ?  Or am I better of with the usual batch
> schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
> the website I can not get a sense of whether hadoop is for me.
>
> A little background:
> We have a  relatively large agent based simulation ( 20+ MB jar) that needs
> to be swept across very large parameter spaces. Agents communicate only
> within the simulation, so there is no interprocess communication. The
> parameter vector is max 20 long , the simulation may take 5-10 minutes on a
> normal desktop and it might return a few mb of raw data. We need 10k-100K
> runs, more if possible.
>
>
>
> Thanks for advice, even a short yes/no is welcome
>
> Greetings
> Igor
>
> --
> ir. Igor Nikolic
> PhD Researcher
> Section Energy & Industry
> Faculty of Technology, Policy and Management
> Delft University of Technology, The Netherlands
>
> Tel: +31152781135
> Email: [EMAIL PROTECTED]
> Web: http://www.igornikolic.com
> wiki server: http://wiki.tudelft.nl
>
>


-- 
John Martyniak
Before Dawn Solutions, Inc.
9457 S. University Blvd. #266
Highlands Ranch, CO 80126
o: 1-877-499-1562 x707 (Toll Free)
c: 303-522-1756
e: [EMAIL PROTECTED]
w: http://www.beforedawn.com


Re: Is Hadoop the thing for us ?

2008-06-25 Thread Igor Nikolic

Thank you for your comment, it did confirm my suspicions.

You framed the problem correctly. I will probably invest a bit of time 
studying the framework anyway, to see if a rewrite is interesting, since 
we hit scaling limitations on our Agent scheduler framework. Our main 
computational load is the massive amount of agent reasoning ( think 
JbossRules) and  inter-agent communication ( they need to sell and buy 
stuff to each other)  so I am not sure if it is at all possible to break 
it down to small tasks, specially if this needs to happen across CPU's, 
the latency is going to kill us.


Thanks
igor

John Martyniak wrote:

I am new to Hadoop.  So take this information with a grain of salt.
But the power of Hadoop is breaking down big problems into small pieces and
spreading it across many (thousands) of machines, in effect creating a
massively parallel processing engine.

But in order to take advantage of that functionality you must write your
application to take advantage of it, using the Hadoop frameworks.

So if I understand  your dilemma correctly.  I do not think that Hadoop is
for you, unless you want to re-write your app to take advantage of it.  And
I suspect that if you have access to a traditional cluster, that will be a
better alternative for you.

Hope that this helps some.

-John


On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <[EMAIL PROTECTED]> wrote:

  

Hello list

We will be getting access to a cluster soon, and I was wondering whether
this I should use Hadoop ?  Or am I better of with the usual batch
schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
the website I can not get a sense of whether hadoop is for me.

A little background:
We have a  relatively large agent based simulation ( 20+ MB jar) that needs
to be swept across very large parameter spaces. Agents communicate only
within the simulation, so there is no interprocess communication. The
parameter vector is max 20 long , the simulation may take 5-10 minutes on a
normal desktop and it might return a few mb of raw data. We need 10k-100K
runs, more if possible.



Thanks for advice, even a short yes/no is welcome

Greetings
Igor

--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl






  



--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl



Re: Is Hadoop the thing for us ?

2008-06-25 Thread Deyaa Adranale

here is some informal description of the map/reduce model:

In the map/reduce paradigm there is usually input data consiting of 
(very large number of) records.
the paradigm assumes that you want to do some computation on each input 
record seperately (without simultenous access to other records) to 
produce some result (the map function). Then the results from the whole 
records are grouped (based on a key) and each group of results can be 
futher processed (the reduce function) together to produce a final 
result for each group.

Also, global parameters could be made visible to the map function.

so you have to try to model your problem as this model, and if it is 
possible, then you can  rewrite your porgram or use hadoop native libraries


regards,

Deyaa


Igor Nikolic wrote:

Thank you for your comment, it did confirm my suspicions.

You framed the problem correctly. I will probably invest a bit of time 
studying the framework anyway, to see if a rewrite is interesting, 
since we hit scaling limitations on our Agent scheduler framework. Our 
main computational load is the massive amount of agent reasoning ( 
think JbossRules) and  inter-agent communication ( they need to sell 
and buy stuff to each other)  so I am not sure if it is at all 
possible to break it down to small tasks, specially if this needs to 
happen across CPU's, the latency is going to kill us.


Thanks
igor

John Martyniak wrote:

I am new to Hadoop.  So take this information with a grain of salt.
But the power of Hadoop is breaking down big problems into small 
pieces and

spreading it across many (thousands) of machines, in effect creating a
massively parallel processing engine.

But in order to take advantage of that functionality you must write your
application to take advantage of it, using the Hadoop frameworks.

So if I understand  your dilemma correctly.  I do not think that 
Hadoop is
for you, unless you want to re-write your app to take advantage of 
it.  And
I suspect that if you have access to a traditional cluster, that will 
be a

better alternative for you.

Hope that this helps some.

-John


On Wed, Jun 25, 2008 at 7:33 AM, Igor Nikolic <[EMAIL PROTECTED]> 
wrote:


 

Hello list

We will be getting access to a cluster soon, and I was wondering 
whether

this I should use Hadoop ?  Or am I better of with the usual batch
schedulers such as ProActive etc ? I am not a CS/CE person, and from 
reading

the website I can not get a sense of whether hadoop is for me.

A little background:
We have a  relatively large agent based simulation ( 20+ MB jar) 
that needs

to be swept across very large parameter spaces. Agents communicate only
within the simulation, so there is no interprocess communication. The
parameter vector is max 20 long , the simulation may take 5-10 
minutes on a
normal desktop and it might return a few mb of raw data. We need 
10k-100K

runs, more if possible.



Thanks for advice, even a short yes/no is welcome

Greetings
Igor

--
ir. Igor Nikolic
PhD Researcher
Section Energy & Industry
Faculty of Technology, Policy and Management
Delft University of Technology, The Netherlands

Tel: +31152781135
Email: [EMAIL PROTECTED]
Web: http://www.igornikolic.com
wiki server: http://wiki.tudelft.nl






  





Re: running hadoop remotely from inside a java program

2008-06-25 Thread Steve Loughran

Deyaa Adranale wrote:

hello,

i am developing a tool that will do some analysis tasks using hadoop 
map/reduce on a cluster


the tool user interfaces will be run on the client windows system and 
should run the analysis tasks as map/reduce jobs on a hadoop cluster 
(configured by the user).


my question is how to run hadoop jobs on a cluster from a client machine 
(other than the master) from inside java code.
I know that I should have a hadoop installation on the client that 
should be configured to point to the cluster's master, but I am not sure 
how to do it.


you need the hadoop JARs; your client can then talk directly to a 
cluster provided

 * it is not too far away, network-wise
 * the client hadoop configuration is in sync with the servers

You just create a JobClient instance and submit a job through it

another necessity for my tool would be to copy files from the local 
client file system to the HDFS on the cluster. I am also not sure if I 
can access the HDFS of the cluster from a client machine using java code.


Yes, look in the FsShell and FileUtils classes

* None of this stuff is documented outside the source+javadocs, so you 
will need to rummage around the source to work out what to do.
* Pull log4J.properties and commons-logging.properties from the hadoop 
JARs if you want to route the hadoop classes logging through your own 
chosen logger




Re: How Mappers function and solultion for my input file problem?

2008-06-25 Thread Ted Dunning
On Tue, Jun 24, 2008 at 10:31 PM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Xuan Dzung Doan wrote:
>
>>
>>
>> The level of parallelism of a job, with respect to mappers, is largely the
>> number of map tasks spawned, which is equal to the number of InputSplits.
>> But within each InputSplit, there may be many records (many input key-value
>> pairs), each is processed by one separate call to the map() method. So are
>> these calls within one single map task also executed in parallel by the
>> framework?
>>
>>
>>
> Afaik no.


This might be a bit misunderstood.

Each task node does run a few map tasks and each of these could be
considered a "single map task executed in parallel".

It is definitely true that you have more than one map task, even per task
node.  But it is also true that you get many calls to map per map task.


Re: Hadoop supports RDBMS?

2008-06-25 Thread Ted Dunning
You can also just write an input format, but you should limit the
parallelism.  Hadoop clusters, even small ones, can completely flatten a
database server very easily.

On Wed, Jun 25, 2008 at 12:34 AM, Enis Soztutar <[EMAIL PROTECTED]>
wrote:

> Yes, there is a way to use DBMS over JDBC. The feature is not realeased
> yet, but you can try it out, and give valuable feedback to us.
> You can find the patch and the jira issue at :
> https://issues.apache.org/jira/browse/HADOOP-2536
>
>
>
> Lakshmi Narayanan wrote:
>
>> Has anyone tried using any RDBMS with the hadoop?  If the data is stored
>> in
>> the database is there any way we can use the mapreduce with the database
>> instead of the HDFS?
>>
>>
>>
>


-- 
ted


Global Variables via DFS

2008-06-25 Thread javaxtreme

Hello all,
I am having a bit of a problem with a seemingly simple problem. I would like
to have some global variable which is a byte array that all of my map tasks
have access to. The best way that I currently know of to do this is to have
a file sitting on the DFS and load that into each map task (note: the global
variable is very small ~20kB). My problem is that I can't seem to load any
file from the Hadoop DFS into my program via the API. I know that the
DistributedFileSystem class has to come into play, but for the life of me I
can't get it to work. 

I noticed there is an initialize() method within the DistributedFileSystem
class, and I thought that I would need to call that, however I'm unsure what
the URI parameter ought to be. I tried "localhost:50070" which stalled the
system and threw a connectionTimeout error. I went on to just attempt to
call DistributedFileSystem.open() but again my program failed this time with
a NullPointerException. I'm assuming that is stemming from he fact that my
DFS object is not "initialized".

Does anyone have any information on how exactly one programatically goes
about loading in a file from the DFS? I would greatly appreciate any help.

Cheers,
Sean M. Arietta
-- 
View this message in context: 
http://www.nabble.com/Global-Variables-via-DFS-tp18115661p18115661.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Is Hadoop the thing for us ?

2008-06-25 Thread Ted Dunning
This can work pretty well if you just use the list of parameter settings as
input.  The map task would run your simulation and output the data.  You may
not even need a reducer, although parallelized summary of output might be
very nice to have.  Because each of your sims takes a long time to run,
hadoop should be very efficient.

The only change you should need to make is to write a map class that
launches your simulation and copies whatever output you want into HDFS
instead of the local file system.  If you can get your sim to write to HDFS
directly that would be better.

On Wed, Jun 25, 2008 at 4:33 AM, Igor Nikolic <[EMAIL PROTECTED]> wrote:

> Hello list
>
> We will be getting access to a cluster soon, and I was wondering whether
> this I should use Hadoop ?  Or am I better of with the usual batch
> schedulers such as ProActive etc ? I am not a CS/CE person, and from reading
> the website I can not get a sense of whether hadoop is for me.
>
> A little background:
> We have a  relatively large agent based simulation ( 20+ MB jar) that needs
> to be swept across very large parameter spaces. Agents communicate only
> within the simulation, so there is no interprocess communication. The
> parameter vector is max 20 long , the simulation may take 5-10 minutes on a
> normal desktop and it might return a few mb of raw data. We need 10k-100K
> runs, more if possible.
>
>
>
> Thanks for advice, even a short yes/no is welcome
>
> Greetings
> Igor
>
> --
> ir. Igor Nikolic
> PhD Researcher
> Section Energy & Industry
> Faculty of Technology, Policy and Management
> Delft University of Technology, The Netherlands
>
> Tel: +31152781135
> Email: [EMAIL PROTECTED]
> Web: http://www.igornikolic.com
> wiki server: http://wiki.tudelft.nl
>
>


-- 
ted


Re: How Mappers function and solultion for my input file problem?

2008-06-25 Thread Xuan Dzung Doan



- Original Message 
From: Ted Dunning <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 25, 2008 8:42:43 AM
Subject: Re: How Mappers function and solultion for my input file problem?

On Tue, Jun 24, 2008 at 10:31 PM, Amar Kamat <[EMAIL PROTECTED]> wrote:

> Xuan Dzung Doan wrote:
>
>>
>>
>> The level of parallelism of a job, with respect to mappers, is largely the
>> number of map tasks spawned, which is equal to the number of InputSplits.
>> But within each InputSplit, there may be many records (many input key-value
>> pairs), each is processed by one separate call to the map() method. So are
>> these calls within one single map task also executed in parallel by the
>> framework?
>>
>>
>>
> Afaik no.


>>>This might be a bit misunderstood.

>>>Each task node does run a few map tasks and each of these could be
>>>considered a "single map task executed in parallel".

>>>It is definitely true that you have more than one map task, even per task
>>>node.  But it is also true that you get many calls to map per map task.

OK. So are these many calls to map per map task also executed in parallel 
(calls per one map task are executed independently)?



  

Re: Global Variables via DFS

2008-06-25 Thread Steve Loughran

javaxtreme wrote:

Hello all,
I am having a bit of a problem with a seemingly simple problem. I would like
to have some global variable which is a byte array that all of my map tasks
have access to. The best way that I currently know of to do this is to have
a file sitting on the DFS and load that into each map task (note: the global
variable is very small ~20kB). My problem is that I can't seem to load any
file from the Hadoop DFS into my program via the API. I know that the
DistributedFileSystem class has to come into play, but for the life of me I
can't get it to work. 


I noticed there is an initialize() method within the DistributedFileSystem
class, and I thought that I would need to call that, however I'm unsure what
the URI parameter ought to be. I tried "localhost:50070" which stalled the
system and threw a connectionTimeout error. I went on to just attempt to
call DistributedFileSystem.open() but again my program failed this time with
a NullPointerException. I'm assuming that is stemming from he fact that my
DFS object is not "initialized".

Does anyone have any information on how exactly one programatically goes
about loading in a file from the DFS? I would greatly appreciate any help.



If the data changes, this sounds more like the kind of data that a 
distributed hash table or tuple space should be looking after...sharing 
facts between nodes


1. what is the rate of change of the data?
2. what are your requirements for consistency?

If the data is static, then yes, a shared file works.  Here's my code 
fragments to work with one. You grab the URI from the configuration, 
then initialise the DFS with both the URI and the configuration.


public static DistributedFileSystem 
createFileSystem(ManagedConfiguration conf) throws 
SmartFrogRuntimeException {
String filesystemURL = 
conf.get(HadoopConfiguration.FS_DEFAULT_NAME);

URI uri = null;
try {
uri = new URI(filesystemURL);
} catch (URISyntaxException e) {
throw (SmartFrogRuntimeException) SmartFrogRuntimeException
.forward(ERROR_INVALID_FILESYSTEM_URI + filesystemURL,
e);
}
DistributedFileSystem dfs = new DistributedFileSystem();
try {
dfs.initialize(uri, conf);
} catch (IOException e) {
throw (SmartFrogRuntimeException) SmartFrogRuntimeException
.forward(ERROR_FAILED_TO_INITIALISE_FILESYSTEM, e);

}
return dfs;
}

As to what URLs work, try  "localhost:9000"; this works on machines 
where I've brought a DFS up on that port. Use netstat to verify your 
chosen port is live.


Re: Global Variables via DFS

2008-06-25 Thread lohit
As steve mentioned you could open up a HDFS file from within your map/reduce 
task.
Also instead of using DistributedFileSystem, you would actually use FileSystem. 
This is what I do.


FileSystem fs = FileSystem.get( new Configuration() );
FSDataInputStream file = fs.open(new Path("/user/foo/jambajuice");


Thanks,
Lohit
- Original Message 
From: Steve Loughran <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 25, 2008 9:15:55 AM
Subject: Re: Global Variables via DFS

javaxtreme wrote:
> Hello all,
> I am having a bit of a problem with a seemingly simple problem. I would like
> to have some global variable which is a byte array that all of my map tasks
> have access to. The best way that I currently know of to do this is to have
> a file sitting on the DFS and load that into each map task (note: the global
> variable is very small ~20kB). My problem is that I can't seem to load any
> file from the Hadoop DFS into my program via the API. I know that the
> DistributedFileSystem class has to come into play, but for the life of me I
> can't get it to work. 
> 
> I noticed there is an initialize() method within the DistributedFileSystem
> class, and I thought that I would need to call that, however I'm unsure what
> the URI parameter ought to be. I tried "localhost:50070" which stalled the
> system and threw a connectionTimeout error. I went on to just attempt to
> call DistributedFileSystem.open() but again my program failed this time with
> a NullPointerException. I'm assuming that is stemming from he fact that my
> DFS object is not "initialized".
> 
> Does anyone have any information on how exactly one programatically goes
> about loading in a file from the DFS? I would greatly appreciate any help.
> 

If the data changes, this sounds more like the kind of data that a 
distributed hash table or tuple space should be looking after...sharing 
facts between nodes

1. what is the rate of change of the data?
2. what are your requirements for consistency?

If the data is static, then yes, a shared file works.  Here's my code 
fragments to work with one. You grab the URI from the configuration, 
then initialise the DFS with both the URI and the configuration.

 public static DistributedFileSystem 
createFileSystem(ManagedConfiguration conf) throws 
SmartFrogRuntimeException {
 String filesystemURL = 
conf.get(HadoopConfiguration.FS_DEFAULT_NAME);
 URI uri = null;
 try {
 uri = new URI(filesystemURL);
 } catch (URISyntaxException e) {
 throw (SmartFrogRuntimeException) SmartFrogRuntimeException
 .forward(ERROR_INVALID_FILESYSTEM_URI + filesystemURL,
 e);
 }
 DistributedFileSystem dfs = new DistributedFileSystem();
 try {
 dfs.initialize(uri, conf);
 } catch (IOException e) {
 throw (SmartFrogRuntimeException) SmartFrogRuntimeException
 .forward(ERROR_FAILED_TO_INITIALISE_FILESYSTEM, e);

 }
 return dfs;
 }

As to what URLs work, try  "localhost:9000"; this works on machines 
where I've brought a DFS up on that port. Use netstat to verify your 
chosen port is live.



Re: realtime hadoop

2008-06-25 Thread Konstantin Shvachko



Daniel wrote:

Also HDFS might be critical since to access your data you need to close

the file

Not anymore. Since 0.16 files are readable while being written to.


Does this mean i can open some file as map input and the reduce output ? So
i can update the files instead of creating new ones.


No files are still write-once in hdfs, you cannot modify a file after it is 
closed.
But if it is not closed you can still write more data into it, and other 
clients will
be able to read this new data.


Also if i want to do query in the records, should i rather use Hbase instead
of HDFS? - say if we have large size of data stored as (key, value).


HDFS has file system api, there is no notion of a record in it, just files and 
bytes.
Depending on how you define a record you can use different systems including 
HBase and Pig.
These two work well for table-like data collections.
Or you can write your own MapReduce job to do processing of a big key-value 
dataset.

Regards,
--Konstantin


Thanks.




it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

It looks like you do not need very strict guarantees.
I think you can use hdfs as a data-storage.
Don't know what kind of data-processing you do, but I agree with Stefan
that map-reduce is designed for batch tasks rather than for real-time
processing.




Stefan Groschupf wrote:


Hadoop might be the wrong technology for you.
Map Reduce is a batch processing mechanism. Also HDFS might be critical
since to access your data you need to close the file - means you might have
many small file, a situation where hdfs is not very strong (namespace is
hold in memory).
Hbase might be an interesting tool for you, also zookeeper if you want to
do something home grown...



On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote:

 Hi!

I am considering using Hadoop for (almost) realime data processing. I
have data coming every second and I would like to use hadoop cluster
to process
it as fast as possible. I need to be able to maintain some guaranteed
max. processing time, for example under 3 minutes.

Does anybody have experience with using Hadoop in such manner? I will
appreciate if you can share your experience or give me pointers
to some articles or pages on the subject.

Vadim



~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com








Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Sandy
Hi,

I am currently trying to get Hadoop Pipes working. I am following the
instructions at the hadoop wiki, where it provides code for a C++
implementation of Word Count (located here:
http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)

I am having some trouble parsing the instructions. What should the file
containing the new word count program be called? "examples"?

If I were to call the file "example" and type in the following:
$ ant -Dcompile.c++=yes example
Buildfile: build.xml

BUILD FAILED
Target `example' does not exist in this project.

Total time: 0 seconds


If I try and compile with "examples" as stated on the wiki, I get:
$ ant -Dcompile.c++=yes examples
Buildfile: build.xml

clover.setup:

clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

clover:

init:
[touch] Creating /tmp/null810513231
   [delete] Deleting: /tmp/null810513231
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

record-parser:

compile-rcc-compiler:
[javac] Compiling 29 source files to
/home/sjm/Desktop/hadoop-0.16.4/build/classes

BUILD FAILED
/home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a javac
compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK

Total time: 1 second



I am a bit puzzled by this. Originally I got the error that tools.jar was
not found, because it was looking for it under
/usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under
/usr/java/jdk1.6.0_06/lib/tools.jar. If I copy this file over to the jre
folder, that message goes away and its replaced with the above message.

My hadoop-env.sh file looks something like:
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
# export JAVA_HOME=$JAVA_HOME


and my .bash_profile file has this line in it:
JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
export PATH


Furthermore, if I go to the command line and type in javac -version, I get:
$ javac -version
javac 1.6.0_06


I also had no problem getting through the hadoop word count map reduce
tutorial in Java. It was able to find my java compiler fine. Could someone
please point me in the right direction? Also, since it is an sh file, should
that export line in hadoop-env.sh really start with a hash sign?

Thank you in advance for your assistance.

-SM


Re: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread lohit
ant -Dcompile.c++=yes compile-c++-examples
I picked it up from build.xml

Thanks,
Lohit

- Original Message 
From: Sandy <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, June 25, 2008 10:44:20 AM
Subject: Compiling Word Count in C++ : Hadoop Pipes

Hi,

I am currently trying to get Hadoop Pipes working. I am following the
instructions at the hadoop wiki, where it provides code for a C++
implementation of Word Count (located here:
http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)

I am having some trouble parsing the instructions. What should the file
containing the new word count program be called? "examples"?

If I were to call the file "example" and type in the following:
$ ant -Dcompile.c++=yes example
Buildfile: build.xml

BUILD FAILED
Target `example' does not exist in this project.

Total time: 0 seconds


If I try and compile with "examples" as stated on the wiki, I get:
$ ant -Dcompile.c++=yes examples
Buildfile: build.xml

clover.setup:

clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

clover:

init:
[touch] Creating /tmp/null810513231
   [delete] Deleting: /tmp/null810513231
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

record-parser:

compile-rcc-compiler:
[javac] Compiling 29 source files to
/home/sjm/Desktop/hadoop-0.16.4/build/classes

BUILD FAILED
/home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a javac
compiler;
com.sun.tools.javac.Main is not on the classpath.
Perhaps JAVA_HOME does not point to the JDK

Total time: 1 second



I am a bit puzzled by this. Originally I got the error that tools.jar was
not found, because it was looking for it under
/usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under
/usr/java/jdk1.6.0_06/lib/tools.jar. If I copy this file over to the jre
folder, that message goes away and its replaced with the above message.

My hadoop-env.sh file looks something like:
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
# export JAVA_HOME=$JAVA_HOME


and my .bash_profile file has this line in it:
JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
export PATH


Furthermore, if I go to the command line and type in javac -version, I get:
$ javac -version
javac 1.6.0_06


I also had no problem getting through the hadoop word count map reduce
tutorial in Java. It was able to find my java compiler fine. Could someone
please point me in the right direction? Also, since it is an sh file, should
that export line in hadoop-env.sh really start with a hash sign?

Thank you in advance for your assistance.

-SM



Re: Hadoop Meetup @ Berlin

2008-06-25 Thread Isabel Drost
On Tuesday 17 June 2008, Isabel Drost wrote:
> I am happy to announce the first German Hadoop Meetup in Berlin. We will
> meet at 5 p.m. MESZ next Tuesday (24th of June) at the newthinking store in
> Berlin Mitte.

Some preliminary feedback I gathered myself at the meeting: There were about 
20 people interested in Hadoop, Mahout and Co. The slides will be online as 
soon as I get them. The newthinking store offered to publish a blog post on 
the meeting, attach the slides and additional information. 

Feedback I got concerning Mahout: In the wiki we should indicate the status of 
the algorithms somewhere (are they only planned, available as JIRA patch or 
part of the main distribution). In addition we should move our collection of 
related books, articles and the like to the new wiki page. I guess I should 
be able to fix that in the coming days.

One visitor was especially interested in the Mahout and UIMA setting, as he is 
currently working on getting UIMA on EC2 - if I remember correctly. I hope to 
see a talk on his work at the next meeting:

We aggreed on meeting again in about two months - for several people the trip 
to Berlin was pretty long. I will announce the next meeting on the related 
user mailing lists soon, so stay tuned.

Isabel

-- 
There will be big changes for you but you will be happy.
  |\  _,,,---,,_   Web:   
  /,`.-'`'-.  ;-;;,_
 |,4-  ) )-,_..;\ (  `'-'
'---''(_/--'  `-'\_) (fL)  IM:  


signature.asc
Description: This is a digitally signed message part.


Re: Global Variables via DFS

2008-06-25 Thread Hairong Kuang
If the data is static, you may ship the file with your job jar and then read
the data locally in the beginning of the map in the configure() method.

Hairong


On 6/25/08 9:43 AM, "lohit" <[EMAIL PROTECTED]> wrote:

> As steve mentioned you could open up a HDFS file from within your map/reduce
> task.
> Also instead of using DistributedFileSystem, you would actually use
> FileSystem. This is what I do.
> 
> 
> FileSystem fs = FileSystem.get( new Configuration() );
> FSDataInputStream file = fs.open(new Path("/user/foo/jambajuice");
> 
> 
> Thanks,
> Lohit
> - Original Message 
> From: Steve Loughran <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 25, 2008 9:15:55 AM
> Subject: Re: Global Variables via DFS
> 
> javaxtreme wrote:
>> Hello all,
>> I am having a bit of a problem with a seemingly simple problem. I would like
>> to have some global variable which is a byte array that all of my map tasks
>> have access to. The best way that I currently know of to do this is to have
>> a file sitting on the DFS and load that into each map task (note: the global
>> variable is very small ~20kB). My problem is that I can't seem to load any
>> file from the Hadoop DFS into my program via the API. I know that the
>> DistributedFileSystem class has to come into play, but for the life of me I
>> can't get it to work.
>> 
>> I noticed there is an initialize() method within the DistributedFileSystem
>> class, and I thought that I would need to call that, however I'm unsure what
>> the URI parameter ought to be. I tried "localhost:50070" which stalled the
>> system and threw a connectionTimeout error. I went on to just attempt to
>> call DistributedFileSystem.open() but again my program failed this time with
>> a NullPointerException. I'm assuming that is stemming from he fact that my
>> DFS object is not "initialized".
>> 
>> Does anyone have any information on how exactly one programatically goes
>> about loading in a file from the DFS? I would greatly appreciate any help.
>> 
> 
> If the data changes, this sounds more like the kind of data that a
> distributed hash table or tuple space should be looking after...sharing
> facts between nodes
> 
> 1. what is the rate of change of the data?
> 2. what are your requirements for consistency?
> 
> If the data is static, then yes, a shared file works.  Here's my code
> fragments to work with one. You grab the URI from the configuration,
> then initialise the DFS with both the URI and the configuration.
> 
>  public static DistributedFileSystem
> createFileSystem(ManagedConfiguration conf) throws
> SmartFrogRuntimeException {
>  String filesystemURL =
> conf.get(HadoopConfiguration.FS_DEFAULT_NAME);
>  URI uri = null;
>  try {
>  uri = new URI(filesystemURL);
>  } catch (URISyntaxException e) {
>  throw (SmartFrogRuntimeException) SmartFrogRuntimeException
>  .forward(ERROR_INVALID_FILESYSTEM_URI + filesystemURL,
>  e);
>  }
>  DistributedFileSystem dfs = new DistributedFileSystem();
>  try {
>  dfs.initialize(uri, conf);
>  } catch (IOException e) {
>  throw (SmartFrogRuntimeException) SmartFrogRuntimeException
>  .forward(ERROR_FAILED_TO_INITIALISE_FILESYSTEM, e);
> 
>  }
>  return dfs;
>  }
> 
> As to what URLs work, try  "localhost:9000"; this works on machines
> where I've brought a DFS up on that port. Use netstat to verify your
> chosen port is live.
> 



MultipleOutputFormat example

2008-06-25 Thread slitz
Hello,
I need the reduce to output to different files depending on the key, after
reading some jira entries and some previous threads of the mailing list i
think that the MultipleTextOutputFormat class would fit my needs, the
problem is that i can't find any example of how to use it.

Could someone please show me a quick example of how to use this class or
MultipleOutputFormat subclasses in general? i'm somewhat lost...

slitz


Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Greetings,

I'm trying to get a handle on job history logging.  According to the
documentation in 'hadoop-defaul.xml' the
'hadoop.job.history.user.location' determines where job history logs
are written.  If not specified these logs go into
'/_logs/history'.  This can cause problems with
applications that don't know about this convention.  It would also be
nicer in my opinion to keep logs and data separate.

It seems to me that a nice way to handle this would be to put logs in
'/logs//history' or something.

Can this be done?  Is there a need for the "job-id" folder?  If this
can't be done, are there alternatives that work well.

-lincoln

--
lincolnritter.com


Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Greetings,

I'm trying to get a handle on job history logging.  According to the
documentation in 'hadoop-defaul.xml' the
'hadoop.job.history.user.location' determines where job history logs
are written.  If not specified these logs go into
'/_logs/history'.  This can cause problems with
applications that don't know about this convention.  It would also be
nicer in my opinion to keep logs and data separate.

It seems to me that a nice way to handle this would be to put logs in
'/logs//history' or something.

Can this be done?  Is there a need for the "job-id" folder?  If this
can't be done, are there alternatives that work well.

-lincoln

--
lincolnritter.com


Re: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Sandy
I'm not sure how this answers my question. Could you be more specific? I
still am getting the above error when I type this commmand in. To summarize:

With my current setup, this occurs:
$ ant -Dcompile.c++=yes compile-c++-examples
Unable to locate tools.jar. Expected to find it in
/usr/java/jre1.6.0_06/lib/tools.jar
Buildfile: build.xml

init:
[touch] Creating /tmp/null2044923713
   [delete] Deleting: /tmp/null2044923713
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

check-c++-makefiles:

create-c++-examples-pipes-makefile:
[mkdir] Created dir:
/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/pipes

BUILD FAILED
/home/sjm/Desktop/hadoop-0.16.4/build.xml:987: Execute failed:
java.io.IOException: Cannot run program
"/home/sjm/Desktop/hadoop-0.16.4/src/examples/pipes/configure" (in directory
"/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/pipes"):
java.io.IOException: error=13, Permission denied

Total time: 1 second

-

If I copy the tools.jar file located in my jdk's lib folder, i get the error
message I printed in the previous message.

Could someone please tell me or suggest to me what I am doing wrong?

Thanks,

-SM

On Wed, Jun 25, 2008 at 1:53 PM, lohit <[EMAIL PROTECTED]> wrote:

> ant -Dcompile.c++=yes compile-c++-examples
> I picked it up from build.xml
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Sandy <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 25, 2008 10:44:20 AM
> Subject: Compiling Word Count in C++ : Hadoop Pipes
>
> Hi,
>
> I am currently trying to get Hadoop Pipes working. I am following the
> instructions at the hadoop wiki, where it provides code for a C++
> implementation of Word Count (located here:
> http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)
>
> I am having some trouble parsing the instructions. What should the file
> containing the new word count program be called? "examples"?
>
> If I were to call the file "example" and type in the following:
> $ ant -Dcompile.c++=yes example
> Buildfile: build.xml
>
> BUILD FAILED
> Target `example' does not exist in this project.
>
> Total time: 0 seconds
>
>
> If I try and compile with "examples" as stated on the wiki, I get:
> $ ant -Dcompile.c++=yes examples
> Buildfile: build.xml
>
> clover.setup:
>
> clover.info:
> [echo]
> [echo]  Clover not found. Code coverage reports disabled.
> [echo]
>
> clover:
>
> init:
>[touch] Creating /tmp/null810513231
>   [delete] Deleting: /tmp/null810513231
> [exec] svn: '.' is not a working copy
> [exec] svn: '.' is not a working copy
>
> record-parser:
>
> compile-rcc-compiler:
>[javac] Compiling 29 source files to
> /home/sjm/Desktop/hadoop-0.16.4/build/classes
>
> BUILD FAILED
> /home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a javac
> compiler;
> com.sun.tools.javac.Main is not on the classpath.
> Perhaps JAVA_HOME does not point to the JDK
>
> Total time: 1 second
>
>
>
> I am a bit puzzled by this. Originally I got the error that tools.jar was
> not found, because it was looking for it under
> /usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under
> /usr/java/jdk1.6.0_06/lib/tools.jar. If I copy this file over to the jre
> folder, that message goes away and its replaced with the above message.
>
> My hadoop-env.sh file looks something like:
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> # export JAVA_HOME=$JAVA_HOME
>
>
> and my .bash_profile file has this line in it:
> JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
> export PATH
>
>
> Furthermore, if I go to the command line and type in javac -version, I get:
> $ javac -version
> javac 1.6.0_06
>
>
> I also had no problem getting through the hadoop word count map reduce
> tutorial in Java. It was able to find my java compiler fine. Could someone
> please point me in the right direction? Also, since it is an sh file,
> should
> that export line in hadoop-env.sh really start with a hash sign?
>
> Thank you in advance for your assistance.
>
> -SM
>
>


Re: MultipleOutputFormat example

2008-06-25 Thread montag

Hi,

  You should check out the MultipleOutputs thread and patch of 
https://issues.apache.org/jira/browse/HADOOP-3149 HADOOP-3149   There are
some relevant and useful code snippets that address the issue of splitting
output to multiple files within the discussion as well as in the patch
documentation.  I found implementing this patch easier than dealing with
MultipleTextOutputFormat.

Cheers,
Mike

  

slitz wrote:
> 
> Hello,
> I need the reduce to output to different files depending on the key, after
> reading some jira entries and some previous threads of the mailing list i
> think that the MultipleTextOutputFormat class would fit my needs, the
> problem is that i can't find any example of how to use it.
> 
> Could someone please show me a quick example of how to use this class or
> MultipleOutputFormat subclasses in general? i'm somewhat lost...
> 
> slitz
> 
> 

-- 
View this message in context: 
http://www.nabble.com/MultipleOutputFormat-example-tp18118780p18119478.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



RE: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Zheng Shao
You need to set JAVA_HOME to your jdk directory (instead of jre).
This is required by ant.

Zheng
-Original Message-
From: Sandy [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 25, 2008 11:22 AM
To: core-user@hadoop.apache.org
Subject: Re: Compiling Word Count in C++ : Hadoop Pipes

I'm not sure how this answers my question. Could you be more specific? I
still am getting the above error when I type this commmand in. To
summarize:

With my current setup, this occurs:
$ ant -Dcompile.c++=yes compile-c++-examples
Unable to locate tools.jar. Expected to find it in
/usr/java/jre1.6.0_06/lib/tools.jar
Buildfile: build.xml

init:
[touch] Creating /tmp/null2044923713
   [delete] Deleting: /tmp/null2044923713
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

check-c++-makefiles:

create-c++-examples-pipes-makefile:
[mkdir] Created dir:
/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/p
ipes

BUILD FAILED
/home/sjm/Desktop/hadoop-0.16.4/build.xml:987: Execute failed:
java.io.IOException: Cannot run program
"/home/sjm/Desktop/hadoop-0.16.4/src/examples/pipes/configure" (in
directory
"/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/
pipes"):
java.io.IOException: error=13, Permission denied

Total time: 1 second

-

If I copy the tools.jar file located in my jdk's lib folder, i get the
error
message I printed in the previous message.

Could someone please tell me or suggest to me what I am doing wrong?

Thanks,

-SM

On Wed, Jun 25, 2008 at 1:53 PM, lohit <[EMAIL PROTECTED]> wrote:

> ant -Dcompile.c++=yes compile-c++-examples
> I picked it up from build.xml
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Sandy <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 25, 2008 10:44:20 AM
> Subject: Compiling Word Count in C++ : Hadoop Pipes
>
> Hi,
>
> I am currently trying to get Hadoop Pipes working. I am following the
> instructions at the hadoop wiki, where it provides code for a C++
> implementation of Word Count (located here:
> http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)
>
> I am having some trouble parsing the instructions. What should the
file
> containing the new word count program be called? "examples"?
>
> If I were to call the file "example" and type in the following:
> $ ant -Dcompile.c++=yes example
> Buildfile: build.xml
>
> BUILD FAILED
> Target `example' does not exist in this project.
>
> Total time: 0 seconds
>
>
> If I try and compile with "examples" as stated on the wiki, I get:
> $ ant -Dcompile.c++=yes examples
> Buildfile: build.xml
>
> clover.setup:
>
> clover.info:
> [echo]
> [echo]  Clover not found. Code coverage reports disabled.
> [echo]
>
> clover:
>
> init:
>[touch] Creating /tmp/null810513231
>   [delete] Deleting: /tmp/null810513231
> [exec] svn: '.' is not a working copy
> [exec] svn: '.' is not a working copy
>
> record-parser:
>
> compile-rcc-compiler:
>[javac] Compiling 29 source files to
> /home/sjm/Desktop/hadoop-0.16.4/build/classes
>
> BUILD FAILED
> /home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a javac
> compiler;
> com.sun.tools.javac.Main is not on the classpath.
> Perhaps JAVA_HOME does not point to the JDK
>
> Total time: 1 second
>
>
>
> I am a bit puzzled by this. Originally I got the error that tools.jar
was
> not found, because it was looking for it under
> /usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under
> /usr/java/jdk1.6.0_06/lib/tools.jar. If I copy this file over to the
jre
> folder, that message goes away and its replaced with the above
message.
>
> My hadoop-env.sh file looks something like:
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> # export JAVA_HOME=$JAVA_HOME
>
>
> and my .bash_profile file has this line in it:
> JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
> export PATH
>
>
> Furthermore, if I go to the command line and type in javac -version, I
get:
> $ javac -version
> javac 1.6.0_06
>
>
> I also had no problem getting through the hadoop word count map reduce
> tutorial in Java. It was able to find my java compiler fine. Could
someone
> please point me in the right direction? Also, since it is an sh file,
> should
> that export line in hadoop-env.sh really start with a hash sign?
>
> Thank you in advance for your assistance.
>
> -SM
>
>


Re: Global Variables via DFS

2008-06-25 Thread Sean Arietta

Thanks very much for your help.

I ended up figuring out a solution a few hours ago. Here is what I did:

Path file = new Path("/user/seanarietta/testDB_candidate");
FileSystem fs = file.getFileSystem(conf);

FSDataInputStream data_in = fs.open(file, 1392);

That was in the configure method of the map task and allowed me to load in
the static byte array. I'm not sure if this was suggested by one of you, but
thank you very much for responding. Hopefully this will help someone with a
similar problem.

Cheers,
Sean


Sean Arietta wrote:
> 
> Hello all,
> I am having a bit of a problem with a seemingly simple problem. I would
> like to have some global variable which is a byte array that all of my map
> tasks have access to. The best way that I currently know of to do this is
> to have a file sitting on the DFS and load that into each map task (note:
> the global variable is very small ~20kB). My problem is that I can't seem
> to load any file from the Hadoop DFS into my program via the API. I know
> that the DistributedFileSystem class has to come into play, but for the
> life of me I can't get it to work. 
> 
> I noticed there is an initialize() method within the DistributedFileSystem
> class, and I thought that I would need to call that, however I'm unsure
> what the URI parameter ought to be. I tried "localhost:50070" which
> stalled the system and threw a connectionTimeout error. I went on to just
> attempt to call DistributedFileSystem.open() but again my program failed
> this time with a NullPointerException. I'm assuming that is stemming from
> he fact that my DFS object is not "initialized".
> 
> Does anyone have any information on how exactly one programatically goes
> about loading in a file from the DFS? I would greatly appreciate any help.
> 
> Cheers,
> Sean M. Arietta
> 

-- 
View this message in context: 
http://www.nabble.com/Global-Variables-via-DFS-tp18115661p18119996.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Sandy
I am under the impression that it already is. As I posted in my original
e-mail, here are the declarations in hadoop-env.sh and my .bash_profile

My hadoop-env.sh file looks something like:
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
# export JAVA_HOME=$JAVA_HOME


and my .bash_profile file has this line in it:
JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
export PATH


Is there a different way I'm supposed to set the JAVA_HOME environment
variable?

Much thanks,

-SM
On Wed, Jun 25, 2008 at 3:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:

> You need to set JAVA_HOME to your jdk directory (instead of jre).
> This is required by ant.
>
> Zheng
> -Original Message-
> From: Sandy [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 25, 2008 11:22 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Compiling Word Count in C++ : Hadoop Pipes
>
> I'm not sure how this answers my question. Could you be more specific? I
> still am getting the above error when I type this commmand in. To
> summarize:
>
> With my current setup, this occurs:
> $ ant -Dcompile.c++=yes compile-c++-examples
> Unable to locate tools.jar. Expected to find it in
> /usr/java/jre1.6.0_06/lib/tools.jar
> Buildfile: build.xml
>
> init:
>[touch] Creating /tmp/null2044923713
>   [delete] Deleting: /tmp/null2044923713
> [exec] svn: '.' is not a working copy
> [exec] svn: '.' is not a working copy
>
> check-c++-makefiles:
>
> create-c++-examples-pipes-makefile:
>[mkdir] Created dir:
> /home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/p
> ipes
>
> BUILD FAILED
> /home/sjm/Desktop/hadoop-0.16.4/build.xml:987: Execute failed:
> java.io.IOException: Cannot run program
> "/home/sjm/Desktop/hadoop-0.16.4/src/examples/pipes/configure" (in
> directory
> "/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/
> pipes"):
> java.io.IOException: error=13, Permission denied
>
> Total time: 1 second
>
> -
>
> If I copy the tools.jar file located in my jdk's lib folder, i get the
> error
> message I printed in the previous message.
>
> Could someone please tell me or suggest to me what I am doing wrong?
>
> Thanks,
>
> -SM
>
> On Wed, Jun 25, 2008 at 1:53 PM, lohit <[EMAIL PROTECTED]> wrote:
>
> > ant -Dcompile.c++=yes compile-c++-examples
> > I picked it up from build.xml
> >
> > Thanks,
> > Lohit
> >
> > - Original Message 
> > From: Sandy <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Wednesday, June 25, 2008 10:44:20 AM
> > Subject: Compiling Word Count in C++ : Hadoop Pipes
> >
> > Hi,
> >
> > I am currently trying to get Hadoop Pipes working. I am following the
> > instructions at the hadoop wiki, where it provides code for a C++
> > implementation of Word Count (located here:
> > http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)
> >
> > I am having some trouble parsing the instructions. What should the
> file
> > containing the new word count program be called? "examples"?
> >
> > If I were to call the file "example" and type in the following:
> > $ ant -Dcompile.c++=yes example
> > Buildfile: build.xml
> >
> > BUILD FAILED
> > Target `example' does not exist in this project.
> >
> > Total time: 0 seconds
> >
> >
> > If I try and compile with "examples" as stated on the wiki, I get:
> > $ ant -Dcompile.c++=yes examples
> > Buildfile: build.xml
> >
> > clover.setup:
> >
> > clover.info:
> > [echo]
> > [echo]  Clover not found. Code coverage reports disabled.
> > [echo]
> >
> > clover:
> >
> > init:
> >[touch] Creating /tmp/null810513231
> >   [delete] Deleting: /tmp/null810513231
> > [exec] svn: '.' is not a working copy
> > [exec] svn: '.' is not a working copy
> >
> > record-parser:
> >
> > compile-rcc-compiler:
> >[javac] Compiling 29 source files to
> > /home/sjm/Desktop/hadoop-0.16.4/build/classes
> >
> > BUILD FAILED
> > /home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a javac
> > compiler;
> > com.sun.tools.javac.Main is not on the classpath.
> > Perhaps JAVA_HOME does not point to the JDK
> >
> > Total time: 1 second
> >
> >
> >
> > I am a bit puzzled by this. Originally I got the error that tools.jar
> was
> > not found, because it was looking for it under
> > /usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under
> > /usr/java/jdk1.6.0_06/lib/tools.jar. If I copy this file over to the
> jre
> > folder, that message goes away and its replaced with the above
> message.
> >
> > My hadoop-env.sh file looks something like:
> > # Set Hadoop-specific environment variables here.
> >
> > # The only required environment variable is JAVA_HOME.  All others are
> > # optional.  When running a distributed configuration it is best

RE: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Zheng Shao
You are setting JAVA_HOME to /usr/java/jre1.6.0_06 which is jre. You
need to set it to your jdk (sth like /usr/java/jdk1.6.0_06).

If you don't have jdk installed, go to http://java.sun.com/ and install
one.

Zheng
-Original Message-
From: Sandy [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 25, 2008 12:31 PM
To: core-user@hadoop.apache.org
Subject: Re: Compiling Word Count in C++ : Hadoop Pipes

I am under the impression that it already is. As I posted in my original
e-mail, here are the declarations in hadoop-env.sh and my .bash_profile

My hadoop-env.sh file looks something like:
# Set Hadoop-specific environment variables here.

# The only required environment variable is JAVA_HOME.  All others are
# optional.  When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use.  Required.
# export JAVA_HOME=$JAVA_HOME


and my .bash_profile file has this line in it:
JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
export PATH


Is there a different way I'm supposed to set the JAVA_HOME environment
variable?

Much thanks,

-SM
On Wed, Jun 25, 2008 at 3:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:

> You need to set JAVA_HOME to your jdk directory (instead of jre).
> This is required by ant.
>
> Zheng
> -Original Message-
> From: Sandy [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 25, 2008 11:22 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Compiling Word Count in C++ : Hadoop Pipes
>
> I'm not sure how this answers my question. Could you be more specific?
I
> still am getting the above error when I type this commmand in. To
> summarize:
>
> With my current setup, this occurs:
> $ ant -Dcompile.c++=yes compile-c++-examples
> Unable to locate tools.jar. Expected to find it in
> /usr/java/jre1.6.0_06/lib/tools.jar
> Buildfile: build.xml
>
> init:
>[touch] Creating /tmp/null2044923713
>   [delete] Deleting: /tmp/null2044923713
> [exec] svn: '.' is not a working copy
> [exec] svn: '.' is not a working copy
>
> check-c++-makefiles:
>
> create-c++-examples-pipes-makefile:
>[mkdir] Created dir:
>
/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/p
> ipes
>
> BUILD FAILED
> /home/sjm/Desktop/hadoop-0.16.4/build.xml:987: Execute failed:
> java.io.IOException: Cannot run program
> "/home/sjm/Desktop/hadoop-0.16.4/src/examples/pipes/configure" (in
> directory
>
"/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/
> pipes"):
> java.io.IOException: error=13, Permission denied
>
> Total time: 1 second
>
> -
>
> If I copy the tools.jar file located in my jdk's lib folder, i get the
> error
> message I printed in the previous message.
>
> Could someone please tell me or suggest to me what I am doing wrong?
>
> Thanks,
>
> -SM
>
> On Wed, Jun 25, 2008 at 1:53 PM, lohit <[EMAIL PROTECTED]> wrote:
>
> > ant -Dcompile.c++=yes compile-c++-examples
> > I picked it up from build.xml
> >
> > Thanks,
> > Lohit
> >
> > - Original Message 
> > From: Sandy <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Wednesday, June 25, 2008 10:44:20 AM
> > Subject: Compiling Word Count in C++ : Hadoop Pipes
> >
> > Hi,
> >
> > I am currently trying to get Hadoop Pipes working. I am following
the
> > instructions at the hadoop wiki, where it provides code for a C++
> > implementation of Word Count (located here:
> > http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)
> >
> > I am having some trouble parsing the instructions. What should the
> file
> > containing the new word count program be called? "examples"?
> >
> > If I were to call the file "example" and type in the following:
> > $ ant -Dcompile.c++=yes example
> > Buildfile: build.xml
> >
> > BUILD FAILED
> > Target `example' does not exist in this project.
> >
> > Total time: 0 seconds
> >
> >
> > If I try and compile with "examples" as stated on the wiki, I get:
> > $ ant -Dcompile.c++=yes examples
> > Buildfile: build.xml
> >
> > clover.setup:
> >
> > clover.info:
> > [echo]
> > [echo]  Clover not found. Code coverage reports disabled.
> > [echo]
> >
> > clover:
> >
> > init:
> >[touch] Creating /tmp/null810513231
> >   [delete] Deleting: /tmp/null810513231
> > [exec] svn: '.' is not a working copy
> > [exec] svn: '.' is not a working copy
> >
> > record-parser:
> >
> > compile-rcc-compiler:
> >[javac] Compiling 29 source files to
> > /home/sjm/Desktop/hadoop-0.16.4/build/classes
> >
> > BUILD FAILED
> > /home/sjm/Desktop/hadoop-0.16.4/build.xml:241: Unable to find a
javac
> > compiler;
> > com.sun.tools.javac.Main is not on the classpath.
> > Perhaps JAVA_HOME does not point to the JDK
> >
> > Total time: 1 second
> >
> >
> >
> > I am a bit puzzled by this. Originally I got the error that
tools.jar
> was
> > not found, because it was looking for it under
> > /usr/java/jre1.6.0_06/lib/tools.jar . There is a tools.jar under

Re: Compiling Word Count in C++ : Hadoop Pipes

2008-06-25 Thread Sandy
My apologies. I had thought I had made that change already.

Regardless, I still get the same error:
$ ant -Dcompile.c++=yes compile-c++-examples
Unable to locate tools.jar. Expected to find it in
/usr/java/jre1.6.0_06/lib/tools.jar
Buildfile: build.xml

init:
[touch] Creating /tmp/null265867151
   [delete] Deleting: /tmp/null265867151
 [exec] svn: '.' is not a working copy
 [exec] svn: '.' is not a working copy

check-c++-makefiles:

create-c++-examples-pipes-makefile:

create-c++-pipes-makefile:

create-c++-utils-makefile:

BUILD FAILED
/home/sjm/Desktop/hadoop-0.16.4/build.xml:947: Execute failed:
java.io.IOException: Cannot run program
"/home/sjm/Desktop/hadoop-0.16.4/src/c++/utils/configure" (in directory
"/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/utils"):
java.io.IOException: error=13, Permission denied

Total time: 1 second

My .bash_profile now contains the line
JAVA_HOME=/usr/java/jdk1.6.0_06; export JAVA_HOME

I then did
source .bash_profile
conf/hadoop-env.sh

Is there anything else I need to do to make the changes take effect?

Thanks again for the assistance.

-SM

On Wed, Jun 25, 2008 at 3:43 PM, lohit <[EMAIL PROTECTED]> wrote:

> may be set it to JDK home? I have set it to my JDK.
>
> - Original Message 
> From: Sandy <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, June 25, 2008 12:31:18 PM
> Subject: Re: Compiling Word Count in C++ : Hadoop Pipes
>
> I am under the impression that it already is. As I posted in my original
> e-mail, here are the declarations in hadoop-env.sh and my .bash_profile
>
> My hadoop-env.sh file looks something like:
> # Set Hadoop-specific environment variables here.
>
> # The only required environment variable is JAVA_HOME.  All others are
> # optional.  When running a distributed configuration it is best to
> # set JAVA_HOME in this file, so that it is correctly defined on
> # remote nodes.
>
> # The java implementation to use.  Required.
> # export JAVA_HOME=$JAVA_HOME
>
>
> and my .bash_profile file has this line in it:
> JAVA_HOME=/usr/java/jre1.6.0_06; export JAVA_HOME
> export PATH
>
>
> Is there a different way I'm supposed to set the JAVA_HOME environment
> variable?
>
> Much thanks,
>
> -SM
> On Wed, Jun 25, 2008 at 3:22 PM, Zheng Shao <[EMAIL PROTECTED]> wrote:
>
> > You need to set JAVA_HOME to your jdk directory (instead of jre).
> > This is required by ant.
> >
> > Zheng
> > -Original Message-
> > From: Sandy [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 25, 2008 11:22 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: Compiling Word Count in C++ : Hadoop Pipes
> >
> > I'm not sure how this answers my question. Could you be more specific? I
> > still am getting the above error when I type this commmand in. To
> > summarize:
> >
> > With my current setup, this occurs:
> > $ ant -Dcompile.c++=yes compile-c++-examples
> > Unable to locate tools.jar. Expected to find it in
> > /usr/java/jre1.6.0_06/lib/tools.jar
> > Buildfile: build.xml
> >
> > init:
> >[touch] Creating /tmp/null2044923713
> >   [delete] Deleting: /tmp/null2044923713
> > [exec] svn: '.' is not a working copy
> > [exec] svn: '.' is not a working copy
> >
> > check-c++-makefiles:
> >
> > create-c++-examples-pipes-makefile:
> >[mkdir] Created dir:
> > /home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/p
> > ipes
> >
> > BUILD FAILED
> > /home/sjm/Desktop/hadoop-0.16.4/build.xml:987: Execute failed:
> > java.io.IOException: Cannot run program
> > "/home/sjm/Desktop/hadoop-0.16.4/src/examples/pipes/configure" (in
> > directory
> > "/home/sjm/Desktop/hadoop-0.16.4/build/c++-build/Linux-i386-32/examples/
> > pipes"):
> > java.io.IOException: error=13, Permission denied
> >
> > Total time: 1 second
> >
> > -
> >
> > If I copy the tools.jar file located in my jdk's lib folder, i get the
> > error
> > message I printed in the previous message.
> >
> > Could someone please tell me or suggest to me what I am doing wrong?
> >
> > Thanks,
> >
> > -SM
> >
> > On Wed, Jun 25, 2008 at 1:53 PM, lohit <[EMAIL PROTECTED]> wrote:
> >
> > > ant -Dcompile.c++=yes compile-c++-examples
> > > I picked it up from build.xml
> > >
> > > Thanks,
> > > Lohit
> > >
> > > - Original Message 
> > > From: Sandy <[EMAIL PROTECTED]>
> > > To: core-user@hadoop.apache.org
> > > Sent: Wednesday, June 25, 2008 10:44:20 AM
> > > Subject: Compiling Word Count in C++ : Hadoop Pipes
> > >
> > > Hi,
> > >
> > > I am currently trying to get Hadoop Pipes working. I am following the
> > > instructions at the hadoop wiki, where it provides code for a C++
> > > implementation of Word Count (located here:
> > > http://wiki.apache.org/hadoop/C++WordCount?highlight=%28C%2B%2B%29)
> > >
> > > I am having some trouble parsing the instructions. What should the
> > file
> > > containing the new word count program be called? "examples"?
> > >
> > > If I were to call the file "example" and type in the following:
> > 

Re: Job History Logging Location

2008-06-25 Thread Lincoln Ritter
Hello again.

I answered my own question.

Setting 'hadoop.job.history.user.location' to 'logs' works fine.

Thanks anyway!

-lincoln

--
lincolnritter.com



On Wed, Jun 25, 2008 at 11:11 AM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> Greetings,
>
> I'm trying to get a handle on job history logging.  According to the
> documentation in 'hadoop-defaul.xml' the
> 'hadoop.job.history.user.location' determines where job history logs
> are written.  If not specified these logs go into
> '/_logs/history'.  This can cause problems with
> applications that don't know about this convention.  It would also be
> nicer in my opinion to keep logs and data separate.
>
> It seems to me that a nice way to handle this would be to put logs in
> '/logs//history' or something.
>
> Can this be done?  Is there a need for the "job-id" folder?  If this
> can't be done, are there alternatives that work well.
>
> -lincoln
>
> --
> lincolnritter.com
>


Re: How Mappers function and solultion for my input file problem?

2008-06-25 Thread Ted Dunning
The map task is not multi-threaded, but multiple map tasks typically run on
each node and obviously many nodes are in the cluster.

So, yes, calls to the map function are executed in parallel.

On Wed, Jun 25, 2008 at 8:58 AM, Xuan Dzung Doan <[EMAIL PROTECTED]>
wrote:

>
>
> OK. So are these many calls to map per map task also executed in parallel
> (calls per one map task are executed independently)?
>
>
>
>




-- 
ted


process limits for streaming jar

2008-06-25 Thread Chris Anderson
Hi there,

I'm running some streaming jobs on ec2 (ruby parsing scripts) and in
my most recent test I managed to spike the load on my large instances
to 25 or so. As a result, I lost communication with one instance. I
think I took down sshd. Whoops.

My question is, has anyone got strategies for managing resources used
by the processes spawned by streaming jar? Ideally I'd like to run my
ruby scripts under nice.

I can hack something together with wrappers, but I'm thinking there
might be a configuration option to handle this within Streaming jar.
Thanks for any suggestions!

-- 
Chris Anderson
http://jchris.mfdz.com


Re: MultipleOutputFormat example

2008-06-25 Thread slitz
Hello,
I just did! Thank you! And indeed it is A LOT easier, or maybe it's just the
included snippets that help a lot, or maybe both things help :)

Although i would still like to learn how to use
MultipleOutputFormat/MultipleTextOutputFormat since it should be more
flexible and i whould like to know how to use this kind of things in hadoop
as this could help me understand other classes and patterns.

So it would be great if someone could give me an example of how to use it.

slitz

On Wed, Jun 25, 2008 at 7:53 PM, montag <[EMAIL PROTECTED]> wrote:

>
> Hi,
>
>  You should check out the MultipleOutputs thread and patch of
> https://issues.apache.org/jira/browse/HADOOP-3149 HADOOP-3149   There are
> some relevant and useful code snippets that address the issue of splitting
> output to multiple files within the discussion as well as in the patch
> documentation.  I found implementing this patch easier than dealing with
> MultipleTextOutputFormat.
>
> Cheers,
> Mike
>
>
>
> slitz wrote:
> >
> > Hello,
> > I need the reduce to output to different files depending on the key,
> after
> > reading some jira entries and some previous threads of the mailing list i
> > think that the MultipleTextOutputFormat class would fit my needs, the
> > problem is that i can't find any example of how to use it.
> >
> > Could someone please show me a quick example of how to use this class or
> > MultipleOutputFormat subclasses in general? i'm somewhat lost...
> >
> > slitz
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/MultipleOutputFormat-example-tp18118780p18119478.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>