Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Hi all,

I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql.  On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev server) example of the final map client:
http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
dynamic grids as you zoom are all pre-calculated.

I am considering (for better throughput as maps generate huge request
volumes) pregenerating all my tiles (PNG) and storing them in S3 with
cloudfront.  There will be billions of PNGs produced each at 1-3KB
each.

Could someone please recommend the best place to generate the PNGs and
when to push them to S3 in a MR system?
If I did the PNG generation and upload to S3 in the reduce the same
task on multiple machines will compete with each other right?  Should
I generate the PNGs to a local directory and then on Task success push
the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
good idea.

I will use EC2 for the MR for the time being, but this will be moved
to a local cluster still pushing to S3...

Cheers,

Tim


Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-14 Thread Brian Bockelman

Hey Jason,

Thanks, I'll keep this on hand as I do more tests.  I now have a C,  
Java, and python version of my testing program ;)


However, I particularly *like* the fact that there's caching going on  
- it'll help out our application immensely, I think.  I'll be looking  
at the performance both with and without the cache.


Brian

On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:

The following very simple program will tell the VM to drop the pages  
being
cached for a file. I tend to spin this in a for loop when making  
large tar
files, or otherwise working with large files, and the system  
performance

really smooths out.
Since it use open(path) it will churn through the inode cache and
directories.
Something like this might actually significantly speed up HDFS by  
running

over the blocks on the datanodes, for busy clusters.


#define _XOPEN_SOURCE 600
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/** Simple program to dump buffered data for specific files from the  
buffer

cache. Copyright Jason Venner 2009, License GPL*/

int main( int argc, char** argv )
{
 int failCount = 0;
 int i;
 for( i = 1; i < argc; i++ ) {
   char* file = argv[i];
   int fd = open( file, O_RDONLY|O_LARGEFILE );
   if (fd == -1) {
 perror( file );
 failCount++;
 continue;
   }
   if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
 fprintf( stderr, "Failed to flush cache for %s %s\n",  
argv[optind],

strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
 failCount++;
   }
   close(fd);
 }
 exit( failCount );
}


On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey  
wrote:




On 4/12/09 9:41 PM, "Brian Bockelman"  wrote:

Ok, here's something perhaps even more strange.  I removed the  
"seek"
part out of my timings, so I was only timing the "read" instead of  
the
"seek + read" as in the first case.  I also turned the read-ahead  
down

to 1-byte (aka, off).

The jump *always* occurs at 128KB, exactly.


Some random ideas:

I have no idea how FUSE interops with the Linux block layer, but 128K
happens to be the default 'readahead' value for block devices,  
which may

just be a coincidence.

For a disk 'sda', you check and set the value (in 512 byte blocks)  
with:


/sbin/blockdev --getra /dev/sda
/sbin/blockdev --setra [num blocks] /dev/sda


I know on my file system tests, the OS readahead is not activated  
until a
series of sequential reads go through the block device, so truly  
random
access is not affected by this.  I've set it to 128MB and random  
iops does
not change on a ext3 or xfs file system.  If this applies to FUSE  
too,

there
may be reasons that this behavior differs.
Furthermore, one would not expect it to be slower to randomly read  
4k than

randomly read up to the readahead size itself even if it did.

I also have no idea how much of the OS device queue and block device
scheduler is involved with FUSE.  If those are involved, then  
there's a

bunch of stuff to tinker with there as well.

Lastly, an FYI if you don't already know the following.  If the OS is
caching pages, there is a way to flush these in Linux to evict the  
cache.

See /proc/sys/vm/drop_caches .





I'm a bit befuddled.  I know we say that HDFS is optimized for  
large,

sequential reads, not random reads - but it seems that it's one bug-
fix away from being a good general-purpose system.  Heck if I can  
find

what's causing the issues though...

Brian








--
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422




Re: Reduce task attempt retry strategy

2009-04-14 Thread Arun C Murthy


On Apr 14, 2009, at 9:11 AM, Jothi Padmanabhan wrote:

2. Framework kills the task because it did not progress enough


That should count as a 'failed' task, not 'killed' - it is a bug if we  
are not counting timed-out tasks against the job...


Arun



Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread Brian Bockelman

Hey Tim,

Why don't you put the PNGs in a SequenceFile in the output of your  
reduce task?  You could then have a post-processing step that unpacks  
the PNG and places it onto S3.  (If my numbers are correct, you're  
looking at around 3TB of data; is this right?  With that much, you  
might want another separate Map task to unpack all the files in  
parallel ... really depends on the throughput you get to Amazon)


Brian

On Apr 14, 2009, at 4:35 AM, tim robertson wrote:


Hi all,

I am currently processing a lot of raw CSV data and producing a
summary text file which I load into mysql.  On top of this I have a
PHP application to generate tiles for google mapping (sample tile:
http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
Here is a (dev server) example of the final map client:
http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
dynamic grids as you zoom are all pre-calculated.

I am considering (for better throughput as maps generate huge request
volumes) pregenerating all my tiles (PNG) and storing them in S3 with
cloudfront.  There will be billions of PNGs produced each at 1-3KB
each.

Could someone please recommend the best place to generate the PNGs and
when to push them to S3 in a MR system?
If I did the PNG generation and upload to S3 in the reduce the same
task on multiple machines will compete with each other right?  Should
I generate the PNGs to a local directory and then on Task success push
the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
good idea.

I will use EC2 for the MR for the time being, but this will be moved
to a local cluster still pushing to S3...

Cheers,

Tim




Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Thanks Brian,

This is pretty much what I was looking for.

Your calculations are correct but based on the assumption that at all
zoom levels we will need all tiles generated.  Given the sparsity of
data, it actually results in only a few 100GBs.  I'll run a second MR
job with the map pushing to S3 then to make use of parallel loading.

Cheers,

Tim


On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman  wrote:
> Hey Tim,
>
> Why don't you put the PNGs in a SequenceFile in the output of your reduce
> task?  You could then have a post-processing step that unpacks the PNG and
> places it onto S3.  (If my numbers are correct, you're looking at around 3TB
> of data; is this right?  With that much, you might want another separate Map
> task to unpack all the files in parallel ... really depends on the
> throughput you get to Amazon)
>
> Brian
>
> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>
>> Hi all,
>>
>> I am currently processing a lot of raw CSV data and producing a
>> summary text file which I load into mysql.  On top of this I have a
>> PHP application to generate tiles for google mapping (sample tile:
>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>> Here is a (dev server) example of the final map client:
>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>> dynamic grids as you zoom are all pre-calculated.
>>
>> I am considering (for better throughput as maps generate huge request
>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>> each.
>>
>> Could someone please recommend the best place to generate the PNGs and
>> when to push them to S3 in a MR system?
>> If I did the PNG generation and upload to S3 in the reduce the same
>> task on multiple machines will compete with each other right?  Should
>> I generate the PNGs to a local directory and then on Task success push
>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>> good idea.
>>
>> I will use EC2 for the MR for the time being, but this will be moved
>> to a local cluster still pushing to S3...
>>
>> Cheers,
>>
>> Tim
>
>


Re: Modeling WordCount in a different way

2009-04-14 Thread Pankil Doshi
Hey,

I am trying complex queries on hadoop and in which i require more than one
job to run to get final result..results of job one captures few joins of the
query and I want to pass those results as input to 2nd job and again do
processing so that I can get final results.queries are such that I cant do
all types of joins and filterin in job1 and so I require two jobs.

right now I write results of job 1 to hdfs and read dem for job2..but thats
take unecessary IO time.So was looking for something that I can store my
results of job1 in memory and use them as input for job 2.

do let me know if you need any  more details.
Pankil

On Mon, Apr 13, 2009 at 9:51 PM, sharad agarwal wrote:

> Pankil Doshi wrote:
>
>>
>> Hey
>>
>> Did u find any class or way out for storing results of Job1 map/reduce in
>> memory and using that as an input to job2 map/Reduce?I am facing a
>> situation
>> where I need to do similar thing.If anyone can help me out..
>>
>>  Normally you would write the job output to a file and input that to the
> next job.
> Any reason why you want to store the map reduce output in memory ? If you
> can describe your problem, perhaps it could be solved in more mapreduce-ish
> way.
>
> - Sharad
>


HDFS and web server

2009-04-14 Thread Stas Oskin
Hi.

Has any succeed running web-server from HDFS?

I mean, to serve websites and application directly from HDFS, perhaps via
FUSE/WebDav?

Regards.


Re: Generating many small PNGs to Amazon S3 with MapReduce

2009-04-14 Thread tim robertson
Sorry Brian, can I just ask please...

I have the PNGs in the Sequence file for my sample set.  If I use a
second MR job and push to S3 in the map, surely I run into the
scenario where multiple tasks are running on the same section of the
sequence file and thus pushing the same data to S3.  Am I missing
something obvious (e.g. can I disable this behavior)?

Cheers

Tim


On Tue, Apr 14, 2009 at 2:44 PM, tim robertson
 wrote:
> Thanks Brian,
>
> This is pretty much what I was looking for.
>
> Your calculations are correct but based on the assumption that at all
> zoom levels we will need all tiles generated.  Given the sparsity of
> data, it actually results in only a few 100GBs.  I'll run a second MR
> job with the map pushing to S3 then to make use of parallel loading.
>
> Cheers,
>
> Tim
>
>
> On Tue, Apr 14, 2009 at 2:37 PM, Brian Bockelman  wrote:
>> Hey Tim,
>>
>> Why don't you put the PNGs in a SequenceFile in the output of your reduce
>> task?  You could then have a post-processing step that unpacks the PNG and
>> places it onto S3.  (If my numbers are correct, you're looking at around 3TB
>> of data; is this right?  With that much, you might want another separate Map
>> task to unpack all the files in parallel ... really depends on the
>> throughput you get to Amazon)
>>
>> Brian
>>
>> On Apr 14, 2009, at 4:35 AM, tim robertson wrote:
>>
>>> Hi all,
>>>
>>> I am currently processing a lot of raw CSV data and producing a
>>> summary text file which I load into mysql.  On top of this I have a
>>> PHP application to generate tiles for google mapping (sample tile:
>>> http://eol-map.gbif.org/php/map/getEolTile.php?tile=0_0_0_13839800).
>>> Here is a (dev server) example of the final map client:
>>> http://eol-map.gbif.org/EOLSpeciesMap.html?taxon_id=13839800 - the
>>> dynamic grids as you zoom are all pre-calculated.
>>>
>>> I am considering (for better throughput as maps generate huge request
>>> volumes) pregenerating all my tiles (PNG) and storing them in S3 with
>>> cloudfront.  There will be billions of PNGs produced each at 1-3KB
>>> each.
>>>
>>> Could someone please recommend the best place to generate the PNGs and
>>> when to push them to S3 in a MR system?
>>> If I did the PNG generation and upload to S3 in the reduce the same
>>> task on multiple machines will compete with each other right?  Should
>>> I generate the PNGs to a local directory and then on Task success push
>>> the lot up?  I am assuming billions of 1-3KB files on HDFS is not a
>>> good idea.
>>>
>>> I will use EC2 for the MR for the time being, but this will be moved
>>> to a local cluster still pushing to S3...
>>>
>>> Cheers,
>>>
>>> Tim
>>
>>
>


fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Guilherme Germoglio
(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


-- 
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com


Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-14 Thread jason hadoop
Oh I agree caching, is wonderful when you plan to re-use the data in the
near term.

Solaris has an interesting feature, if the application writes enough
contiguous data, in a short time window, (tunable in later nevada builds),
solaris bypasses the buffer cache for the writes.

For reasons I have never had time to look into, there is a significant
impact on overall system responsiveness when there is significant cache
store activity going on, and there are patterns that work in the general
case but fail in others, the tar example from earlier, it is my theory that
the blocks written to the tar file, take priority over the read ahead, and
so the next file to be read for the tar archive are not pre-cached. Using
the cache flush on the tar file, allows the read aheads to go ahead.
The other nice thing that happens is that the size of the dirty pool tends
not to grow to to the point that the periodic sync operations pause the
system.

We had an interesting problem with solaris under vmware some years back,
where we were running IMAP servers as part of JES for testing a middleware
mail application, The IMAP writes would accumulate in the buffer cache, and
performace would be wonderful, and the middle ware performace was great,
then the must flush now threshold would be crossed and it would take 2
minutes to flush all of the accumulated writes out, and the middle ware app
would block waiting on that to finish. In the end as a quick hack, we did
the following *while true; do sync; sleep 30; done*, which prevented the
stalls as it kept the flush time down. The flushes totally fill the disk
queues and will cause starvation for other apps.

I believe this is part of the block report stall problem in 4584.

On Tue, Apr 14, 2009 at 4:52 AM, Brian Bockelman wrote:

> Hey Jason,
>
> Thanks, I'll keep this on hand as I do more tests.  I now have a C, Java,
> and python version of my testing program ;)
>
> However, I particularly *like* the fact that there's caching going on -
> it'll help out our application immensely, I think.  I'll be looking at the
> performance both with and without the cache.
>
> Brian
>
>
> On Apr 14, 2009, at 12:01 AM, jason hadoop wrote:
>
>  The following very simple program will tell the VM to drop the pages being
>> cached for a file. I tend to spin this in a for loop when making large tar
>> files, or otherwise working with large files, and the system performance
>> really smooths out.
>> Since it use open(path) it will churn through the inode cache and
>> directories.
>> Something like this might actually significantly speed up HDFS by running
>> over the blocks on the datanodes, for busy clusters.
>>
>>
>> #define _XOPEN_SOURCE 600
>> #define _GNU_SOURCE
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> /** Simple program to dump buffered data for specific files from the
>> buffer
>> cache. Copyright Jason Venner 2009, License GPL*/
>>
>> int main( int argc, char** argv )
>> {
>>  int failCount = 0;
>>  int i;
>>  for( i = 1; i < argc; i++ ) {
>>   char* file = argv[i];
>>   int fd = open( file, O_RDONLY|O_LARGEFILE );
>>   if (fd == -1) {
>> perror( file );
>> failCount++;
>> continue;
>>   }
>>   if (posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED )!=0) {
>> fprintf( stderr, "Failed to flush cache for %s %s\n", argv[optind],
>> strerror( posix_fadvise( fd, 0, 0, POSIX_FADV_DONTNEED ) ) );
>> failCount++;
>>   }
>>   close(fd);
>>  }
>>  exit( failCount );
>> }
>>
>>
>> On Mon, Apr 13, 2009 at 4:01 PM, Scott Carey > >wrote:
>>
>>
>>> On 4/12/09 9:41 PM, "Brian Bockelman"  wrote:
>>>
>>>  Ok, here's something perhaps even more strange.  I removed the "seek"
 part out of my timings, so I was only timing the "read" instead of the
 "seek + read" as in the first case.  I also turned the read-ahead down
 to 1-byte (aka, off).

 The jump *always* occurs at 128KB, exactly.

>>>
>>> Some random ideas:
>>>
>>> I have no idea how FUSE interops with the Linux block layer, but 128K
>>> happens to be the default 'readahead' value for block devices, which may
>>> just be a coincidence.
>>>
>>> For a disk 'sda', you check and set the value (in 512 byte blocks) with:
>>>
>>> /sbin/blockdev --getra /dev/sda
>>> /sbin/blockdev --setra [num blocks] /dev/sda
>>>
>>>
>>> I know on my file system tests, the OS readahead is not activated until a
>>> series of sequential reads go through the block device, so truly random
>>> access is not affected by this.  I've set it to 128MB and random iops
>>> does
>>> not change on a ext3 or xfs file system.  If this applies to FUSE too,
>>> there
>>> may be reasons that this behavior differs.
>>> Furthermore, one would not expect it to be slower to randomly read 4k
>>> than
>>> randomly read up to the readahead size itself even if it did.
>>>
>>> I also have no idea how much of the OS device queue and block device
>>> scheduler is involved with FUSE.  If those are involved, then there's 

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread tim robertson
Thanks for sharing this - I find these comparisons really interesting.
 I have a small comment after skimming this very quickly.
[Please accept my apologies for commenting on such a trivial thing,
but personal experience has shown this really influences performance]

One thing not touched on in the article is the need for developers to
take into account performance gains when writing MapReduce programs,
much like they do when making sure the DB query optimizer is doing the
join order sensibly.

For example in your MR code you have 2 simple places to improve performance:

  String fields[] = value.toString().split("\\" +
BenchmarkBase.VALUE_DELIMITER);

This will create a String for "\\" + BenchmarkBase.VALUE_DELIMITER,
and then compile a pattern for it, then split the input and then the
new String and the Pattern are god for garbage collection.

String splitting like this is far quicker with a precompiled Pattern
that you reuse:
  static Pattern splitter = Pattern.compile("\\" +
BenchmarkBase.VALUE_DELIMITER);
  
  splitter.split(value.toString());

A simple loop of splitting 10 records has 431msec to 69msec on my
2G macbook pro.  Now consider what happens when splitting Billions of
rows (it only gets worse with a bigger input string).

The other gain is object reusing rather than creation:
  key = new Text(key.toString().substring(0, 7));

Unnecessary Object creation and garbage collection kills Java
performance in any application.

(I haven't seen it in your code, but another performance gain is
reliance on Exceptions where if/else clauses perform far quicker.)

These are really trivial things that people often overlook but when
you are running these operations billions of times it really adds up
and is analogous to using BigInteger on a DB column with an Index
where a SmallInteger will do.

Again - I apologise for commenting on such a trivial thing (I really
feel stupid commenting on how to split a String in Java efficiently to
this mailing list), but might be worth considering when doing these
kind of tests - and like you say RDBMS has 20 years of these
performance tweaks.  Of course the fact that RDBMS mostly shield
people from these low level things is a huge benefit and might be
worth mentioning.

Cheers,

Tim












On Tue, Apr 14, 2009 at 4:16 PM, Guilherme Germoglio
 wrote:
> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>
> There is currently considerable enthusiasm around the MapReduce
> (MR) paradigm for large-scale data analysis [17]. Although the
> basic control flow of this framework has existed in parallel SQL
> database management systems (DBMS) for over 20 years, some
> have called MR a dramatically new computing model [8, 17]. In
> this paper, we describe and compare both paradigms. Furthermore,
> we evaluate both kinds of systems in terms of performance and de-
> velopment complexity. To this end, we define a benchmark con-
> sisting of a collection of tasks that we have run on an open source
> version of MR as well as on two parallel DBMSs. For each task,
> we measure each system’s performance for various degrees of par-
> allelism on a cluster of 100 nodes. Our results reveal some inter-
> esting trade-offs. Although the process to load data into and tune
> the execution of parallel DBMSs took much longer than the MR
> system, the observed performance of these DBMSs was strikingly
> better. We speculate about the causes of the dramatic performance
> difference and consider implementation concepts that future sys-
> tems should take from both kinds of architectures.
>
>
> --
> Guilherme
>
> msn: guigermog...@hotmail.com
> homepage: http://germoglio.googlepages.com
>


Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread czero

I actually picked up the alpha .PDF's of your book, great job.

I'm following the example in chapter 7 to the letter now and am still
getting the same problem.  2 quick questions (and thanks for your time in
advance)...

Is the ClusterMapReduceDelegate class available anywhere yet?

Adding ~/hadoop/libs/*.jar in it's entirety to my pom.xml is a lot of bulk,
so I've avoided it until now.  Are there any lib's in there that are
absolutely necessary for this test to work?

Thanks again,
bc



jason hadoop wrote:
> 
> I have a nice variant of this in the ch7 examples section of my book,
> including a standalone wrapper around the virtual cluster for allowing
> multiple test instances to share the virtual cluster - and allow an easier
> time to poke around with the input and output datasets.
> 
> It even works decently under windows - my editor insisting on word to
> recent
> for crossover.
> 
> On Mon, Apr 13, 2009 at 9:16 AM, czero  wrote:
> 
>>
>> Sry, I forgot to include the not-IntelliJ-console output :)
>>
>> 09/04/13 12:07:14 ERROR mapred.MiniMRCluster: Job tracker crashed
>> java.lang.NullPointerException
>>at java.io.File.(File.java:222)
>>at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:143)
>>at
>> org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1110)
>>at
>> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:143)
>>at
>>
>> org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:96)
>>at java.lang.Thread.run(Thread.java:637)
>>
>> I managed to pick up the chapter in the Hadoop Book that Jason mentions
>> that
>> deals with Unit testing (great chapter btw) and it looks like everything
>> is
>> in order.  He points out that this error is typically caused by a bad
>> hadoop.log.dir or missing log4j.properties, but I verified that my dir is
>> ok
>> and my hadoop-0.19.1-core.jar has the log4j.properties in it.
>>
>> I also tried running the same test with hadoop-core/test 0.19.0 - same
>> thing.
>>
>> Thanks again,
>>
>> bc
>>
>>
>> czero wrote:
>> >
>> > Hey all,
>> >
>> > I'm also extending the ClusterMapReduceTestCase and having a bit of
>> > trouble as well.
>> >
>> > Currently I'm getting :
>> >
>> > Starting DataNode 0 with dfs.data.dir:
>> > build/test/data/dfs/data/data1,build/test/data/dfs/data/data2
>> > Starting DataNode 1 with dfs.data.dir:
>> > build/test/data/dfs/data/data3,build/test/data/dfs/data/data4
>> > Generating rack names for tasktrackers
>> > Generating host names for tasktrackers
>> >
>> > And then nothing... just spins on that forever.  Any ideas?
>> >
>> > I have all the jetty and jetty-ext libs in the classpath and I set the
>> > hadoop.log.dir and the SAX parser correctly.
>> >
>> > This is all I have for my test class so far, I'm not even doing
>> anything
>> > yet:
>> >
>> > public class TestDoop extends ClusterMapReduceTestCase {
>> >
>> > @Test
>> > public void testDoop() throws Exception {
>> > System.setProperty("hadoop.log.dir", "~/test-logs");
>> > System.setProperty("javax.xml.parsers.SAXParserFactory",
>> > "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl");
>> >
>> > setUp();
>> >
>> > System.out.println("done.");
>> > }
>> >
>> > Thanks!
>> >
>> > bc
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Extending-ClusterMapReduceTestCase-tp22440254p23024597.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Extending-ClusterMapReduceTestCase-tp22440254p23041470.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Bryan Duxbury
I thought it a conspicuous omission to not discuss the cost of  
various approaches. Hadoop is free, though you have to spend  
developer time; how much does Vertica cost on 100 nodes?


-Bryan

On Apr 14, 2009, at 7:16 AM, Guilherme Germoglio wrote:


(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


--
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com




Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
I ve drawn a blank here! Can't figure out what s wrong with the ports. I can
ssh between the nodes but cant access the DFS from the slaves - says "Bad
connection to DFS". Master seems to be fine.
Mithila

On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra  wrote:

> Yes I can..
>
>
> On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky wrote:
>
>> Can you ssh between the nodes?
>>
>> -jim
>>
>> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra 
>> wrote:
>>
>> > Thanks Aaron.
>> > Jim: The three clusters I setup had ubuntu running on them and the dfs
>> was
>> > accessed at port 54310. The new cluster which I ve setup has Red Hat
>> Linux
>> > release 7.2 (Enigma)running on it. Now when I try to access the dfs from
>> > one
>> > of the slaves i get the following response: dfs cannot be accessed. When
>> I
>> > access the DFS throught the master there s no problem. So I feel there a
>> > problem with the port. Any ideas? I did check the list of slaves, it
>> looks
>> > fine to me.
>> >
>> > Mithila
>> >
>> >
>> >
>> >
>> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky 
>> > wrote:
>> >
>> > > Mithila,
>> > >
>> > > You said all the slaves were being utilized in the 3 node cluster.
>> Which
>> > > application did you run to test that and what was your input size? If
>> you
>> > > tried the word count application on a 516 MB input file on both
>> cluster
>> > > setups, than some of your nodes in the 15 node cluster may not be
>> running
>> > > at
>> > > all. Generally, one map job is assigned to each input split and if you
>> > are
>> > > running your cluster with the defaults, the splits are 64 MB each. I
>> got
>> > > confused when you said the Namenode seemed to do all the work. Can you
>> > > check
>> > > conf/slaves and make sure you put the names of all task trackers
>> there? I
>> > > also suggest comparing both clusters with a larger input size, say at
>> > least
>> > > 5 GB, to really see a difference.
>> > >
>> > > Jim
>> > >
>> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball 
>> > wrote:
>> > >
>> > > > in hadoop-*-examples.jar, use "randomwriter" to generate the data
>> and
>> > > > "sort"
>> > > > to sort it.
>> > > > - Aaron
>> > > >
>> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi 
>> > > wrote:
>> > > >
>> > > > > Your data is too small I guess for 15 clusters ..So it might be
>> > > overhead
>> > > > > time of these clusters making your total MR jobs more time
>> consuming.
>> > > > > I guess you will have to try with larger set of data..
>> > > > >
>> > > > > Pankil
>> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
>> mnage...@asu.edu>
>> > > > > wrote:
>> > > > >
>> > > > > > Aaron
>> > > > > >
>> > > > > > That could be the issue, my data is just 516MB - wouldn't this
>> see
>> > a
>> > > > bit
>> > > > > of
>> > > > > > speed up?
>> > > > > > Could you guide me to the example? I ll run my cluster on it and
>> > see
>> > > > what
>> > > > > I
>> > > > > > get. Also for my program I had a java timer running to record
>> the
>> > > time
>> > > > > > taken
>> > > > > > to complete execution. Does Hadoop have an inbuilt timer?
>> > > > > >
>> > > > > > Mithila
>> > > > > >
>> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
>> aa...@cloudera.com
>> > >
>> > > > > wrote:
>> > > > > >
>> > > > > > > Virtually none of the examples that ship with Hadoop are
>> designed
>> > > to
>> > > > > > > showcase its speed. Hadoop's speedup comes from its ability to
>> > > > process
>> > > > > > very
>> > > > > > > large volumes of data (starting around, say, tens of GB per
>> job,
>> > > and
>> > > > > > going
>> > > > > > > up in orders of magnitude from there). So if you are timing
>> the
>> > pi
>> > > > > > > calculator (or something like that), its results won't
>> > necessarily
>> > > be
>> > > > > > very
>> > > > > > > consistent. If a job doesn't have enough fragments of data to
>> > > > allocate
>> > > > > > one
>> > > > > > > per each node, some of the nodes will also just go unused.
>> > > > > > >
>> > > > > > > The best example for you to run is to use randomwriter to fill
>> up
>> > > > your
>> > > > > > > cluster with several GB of random data and then run the sort
>> > > program.
>> > > > > If
>> > > > > > > that doesn't scale up performance from 3 nodes to 15, then
>> you've
>> > > > > > > definitely
>> > > > > > > got something strange going on.
>> > > > > > >
>> > > > > > > - Aaron
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra <
>> > > mnage...@asu.edu>
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hey all
>> > > > > > > > I recently setup a three node hadoop cluster and ran an
>> > examples
>> > > on
>> > > > > it.
>> > > > > > > It
>> > > > > > > > was pretty fast, and all the three nodes were being used (I
>> > > checked
>> > > > > the
>> > > > > > > log
>> > > > > > > > files to make sure that the slaves are utilized).
>> > > > > > > >
>> > > > > > > > Now I ve setup another cluster consisting of 15 nodes. I

Total number of records processed in mapper

2009-04-14 Thread Andy Liu
Is there a way for all the reducers to have access to the total number of
records that were processed in the Map phase?

For example, I'm trying to perform a simple document frequency calculation.
During the map phase, I emit  pairs for every unique word in every
document.  During the reduce phase, I sum the values for each word group.
Then I want to divide that value by the total number of documents.

I suppose I can create a whole separate m/r job whose sole purpose is to
count all the records, then pass that number on.  Is there a more
straighforward way of doing this?

Andy


Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Hello,
Suppose I have a Hadoop job and have set my combiner to the Reducer class.
Does the map function and the combiner function run in the same JVM in
different threads? or in different JVMs?
I ask because I have to load a native library and if they are in the same
JVM then the native library is loaded once and I have to take  precautions.

Thank you
Saptarshi Guha


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Brian Bockelman

Hey Guilherme,

It's good to see comparisons, especially as it helps folks understand  
better what tool is the best for their problem.  As you show in your  
paper, a MapReduce system is hideously bad in performing tasks that  
column-store databases were designed for (selecting a single value  
along an index, joining tables).


Some comments:
1) For some of your graphs, you show Hadoop's numbers in half-grey,  
half-white.  I can't figure out for the life of me what this  
signifies!  What have I overlooked?
2) I see that one of your co-authors is the CEO/inventor of the  
Vertica DB.  Out of curiosity, how did you interact with Vertica  
versus Hadoop versus DBMS-X?  Did you get help tuning the systems from  
the experts?  I.e., if you sat down with a Hadoop expert for a few  
days, I'm certain you could squeeze out more performance, just like  
whenever I sit down with an Oracle DBA for a few hours, my DB queries  
are much faster.  You touch upon the sociological issues (having to  
program your own code versus having to only know SQL, as well as the  
comparative time it took to set up the DB) - I'd like to hear how much  
time you spent "tweaking" and learning the best practices for the  
three, very different approaches.  If you added a 5th test, what's the  
marginal effort required?
3) It would be nice to see how some of your more DB-like tasks perform  
on something like HBase.  That'd be a much more apples-to-apples  
comparison of column-store DBMS versus column-store data system,  
although the HBase work is just now revving up.  I'm a bit uninformed  
in that area, so I don't have a good gut in how that'd do.
4) I think that the UDF aggregation task (calculating the inlink count  
for each document in a sample) is interesting - it's a more Map-Reduce  
oriented task, and it sounds like it was fairly miserable to hack  
around the limitations / bugs in the DBMS.
5) I really think you undervalue the benefits of replication and  
reliability, especially in terms of cost.  As someone who helps with a  
small site (about 300 machines) that range from commodity workers to  
Sun Thumpers, if your site depends on all your storage nodes  
functioning, then your costs go way up.  You can't make cheap hardware  
scale unless your software can account for it.
  - Yes, I realize this is a different approach than you take.  There  
are pros and cons to large expensive hardware versus lots of cheap  
hardware ... the argument has been going on since the dawn of time.   
However, it's a bit unfair to just outright dismiss one approach.  I  
am a bit wary of the claims that your results can scale up to Google/ 
Yahoo scale, but I do agree that there are truly few users that are  
that large!


I love your last paragraph, it's a very good conclusion.  It kind of  
reminds me of the grid computing field which was (is?) completely  
shocked by the emergence of cloud computing.  After you cut through  
the hype surrounding the new fads, you find (a) that there are some  
very good reasons that the fads are popular - they have definite  
strengths that the existing field was missing (or didn't want to hear)  
and (b) there's a lot of common ground and learning that has to be  
done, even to get a good common terminology :)


Enjoy your conference!

Brian

On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:


(Hadoop is used in the benchmarks)

http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
this paper, we describe and compare both paradigms. Furthermore,
we evaluate both kinds of systems in terms of performance and de-
velopment complexity. To this end, we define a benchmark con-
sisting of a collection of tasks that we have run on an open source
version of MR as well as on two parallel DBMSs. For each task,
we measure each system’s performance for various degrees of par-
allelism on a cluster of 100 nodes. Our results reveal some inter-
esting trade-offs. Although the process to load data into and tune
the execution of parallel DBMSs took much longer than the MR
system, the observed performance of these DBMSs was strikingly
better. We speculate about the causes of the dramatic performance
difference and consider implementation concepts that future sys-
tems should take from both kinds of architectures.


--
Guilherme

msn: guigermog...@hotmail.com
homepage: http://germoglio.googlepages.com




Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Guilherme Germoglio
Hi Brian,

I'm sorry but it is not my paper. :-) I've posted the link here because
we're always looking for comparison data -- so, I thought this benchmark
would be welcome.

Also, I won't attend the conference. However, it would be a good idea to
someone who will to ask directly to the authors all these questions and
comments and then post their answers here.


On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman wrote:

> Hey Guilherme,
>
> It's good to see comparisons, especially as it helps folks understand
> better what tool is the best for their problem.  As you show in your paper,
> a MapReduce system is hideously bad in performing tasks that column-store
> databases were designed for (selecting a single value along an index,
> joining tables).
>
> Some comments:
> 1) For some of your graphs, you show Hadoop's numbers in half-grey,
> half-white.  I can't figure out for the life of me what this signifies!
>  What have I overlooked?
> 2) I see that one of your co-authors is the CEO/inventor of the Vertica DB.
>  Out of curiosity, how did you interact with Vertica versus Hadoop versus
> DBMS-X?  Did you get help tuning the systems from the experts?  I.e., if you
> sat down with a Hadoop expert for a few days, I'm certain you could squeeze
> out more performance, just like whenever I sit down with an Oracle DBA for a
> few hours, my DB queries are much faster.  You touch upon the sociological
> issues (having to program your own code versus having to only know SQL, as
> well as the comparative time it took to set up the DB) - I'd like to hear
> how much time you spent "tweaking" and learning the best practices for the
> three, very different approaches.  If you added a 5th test, what's the
> marginal effort required?
> 3) It would be nice to see how some of your more DB-like tasks perform on
> something like HBase.  That'd be a much more apples-to-apples comparison of
> column-store DBMS versus column-store data system, although the HBase work
> is just now revving up.  I'm a bit uninformed in that area, so I don't have
> a good gut in how that'd do.
> 4) I think that the UDF aggregation task (calculating the inlink count for
> each document in a sample) is interesting - it's a more Map-Reduce oriented
> task, and it sounds like it was fairly miserable to hack around the
> limitations / bugs in the DBMS.
> 5) I really think you undervalue the benefits of replication and
> reliability, especially in terms of cost.  As someone who helps with a small
> site (about 300 machines) that range from commodity workers to Sun Thumpers,
> if your site depends on all your storage nodes functioning, then your costs
> go way up.  You can't make cheap hardware scale unless your software can
> account for it.
>  - Yes, I realize this is a different approach than you take.  There are
> pros and cons to large expensive hardware versus lots of cheap hardware ...
> the argument has been going on since the dawn of time.  However, it's a bit
> unfair to just outright dismiss one approach.  I am a bit wary of the claims
> that your results can scale up to Google/Yahoo scale, but I do agree that
> there are truly few users that are that large!
>
> I love your last paragraph, it's a very good conclusion.  It kind of
> reminds me of the grid computing field which was (is?) completely shocked by
> the emergence of cloud computing.  After you cut through the hype
> surrounding the new fads, you find (a) that there are some very good reasons
> that the fads are popular - they have definite strengths that the existing
> field was missing (or didn't want to hear) and (b) there's a lot of common
> ground and learning that has to be done, even to get a good common
> terminology :)
>
> Enjoy your conference!
>
> Brian
>
> On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:
>
>  (Hadoop is used in the benchmarks)
>>
>> http://database.cs.brown.edu/sigmod09/
>>
>> There is currently considerable enthusiasm around the MapReduce
>> (MR) paradigm for large-scale data analysis [17]. Although the
>> basic control flow of this framework has existed in parallel SQL
>> database management systems (DBMS) for over 20 years, some
>> have called MR a dramatically new computing model [8, 17]. In
>> this paper, we describe and compare both paradigms. Furthermore,
>> we evaluate both kinds of systems in terms of performance and de-
>> velopment complexity. To this end, we define a benchmark con-
>> sisting of a collection of tasks that we have run on an open source
>> version of MR as well as on two parallel DBMSs. For each task,
>> we measure each system’s performance for various degrees of par-
>> allelism on a cluster of 100 nodes. Our results reveal some inter-
>> esting trade-offs. Although the process to load data into and tune
>> the execution of parallel DBMSs took much longer than the MR
>> system, the observed performance of these DBMSs was strikingly
>> better. We speculate about the causes of the dramatic performance
>> difference and consider 

Re: HDFS and web server

2009-04-14 Thread Michael Bieniosek
webdav server - https://issues.apache.org/jira/browse/HADOOP-496
There's a fuse issue somewhere too, but I never managed to get it working.

As far as serving websites directly from HDFS goes, I would say you'd probably 
have better luck writing a pure-java webserver that served directly from the 
java HDFS api.  I've had good experiences serving (internal tools) with jetty 
-- http://docs.codehaus.org/display/JETTY/Embedding+Jetty

-Michael

On 4/14/09 6:45 AM, "Stas Oskin"  wrote:

Hi.

Has any succeed running web-server from HDFS?

I mean, to serve websites and application directly from HDFS, perhaps via
FUSE/WebDav?

Regards.



Re: HDFS and web server

2009-04-14 Thread Stas Oskin
Hi.

2009/4/14 Michael Bieniosek 

>  webdav server - https://issues.apache.org/jira/browse/HADOOP-496
> There's a fuse issue somewhere too, but I never managed to get it working.
>
> As far as serving websites directly from HDFS goes, I would say you'd
> probably have better luck writing a pure-java webserver that served directly
> from the java HDFS api.  I've had good experiences serving (internal tools)
> with jetty -- http://docs.codehaus.org/display/JETTY/Embedding+Jetty
>
> -Michael
>
>
Thanks for the advice, but I need PHP support, which Java can't probably do
much about.

Exposing HDFS via FUSE would be ideal, but from what I heard, it's not
considered very stable yet.

Might not be entirely related to this list, but what other measures can be
used to reliably replicate the content over several servers, and get it
ready for HTTP serving?

Regards.


Re: Is combiner and map in same JVM?

2009-04-14 Thread Aaron Kimball
They're in the same JVM, and I believe in the same thread.
- Aaron

On Tue, Apr 14, 2009 at 10:25 AM, Saptarshi Guha
wrote:

> Hello,
> Suppose I have a Hadoop job and have set my combiner to the Reducer class.
> Does the map function and the combiner function run in the same JVM in
> different threads? or in different JVMs?
> I ask because I have to load a native library and if they are in the same
> JVM then the native library is loaded once and I have to take  precautions.
>
> Thank you
> Saptarshi Guha
>


Re: bzip2 input format

2009-04-14 Thread John Heidemann
On Tue, 14 Apr 2009 10:52:02 +0900, "Edward J. Yoon" wrote: 
>Does anyone have a input formatter for bzip2?

Do you mean like this codec (now committed):
https://issues.apache.org/jira/browse/HADOOP-3646

and split support nearly done, pending QA auto-tests:

https://issues.apache.org/jira/browse/HADOOP-4012

?

   -John Heidemann


Re: Map-Reduce Slow Down

2009-04-14 Thread Aaron Kimball
Are there any error messages in the log files on those nodes?
- Aaron

On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra  wrote:

> I ve drawn a blank here! Can't figure out what s wrong with the ports. I
> can
> ssh between the nodes but cant access the DFS from the slaves - says "Bad
> connection to DFS". Master seems to be fine.
> Mithila
>
> On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra 
> wrote:
>
> > Yes I can..
> >
> >
> > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky  >wrote:
> >
> >> Can you ssh between the nodes?
> >>
> >> -jim
> >>
> >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra 
> >> wrote:
> >>
> >> > Thanks Aaron.
> >> > Jim: The three clusters I setup had ubuntu running on them and the dfs
> >> was
> >> > accessed at port 54310. The new cluster which I ve setup has Red Hat
> >> Linux
> >> > release 7.2 (Enigma)running on it. Now when I try to access the dfs
> from
> >> > one
> >> > of the slaves i get the following response: dfs cannot be accessed.
> When
> >> I
> >> > access the DFS throught the master there s no problem. So I feel there
> a
> >> > problem with the port. Any ideas? I did check the list of slaves, it
> >> looks
> >> > fine to me.
> >> >
> >> > Mithila
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky 
> >> > wrote:
> >> >
> >> > > Mithila,
> >> > >
> >> > > You said all the slaves were being utilized in the 3 node cluster.
> >> Which
> >> > > application did you run to test that and what was your input size?
> If
> >> you
> >> > > tried the word count application on a 516 MB input file on both
> >> cluster
> >> > > setups, than some of your nodes in the 15 node cluster may not be
> >> running
> >> > > at
> >> > > all. Generally, one map job is assigned to each input split and if
> you
> >> > are
> >> > > running your cluster with the defaults, the splits are 64 MB each. I
> >> got
> >> > > confused when you said the Namenode seemed to do all the work. Can
> you
> >> > > check
> >> > > conf/slaves and make sure you put the names of all task trackers
> >> there? I
> >> > > also suggest comparing both clusters with a larger input size, say
> at
> >> > least
> >> > > 5 GB, to really see a difference.
> >> > >
> >> > > Jim
> >> > >
> >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball 
> >> > wrote:
> >> > >
> >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate the data
> >> and
> >> > > > "sort"
> >> > > > to sort it.
> >> > > > - Aaron
> >> > > >
> >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
> forpan...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > Your data is too small I guess for 15 clusters ..So it might be
> >> > > overhead
> >> > > > > time of these clusters making your total MR jobs more time
> >> consuming.
> >> > > > > I guess you will have to try with larger set of data..
> >> > > > >
> >> > > > > Pankil
> >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
> >> mnage...@asu.edu>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Aaron
> >> > > > > >
> >> > > > > > That could be the issue, my data is just 516MB - wouldn't this
> >> see
> >> > a
> >> > > > bit
> >> > > > > of
> >> > > > > > speed up?
> >> > > > > > Could you guide me to the example? I ll run my cluster on it
> and
> >> > see
> >> > > > what
> >> > > > > I
> >> > > > > > get. Also for my program I had a java timer running to record
> >> the
> >> > > time
> >> > > > > > taken
> >> > > > > > to complete execution. Does Hadoop have an inbuilt timer?
> >> > > > > >
> >> > > > > > Mithila
> >> > > > > >
> >> > > > > > On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball <
> >> aa...@cloudera.com
> >> > >
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > Virtually none of the examples that ship with Hadoop are
> >> designed
> >> > > to
> >> > > > > > > showcase its speed. Hadoop's speedup comes from its ability
> to
> >> > > > process
> >> > > > > > very
> >> > > > > > > large volumes of data (starting around, say, tens of GB per
> >> job,
> >> > > and
> >> > > > > > going
> >> > > > > > > up in orders of magnitude from there). So if you are timing
> >> the
> >> > pi
> >> > > > > > > calculator (or something like that), its results won't
> >> > necessarily
> >> > > be
> >> > > > > > very
> >> > > > > > > consistent. If a job doesn't have enough fragments of data
> to
> >> > > > allocate
> >> > > > > > one
> >> > > > > > > per each node, some of the nodes will also just go unused.
> >> > > > > > >
> >> > > > > > > The best example for you to run is to use randomwriter to
> fill
> >> up
> >> > > > your
> >> > > > > > > cluster with several GB of random data and then run the sort
> >> > > program.
> >> > > > > If
> >> > > > > > > that doesn't scale up performance from 3 nodes to 15, then
> >> you've
> >> > > > > > > definitely
> >> > > > > > > got something strange going on.
> >> > > > > > >
> >> > > > > > > - Aaron
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra <
> >> > > mnage...@

Re: Is combiner and map in same JVM?

2009-04-14 Thread Owen O'Malley


On Apr 14, 2009, at 10:52 AM, Aaron Kimball wrote:


They're in the same JVM, and I believe in the same thread.


They are the same JVM. They *used* to be the same thread. In either  
0.19 or 0.20, combiners are also called in the reduce JVM if spills  
are required.


-- Owen


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Brian Bockelman


On Apr 14, 2009, at 12:47 PM, Guilherme Germoglio wrote:


Hi Brian,

I'm sorry but it is not my paper. :-) I've posted the link here  
because
we're always looking for comparison data -- so, I thought this  
benchmark

would be welcome.



Ah, sorry, I guess I was being dense when looking at the author list.   
I'm dense a lot.


Also, I won't attend the conference. However, it would be a good  
idea to
someone who will to ask directly to the authors all these questions  
and

comments and then post their answers here.



It would be interesting!

In the particular field I'm working for (HEP), databases have a long,  
colorful history of failure while MapReduce-like approaches (although  
colored in very, very different terms and some interesting alternate  
optimizations ... perhaps the best way to describe it would be Map- 
Reduce on a partially column-oriented, unstructured data store) have  
survived.


I'm a big fan of "use the right tool for the job".  There are jobs for  
Map-Reduce, there are jobs for DBMS, and (as the authors point out),  
there is overlap and possible cross-pollination between the two.  In  
the end, if the tools get better, everyone wins.


Brian



On Tue, Apr 14, 2009 at 2:26 PM, Brian Bockelman  
wrote:



Hey Guilherme,

It's good to see comparisons, especially as it helps folks understand
better what tool is the best for their problem.  As you show in  
your paper,
a MapReduce system is hideously bad in performing tasks that column- 
store

databases were designed for (selecting a single value along an index,
joining tables).

Some comments:
1) For some of your graphs, you show Hadoop's numbers in half-grey,
half-white.  I can't figure out for the life of me what this  
signifies!

What have I overlooked?
2) I see that one of your co-authors is the CEO/inventor of the  
Vertica DB.
Out of curiosity, how did you interact with Vertica versus Hadoop  
versus
DBMS-X?  Did you get help tuning the systems from the experts?   
I.e., if you
sat down with a Hadoop expert for a few days, I'm certain you could  
squeeze
out more performance, just like whenever I sit down with an Oracle  
DBA for a
few hours, my DB queries are much faster.  You touch upon the  
sociological
issues (having to program your own code versus having to only know  
SQL, as
well as the comparative time it took to set up the DB) - I'd like  
to hear
how much time you spent "tweaking" and learning the best practices  
for the
three, very different approaches.  If you added a 5th test, what's  
the

marginal effort required?
3) It would be nice to see how some of your more DB-like tasks  
perform on
something like HBase.  That'd be a much more apples-to-apples  
comparison of
column-store DBMS versus column-store data system, although the  
HBase work
is just now revving up.  I'm a bit uninformed in that area, so I  
don't have

a good gut in how that'd do.
4) I think that the UDF aggregation task (calculating the inlink  
count for
each document in a sample) is interesting - it's a more Map-Reduce  
oriented

task, and it sounds like it was fairly miserable to hack around the
limitations / bugs in the DBMS.
5) I really think you undervalue the benefits of replication and
reliability, especially in terms of cost.  As someone who helps  
with a small
site (about 300 machines) that range from commodity workers to Sun  
Thumpers,
if your site depends on all your storage nodes functioning, then  
your costs
go way up.  You can't make cheap hardware scale unless your  
software can

account for it.
- Yes, I realize this is a different approach than you take.  There  
are
pros and cons to large expensive hardware versus lots of cheap  
hardware ...
the argument has been going on since the dawn of time.  However,  
it's a bit
unfair to just outright dismiss one approach.  I am a bit wary of  
the claims
that your results can scale up to Google/Yahoo scale, but I do  
agree that

there are truly few users that are that large!

I love your last paragraph, it's a very good conclusion.  It kind of
reminds me of the grid computing field which was (is?) completely  
shocked by

the emergence of cloud computing.  After you cut through the hype
surrounding the new fads, you find (a) that there are some very  
good reasons
that the fads are popular - they have definite strengths that the  
existing
field was missing (or didn't want to hear) and (b) there's a lot of  
common

ground and learning that has to be done, even to get a good common
terminology :)

Enjoy your conference!

Brian

On Apr 14, 2009, at 9:16 AM, Guilherme Germoglio wrote:

(Hadoop is used in the benchmarks)


http://database.cs.brown.edu/sigmod09/

There is currently considerable enthusiasm around the MapReduce
(MR) paradigm for large-scale data analysis [17]. Although the
basic control flow of this framework has existed in parallel SQL
database management systems (DBMS) for over 20 years, some
have called MR a dramatically new computing model [8, 17]. In
t

How large is one file split?

2009-04-14 Thread Foss User
In the documentation I was reading that files are stored as file
splits in the HDFS. What is the size of each file split? Is it
configurable? If yes, how can I configure it?


Re: Is combiner and map in same JVM?

2009-04-14 Thread Saptarshi Guha
Thanks. I am using 0.19, and to confirm, the map and combiner (in the map
jvm) are run in *different* threads at the same time?
My native library is not thread safe, so I would have to implement locks.
Aaron's email gave me hope(since the map and combiner would then be running
sequentially), but this appears to make things complicated.


Saptarshi Guha


On Tue, Apr 14, 2009 at 2:01 PM, Owen O'Malley  wrote:

>
> On Apr 14, 2009, at 10:52 AM, Aaron Kimball wrote:
>
>  They're in the same JVM, and I believe in the same thread.
>>
>
> They are the same JVM. They *used* to be the same thread. In either 0.19 or
> 0.20, combiners are also called in the reduce JVM if spills are required.
>
> -- Owen
>


Announcing CloudBase-1.3 release

2009-04-14 Thread Tarandeep Singh
Hi,

We have released 1.3 version of CloudBase on sourceforge-
http://cloudbase.sourceforge.net/

[ CloudBase is a data warehouse system built on top of Hadoop's Map-Reduce
architecture. It uses ANSI SQL as its query language and comes with a JDBC
driver. It is developed by Business.com and is released to the open source
community under GNU GPL license]

Please give it a try and send us your feedback on CloudBase users group-
http://groups.google.com/group/cloudbase-users

Thanks,
Tarandeep

Release notes-
--

New Features:

* User Defined Types (UDTs)- User can create new Types and use them to
create table of these types. UDTs are used when data is not structured. User
creates a java class and defines the parsing logic in the constructor. Using
UDT, one can query semi-structured/totally unstructured data using SQL

* Update index- Update index command has been added to update the index when
new data is added to the table.

* CREATE JSON/XML tables- One can create tables on top of one's JSON/XML
data and query them using SQL

Bug fixes:

* DROP table does not drop indexes
* Timestamp part is not shown on Squirrel for DateTime data type
* Range queries not working on Indexed VARCHAR column
* When inserting data to external database, the command would fail if
autocommit is set to true for the database
* CloudBase ignores the case of the external table names.
* Create index do not check if index is already created on the column
* When submitting multiple queries, the previous queries fail

NOTE: One needs to use new driver version (cloudbasejdbc-1.3.jar) to connect
to CloudBase-1.3 If you are using Squirrel, reregister your CloudBase JDBC
Driver.

Online documentation has been updated with new features-
http://cloudbase.sourceforge.net/index.html#userDoc


Re: Using 3rd party Api in Map class

2009-04-14 Thread Farhan Husain
Hello,

I got another solution for this. I just pasted all the required jar files in
lib folder of each hadoop node. In this way the job jar is not too big and
will require less time to distribute in the cluster.

Thanks,
Farhan

On Mon, Apr 13, 2009 at 7:22 PM, Nick Cen  wrote:

> create a directroy call 'lib' in your project's root dir, then put all the
> 3rd party jar in it.
>
> 2009/4/14 Farhan Husain 
>
> > Hello,
> >
> > I am trying to use Pellet library for some OWL inferencing in my mapper
> > class. But I can't find a way to bundle the library jar files in my job
> jar
> > file. I am exporting my project as a jar file from Eclipse IDE. Will it
> > work
> > if I create the jar manually and include all the jar files Pellet library
> > has? Is there any simpler way to include 3rd party library jar files in a
> > hadoop job jar? Without being able to include the library jars I am
> getting
> > ClassNotFoundException.
> >
> > Thanks,
> > Farhan
> >
>
>
>
> --
> http://daily.appspot.com/food/
>


Re: Is combiner and map in same JVM?

2009-04-14 Thread Owen O'Malley


On Apr 14, 2009, at 11:10 AM, Saptarshi Guha wrote:

Thanks. I am using 0.19, and to confirm, the map and combiner (in  
the map jvm) are run in *different* threads at the same time?


And the change was actually made in 0.18. So since then, the combiner  
is called 0, 1, or many times on each key in both the mapper and the  
reducer. It is called in a separate thread from the base application  
in the map (in the reduce task, the combiner is only use during the  
shuffle).


My native library is not thread safe, so I would have to implement  
locks. Aaron's email gave me hope(since the map and combiner would  
then be running sequentially), but this appears to make things  
complicated.


Yes, you'll probably need locks around your code that isn't thread safe.

-- Owen




Adding 3rd party API / Libraries to Hadoop

2009-04-14 Thread asif md
To do this just add the concerned jars to the lib directory of hadoop.

Thanx.


Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
Aaron: Which log file do I look into - there are alot of them. Here s what
the error looks like:
[mith...@node19:~]$ cd hadoop
[mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 0 time(s).
09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 1 time(s).
09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 2 time(s).
09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 3 time(s).
09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 4 time(s).
09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 5 time(s).
09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 6 time(s).
09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 7 time(s).
09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 8 time(s).
09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
192.168.0.18:54310. Already tried 9 time(s).
Bad connection to FS. command aborted.

Node19 is a slave and Node18 is the master.

Mithila



On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball  wrote:

> Are there any error messages in the log files on those nodes?
> - Aaron
>
> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra 
> wrote:
>
> > I ve drawn a blank here! Can't figure out what s wrong with the ports. I
> > can
> > ssh between the nodes but cant access the DFS from the slaves - says "Bad
> > connection to DFS". Master seems to be fine.
> > Mithila
> >
> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra 
> > wrote:
> >
> > > Yes I can..
> > >
> > >
> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky  > >wrote:
> > >
> > >> Can you ssh between the nodes?
> > >>
> > >> -jim
> > >>
> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra 
> > >> wrote:
> > >>
> > >> > Thanks Aaron.
> > >> > Jim: The three clusters I setup had ubuntu running on them and the
> dfs
> > >> was
> > >> > accessed at port 54310. The new cluster which I ve setup has Red Hat
> > >> Linux
> > >> > release 7.2 (Enigma)running on it. Now when I try to access the dfs
> > from
> > >> > one
> > >> > of the slaves i get the following response: dfs cannot be accessed.
> > When
> > >> I
> > >> > access the DFS throught the master there s no problem. So I feel
> there
> > a
> > >> > problem with the port. Any ideas? I did check the list of slaves, it
> > >> looks
> > >> > fine to me.
> > >> >
> > >> > Mithila
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky  >
> > >> > wrote:
> > >> >
> > >> > > Mithila,
> > >> > >
> > >> > > You said all the slaves were being utilized in the 3 node cluster.
> > >> Which
> > >> > > application did you run to test that and what was your input size?
> > If
> > >> you
> > >> > > tried the word count application on a 516 MB input file on both
> > >> cluster
> > >> > > setups, than some of your nodes in the 15 node cluster may not be
> > >> running
> > >> > > at
> > >> > > all. Generally, one map job is assigned to each input split and if
> > you
> > >> > are
> > >> > > running your cluster with the defaults, the splits are 64 MB each.
> I
> > >> got
> > >> > > confused when you said the Namenode seemed to do all the work. Can
> > you
> > >> > > check
> > >> > > conf/slaves and make sure you put the names of all task trackers
> > >> there? I
> > >> > > also suggest comparing both clusters with a larger input size, say
> > at
> > >> > least
> > >> > > 5 GB, to really see a difference.
> > >> > >
> > >> > > Jim
> > >> > >
> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
> aa...@cloudera.com>
> > >> > wrote:
> > >> > >
> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate the
> data
> > >> and
> > >> > > > "sort"
> > >> > > > to sort it.
> > >> > > > - Aaron
> > >> > > >
> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
> > forpan...@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Your data is too small I guess for 15 clusters ..So it might
> be
> > >> > > overhead
> > >> > > > > time of these clusters making your total MR jobs more time
> > >> consuming.
> > >> > > > > I guess you will have to try with larger set of data..
> > >> > > > >
> > >> > > > > Pankil
> > >> > > > > On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra <
> > >> mnage...@asu.edu>
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > Aaron
> > >> > > > > >
> > >> > > > > > That could be the issue, my data is just 516MB - wouldn't
> this
> > >> see
> > >> > a
> > >> > > > bit
> > >> > > > > of
> > >> > > > > > speed up?
> > >> > > > > > Could you guide me to the example? I ll run my cluster on 

Re: How large is one file split?

2009-04-14 Thread Jim Twensky
Files are stored as blocks and the default block size is 64MB. You can
change this by setting the dfs.block.size property. Map/Reduce interprets
files in large chunks of bytes and these are called splits. Splits are not
physical, think about them as being logical data structures that tell you
the starting byte position in the file and the length of the split. Each
mapper generally takes a split and precesses it. You can also configure the
minimum split size by setting the mapred.min.split.size property. Hope this
helps.

-Jim

On Tue, Apr 14, 2009 at 1:05 PM, Foss User  wrote:

> In the documentation I was reading that files are stored as file
> splits in the HDFS. What is the size of each file split? Is it
> configurable? If yes, how can I configure it?
>


Re: Map-Reduce Slow Down

2009-04-14 Thread Mithila Nagendra
Also, Would the way the port is accessed change if all these node are
connected through a gateway? I mean in the hadoop-site.xml file? The Ubuntu
systems we worked with earlier didnt have a gateway.
Mithila

On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra  wrote:

> Aaron: Which log file do I look into - there are alot of them. Here s what
> the error looks like:
> [mith...@node19:~]$ cd hadoop
> [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
> 09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 0 time(s).
> 09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 1 time(s).
> 09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 2 time(s).
> 09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 3 time(s).
> 09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 4 time(s).
> 09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 5 time(s).
> 09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 6 time(s).
> 09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 7 time(s).
> 09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 8 time(s).
> 09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
> 192.168.0.18:54310. Already tried 9 time(s).
> Bad connection to FS. command aborted.
>
> Node19 is a slave and Node18 is the master.
>
> Mithila
>
>
>
> On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball  wrote:
>
>> Are there any error messages in the log files on those nodes?
>> - Aaron
>>
>> On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra 
>> wrote:
>>
>> > I ve drawn a blank here! Can't figure out what s wrong with the ports. I
>> > can
>> > ssh between the nodes but cant access the DFS from the slaves - says
>> "Bad
>> > connection to DFS". Master seems to be fine.
>> > Mithila
>> >
>> > On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra 
>> > wrote:
>> >
>> > > Yes I can..
>> > >
>> > >
>> > > On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky > > >wrote:
>> > >
>> > >> Can you ssh between the nodes?
>> > >>
>> > >> -jim
>> > >>
>> > >> On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra 
>> > >> wrote:
>> > >>
>> > >> > Thanks Aaron.
>> > >> > Jim: The three clusters I setup had ubuntu running on them and the
>> dfs
>> > >> was
>> > >> > accessed at port 54310. The new cluster which I ve setup has Red
>> Hat
>> > >> Linux
>> > >> > release 7.2 (Enigma)running on it. Now when I try to access the dfs
>> > from
>> > >> > one
>> > >> > of the slaves i get the following response: dfs cannot be accessed.
>> > When
>> > >> I
>> > >> > access the DFS throught the master there s no problem. So I feel
>> there
>> > a
>> > >> > problem with the port. Any ideas? I did check the list of slaves,
>> it
>> > >> looks
>> > >> > fine to me.
>> > >> >
>> > >> > Mithila
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky <
>> jim.twen...@gmail.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Mithila,
>> > >> > >
>> > >> > > You said all the slaves were being utilized in the 3 node
>> cluster.
>> > >> Which
>> > >> > > application did you run to test that and what was your input
>> size?
>> > If
>> > >> you
>> > >> > > tried the word count application on a 516 MB input file on both
>> > >> cluster
>> > >> > > setups, than some of your nodes in the 15 node cluster may not be
>> > >> running
>> > >> > > at
>> > >> > > all. Generally, one map job is assigned to each input split and
>> if
>> > you
>> > >> > are
>> > >> > > running your cluster with the defaults, the splits are 64 MB
>> each. I
>> > >> got
>> > >> > > confused when you said the Namenode seemed to do all the work.
>> Can
>> > you
>> > >> > > check
>> > >> > > conf/slaves and make sure you put the names of all task trackers
>> > >> there? I
>> > >> > > also suggest comparing both clusters with a larger input size,
>> say
>> > at
>> > >> > least
>> > >> > > 5 GB, to really see a difference.
>> > >> > >
>> > >> > > Jim
>> > >> > >
>> > >> > > On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball <
>> aa...@cloudera.com>
>> > >> > wrote:
>> > >> > >
>> > >> > > > in hadoop-*-examples.jar, use "randomwriter" to generate the
>> data
>> > >> and
>> > >> > > > "sort"
>> > >> > > > to sort it.
>> > >> > > > - Aaron
>> > >> > > >
>> > >> > > > On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi <
>> > forpan...@gmail.com>
>> > >> > > wrote:
>> > >> > > >
>> > >> > > > > Your data is too small I guess for 15 clusters ..So it might
>> be
>> > >> > > overhead
>> > >> > > > > time of these clusters making your total MR jobs more time
>> > >> consuming.
>> > >> > > > > I guess you will h

Re: Total number of records processed in mapper

2009-04-14 Thread Jim Twensky
Hi Andy,

Take a look at this piece of code:

Counters counters = job.getCounters();
counters.findCounter("org.apache.hadoop.mapred.Task$Counter",
"REDUCE_INPUT_RECORDS").getCounter()

This is for reduce input records but I believe there is also a counter for
reduce output records. You should dig into the source code to find out what
it is because unfortunately, the default counters associated with the
map/reduce jobs are not public yet.

-Jim


On Tue, Apr 14, 2009 at 11:19 AM, Andy Liu  wrote:

> Is there a way for all the reducers to have access to the total number of
> records that were processed in the Map phase?
>
> For example, I'm trying to perform a simple document frequency calculation.
> During the map phase, I emit  pairs for every unique word in every
> document.  During the reduce phase, I sum the values for each word group.
> Then I want to divide that value by the total number of documents.
>
> I suppose I can create a whole separate m/r job whose sole purpose is to
> count all the records, then pass that number on.  Is there a more
> straighforward way of doing this?
>
> Andy
>


Hadoop User Group - DC meeting tomorrow

2009-04-14 Thread Sullivan, Joshua [USA]
REMINDER  

 

The DC area Hadoop User Group is meeting tomorrow.  Full details at:
http://www.meetup.com/Hadoop-DC/calendar/10073493/

 

Christophe Bisciglia and Dr. Jimmy Lin will be speaking.

 

Cloudera's Founder, Christophe Bisciglia, will give a talk about
simplifying Hadoop configuration, deployment and management using
Cloudera's Distribution for Hadoop. He'll talk about why Cloudera
packaged a Hadoop distribution, and what specific challenges they are
trying to address. 

 

>From Christophe: One of the most consistent issues we hear from the
community is that Hadoop is a pain to deploy and manage. We released our
distribution to take steps towards making Hadoop play nicely with
standard systems already used to deploy and manage software. In our
initial release, this means RPM packaging and standard linux service
management, but we don't plan to stop here. We'll provide a walkthrough
of deploying Hadoop using our distribution, but more importantly, listen
to the community about what you like, what you don't, and what you'd
like to see going forward.

 

Dr. Jimmy Lin publishes articles and code libraries using Hadoop, along
with extensive research into large-data distributed applications with
MapReduce. Dr. Lin will be discussing his work with Hadoop and
data-analysis.

 

Pizza, sandwiches, and drinks will be provided by Booz Allen Hamilton
 .

 

This will be a really fun event so hope to see you there!

 

 



Distributed Agent

2009-04-14 Thread Burak ISIKLI
Hello everyone;
I want to write a distributed agent program. But i can't understand one thing 
that what's difference between client-server program and agent program? Pls 
help me...

 
 


Burak ISIKLI
Dumlupinar University
Electric & Electronic - Computer Engineering
 
http://burakisikli.wordpress.com
http://burakisikli.blogspot.com




  

Re: HDFS as a logfile ??

2009-04-14 Thread Ariel Rabkin
Everything gets dumped into the same files.

We don't assume anything at all about the format of the input data; it
gets dumped into Hadoop sequence files, tagged with some metadata to
say what machine and app it came from, and where it was in the
original stream.

There is a slight penalty from the log-to-local disk. In practice, you
often want a local copy anyway, both for redundancy and because it's
much more convenient for human inspection.  Having a separate
collector process is indeed inelegant. However, HDFS copes badly with
many small files, so that pushes you to merge entries across either
hosts or time partitions. And since HDFS doesn't have a flush(),
having one log per source means that log entries don't become visible
quickly enough.   Hence, collectors.

It isn't gorgeous, but it seems to work fine in practice.

On Mon, Apr 13, 2009 at 8:01 AM, Ricky Ho  wrote:
> Ari, thanks for your note.
>
> Like to understand more how Chukwa group log entries ... If I have appA 
> running in machine X, Y and appB running in machine Y, Z.  Each of them 
> calling the Chukwa log API.
>
> Do I have all entries going in the same HDFS file ?  or 4 separated HDFS 
> files based on the App/Machine combination ?
>
> If the answer of first Q is "yes", then what if appA and appB has different 
> format of log entries ?
> If the answer of second Q is "yes", then are all these HDFS files cut at the 
> same time boundary ?
>
> Looks like in Chukwa, application first log to a daemon, which buffer-write 
> the log entries into a local file.  And there is a separate process to ship 
> these data to a remote collector daemon which issue the actual HDFS write.  I 
> observe the following overhead ...
>
> 1) The overhead of extra write to local disk and ship the data over to the 
> collector.  If HDFS supports append, the application can then go directly to 
> the HDFS
>
> 2) The centralized collector establish a bottleneck to the otherwise 
> perfectly parallel HDFS architecture.
>
> Am I missing something here ?
>

-- 
Ari Rabkin asrab...@gmail.com
UC Berkeley Computer Science Department


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Andrew Newman
They are comparing an indexed system with one that isn't.  Why is
Hadoop faster at loading than the others?  Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does.  Who want to write a paper for next year, "grep vs reverse
index"?

2009/4/15 Guilherme Germoglio :
> (Hadoop is used in the benchmarks)
>
> http://database.cs.brown.edu/sigmod09/
>


DBOutputFormat - Communications link failure

2009-04-14 Thread Streckfus, William [USA]
Hey guys,
 
I'm trying my hand at outputting into a MySQL table but I'm running into
a "Communications link failure" during the reduce (in the
getRecordReader() method of DBOutputFormat to be more specific). Google
tells me this seems to happen when a SQL server drops the client
(usually after a period of time) but I'm wondering why that would be the
case here. I also threw in a cookie cutter connection at the beginning
of my driver and that seems to be connecting fine:

Connection conn = null;
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
conn = DriverManager.getConnection("jdbc:mysql://localhost/wiki",
"user", "pass");
System.out.println ("Database connection established");
conn.close();
} catch (Exception e) {
e.printStackTrace();
}

Here's the main snippet:
conf.setOutputKeyClass(TermFrequencyWritable.class); // My
DBWritable
conf.setOutputValueClass(NullWritable.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(TermFrequencyWritable.class);

conf.setOutputFormat(DBOutputFormat.class);

DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver",
"jdbc:mysql://localhost/wiki", "user", "pass");
DBOutputFormat.setOutput(conf, "table", "word", "docid",
"frequency");
 
Has anyone encountered this before or know where this is going wrong?
 
Thanks,
- Bill


Re: More Replication on dfs

2009-04-14 Thread Alex Loddengaard
Ah, I didn't realize you were using HBase.  It could definitely be the case
that HBase is explicitly setting file replication to 1 for certain files.
Unfortunately I don't know enough about HBase to know if or why certain
files are set to have no replication.  This might be a good question for the
HBase user list, or perhaps grepping the HBase source might tell you
something interesting.  Can you confirm that the under-replicated files are
created by HBase?

Alex

On Sat, Apr 11, 2009 at 7:00 AM, Puri, Aseem wrote:

> Alex,
>
> Ouput of $ bin/hadoop fsck / command after running HBase data insert
> command in a table is:
>
> .
> .
> .
> .
> .
> /hbase/test/903188508/tags/info/4897652949308499876:  Under replicated
> blk_-5193
> 695109439554521_3133. Target Replicas is 3 but found 1 replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/data:  Under
> replicated
> blk_-1213602857020415242_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /hbase/test/903188508/tags/mapfiles/4897652949308499876/index:  Under
> replicated
>  blk_3934493034551838567_3132. Target Replicas is 3 but found 1
> replica(s).
> .
> /user/HadoopAdmin/hbase table.doc:  Under replicated
> blk_4339521803948458144_103
> 1. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/bin.doc:  Under replicated
> blk_-3661765932004150973_1030
> . Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file01.txt:  Under replicated
> blk_2744169131466786624_10
> 01. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/file02.txt:  Under replicated
> blk_2021956984317789924_10
> 02. Target Replicas is 3 but found 2 replica(s).
> .
> /user/HadoopAdmin/input/test.txt:  Under replicated
> blk_-3062256167060082648_100
> 4. Target Replicas is 3 but found 2 replica(s).
> ...
> /user/HadoopAdmin/output/part-0:  Under replicated
> blk_8908973033976428484_1
> 010. Target Replicas is 3 but found 2 replica(s).
> Status: HEALTHY
>  Total size:48510226 B
>  Total dirs:492
>  Total files:   439 (Files currently being written: 2)
>  Total blocks (validated):  401 (avg. block size 120973 B) (Total
> open file
> blocks (not validated): 2)
>  Minimally replicated blocks:   401 (100.0 %)
>  Over-replicated blocks:0 (0.0 %)
>  Under-replicated blocks:   399 (99.50124 %)
>  Mis-replicated blocks: 0 (0.0 %)
>  Default replication factor:2
>  Average block replication: 1.3117207
>  Corrupt blocks:0
>  Missing replicas:  675 (128.327 %)
>  Number of data-nodes:  2
>  Number of racks:   1
>
>
> The filesystem under path '/' is HEALTHY
> Please tell what is wrong.
>
> Aseem
>
> -Original Message-
> From: Alex Loddengaard [mailto:a...@cloudera.com]
> Sent: Friday, April 10, 2009 11:04 PM
> To: core-user@hadoop.apache.org
> Subject: Re: More Replication on dfs
>
> Aseem,
>
> How are you verifying that blocks are not being replicated?  Have you
> ran
> fsck?  *bin/hadoop fsck /*
>
> I'd be surprised if replication really wasn't happening.  Can you run
> fsck
> and pay attention to "Under-replicated blocks" and "Mis-replicated
> blocks?"
> In fact, can you just copy-paste the output of fsck?
>
> Alex
>
> On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem
> wrote:
>
> >
> > Hi
> >I also tried the command $ bin/hadoop balancer. But still the
> > same problem.
> >
> > Aseem
> >
> > -Original Message-
> > From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
> > Sent: Friday, April 10, 2009 11:18 AM
> > To: core-user@hadoop.apache.org
> > Subject: RE: More Replication on dfs
> >
> > Hi Alex,
> >
> >Thanks for sharing your knowledge. Till now I have three
> > machines and I have to check the behavior of Hadoop so I want
> > replication factor should be 2. I started my Hadoop server with
> > replication factor 3. After that I upload 3 files to implement word
> > count program. But as my all files are stored on one machine and
> > replicated to other datanodes also, so my map reduce program takes
> input
> > from one Datanode only. I want my files to be on different data node
> so
> > to check functionality of map reduce properly.
> >
> >Also before starting my Hadoop server again with replication
> > factor 2 I formatted all Datanodes and deleted all old data manually.
> >
> > Please suggest what I should do now.
> >
> > Regards,
> > Aseem Puri
> >
> >
> > -Original Message-
> > From: Mithila Nagendra [mailto:mnage...@asu.edu]
> > Sent: Friday, April 10, 2009 10:56 AM
> > To: core-user@hadoop.apache.org
> > Subject: Re: More Replication on dfs
> >
> > To add to the question, how does one decide what is the optimal
> > replication
> > factor for a cluster. For instance what would be the appropriate
> > replication
> > factor for a cluster consisting of 5 nodes.
> > Mithila
> >
> > On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard 
> > wrote:
> >
> > >

Re: getting DiskErrorException during map

2009-04-14 Thread Alex Loddengaard
First, did you bounce the Hadoop daemons after you changed the configuration
files?  I think you'll have to do this.

Second, I believe 0.19.1 has hadoop-default.xml baked into the jar.  Try
setting $HADOOP_CONF_DIR to the directory where hadoop-site.xml lives.  For
whatever reason your hadoop-site.xml (and the hadoop-default.xml you tried
to change) are probably not being loaded.  $HADOOP_CONF_DIR should fix this.

Good luck!

Alex

On Mon, Apr 13, 2009 at 11:25 AM, Jim Twensky  wrote:

> Thank you Alex, you are right. There are quotas on the systems that I'm
> working. However, I tried to change mapred.local.dir as follows:
>
> --inside hadoop-site.xml:
>
>
>mapred.child.tmp
>/scratch/local/jim
>
>
>hadoop.tmp.dir
>/scratch/local/jim
>
>
>mapred.local.dir
>/scratch/local/jim
>
>
>  and observed that the intermediate map outputs are still being written
> under /tmp/hadoop-jim/mapred/local
>
> I'm confused at this point since I also tried setting these values directly
> inside the hadoop-default.xml and that didn't work either. Is there any
> other property that I'm supposed to change? I tried searching for "/tmp" in
> the hadoop-default.xml file but couldn't find anything else.
>
> Thanks,
> Jim
>
>
> On Tue, Apr 7, 2009 at 9:35 PM, Alex Loddengaard 
> wrote:
>
> > The getLocalPathForWrite function that throws this Exception assumes that
> > you have space on the disks that mapred.local.dir is configured on.  Can
> > you
> > verify with `df` that those disks have space available?  You might also
> try
> > moving mapred.local.dir off of /tmp if it's configured to use /tmp right
> > now; I believe some systems have quotas on /tmp.
> >
> > Hope this helps.
> >
> > Alex
> >
> > On Tue, Apr 7, 2009 at 7:22 PM, Jim Twensky 
> wrote:
> >
> > > Hi,
> > >
> > > I'm using Hadoop 0.19.1 and I have a very small test cluster with 9
> > nodes,
> > > 8
> > > of them being task trackers. I'm getting the following error and my
> jobs
> > > keep failing when map processes start hitting 30%:
> > >
> > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> any
> > > valid local directory for
> > >
> > >
> >
> taskTracker/jobcache/job_200904072051_0001/attempt_200904072051_0001_m_00_1/output/file.out
> > >at
> > >
> > >
> >
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
> > >at
> > >
> > >
> >
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
> > >at
> > >
> > >
> >
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61)
> > >at
> > >
> > >
> >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1209)
> > >at
> > >
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:867)
> > >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> > >at org.apache.hadoop.mapred.Child.main(Child.java:158)
> > >
> > >
> > > I googled many blogs and web pages but I could neither understand why
> > this
> > > happens nor found a solution to this. What does that error message mean
> > and
> > > how can avoid it, any suggestions?
> > >
> > > Thanks in advance,
> > > -jim
> > >
> >
>


Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-14 Thread Tarandeep Singh
I think there is one important comparison missing in the paper- cost. The
paper does mention that Hadoop comes for free in the conclusion, but didn't
give any details of how much it would cost to get license for Vertica or
DBMS X to run on 100 nodes.

Further, with data warehouse products like Hive, CloudBase built on top of
Hadoop, one can use SQL to query the data while using Hadoop underneath.

Thanks for sharing the link,
Tarandeep

On Tue, Apr 14, 2009 at 1:39 PM, Andrew Newman wrote:

> They are comparing an indexed system with one that isn't.  Why is
> Hadoop faster at loading than the others?  Surely no one would be
> surprised that it would be slower - I'm surprised at how well Hadoop
> does.  Who want to write a paper for next year, "grep vs reverse
> index"?
>
> 2009/4/15 Guilherme Germoglio :
> > (Hadoop is used in the benchmarks)
> >
> > http://database.cs.brown.edu/sigmod09/
> >
>


RE: DBOutputFormat - Communications link failure

2009-04-14 Thread Streckfus, William [USA]
Responding to this for archiving purposes...

After being stuck for a couple hours I then realized that localhost meant a
different machine as it ran on different reducers :). Thus, replacing that
with the IP address did the trick.

-Original Message-
From: Streckfus, William [USA] [mailto:streckfus_will...@bah.com] 
Sent: Tuesday, April 14, 2009 4:45 PM
To: core-user@hadoop.apache.org
Subject: DBOutputFormat - Communications link failure

Hey guys,
 
I'm trying my hand at outputting into a MySQL table but I'm running into a
"Communications link failure" during the reduce (in the
getRecordReader() method of DBOutputFormat to be more specific). Google
tells me this seems to happen when a SQL server drops the client (usually
after a period of time) but I'm wondering why that would be the case here. I
also threw in a cookie cutter connection at the beginning of my driver and
that seems to be connecting fine:

Connection conn = null;
try {
Class.forName("com.mysql.jdbc.Driver").newInstance();
conn = DriverManager.getConnection("jdbc:mysql://localhost/wiki",
"user", "pass");
System.out.println ("Database connection established");
conn.close();
} catch (Exception e) {
e.printStackTrace();
}

Here's the main snippet:
conf.setOutputKeyClass(TermFrequencyWritable.class); // My
DBWritable
conf.setOutputValueClass(NullWritable.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(TermFrequencyWritable.class);

conf.setOutputFormat(DBOutputFormat.class);

DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver",
"jdbc:mysql://localhost/wiki", "user", "pass");
DBOutputFormat.setOutput(conf, "table", "word", "docid",
"frequency");
 
Has anyone encountered this before or know where this is going wrong?
 
Thanks,
- Bill


smime.p7s
Description: S/MIME cryptographic signature


Re: More Replication on dfs

2009-04-14 Thread Raghu Angadi

Aseem,

Regd over-replication, it is mostly app related issue as Alex mentioned.

But if you are concerned about under-replicated blocks in fsck output :

These blocks should not stay under-replicated if you have enough nodes 
and enough space on them (check NameNode webui).


Try grep-ing for one of the blocks in NameNode log (and datnode logs as 
well, since you have just 3 nodes).


Raghu.

Puri, Aseem wrote:

Alex,

Ouput of $ bin/hadoop fsck / command after running HBase data insert
command in a table is:

.
.
.
.
.
/hbase/test/903188508/tags/info/4897652949308499876:  Under replicated
blk_-5193
695109439554521_3133. Target Replicas is 3 but found 1 replica(s).
.
/hbase/test/903188508/tags/mapfiles/4897652949308499876/data:  Under
replicated
blk_-1213602857020415242_3132. Target Replicas is 3 but found 1
replica(s).
.
/hbase/test/903188508/tags/mapfiles/4897652949308499876/index:  Under
replicated
 blk_3934493034551838567_3132. Target Replicas is 3 but found 1
replica(s).
.
/user/HadoopAdmin/hbase table.doc:  Under replicated
blk_4339521803948458144_103
1. Target Replicas is 3 but found 2 replica(s).
.
/user/HadoopAdmin/input/bin.doc:  Under replicated
blk_-3661765932004150973_1030
. Target Replicas is 3 but found 2 replica(s).
.
/user/HadoopAdmin/input/file01.txt:  Under replicated
blk_2744169131466786624_10
01. Target Replicas is 3 but found 2 replica(s).
.
/user/HadoopAdmin/input/file02.txt:  Under replicated
blk_2021956984317789924_10
02. Target Replicas is 3 but found 2 replica(s).
.
/user/HadoopAdmin/input/test.txt:  Under replicated
blk_-3062256167060082648_100
4. Target Replicas is 3 but found 2 replica(s).
...
/user/HadoopAdmin/output/part-0:  Under replicated
blk_8908973033976428484_1
010. Target Replicas is 3 but found 2 replica(s).
Status: HEALTHY
 Total size:48510226 B
 Total dirs:492
 Total files:   439 (Files currently being written: 2)
 Total blocks (validated):  401 (avg. block size 120973 B) (Total
open file
blocks (not validated): 2)
 Minimally replicated blocks:   401 (100.0 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   399 (99.50124 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:2
 Average block replication: 1.3117207
 Corrupt blocks:0
 Missing replicas:  675 (128.327 %)
 Number of data-nodes:  2
 Number of racks:   1


The filesystem under path '/' is HEALTHY
Please tell what is wrong.

Aseem

-Original Message-
From: Alex Loddengaard [mailto:a...@cloudera.com] 
Sent: Friday, April 10, 2009 11:04 PM

To: core-user@hadoop.apache.org
Subject: Re: More Replication on dfs

Aseem,

How are you verifying that blocks are not being replicated?  Have you
ran
fsck?  *bin/hadoop fsck /*

I'd be surprised if replication really wasn't happening.  Can you run
fsck
and pay attention to "Under-replicated blocks" and "Mis-replicated
blocks?"
In fact, can you just copy-paste the output of fsck?

Alex

On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem
wrote:


Hi
   I also tried the command $ bin/hadoop balancer. But still the
same problem.

Aseem

-Original Message-
From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
Sent: Friday, April 10, 2009 11:18 AM
To: core-user@hadoop.apache.org
Subject: RE: More Replication on dfs

Hi Alex,

   Thanks for sharing your knowledge. Till now I have three
machines and I have to check the behavior of Hadoop so I want
replication factor should be 2. I started my Hadoop server with
replication factor 3. After that I upload 3 files to implement word
count program. But as my all files are stored on one machine and
replicated to other datanodes also, so my map reduce program takes

input

from one Datanode only. I want my files to be on different data node

so

to check functionality of map reduce properly.

   Also before starting my Hadoop server again with replication
factor 2 I formatted all Datanodes and deleted all old data manually.

Please suggest what I should do now.

Regards,
Aseem Puri


-Original Message-
From: Mithila Nagendra [mailto:mnage...@asu.edu]
Sent: Friday, April 10, 2009 10:56 AM
To: core-user@hadoop.apache.org
Subject: Re: More Replication on dfs

To add to the question, how does one decide what is the optimal
replication
factor for a cluster. For instance what would be the appropriate
replication
factor for a cluster consisting of 5 nodes.
Mithila

On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard 
wrote:


Did you load any files when replication was set to 3?  If so, you'll

have

to
rebalance:





<


http://hadoop.apache.org/core/docs/r0.19.1/hdfs_user_guide.html#Rebalanc

er

Note that most people run HDFS with a replication factor of 3.

There

have

been cases when clusters running with a replication of 2 discovered

new

bugs, because replication is so often set to 3.  That said, i

Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
I have actually built an add on class on top of ClusterMapReduceDelegate
that just runs a virtual cluster that persists for running tests on, it is
very nice, as you can interact via the web ui.
Especially since the virtual cluster stuff is somewhat flaky under windows.

I have a question in to the editor about the sample code.


On Tue, Apr 14, 2009 at 8:16 AM, czero  wrote:

>
> I actually picked up the alpha .PDF's of your book, great job.
>
> I'm following the example in chapter 7 to the letter now and am still
> getting the same problem.  2 quick questions (and thanks for your time in
> advance)...
>
> Is the ClusterMapReduceDelegate class available anywhere yet?
>
> Adding ~/hadoop/libs/*.jar in it's entirety to my pom.xml is a lot of bulk,
> so I've avoided it until now.  Are there any lib's in there that are
> absolutely necessary for this test to work?
>
> Thanks again,
> bc
>
>
>
> jason hadoop wrote:
> >
> > I have a nice variant of this in the ch7 examples section of my book,
> > including a standalone wrapper around the virtual cluster for allowing
> > multiple test instances to share the virtual cluster - and allow an
> easier
> > time to poke around with the input and output datasets.
> >
> > It even works decently under windows - my editor insisting on word to
> > recent
> > for crossover.
> >
> > On Mon, Apr 13, 2009 at 9:16 AM, czero  wrote:
> >
> >>
> >> Sry, I forgot to include the not-IntelliJ-console output :)
> >>
> >> 09/04/13 12:07:14 ERROR mapred.MiniMRCluster: Job tracker crashed
> >> java.lang.NullPointerException
> >>at java.io.File.(File.java:222)
> >>at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:143)
> >>at
> >> org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1110)
> >>at
> >> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:143)
> >>at
> >>
> >>
> org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:96)
> >>at java.lang.Thread.run(Thread.java:637)
> >>
> >> I managed to pick up the chapter in the Hadoop Book that Jason mentions
> >> that
> >> deals with Unit testing (great chapter btw) and it looks like everything
> >> is
> >> in order.  He points out that this error is typically caused by a bad
> >> hadoop.log.dir or missing log4j.properties, but I verified that my dir
> is
> >> ok
> >> and my hadoop-0.19.1-core.jar has the log4j.properties in it.
> >>
> >> I also tried running the same test with hadoop-core/test 0.19.0 - same
> >> thing.
> >>
> >> Thanks again,
> >>
> >> bc
> >>
> >>
> >> czero wrote:
> >> >
> >> > Hey all,
> >> >
> >> > I'm also extending the ClusterMapReduceTestCase and having a bit of
> >> > trouble as well.
> >> >
> >> > Currently I'm getting :
> >> >
> >> > Starting DataNode 0 with dfs.data.dir:
> >> > build/test/data/dfs/data/data1,build/test/data/dfs/data/data2
> >> > Starting DataNode 1 with dfs.data.dir:
> >> > build/test/data/dfs/data/data3,build/test/data/dfs/data/data4
> >> > Generating rack names for tasktrackers
> >> > Generating host names for tasktrackers
> >> >
> >> > And then nothing... just spins on that forever.  Any ideas?
> >> >
> >> > I have all the jetty and jetty-ext libs in the classpath and I set the
> >> > hadoop.log.dir and the SAX parser correctly.
> >> >
> >> > This is all I have for my test class so far, I'm not even doing
> >> anything
> >> > yet:
> >> >
> >> > public class TestDoop extends ClusterMapReduceTestCase {
> >> >
> >> > @Test
> >> > public void testDoop() throws Exception {
> >> > System.setProperty("hadoop.log.dir", "~/test-logs");
> >> > System.setProperty("javax.xml.parsers.SAXParserFactory",
> >> > "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl");
> >> >
> >> > setUp();
> >> >
> >> > System.out.println("done.");
> >> > }
> >> >
> >> > Thanks!
> >> >
> >> > bc
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Extending-ClusterMapReduceTestCase-tp22440254p23024597.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Extending-ClusterMapReduceTestCase-tp22440254p23041470.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Directory /tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory does not exist

2009-04-14 Thread Pankil Doshi
Hello Everyone,

At time I get following error,when i restart my cluster desktops.(Before
that I shutdown mapred and dfs properly though).
Temp folder contains of the directory its looking for.Still I get this
error.
Only solution I found to get rid with this error is I have to format my dfs
entirely and then load the data again. and start whole process.

But in that I loose my data on HDFS and I have to reload it.

Does anyone has any clue abt it?

Error from log fil e:-

2009-04-14 19:40:29,963 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Semantic002/192.168.1.133
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.18.3
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250;
compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
/
2009-04-14 19:40:30,958 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2009-04-14 19:40:30,996 INFO org.apache.hadoop.dfs.NameNode: Namenode up at:
Semantic002/192.168.1.133:9000
2009-04-14 19:40:31,007 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=NameNode, sessionId=null
2009-04-14 19:40:31,014 INFO org.apache.hadoop.dfs.NameNodeMetrics:
Initializing NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullCont
ext
2009-04-14 19:40:31,160 INFO org.apache.hadoop.fs.FSNamesystem:
fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,plugdev,scanner,fuse,admin
2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
supergroup=supergroup
2009-04-14 19:40:31,161 INFO org.apache.hadoop.fs.FSNamesystem:
isPermissionEnabled=true
2009-04-14 19:40:31,183 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
Initializing FSNamesystemMeterics using context
object:org.apache.hadoop.metrics.spi.
NullContext
2009-04-14 19:40:31,184 INFO org.apache.hadoop.fs.FSNamesystem: Registered
FSNamesystemStatusMBean
2009-04-14 19:40:31,248 INFO org.apache.hadoop.dfs.Storage: Storage
directory /tmp/hadoop-hadoop/dfs/name does not exist.
2009-04-14 19:40:31,251 ERROR org.apache.hadoop.fs.FSNamesystem:
FSNamesystem initialization failed.
org.apache.hadoop.dfs.InconsistentFSStateException: Directory
/tmp/hadoop-hadoop/dfs/name is in an inconsistent state: storage directory
does not exist or is
 not accessible.
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
2009-04-14 19:40:31,261 INFO org.apache.hadoop.ipc.Server: Stopping server
on 9000
2009-04-14 19:40:31,262 ERROR org.apache.hadoop.dfs.NameNode:
org.apache.hadoop.dfs.InconsistentFSStateException: Directory
/tmp/hadoop-hadoop/dfs/name is in
 an inconsistent state: storage directory does not exist or is not
accessible.
at
org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:211)
at
org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)
at
org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:294)
at org.apache.hadoop.dfs.FSNamesystem.(FSNamesystem.java:273)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:148)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)

2009-04-14 19:40:31,267 INFO org.apache.hadoop.dfs.NameNode: SHUTDOWN_MSG:
/
:

Thanks

Pankil


hadoop pipes problem

2009-04-14 Thread stchu
Hi,

I tried to use Hadoop Pipes for C++. The program was copied from the Hadoop
Wiki: http://wiki.apache.org/hadoop/C++WordCount.
But I am confused about the file name and the path. Should the program be
named as "examples"? And where should I put this code
for compiling by ant? I used the name:"examples" and put it to
{$HADOOP_HOME}/src/examples/pipes/. When I typed the instruction:
ant -Dcompile.c++=yes examples, I got the message as followed:

===
Buildfile: build.xml

clover.setup:

clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

clover:

init:
[touch] Creating /tmp/null255878201
   [delete] Deleting: /tmp/null255878201
 [exec] src/saveVersion.sh: 22: svn: not found
 [exec] src/saveVersion.sh: 23: svn: not found

record-parser:

compile-rcc-compiler:

compile-core-classes:
[javac] Compiling 1 source file to /home/chu/hadoop-0.18.1/build/classes

compile-core-native:

check-c++-makefiles:

create-c++-pipes-makefile:

create-c++-utils-makefile:

compile-c++-utils:
 [exec] depbase=`echo impl/StringUtils.o | sed
's|[^/]*$|.deps/&|;s|\.o$||'`; \
 [exec] if g++ -DHAVE_CONFIG_H -I.
-I/home/chu/hadoop-0.18.1/src/c++/utils -I./impl
-I/home/chu/hadoop-0.18.1/src/c++/utils/api -Wall -g -O2 -MT
impl/StringUtils.o -MD -MP -MF "$depbase.Tpo" -c -o impl/StringUtils.o
/home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc; \
 [exec] then mv -f "$depbase.Tpo" "$depbase.Po"; else rm -f
"$depbase.Tpo"; exit 1; fi
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc: In
function ‘uint64_t HadoopUtils::getCurrentMillis()’:
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc:74:
error: ‘strerror’ was not declared in this scope
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc: In
function ‘std::string HadoopUtils::quoteString(const std::string&, const
char*)’:
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc:103:
error: ‘strchr’ was not declared in this scope
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc: In
function ‘std::string HadoopUtils::unquoteString(const std::string&)’:
 [exec] /home/chu/hadoop-0.18.1/src/c++/utils/impl/StringUtils.cc:144:
error: ‘strtol’ was not declared in this scope
 [exec] make: *** [impl/StringUtils.o] Error 1

BUILD FAILED
/home/chu/hadoop-0.18.1/build.xml:1106: exec returned: 2

Total time: 1 second
==

The Hadoop version is 0.18.1, and the jdk path has been set. Can anyone give
me some suggestions or guides?
Thanks a lot.


stchu


Re: bzip2 input format

2009-04-14 Thread Edward J. Yoon
Thanks! good job :-)

On Wed, Apr 15, 2009 at 2:51 AM, John Heidemann  wrote:
> On Tue, 14 Apr 2009 10:52:02 +0900, "Edward J. Yoon" wrote:
>>Does anyone have a input formatter for bzip2?
>
> Do you mean like this codec (now committed):
> https://issues.apache.org/jira/browse/HADOOP-3646
>
> and split support nearly done, pending QA auto-tests:
>
> https://issues.apache.org/jira/browse/HADOOP-4012
>
> ?
>
>   -John Heidemann
>



-- 
Best Regards, Edward J. Yoon
edwardy...@apache.org
http://blog.udanax.org


Re: Extending ClusterMapReduceTestCase

2009-04-14 Thread jason hadoop
btw that stack trace looks like the hadoop.log.dir issue
This is the code out of the init method, in JobHistory

LOG_DIR = conf.get("hadoop.job.history.location" ,
"file:///" + new File(
System.getProperty("hadoop.log.dir")).getAbsolutePath()
+ File.separator + "history");

looks like the hadoop.log.dir system property is not set, note: not
environment variable, not configuration parameter, but system property.

Try a *System.setProperty("hadoop.log.dir","/tmp");* in your code before you
initialize the virtual cluster.



On Tue, Apr 14, 2009 at 5:56 PM, jason hadoop wrote:

>
> I have actually built an add on class on top of ClusterMapReduceDelegate
> that just runs a virtual cluster that persists for running tests on, it is
> very nice, as you can interact via the web ui.
> Especially since the virtual cluster stuff is somewhat flaky under windows.
>
> I have a question in to the editor about the sample code.
>
>
>
> On Tue, Apr 14, 2009 at 8:16 AM, czero  wrote:
>
>>
>> I actually picked up the alpha .PDF's of your book, great job.
>>
>> I'm following the example in chapter 7 to the letter now and am still
>> getting the same problem.  2 quick questions (and thanks for your time in
>> advance)...
>>
>> Is the ClusterMapReduceDelegate class available anywhere yet?
>>
>> Adding ~/hadoop/libs/*.jar in it's entirety to my pom.xml is a lot of
>> bulk,
>> so I've avoided it until now.  Are there any lib's in there that are
>> absolutely necessary for this test to work?
>>
>> Thanks again,
>> bc
>>
>>
>>
>> jason hadoop wrote:
>> >
>> > I have a nice variant of this in the ch7 examples section of my book,
>> > including a standalone wrapper around the virtual cluster for allowing
>> > multiple test instances to share the virtual cluster - and allow an
>> easier
>> > time to poke around with the input and output datasets.
>> >
>> > It even works decently under windows - my editor insisting on word to
>> > recent
>> > for crossover.
>> >
>> > On Mon, Apr 13, 2009 at 9:16 AM, czero  wrote:
>> >
>> >>
>> >> Sry, I forgot to include the not-IntelliJ-console output :)
>> >>
>> >> 09/04/13 12:07:14 ERROR mapred.MiniMRCluster: Job tracker crashed
>> >> java.lang.NullPointerException
>> >>at java.io.File.(File.java:222)
>> >>at org.apache.hadoop.mapred.JobHistory.init(JobHistory.java:143)
>> >>at
>> >> org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1110)
>> >>at
>> >> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:143)
>> >>at
>> >>
>> >>
>> org.apache.hadoop.mapred.MiniMRCluster$JobTrackerRunner.run(MiniMRCluster.java:96)
>> >>at java.lang.Thread.run(Thread.java:637)
>> >>
>> >> I managed to pick up the chapter in the Hadoop Book that Jason mentions
>> >> that
>> >> deals with Unit testing (great chapter btw) and it looks like
>> everything
>> >> is
>> >> in order.  He points out that this error is typically caused by a bad
>> >> hadoop.log.dir or missing log4j.properties, but I verified that my dir
>> is
>> >> ok
>> >> and my hadoop-0.19.1-core.jar has the log4j.properties in it.
>> >>
>> >> I also tried running the same test with hadoop-core/test 0.19.0 - same
>> >> thing.
>> >>
>> >> Thanks again,
>> >>
>> >> bc
>> >>
>> >>
>> >> czero wrote:
>> >> >
>> >> > Hey all,
>> >> >
>> >> > I'm also extending the ClusterMapReduceTestCase and having a bit of
>> >> > trouble as well.
>> >> >
>> >> > Currently I'm getting :
>> >> >
>> >> > Starting DataNode 0 with dfs.data.dir:
>> >> > build/test/data/dfs/data/data1,build/test/data/dfs/data/data2
>> >> > Starting DataNode 1 with dfs.data.dir:
>> >> > build/test/data/dfs/data/data3,build/test/data/dfs/data/data4
>> >> > Generating rack names for tasktrackers
>> >> > Generating host names for tasktrackers
>> >> >
>> >> > And then nothing... just spins on that forever.  Any ideas?
>> >> >
>> >> > I have all the jetty and jetty-ext libs in the classpath and I set
>> the
>> >> > hadoop.log.dir and the SAX parser correctly.
>> >> >
>> >> > This is all I have for my test class so far, I'm not even doing
>> >> anything
>> >> > yet:
>> >> >
>> >> > public class TestDoop extends ClusterMapReduceTestCase {
>> >> >
>> >> > @Test
>> >> > public void testDoop() throws Exception {
>> >> > System.setProperty("hadoop.log.dir", "~/test-logs");
>> >> > System.setProperty("javax.xml.parsers.SAXParserFactory",
>> >> > "com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl");
>> >> >
>> >> > setUp();
>> >> >
>> >> > System.out.println("done.");
>> >> > }
>> >> >
>> >> > Thanks!
>> >> >
>> >> > bc
>> >> >
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://www.nabble.com/Extending-ClusterMapReduceTestCase-tp22440254p23024597.html
>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>> >>
>> >>
>> >
>> >
>> > --
>> > Alpha Chapters of my book on Hadoop are available
>> > http://www.apress.com/book/view/9781430219422
>> >

Re: Modeling WordCount in a different way

2009-04-14 Thread Sharad Agarwal


> I am trying complex queries on hadoop and in which i require more than one
> job to run to get final result..results of job one captures few joins of the
> query and I want to pass those results as input to 2nd job and again do
> processing so that I can get final results.queries are such that I cant do
> all types of joins and filterin in job1 and so I require two jobs.
>
> right now I write results of job 1 to hdfs and read dem for job2..but thats
> take unecessary IO time.So was looking for something that I can store my
> results of job1 in memory and use them as input for job 2.
>
> do let me know if you need any  more details.
How big is your input and output data ? How many nodes you are using?
What is your job runtime?
I don't completely understand your usecase but my guess is that amout of IO
time might not be significant as compared to your overall job runtime., assuming
you have data local maps.

In case you are joining two data sets and one being small, you can load the 
smaller
one in memory. With jvm reuse feature you can load it once in a static field
in the mapper.

- Sharad