how to preserve original line order?

2009-03-12 Thread Roldano Cattoni
The task should be simple, I want to put in uppercase all the words of a
(large) file.

I tried the following:
 - streaming mode
 - the mapper is a perl script that put each line in uppercase (number of
   mappers > 1)
 - no reducer (number of reducers set to zero)

It works fine except for line order which is not preserved.

How to preserve the original line order?

I would appreciate any suggestion.

  Roldano



Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-03-12 Thread Sriram Rao
Hey TCK,

We operate a large cluster in which we run both HDFS/KFS in the same
cluster and on the same nodes.  We run two instances of KFS and one
instance of HDFS in the cluster:
 - Our logs are in KFS and we have KFS setup in WORM mode (a mode in
which deletions/renames on files/dirs are permitted only on files with
.tmp extension).
 - Map/reduce jobs read from WORM and can write to HDFS or the KFS
setup in r/w mode.
- For archival purposes, we back data between the two different DFS
implementations.

The thruput you get depends on your cluster setup: in our case, we
have 4 1-TB disks on each node, that we can push at 100MB/s a piece.
In JBOD mode, in theory, we can get 400MB/s.  With a 1 Gbps NIC, the
theoratical limit is 125MB/s.

Sriram

>>>
>>> Thanks, Brian. This sounds encouraging for us.
>>>
>>> What are the advantages/disadvantages of keeping a persistent storage
>>
>> (HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?
>>>
>>> The advantage I can think of is that a permanent storage cluster has
>>
>> different requirements from a map-reduce processing cluster -- the
>> permanent
>> storage cluster would need faster, bigger hard disks, and would need to
>> grow as
>> the total volume of all collected logs grows, whereas the processing
>> cluster
>> would need fast CPUs and would only need to grow with the rate of incoming
>> data.
>> So it seems to make sense to me to copy a piece of data from the permanent
>> storage cluster to the processing cluster only when it needs to be
>> processed. Is
>> my line of thinking reasonable? How would this compare to running the
>> map-reduce
>> processing on same cluster as the data is stored in? Which approach is
>> used by
>> most people?
>>>
>>> Best Regards,
>>> TCK
>>>
>>>
>>>
>>> --- On Wed, 2/4/09, Brian Bockelman  wrote:
>>> From: Brian Bockelman 
>>> Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel
>>
>> reads?
>>>
>>> To: core-user@hadoop.apache.org
>>> Date: Wednesday, February 4, 2009, 1:06 PM
>>>
>>> Hey TCK,
>>>
>>> We use HDFS+FUSE solely as a storage solution for a application which
>>> doesn't understand MapReduce.  We've scaled this solution to
>>
>> around
>>>
>>> 80Gbps.  For 300 processes reading from the same file, we get about
>>
>> 20Gbps.
>>>
>>> Do consider your data retention policies -- I would say that Hadoop as a
>>> storage system is thus far about 99% reliable for storage and is not a
>>
>> backup
>>>
>>> solution.  If you're scared of getting more than 1% of your logs lost,
>>
>> have
>>>
>>> a good backup solution.  I would also add that when you are learning your
>>> operational staff's abilities, expect even more data loss.  As you
>>
>> gain
>>>
>>> experience, data loss goes down.
>>>
>>> I don't believe we've lost a single block in the last month, but
>>
>> it
>>>
>>> took us 2-3 months of 1%-level losses to get here.
>>>
>>> Brian
>>>
>>> On Feb 4, 2009, at 11:51 AM, TCK wrote:
>>>
 Hey guys,

 We have been using Hadoop to do batch processing of logs. The logs get
>>>
>>> written and stored on a NAS. Our Hadoop cluster periodically copies a
>>
>> batch of
>>>
>>> new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
>>> copies the output back to the NAS. The HDFS is cleaned up at the end of
>>
>> each
>>>
>>> batch (ie, everything in it is deleted).

 The problem is that reads off the NAS via NFS don't scale even if
>>
>> we
>>>
>>> try to scale the copying process by adding more threads to read in
>>
>> parallel.

 If we instead stored the log files on an HDFS cluster (instead of
>>
>> NAS), it
>>>
>>> seems like the reads would scale since the data can be read from multiple
>>
>> data
>>>
>>> nodes at the same time without any contention (except network IO, which
>>> shouldn't be a problem).

 I would appreciate if anyone could share any similar experience they
>>
>> have
>>>
>>> had with doing parallel reads from a storage HDFS.

 Also is it a good idea to have a separate HDFS for storage vs for
>>
>> doing
>>>
>>> the batch processing ?

 Best Regards,
 TCK




>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


Re: Creating Lucene index in Hadoop

2009-03-12 Thread 王红宝
you can see the nutch code.

2009/3/13 Mark Kerzner 

> Hi,
>
> How do I allow multiple nodes to write to the same index file in HDFS?
>
> Thank you,
> Mark
>


Re: tuning performance

2009-03-12 Thread Allen Wittenauer



On 3/12/09 7:13 PM, "Vadim Zaliva"  wrote:

> The machines have 4 disk each, stripped.
> However I do not see disks being a bottleneck.

When you stripe you automatically make every disk in the system have the
same speed as the slowest disk.  In our experiences, systems are more likely
to have a 'slow' disk than a dead one and detecting that is really
really hard.  In a distributed system, that multiplier effect can have
significant consequences on the whole grids performance.

Also:

>>> 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
>>> I've set up each of them to run 32 map and 8 reduce tasks.
>>> Also, HADOOP_HEAPSIZE=2048.

IIRC, HEAPSIZE is in MB, sooo...

32*2048=65536
8*2048=16384

81920MB of VM *just in heap* ...

>>> I see CPU is under utilized. If there is a guideline how I can find optimal
>>> number of tasks and memory setting for this kind of hardware.

If you actually have all 40 task slots going, the system is likely
spending quite a bit of time paging...

We generally use cores/2-1 for map and reduce slots.  This leaves some
cores and memory for the OS to use for monitoring, data node, etc. So 8
cores=3 maps, 3 reduces per node.  Going above that has been done for
extremely lightweight processes, but you're more likely headed for heartache
if you aren't careful.



Creating Lucene index in Hadoop

2009-03-12 Thread Mark Kerzner
Hi,

How do I allow multiple nodes to write to the same index file in HDFS?

Thank you,
Mark


Child Nodes processing jobs?

2009-03-12 Thread Richa Khandelwal
Hi,
I am running a cluster of map/reduce jobs. How do I confirm that slaves are
actually executing the map/reduce job spawned by the JobTracker at the
master. All the slaves are running the datanodes and tasktracker fine.

Thanks,
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763


Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Amareshwari Sriramadasu

Are you seeing reducers getting spawned from web ui? then, it is a bug.
If not, there won't be reducers spawned, it could be job-setup/ 
job-cleanup task that is running on a reduce slot. See HADOOP-3150 and 
HADOOP-4261.

-Amareshwari
Chris K Wensel wrote:


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task is 
actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating the 
reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when the 
job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly 
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default 
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't 
bother not setting the value, and subsequently comes to ones 
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is 
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/





Re: tuning performance

2009-03-12 Thread jason hadoop
For a simple test, set the replication on your entire cluster to 6 hadoop
dfs -setRep  -R  -w 6 /

This will triple your disk usage and probably take a while, but then you are
guaranteed that all data is local.

You can also get a rough idea from the Job Counters, 'Data-local map tasks'
total field, of the data local map tasks.



-setrep [-R] [-w]  :  Set the replication level of a file.
The -R flag requests a recursive change of replication level
for an entire tree.


On Thu, Mar 12, 2009 at 7:13 PM, Vadim Zaliva  wrote:

> The machines have 4 disk each, stripped.
> However I do not see disks being a bottleneck. Monitoring system activity
> shows that CPU is utilized 2-70%, disk usage is moderate, while network
> activity seems to be quite high. In this particular cluster we have 6
> machines
> and replication factor is 2. I was wondering if increasing replication
> factor would
> help, so there is a better chance that data block is available locally.
>
> Sincerely,
> Vadim
>
>
> On Thu, Mar 12, 2009 at 13:27, Aaron Kimball  wrote:
> > Xeon vs. Opteron is likely not going to be a major factor. More important
> > than this is the number of disks you have per machine. Task performance
> is
> > proportional to both the number of CPUs and the number of disks.
> >
> > You are probably using way too many tasks. Adding more tasks/node isn't
> > necessarily going to increase utilization if they're waiting on data from
> > the disks; you'll just increase the IO pressure and probably make things
> > seek more. You've configured up to 40 simultaneous tasks/node, to run on
> 8
> > cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
> > usage since they don't compete for IO as much. The correct value for you
> > might lie somewhere in between, too, depending on your workload.
> >
> > How many hdds does your machine have? You should stripe dfs.data.dir and
> > mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a
> 2:1
> > map task:disk) ratio is probably a good starting point, so for eight
> cores,
> > you should have at least 4 disks.
> >
> > - Aaron
> >
> > On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva 
> wrote:
> >
> >> Hi!
> >>
> >> I have a question about fine-tunining hadoop performance on 8-core
> >> machines.
> >> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
> >> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
> >> I've set up each of them to run 32 map and 8 reduce tasks.
> >> Also, HADOOP_HEAPSIZE=2048.
> >>
> >> I see CPU is under utilized. If there is a guideline how I can find
> optimal
> >> number of tasks and memory setting for this kind of hardware.
> >>
> >> Also, since we going to my more machines like this, I need to decided
> >> whenever buy Xeons or Opterons. Any advise on that?
> >>
> >> Sincerely,
> >> Vadim
> >>
> >> P.S. I am using Hadoop 19 and java version "1.6.0_12":
> >> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
> >> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
> >>
> >
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: tuning performance

2009-03-12 Thread Vadim Zaliva
The machines have 4 disk each, stripped.
However I do not see disks being a bottleneck. Monitoring system activity
shows that CPU is utilized 2-70%, disk usage is moderate, while network
activity seems to be quite high. In this particular cluster we have 6 machines
and replication factor is 2. I was wondering if increasing replication
factor would
help, so there is a better chance that data block is available locally.

Sincerely,
Vadim


On Thu, Mar 12, 2009 at 13:27, Aaron Kimball  wrote:
> Xeon vs. Opteron is likely not going to be a major factor. More important
> than this is the number of disks you have per machine. Task performance is
> proportional to both the number of CPUs and the number of disks.
>
> You are probably using way too many tasks. Adding more tasks/node isn't
> necessarily going to increase utilization if they're waiting on data from
> the disks; you'll just increase the IO pressure and probably make things
> seek more. You've configured up to 40 simultaneous tasks/node, to run on 8
> cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
> usage since they don't compete for IO as much. The correct value for you
> might lie somewhere in between, too, depending on your workload.
>
> How many hdds does your machine have? You should stripe dfs.data.dir and
> mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1
> map task:disk) ratio is probably a good starting point, so for eight cores,
> you should have at least 4 disks.
>
> - Aaron
>
> On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva  wrote:
>
>> Hi!
>>
>> I have a question about fine-tunining hadoop performance on 8-core
>> machines.
>> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
>> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
>> I've set up each of them to run 32 map and 8 reduce tasks.
>> Also, HADOOP_HEAPSIZE=2048.
>>
>> I see CPU is under utilized. If there is a guideline how I can find optimal
>> number of tasks and memory setting for this kind of hardware.
>>
>> Also, since we going to my more machines like this, I need to decided
>> whenever buy Xeons or Opterons. Any advise on that?
>>
>> Sincerely,
>> Vadim
>>
>> P.S. I am using Hadoop 19 and java version "1.6.0_12":
>> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
>> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
>>
>


Hadoop Streaming throw an exception with wget as the mapper

2009-03-12 Thread Nick Cen
Hi All,

I am trying to use the hadoop straeming with "wget" to simulate a
distributed downloader.
The command line i use is

./bin/hadoop jar -D mapred.reduce.tasks=0
contrib/streaming/hadoop-0.19.0-streaming.jar -input urli -output urlo
-mapper /usr/bin/wget -outputformat
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

But it thrown an exception

java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess
failed with code 1
at 
org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:295)
at 
org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:519)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:136)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

can somebody point me a way of why this happend. thanks.



-- 
http://daily.appspot.com/food/


Re: How to let key sorted in the final outputfile

2009-03-12 Thread Edward J. Yoon
For your information - http://wiki.apache.org/hama/MatMult

On Wed, Nov 12, 2008 at 2:05 AM, He Chen  wrote:
> hi everyone
>
> I use hadoop to do matrix multiplication, I let the key to store the row
> information, and let the value be the total row like this:
>
> 0 (this is the key)              (0,537040)      (1,217656)  (this string
> after the key is value in text)
> 8                                        (0,641356)      (1,287908)
>
> Now I want to keep the key(LongWritable) in turn from the low to high. I
> used the
>
> "config.setOutputKeyComparatorClass(LongWritable.Comparator.class);"
>
> to realize my intent. However, when I multiply two 10x10 matrixes. the key
> can not keep in turn. What is the reason?Should I rewrite the comparator?
>
> Chen
>



-- 
Best Regards, Edward J. Yoon
edwardy...@apache.org
http://blog.udanax.org


Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?

2009-03-12 Thread Raghu Angadi

TCK wrote:

How well does the read throughput from HDFS scale with the number of data nodes 
?
For example, if I had a large file (say 10GB) on a 10 data node cluster, would the time taken to read this whole file in parallel (ie, with multiple reader client processes requesting different parts of the file in parallel) 


be halved if I had the same file on a 20 data node cluster ? 
depends: yes, if whatever was bottleneck with 10 still continues to be 
bottleneck (i.e. you are able to saturate in both cases) and that 
resource is scaled (disk or network)


Is this not possible because HDFS doesn't support random seeks? 

HDFS does support random seeks for reading... your case should work.

Raghu.



What about if the file was split up into multiple smaller files before placing 
in the HDFS ?
Thanks for your input.
-TCK




--- On Wed, 2/4/09, Brian Bockelman  wrote:
From: Brian Bockelman 
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel reads?
To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:50 PM

Sounds overly complicated.  Complicated usually leads to mistakes :)

What about just having a single cluster and only running the tasktrackers on
the fast CPUs?  No messy cross-cluster transferring.

Brian

On Feb 4, 2009, at 12:46 PM, TCK wrote:



Thanks, Brian. This sounds encouraging for us.

What are the advantages/disadvantages of keeping a persistent storage

(HD/K)FS cluster separate from a processing Hadoop+(HD/K)FS cluster ?

The advantage I can think of is that a permanent storage cluster has

different requirements from a map-reduce processing cluster -- the permanent
storage cluster would need faster, bigger hard disks, and would need to grow as
the total volume of all collected logs grows, whereas the processing cluster
would need fast CPUs and would only need to grow with the rate of incoming data.
So it seems to make sense to me to copy a piece of data from the permanent
storage cluster to the processing cluster only when it needs to be processed. Is
my line of thinking reasonable? How would this compare to running the map-reduce
processing on same cluster as the data is stored in? Which approach is used by
most people?

Best Regards,
TCK



--- On Wed, 2/4/09, Brian Bockelman  wrote:
From: Brian Bockelman 
Subject: Re: Batch processing with Hadoop -- does HDFS scale for parallel

reads?

To: core-user@hadoop.apache.org
Date: Wednesday, February 4, 2009, 1:06 PM

Hey TCK,

We use HDFS+FUSE solely as a storage solution for a application which
doesn't understand MapReduce.  We've scaled this solution to

around

80Gbps.  For 300 processes reading from the same file, we get about

20Gbps.

Do consider your data retention policies -- I would say that Hadoop as a
storage system is thus far about 99% reliable for storage and is not a

backup

solution.  If you're scared of getting more than 1% of your logs lost,

have

a good backup solution.  I would also add that when you are learning your
operational staff's abilities, expect even more data loss.  As you

gain

experience, data loss goes down.

I don't believe we've lost a single block in the last month, but

it

took us 2-3 months of 1%-level losses to get here.

Brian

On Feb 4, 2009, at 11:51 AM, TCK wrote:


Hey guys,

We have been using Hadoop to do batch processing of logs. The logs get

written and stored on a NAS. Our Hadoop cluster periodically copies a

batch of

new logs from the NAS, via NFS into Hadoop's HDFS, processes them, and
copies the output back to the NAS. The HDFS is cleaned up at the end of

each

batch (ie, everything in it is deleted).

The problem is that reads off the NAS via NFS don't scale even if

we

try to scale the copying process by adding more threads to read in

parallel.

If we instead stored the log files on an HDFS cluster (instead of

NAS), it

seems like the reads would scale since the data can be read from multiple

data

nodes at the same time without any contention (except network IO, which
shouldn't be a problem).

I would appreciate if anyone could share any similar experience they

have

had with doing parallel reads from a storage HDFS.

Also is it a good idea to have a separate HDFS for storage vs for

doing

the batch processing ?

Best Regards,
TCK













  




Re: Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Chris K Wensel


May have found the answer, waiting on confirmation from users.

Turns out 0.19.0 and .1 instantiate the reducer class when the task is  
actually intended for job/task cleanup.


branch-0.19 looks like it resolves this issue by not instantiating the  
reducer class in this case.


I've got a workaround in the next maint release:
http://github.com/cwensel/cascading/tree/wip-1.0.5

ckw

On Mar 12, 2009, at 10:12 AM, Chris K Wensel wrote:


Hey all

Have some users reporting intermittent spawning of Reducers when the  
job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly  
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default  
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't  
bother not setting the value, and subsequently comes to ones  
attention rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is  
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



Hadoop User Group Meeting (Bay Area) 3/18

2009-03-12 Thread Ajay Anand
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday,
March 18th at Yahoo! 2811 Mission College Blvd, Santa Clara, Building 2,
Training Rooms 5 & 6 from 6:00-7:30 pm.

 

Agenda:
"Performance Enhancement Techniques with Hadoop - a Case Study" - Milind
Bhandarkar
"RPMs for Hadoop Deployment and Configuration Management" - Matt Massie,
Christophe Bisciglia



Registration: http://upcoming.yahoo.com/event/2112338/

 

Look forward to seeing you there.

Ajay

 



Re: Building Release 0.19.1

2009-03-12 Thread Tsz Wo (Nicholas), Sze

Hi Aviad,

You are right.  The eclipse plugin cannot be compiled in in windows.  See also 
HADOOP-4310,
https://issues.apache.org/jira/browse/HADOOP-4310

Nicholas Sze



- Original Message 
> From: Aviad sela 
> To: Hadoop Users Support 
> Sent: Thursday, March 12, 2009 1:00:12 PM
> Subject: Building Release 0.19.1
> 
> Building the eclipse project in windows XP, using Eclipse 3.4
> results with the following error.
> It seems that some of the jars to build the projects are missing
> *
> 
> compile*:
> [*echo*] contrib: eclipse-plugin
> [*javac*] Compiling 45 source files to
> D:\Work\AviadWork\workspace\cur\W_ECLIPSE\E34_Hadoop_Core_19_1\Hadoop\build\contrib\eclipse-plugin\classes
> [*javac*] *
> D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
> *:35: cannot find symbol
> [*javac*] symbol : class JavaApplicationLaunchShortcut
> [*javac*] location: package org.eclipse.jdt.internal.debug.ui.launcher
> [*javac*] import
> org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut;
> [*javac*] ^
> [*javac*] *
> D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
> *:49: cannot find symbol
> [*javac*] symbol: class JavaApplicationLaunchShortcut
> [*javac*] JavaApplicationLaunchShortcut {
> [*javac*] ^
> [*javac*] *
> D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
> *:66: cannot find symbol
> [*javac*] symbol : variable super
> [*javac*] location: class
> org.apache.hadoop.eclipse.launch.HadoopApplicationLaunchShortcut
> [*javac*] super.findLaunchConfiguration(type, configType);
> [*javac*] ^
> [*javac*] *
> D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
> *:60: method does not override or implement a method from a supertype
> [*javac*] @Override
> [*javac*] ^
> [*javac*] Note: Some input files use or override a deprecated API.
> [*javac*] Note: Recompile with -Xlint:deprecation for details.
> [*javac*] Note: Some input files use unchecked or unsafe operations.
> [*javac*] Note: Recompile with -Xlint:unchecked for details.
> [*javac*] 4 errors



Re: tuning performance

2009-03-12 Thread Aaron Kimball
Xeon vs. Opteron is likely not going to be a major factor. More important
than this is the number of disks you have per machine. Task performance is
proportional to both the number of CPUs and the number of disks.

You are probably using way too many tasks. Adding more tasks/node isn't
necessarily going to increase utilization if they're waiting on data from
the disks; you'll just increase the IO pressure and probably make things
seek more. You've configured up to 40 simultaneous tasks/node, to run on 8
cores. I'd lower that to maybe 8 maps and 6 reduces. You might get better
usage since they don't compete for IO as much. The correct value for you
might lie somewhere in between, too, depending on your workload.

How many hdds does your machine have? You should stripe dfs.data.dir and
mapred.local.dir across all of them. Having a 2:1 core:disk (And thus a 2:1
map task:disk) ratio is probably a good starting point, so for eight cores,
you should have at least 4 disks.

- Aaron

On Wed, Mar 11, 2009 at 10:15 AM, Vadim Zaliva  wrote:

> Hi!
>
> I have a question about fine-tunining hadoop performance on 8-core
> machines.
> I have 2 machines I am testing. One is 8-core Xeon and another is 8-core
> Opteron. 16Gb RAM each. They both run mapreduce and dfs nodes. Currently
> I've set up each of them to run 32 map and 8 reduce tasks.
> Also, HADOOP_HEAPSIZE=2048.
>
> I see CPU is under utilized. If there is a guideline how I can find optimal
> number of tasks and memory setting for this kind of hardware.
>
> Also, since we going to my more machines like this, I need to decided
> whenever buy Xeons or Opterons. Any advise on that?
>
> Sincerely,
> Vadim
>
> P.S. I am using Hadoop 19 and java version "1.6.0_12":
> Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
> Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
>


Building Release 0.19.1

2009-03-12 Thread Aviad sela
Building the eclipse project in windows XP, using Eclipse 3.4
results with the following error.
It seems that some of the jars to build the projects are missing
*

compile*:
[*echo*] contrib: eclipse-plugin
[*javac*] Compiling 45 source files to
D:\Work\AviadWork\workspace\cur\W_ECLIPSE\E34_Hadoop_Core_19_1\Hadoop\build\contrib\eclipse-plugin\classes
[*javac*] *
D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
*:35: cannot find symbol
[*javac*] symbol : class JavaApplicationLaunchShortcut
[*javac*] location: package org.eclipse.jdt.internal.debug.ui.launcher
[*javac*] import
org.eclipse.jdt.internal.debug.ui.launcher.JavaApplicationLaunchShortcut;
[*javac*] ^
[*javac*] *
D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
*:49: cannot find symbol
[*javac*] symbol: class JavaApplicationLaunchShortcut
[*javac*] JavaApplicationLaunchShortcut {
[*javac*] ^
[*javac*] *
D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
*:66: cannot find symbol
[*javac*] symbol : variable super
[*javac*] location: class
org.apache.hadoop.eclipse.launch.HadoopApplicationLaunchShortcut
[*javac*] super.findLaunchConfiguration(type, configType);
[*javac*] ^
[*javac*] *
D:\Work\Hadoop\src\contrib\eclipse-plugin\src\java\org\apache\hadoop\eclipse\launch\HadoopApplicationLaunchShortcut.java
*:60: method does not override or implement a method from a supertype
[*javac*] @Override
[*javac*] ^
[*javac*] Note: Some input files use or override a deprecated API.
[*javac*] Note: Recompile with -Xlint:deprecation for details.
[*javac*] Note: Some input files use unchecked or unsafe operations.
[*javac*] Note: Recompile with -Xlint:unchecked for details.
[*javac*] 4 errors


Re: about block size

2009-03-12 Thread Doug Cutting
One factor is that block size should minimize the impact of disk seeks. 
 For example, if a disk seeks in 10ms and transfers at 100MB/s, then a 
good block size will be substantially larger than 1MB.  With 100MB 
blocks, seeks would only slow things by 1%.


Another factor is that, unless files are smaller than the block size, 
larger blocks means fewer blocks, and fewer blocks make for a more 
efficient namenode.


The primary harm of too large blocks is that you will end up with fewer 
map tasks than nodes, and not use your cluster optimally.


Doug

ChihChun Chu wrote:

Hi,

I have a question about how to decide the block size.
As I understanding, the block size is related to namenode's heap size(how
many blocks can be handled),
total storage capacity of clusters, the files size (depends on applications,
e.g. 1T log file), #of replicas,
and the performance of mapreduce.
In Google's paper, they used 64MB as their block size. Yahoo and Facebook
seems set block size
to 128MB. Hadoop default value is 64MB. I don't know why 64MB or 128MB. Is
that the result from the tradeoff
as I mentioned above? How do I decide the block size if I want to build my
application upon Hadoop? Is their
any criterion or formula?

Any opinions or comments will be appreciate.


stchu



Reducers spawned when mapred.reduce.tasks=0

2009-03-12 Thread Chris K Wensel

Hey all

Have some users reporting intermittent spawning of Reducers when the  
job.xml shows mapred.reduce.tasks=0 in 0.19.0 and .1.


This is also confirmed when jobConf is queried in the (supposedly  
ignored) Reducer implementation.


In general this issue would likely go unnoticed since the default  
reducer is IdentityReducer.


but since it should be ignored in the Mapper only case, we don't  
bother not setting the value, and subsequently comes to ones attention  
rather abruptly.


am happy to open a JIRA, but wanted to see if anyone else is  
experiencing this issue.


note the issue seems to manifest with or without spec exec.

ckw

--
Chris K Wensel
ch...@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/



How to limit concurrent task numbers of a job.

2009-03-12 Thread Zhou, Yunqing
Here I have a job , it contains 2000 map tasks and each map need 1
hour or so (map cannot be splited because its input is a compressed
archive.)
How can I set this job's max concurrent task numbers (map and reduce)
to leave resources for other urgent jobs?

Thanks.


Re: Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-12 Thread Richa Khandelwal
I am running the same test and job that completes in 10 mins for (hk,lv)
case takes  is still running after 30mins have passed for (sk,hv) case.
Would be interesting to pinpoint the reason behind it.
On Wed, Mar 11, 2009 at 1:27 PM, Gyanit  wrote:

>
> Here are exact numbers:
> # of (k,v) pairs = 1.2 Mil this is same.
> # of unique k = 1000, k is integer.
> # of  unique v = 1Mil, v is a big big string.
> For a given k, cumulative size of all v associated to it is about 30 Mb.
> (That is each v is about 25~30Kb)
> # of Mappers = 30
> # of Reducers = 10
>
> (v,k) is atleast 4/5 times faster than (k,v).
>
> -Gyanit
>
>
> Scott Carey wrote:
> >
> > Well if the smaller keys are producing fewer unique values, there should
> > be some more significant differences.
> >
> > I had assumed that your test produced the same number of unique values.
> >
> > I'm still not sure why there would be that significant of a difference as
> > long as the total number of unique values in the small key test is a good
> > deal larger than the number of reducers and there is not too much skew in
> > the bucket sizes.  If there are a small subset of keys in the small key
> > test that contain a large subset of the values, then the reducers will
> > have very skewed work sizes and this could explain your observation.
> >
> >
> > On 3/11/09 11:50 AM, "Gyanit"  wrote:
> >
> >
> >
> > I notices one more thing. Lighter keys tend to make smaller number of
> > unique
> > keys.
> > For example (key,value) pairs may be 10Mil, but if key is lighter unique
> > keys might be just 1000.
> > In other case if keys are heavier unique keys might be 5 mil.
> > I think this might have something to do with it.
> > Bottom line: If your reduce is simple dump and no combining, the put data
> > in
> > keys than values.
> >
> > I need to put data in values. Any suggestions on how to make it faster.
> >
> > -Gyanit.
> >
> >
> > Scott Carey wrote:
> >>
> >> That is a fascinating question.  I would also love to know the reason
> >> behind this.
> >>
> >> If I were to guess I would have thought that smaller keys and heavier
> >> values would slightly outperform, rather than significantly
> underperform.
> >> (assuming total pair count at each phase is the same).   Perhaps there
> is
> >> room for optimization here?
> >>
> >>
> >>
> >> On 3/10/09 6:44 PM, "Gyanit"  wrote:
> >>
> >>
> >>
> >> I have large number of key,value pairs. I don't actually care if data
> >> goes
> >> in
> >> value or key. Let me be more exact.
> >> (k,v) pair after combiner is about 1 mil. I have approx 1kb data for
> each
> >> pair. I can put it in keys or values.
> >> I have experimented with both options (heavy key , light value)  vs
> >> (light
> >> key, heavy value). It turns out that hk,lv option is much much better
> >> than
> >> (lk,hv).
> >> Has someone else also noticed this?
> >> Is there a way to make things faster in light key , heavy value option.
> >> As
> >> some application will need that also.
> >> Remember in both cases we are talking about atleast dozen or so million
> >> pairs.
> >> There is a difference of time in shuffle phase. Which is weird as amount
> >> of
> >> data transferred is same.
> >>
> >> -gyanit
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >>
> >>
> >>
> >
> > --
> > View this message in context:
> >
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463049.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22463784.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Richa Khandelwal


University Of California,
Santa Cruz.
Ph:425-241-7763


Re: using virtual slave machines

2009-03-12 Thread Steve Loughran

Karthikeyan V wrote:

There is no specific procedure for configuring virtual machine slaves.



make sure the following thing are done.



I've used these as the beginning of a page on this

http://wiki.apache.org/hadoop/VirtualCluster




Re: Extending ClusterMapReduceTestCase

2009-03-12 Thread Steve Loughran

jason hadoop wrote:

I am having trouble reproducing this one. It happened in a very specific
environment that pulled in an alternate sax parser.

The bottom line is that jetty expects a parser with particular capabilities
and if it doesn't get one, odd things happen.

In a day or so I will have hopefully worked out the details, but it has been
have a year since I dealt with this last.

Unless you are forking, to run your junit tests, ant won't let you change
the class path for your unit tests - much chaos will ensue.


Even if you fork, unless you set includeantruntime=false then you get 
Ant's classpath, as the junit test listeners are in the 
ant-optional-junit.jar and you'd better pull them in somehow.


I can see why AElfred would cause problems for jetty; they need to 
handle web.xml and suchlike, and probably validate them against the 
schema to reduce support calls.




Re: Persistent HDFS On EC2

2009-03-12 Thread Steve Loughran

Kris Jirapinyo wrote:

Why would you lose the locality of storage-per-machine if one EBS volume is
mounted to each machine instance?  When that machine goes down, you can just
restart the instance and re-mount the exact same volume.  I've tried this
idea before successfully on a 10 node cluster on EC2, and didn't see any
adverse performance effects--



I was thinking more of S3 FS, which is remote-ish and write times measurable


and actually amazon claims that EBS I/O should
be even better than the instance stores.  


Assuming the transient filesystems are virtual disks (and not physical 
disks that get scrubbed, formatted and mounted on every VM 
instantiation), and also assuming that EBS disks are on a SAN in the 
same datacentre, this is probably true. Disk IO performance in virtual 
disks is currently pretty slow as you are navigating through >1 
filesystem, and potentially seeking at lot, even something that appears 
unfragmented at the VM level





The only concerns I see are that

you need to pay for EBS storage regardless of whether you use that storage
or not.  So, if you have 10 EBS volumes of 1 TB each, and you're just
starting out with your cluster so you're using only 50GB on each EBS volume
so far for the month, you'd still have to pay for 10TB worth of EBS volumes,
and that could be a hefty price for each month.  Also, currently EBS needs
to be created in the same availability zone as your instances, so you need
to make sure that they are created correctly, as there is no direct
migration of EBS to different availability zones.


View EBS as renting space in SAN and it starts to make sense.


--
Steve Loughran  http://www.1060.org/blogxter/publish/5
Author: Ant in Action   http://antbook.org/


Re: HADOOP Oracle connection workaround

2009-03-12 Thread Mridul Muralidharan


Would be better to externalize this through either a template - or at 
the least, message bundles.


- Mridul

evana wrote:

Out of the box implementation hadoop has some issues in connecting to oracle.
Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to
modify the out of the box implementation of getSelectQuery method in
DBInputFomat.

WORK AROUND
here is the code snippet...(remember this works only on oracle. if you want
to get it working on any db other than oracle you have to have if-else logic
on db type)

protected String getSelectQuery() {
StringBuilder query = new StringBuilder();
if(dbConf.getInputQuery() == null) {
query.append("SELECT ");

for (int i = 0; i < fieldNames.length; i++) {
  query.append(fieldNames[i]);
  if(i != fieldNames.length -1) {
query.append(", ");
  }
}

query.append(" FROM ").append(tableName);
if (conditions != null && conditions.length() > 0)
query.append(" WHERE ").append(conditions);
String orderBy = dbConf.getInputOrderBy();
if(orderBy != null && orderBy.length() > 0) {
query.append(" ORDER BY ").append(orderBy);
}
  }else {
//PREBUILT QUERY
query.append(dbConf.getInputQuery());
  }
		  
			try {

if(split.getLength() > 0 && split.getStart() > 
0){
String querystring = query.toString();

query = new StringBuilder();
query.append("select * from (select 
a.*,rownum rno from ( ");
query.append(querystring);
query.append(" ) a where rownum <= 
").append(split.getStart()).append("
+ ").append(split.getLength());
query.append(" ) where rno >= 
").append(split.getStart());
}
}catch (IOException ex) {
//ignore, will not throw
}
 return query.toString();
}




How to skip bad records in .19.1

2009-03-12 Thread 柳松


Dear all:
I have set the value "SkipBadRecords.setMapperMaxSkipRecords(conf, 1)",
and also the "SkipBadRecords.setAttemptsToStartSkipping(conf, 2)".
 
However, after 3 failed attempts, it gave me this exception message:
 
   java.lang.NullPointerException
 at 
org.apache.hadoop.io.serializer.SerializationFactory.getSerializer(SerializationFactory.java:73)
 at org.apache.hadoop.io.SequenceFile$Writer.init(SequenceFile.java:910)
 at 
org.apache.hadoop.io.SequenceFile$BlockCompressWriter.(SequenceFile.java:1198)
 at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:401)
 at org.apache.hadoop.io.SequenceFile.createWriter(SequenceFile.java:306)
 at 
org.apache.hadoop.mapred.MapTask$SkippingRecordReader.writeSkippedRec(MapTask.java:265)
 at org.apache.hadoop.mapred.MapTask$SkippingRecordReader.next(MapTask.java:237)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:342)
 at org.apache.hadoop.mapred.Child.main(Child.java:158)
 
   The last line of  syslog shows:
   2009-03-12 16:44:11,218 WARN org.apache.hadoop.mapred.SortedRanges: Skipping 
index 1-2
 
   I have two questions: 
   1. Should it skip the bad record automatically after 2 attempts? why it 
starts after 3?
 
   2. Why does the skip fail?
 
Regards
Song Liu from Suzhou University
 

HADOOP Oracle connection workaround

2009-03-12 Thread evana

Out of the box implementation hadoop has some issues in connecting to oracle.
Loos like DBInputFomat is built keeping mysql/hsqldb in mind. You need to
modify the out of the box implementation of getSelectQuery method in
DBInputFomat.

WORK AROUND
here is the code snippet...(remember this works only on oracle. if you want
to get it working on any db other than oracle you have to have if-else logic
on db type)

protected String getSelectQuery() {
StringBuilder query = new StringBuilder();
if(dbConf.getInputQuery() == null) {
query.append("SELECT ");

for (int i = 0; i < fieldNames.length; i++) {
  query.append(fieldNames[i]);
  if(i != fieldNames.length -1) {
query.append(", ");
  }
}

query.append(" FROM ").append(tableName);
if (conditions != null && conditions.length() > 0)
query.append(" WHERE ").append(conditions);
String orderBy = dbConf.getInputOrderBy();
if(orderBy != null && orderBy.length() > 0) {
query.append(" ORDER BY ").append(orderBy);
}
  }else {
//PREBUILT QUERY
query.append(dbConf.getInputQuery());
  }
  
try {
if(split.getLength() > 0 && split.getStart() > 
0){
String querystring = query.toString();

query = new StringBuilder();
query.append("select * from (select 
a.*,rownum rno from ( ");
query.append(querystring);
query.append(" ) a where rownum <= 
").append(split.getStart()).append("
+ ").append(split.getLength());
query.append(" ) where rno >= 
").append(split.getStart());
}
}catch (IOException ex) {
//ignore, will not throw
}
 return query.toString();
}
-- 
View this message in context: 
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22471395.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.