Problem : data distribution is non uniform between two different disks on datanode.

2009-03-16 Thread Vaibhav J
 

 

  _  

From: Vaibhav J [mailto:vaibh...@rediff.co.in] 
Sent: Monday, March 16, 2009 5:46 PM
To: 'nutch-...@lucene.apache.org'; 'nutch-u...@lucene.apache.org'
Subject: Problem : data distribution is non uniform between two different
disks on datanode.

 

 

 

 

We have 27 datanode and replication factor is 1. (data size is ~6.75 TB)

We have specified two different disks for dfs data directory on each
datanode by using 

property dfs.data.dir in hadoop-site.xml file of conf directory.

(value of property dfs.data.dir : /mnt/hadoop-dfs/data,
/mnt2/hadoop-dfs/data)

 

when we are setting replication factor 2 then data distribution is biased to
first disk, 

more data is coping on /mnt/hadoop-dfs/data and after copying some
data...first disk becomes full 

and showing no available space on disk while we have enough space on second
disk (/mnt2/hadoop-dfs/data ). 

so, it is difficult to achieve replication factor 2.

 

Data traffic is coming on second disk also (/mnt2/hadoop-dfs/data) but it
looks that

more data is copied on fisrt disk (/mnt/hadoop-dfs/data).

 

 

What should we do to get uniform data distribution between two different
disks on 

each datanode to achieve replication factor 2?

 

 

Regards

Vaibhav J.



Re: Problem : data distribution is non uniform between two different disks on datanode.

2009-03-16 Thread Brian Bockelman

Hey  Vaibhavj,

Two notes beforehand:
1) When asking questions, you'll want to post the Hadoop version used.
2) You'll also want to only send to one mailing list at a time; it is  
a common courtesy.


Can you provide the list with the outputs of "df -h"?  Also, can you  
share what your namenode interface thinks about the configured  
capacity, used, non-dfs used, and remaining columns for your node?



Brian

On Mar 16, 2009, at 7:19 AM, Vaibhav J wrote:






 _

From: Vaibhav J [mailto:vaibh...@rediff.co.in]
Sent: Monday, March 16, 2009 5:46 PM
To: 'nutch-...@lucene.apache.org'; 'nutch-u...@lucene.apache.org'
Subject: Problem : data distribution is non uniform between two  
different

disks on datanode.









We have 27 datanode and replication factor is 1. (data size is ~6.75  
TB)


We have specified two different disks for dfs data directory on each
datanode by using

property dfs.data.dir in hadoop-site.xml file of conf directory.

(value of property dfs.data.dir : /mnt/hadoop-dfs/data,
/mnt2/hadoop-dfs/data)



when we are setting replication factor 2 then data distribution is  
biased to

first disk,

more data is coping on /mnt/hadoop-dfs/data and after copying some
data...first disk becomes full

and showing no available space on disk while we have enough space on  
second

disk (/mnt2/hadoop-dfs/data ).

so, it is difficult to achieve replication factor 2.



Data traffic is coming on second disk also (/mnt2/hadoop-dfs/data)  
but it

looks that

more data is copied on fisrt disk (/mnt/hadoop-dfs/data).





What should we do to get uniform data distribution between two  
different

disks on

each datanode to achieve replication factor 2?





Regards

Vaibhav J.





Task Side Effect files and copying(getWorkOutputPath)

2009-03-16 Thread Saptarshi Guha
Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

void close()  {
if (shouldcopy) {
ArrayList lop = new ArrayList();
for(String ff :  tempdir.list()){
lop.add(new Path(temppfx+ff));
}
dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
}

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_00_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.

Thank you
Saptarshi Guha


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

I understand why you would index in the reduce phase, because the anchor
text gets shuffled to be next to the document.  However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works.  The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.

Ian

Ning Li  writes:

> Or you can check out the index contrib. The difference of the two is that:
>   - In Nutch's indexing map/reduce job, indexes are built in the
> reduce phase. Afterwards, they are merged into smaller number of
> shards if necessary. The last time I checked, the merge process does
> not use map/reduce.
>   - In contrib/index, small indexes are built in the map phase. They
> are merged into the desired number of shards in the reduce phase. In
> addition, they can be merged into existing shards.
>
> Cheers,
> Ning
>
>
> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
>> you can see the nutch code.
>>
>> 2009/3/13 Mark Kerzner 
>>
>>> Hi,
>>>
>>> How do I allow multiple nodes to write to the same index file in HDFS?
>>>
>>> Thank you,
>>> Mark
>>>
>>



Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.

Cheers,
Ning


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff  wrote:
>
> I understand why you would index in the reduce phase, because the anchor
> text gets shuffled to be next to the document.  However, when you index
> in the map phase, don't you just have to reindex later?
>
> The main point to the OP is that HDFS is a bad FS for writing Lucene
> indexes because of how Lucene works.  The simple approach is to write
> your index outside of HDFS in the reduce phase, and then merge the
> indexes from each reducer manually.
>
> Ian
>
> Ning Li  writes:
>
>> Or you can check out the index contrib. The difference of the two is that:
>>   - In Nutch's indexing map/reduce job, indexes are built in the
>> reduce phase. Afterwards, they are merged into smaller number of
>> shards if necessary. The last time I checked, the merge process does
>> not use map/reduce.
>>   - In contrib/index, small indexes are built in the map phase. They
>> are merged into the desired number of shards in the reduce phase. In
>> addition, they can be merged into existing shards.
>>
>> Cheers,
>> Ning
>>
>>
>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
>>> you can see the nutch code.
>>>
>>> 2009/3/13 Mark Kerzner 
>>>
 Hi,

 How do I allow multiple nodes to write to the same index file in HDFS?

 Thank you,
 Mark

>>>
>
>


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ian Soboroff

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for
reading.

Ian

Ning Li  writes:

> I should have pointed out that Nutch index build and contrib/index
> targets different applications. The latter is for applications who
> simply want to build Lucene index from a set of documents - e.g. no
> link analysis.
>
> As to writing Lucene indexes, both work the same way - write the final
> results to local file system and then copy to HDFS. In contrib/index,
> the intermediate results are in memory and not written to HDFS.
>
> Hope it clarifies things.
>
> Cheers,
> Ning
>
>
> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff  wrote:
>>
>> I understand why you would index in the reduce phase, because the anchor
>> text gets shuffled to be next to the document.  However, when you index
>> in the map phase, don't you just have to reindex later?
>>
>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>> indexes because of how Lucene works.  The simple approach is to write
>> your index outside of HDFS in the reduce phase, and then merge the
>> indexes from each reducer manually.
>>
>> Ian
>>
>> Ning Li  writes:
>>
>>> Or you can check out the index contrib. The difference of the two is that:
>>>   - In Nutch's indexing map/reduce job, indexes are built in the
>>> reduce phase. Afterwards, they are merged into smaller number of
>>> shards if necessary. The last time I checked, the merge process does
>>> not use map/reduce.
>>>   - In contrib/index, small indexes are built in the map phase. They
>>> are merged into the desired number of shards in the reduce phase. In
>>> addition, they can be merged into existing shards.
>>>
>>> Cheers,
>>> Ning
>>>
>>>
>>> On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
 you can see the nutch code.

 2009/3/13 Mark Kerzner 

> Hi,
>
> How do I allow multiple nodes to write to the same index file in HDFS?
>
> Thank you,
> Mark
>

>>
>>



Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
> I'm missing why you would ever want the Lucene index in HDFS for
> reading.

The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.

Cheers,
Ning


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff  wrote:
>
> Does anyone have stats on how multiple readers on an optimized Lucene
> index in HDFS compares with a ParallelMultiReader (or whatever its
> called) over RPC on a local filesystem?
>
> I'm missing why you would ever want the Lucene index in HDFS for
> reading.
>
> Ian
>
> Ning Li  writes:
>
>> I should have pointed out that Nutch index build and contrib/index
>> targets different applications. The latter is for applications who
>> simply want to build Lucene index from a set of documents - e.g. no
>> link analysis.
>>
>> As to writing Lucene indexes, both work the same way - write the final
>> results to local file system and then copy to HDFS. In contrib/index,
>> the intermediate results are in memory and not written to HDFS.
>>
>> Hope it clarifies things.
>>
>> Cheers,
>> Ning
>>
>>
>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff  wrote:
>>>
>>> I understand why you would index in the reduce phase, because the anchor
>>> text gets shuffled to be next to the document.  However, when you index
>>> in the map phase, don't you just have to reindex later?
>>>
>>> The main point to the OP is that HDFS is a bad FS for writing Lucene
>>> indexes because of how Lucene works.  The simple approach is to write
>>> your index outside of HDFS in the reduce phase, and then merge the
>>> indexes from each reducer manually.
>>>
>>> Ian
>>>
>>> Ning Li  writes:
>>>
 Or you can check out the index contrib. The difference of the two is that:
   - In Nutch's indexing map/reduce job, indexes are built in the
 reduce phase. Afterwards, they are merged into smaller number of
 shards if necessary. The last time I checked, the merge process does
 not use map/reduce.
   - In contrib/index, small indexes are built in the map phase. They
 are merged into the desired number of shards in the reduce phase. In
 addition, they can be merged into existing shards.

 Cheers,
 Ning


 On Fri, Mar 13, 2009 at 1:34 AM, 王红宝  wrote:
> you can see the nutch code.
>
> 2009/3/13 Mark Kerzner 
>
>> Hi,
>>
>> How do I allow multiple nodes to write to the same index file in HDFS?
>>
>> Thank you,
>> Mark
>>
>
>>>
>>>
>
>


Re: Creating Lucene index in Hadoop

2009-03-16 Thread Doug Cutting

Ning Li wrote:

With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.


I don't think HADOOP-4801 is required.  It would help, certainly, but 
it's so fraught with security and other issues that I doubt it will be 
committed anytime soon.


What would probably help HDFS random access performance for Lucene 
significantly would be:
 1. A cache of connections to datanodes, so that each seek() does not 
require an open().  If we move HDFS data transfer to be RPC-based (see, 
e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will 
come for free, since RPC already caches connections.  We hope to do this 
for Hadoop 1.0, so that we use a single transport for all Hadoop's core 
operations, to simplify security.
 2. A local cache of read-only HDFS data, equivalent to kernel's buffer 
cache.  This might be implemented as a Lucene Directory that keeps an 
LRU cache of buffers from a wrapped filesystem, perhaps a subclass of 
RAMDirectory.


With these, performance would still be slower than a local drive, but 
perhaps not so dramatically.


Doug


Re: tuning performance

2009-03-16 Thread Scott Carey
Yes, I am referring to HDFS taking multiple mounts points and automatically 
round-robin block allocation across it.
A single file block will only exist on a single disk, but the extra speed you 
can get with raid-0 within a block can't be used effectively by almost any 
mapper or reducer anyway.  Perhaps an identity mapper can read faster than a 
single disk - but certainly not if the content is compressed.  \

RAID-0 may be more useful for local temp space.

In effect, you can say that HDFS data nodes already do RAID-0, but with a very 
large block size, and where failure of a disk reduces the redundancy minimally 
and temporarily.

For reference, today's Intel / AMD CPUs can decompress a gzip stream at less 
than 30MB/sec usually  (50MB to 100MB of uncompressed data output a sec).


On 3/14/09 1:53 AM, "Vadim Zaliva"  wrote:

Scott,

Thanks for interesting information. By JBOD, I assume you mean just listing
multiple partition mount points in hadoop config?

Vadim

On Fri, Mar 13, 2009 at 12:48, Scott Carey  wrote:
> On 3/13/09 11:56 AM, "Allen Wittenauer"  wrote:
>
> On 3/13/09 11:25 AM, "Vadim Zaliva"  wrote:
>
>>>When you stripe you automatically make every disk in the system have the
>>> same speed as the slowest disk.  In our experiences, systems are more likely
>>> to have a 'slow' disk than a dead one and detecting that is really
>>> really hard.  In a distributed system, that multiplier effect can have
>>> significant consequences on the whole grids performance.
>>
>> All disk are the same, so there is no speed difference.
>
>There will be when they start to fail. :)
>
>
>
> This has been discussed before:
> http://www.nabble.com/RAID-vs.-JBOD-td21404366.html
>
> JBOD is going to be better, the only benefit of RAID-0 is slightly easier 
> management in hadoop config, but harder to manage at the OS level.
> When a single JBOD drive dies, you only lose that set of data.  The datanode 
> goes down but a restart brings back up the parts that still exist.  Then you 
> can leave it be while the replacement is procured... With RAID-0 the whole 
> node is down until you get the new drive and recreate the RAID.
>
> With JBOD, don't forget to set the linux readahead for the drives to a decent 
> level  (you'll gain up to 25% more sequential read throughput depending on 
> your kernel version).  (blockdev -setra 8192 /dev/).  I also see good 
> gains by using xfs instead of ext3.  For a big shocker check out the 
> difference in time to delete a bunch of large files with ext3 (long time) 
> versus xfs (almost instant).
>
> For the newer drives, they can do about 120MB/sec at the front of the drive 
> when tuned (xfs, readahead >4096) and the back of the drive is 60MB/sec.  If 
> you are going to not use 100% of the drive for HDFS, use this knowledge and 
> place the partitions appropriately.  The last 20% or so of the drive is a lot 
> slower than the front 60%.  Here is a typical sequential transfer rate chart 
> for a SATA drive as a function of LBA:
> http://www.tomshardware.com/reviews/Seagate-Barracuda-1.5-TB,2032-5.html
> (graphs aare about 3/4 of the way down the page before the comments).
>



Re: Creating Lucene index in Hadoop

2009-03-16 Thread Ning Li
1 is good. But for 2:
  - Won't it have a security concern as well? Or is this not a general
local cache?
  - You are referring to caching in RAM, not caching in local FS,
right? In general, a Lucene index size could be quite large. We may
have to cache a lot of data to reach a reasonable hit ratio...

Cheers,
Ning


On Mon, Mar 16, 2009 at 5:36 PM, Doug Cutting  wrote:
> Ning Li wrote:
>>
>> With
>> http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
>> become feasible to search on HDFS directly.
>
> I don't think HADOOP-4801 is required.  It would help, certainly, but it's
> so fraught with security and other issues that I doubt it will be committed
> anytime soon.
>
> What would probably help HDFS random access performance for Lucene
> significantly would be:
>  1. A cache of connections to datanodes, so that each seek() does not
> require an open().  If we move HDFS data transfer to be RPC-based (see,
> e.g., http://issues.apache.org/jira/browse/HADOOP-4386), then this will come
> for free, since RPC already caches connections.  We hope to do this for
> Hadoop 1.0, so that we use a single transport for all Hadoop's core
> operations, to simplify security.
>  2. A local cache of read-only HDFS data, equivalent to kernel's buffer
> cache.  This might be implemented as a Lucene Directory that keeps an LRU
> cache of buffers from a wrapped filesystem, perhaps a subclass of
> RAMDirectory.
>
> With these, performance would still be slower than a local drive, but
> perhaps not so dramatically.
>
> Doug
>


Problem with com.sun.pinkdots.LogHandler

2009-03-16 Thread psterk

Hi,

I have been running a hadoop cluster successfully for a few months.  During
today's run, I am seeing a new error and it is not clear to me how to
resolve it. Below are the stack traces and the configure file I am using.
Please share any tips you may have.

Thanks,
Paul

09/03/16 16:28:25 INFO mapred.JobClient: Task Id :
task_200903161455_0003_m_000127_0, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 3
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

task_200903161455_0003_m_000127_0: Starting
null.task_200903161455_0003_m_000127_0
task_200903161455_0003_m_000127_0: Closing
task_200903161455_0003_m_000127_0: log4j:WARN No appenders could be found
for logger (org.apache.hadoop.mapred.TaskRu
task_200903161455_0003_m_000127_0: log4j:WARN Please initialize the log4j
system properly.
09/03/16 16:28:27 INFO mapred.JobClient: Task Id :
task_200903161455_0003_m_000128_0, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 3
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

task_200903161455_0003_m_000128_0: Starting
null.task_200903161455_0003_m_000128_0
task_200903161455_0003_m_000128_0: Closing
09/03/16 16:28:32 INFO mapred.JobClient: Task Id :
task_200903161455_0003_m_000128_1, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 3
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

task_200903161455_0003_m_000128_1: Starting
null.task_200903161455_0003_m_000128_1
task_200903161455_0003_m_000128_1: Closing
09/03/16 16:28:37 INFO mapred.JobClient: Task Id :
task_200903161455_0003_m_000127_1, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 3
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

:qsk_200903161455_0003_m_000127_1: Starting
null.task_200903161455_0003_m_000127_1
clear200903161455_0003_m_000127_1: Closing
task_200903161455_0003_m_000127_1: log4j:WARN No appenders could be found
for logger (org.apache.hadoop.ipc.Client).
task_200903161455_0003_m_000127_1: log4j:WARN Please initialize the log4j
system properly.
09/03/16 16:28:40 INFO mapred.JobClient: Task Id :
task_200903161455_0003_m_000128_2, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 3
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

task_200903161455_0003_m_000128_2: Starting
null.task_200903161455_0003_m_000128_2
task_200903161455_0003_m_000128_2: Closing
09/03/16 16:28:46 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
at com.sun.pinkdots.Main.handleLogs(Main.java:63)
at com.sun.pinkdots.Main.main(Main.java:35)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)


#
# First, setup the necessary filesystem locations
#
${HADOOP}/bin/hadoop dfs -rmr hdfs:///user/${USER}/pinkdots
${HADOOP}/bin/hadoop dfs -copyFromLocal \
  file://${HOME}/pinkdots/config/glassfish_admin.xml \
  hdfs:

Cloudera's Distribution for Hadoop

2009-03-16 Thread Christophe Bisciglia
Hey Hadoop Fans,

It's been a crazy week here at Cloudera. Today we launched our
Distribution for Hadoop. This is targeted at Hadoop users who want to
use the most recent stable version of Hadoop and take advantage of
standard packaging and deployment tools like RPMs and YUM. We also
provide an AJAXy wizard to help you configure your cluster. We'll
include more options for deployment (.debs, solaris packages, etc) as
you ask for them, so please don't be shy - hit up our community
support page.

The high level features for our first release include:
* RPM Deployment and a public YUM repository
* Client RPMs for Hive and Pig (what else should we include? Tell us
on community support! Link below.)
* Standard Linux Service Management
* Local Documentation and Man Pages

We'll be going over some details and walk through deployment at the
Bay Area Hadoop Users Group at Y! this Wednesday, but if you're from
out of town, or want a head start, here are some links:
* Blog post announcement:
http://www.cloudera.com/blog/2009/03/15/cloudera-distribution-for-hadoop/
* Cloudera's Distribution for Hadoop Home Page: http://www.cloudera.com/hadoop
* Community Support: http://www.cloudera.com/community-support

Also, we turning into twitter junkies, so if you've been infected too,
follow @cloudera for updates.

See you Wednesday!

Cheers,
Christophe


Re: Reduce task going away for 10 seconds at a time

2009-03-16 Thread Aaron Kimball
If you jstack the process in the middle of one of these pauses, can you see
where it's sticking?
- Aaron

On Fri, Mar 13, 2009 at 6:51 AM, Doug Cook  wrote:

>
> Hi folks,
>
> I've been debugging a severe performance problems with a Hadoop-based
> application (a highly modified version of Nutch). I've recently upgraded to
> Hadoop 0.19.1 from a much, much older version, and a reduce that used to
> work just fine is now running orders of magnitude more slowly.
>
> From the logs I can see that progress of my reduce stops for periods that
> average almost exactly 10 seconds (with a very narrow distribution around
> 10
> seconds), and it does so in various places in my code, but more or less in
> proportion to how much time I'd expect the task would normally spend in
> that
> particular place in the code, i.e. the behavior seems like my code is
> randomly being interrupted for 10 seconds at a time.
>
> I'm planning to keep digging, but thought that these symptoms might sound
> familiar to someone on this list. Ring any bells? Your help much
> appreciated.
>
> Thanks!
>
> Doug Cook
> --
> View this message in context:
> http://www.nabble.com/Reduce-task-going-away-for-10-seconds-at-a-time-tp22496810p22496810.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


Re: Cloudera's Distribution for Hadoop

2009-03-16 Thread Mark Kerzner
Christophe,

if you do .deb, I will be the first one to try. As it is, I am second :)

Mark

On Mon, Mar 16, 2009 at 7:42 PM, Christophe Bisciglia <
christo...@cloudera.com> wrote:

> Hey Hadoop Fans,
>
> It's been a crazy week here at Cloudera. Today we launched our
> Distribution for Hadoop. This is targeted at Hadoop users who want to
> use the most recent stable version of Hadoop and take advantage of
> standard packaging and deployment tools like RPMs and YUM. We also
> provide an AJAXy wizard to help you configure your cluster. We'll
> include more options for deployment (.debs, solaris packages, etc) as
> you ask for them, so please don't be shy - hit up our community
> support page.
>
> The high level features for our first release include:
> * RPM Deployment and a public YUM repository
> * Client RPMs for Hive and Pig (what else should we include? Tell us
> on community support! Link below.)
> * Standard Linux Service Management
> * Local Documentation and Man Pages
>
> We'll be going over some details and walk through deployment at the
> Bay Area Hadoop Users Group at Y! this Wednesday, but if you're from
> out of town, or want a head start, here are some links:
> * Blog post announcement:
> http://www.cloudera.com/blog/2009/03/15/cloudera-distribution-for-hadoop/
> * Cloudera's Distribution for Hadoop Home Page:
> http://www.cloudera.com/hadoop
> * Community Support: http://www.cloudera.com/community-support
>
> Also, we turning into twitter junkies, so if you've been infected too,
> follow @cloudera for updates.
>
> See you Wednesday!
>
> Cheers,
> Christophe
>


Re: Cloudera's Distribution for Hadoop

2009-03-16 Thread Christophe Bisciglia
Mark, this is great feedback.
To everyone else, let me get a little more explicit about our community
support. We use Get Satisfaction: http://www.getsatisfaction.com/cloudera

You'll notice a topic for "Ubuntu support" - this is essentially asking for
.debs - http://www.getsatisfaction.com/cloudera/topics/ubuntu_support

If you want this, get on there and say "me too" - if you want
another platform, say that too.

Really - we'll listen. We're building this for you. We want it to be as easy
as possible for developers to get up and running so when your managers
realize how cool Hadoop is, they can consider the value of paying for
additional support.

Christophe

On Mon, Mar 16, 2009 at 7:21 PM, Mark Kerzner  wrote:

> Christophe,
>
> if you do .deb, I will be the first one to try. As it is, I am second :)
>
> Mark
>
> On Mon, Mar 16, 2009 at 7:42 PM, Christophe Bisciglia <
> christo...@cloudera.com> wrote:
>
> > Hey Hadoop Fans,
> >
> > It's been a crazy week here at Cloudera. Today we launched our
> > Distribution for Hadoop. This is targeted at Hadoop users who want to
> > use the most recent stable version of Hadoop and take advantage of
> > standard packaging and deployment tools like RPMs and YUM. We also
> > provide an AJAXy wizard to help you configure your cluster. We'll
> > include more options for deployment (.debs, solaris packages, etc) as
> > you ask for them, so please don't be shy - hit up our community
> > support page.
> >
> > The high level features for our first release include:
> > * RPM Deployment and a public YUM repository
> > * Client RPMs for Hive and Pig (what else should we include? Tell us
> > on community support! Link below.)
> > * Standard Linux Service Management
> > * Local Documentation and Man Pages
> >
> > We'll be going over some details and walk through deployment at the
> > Bay Area Hadoop Users Group at Y! this Wednesday, but if you're from
> > out of town, or want a head start, here are some links:
> > * Blog post announcement:
> >
> http://www.cloudera.com/blog/2009/03/15/cloudera-distribution-for-hadoop/
> > * Cloudera's Distribution for Hadoop Home Page:
> > http://www.cloudera.com/hadoop
> > * Community Support: http://www.cloudera.com/community-support
> >
> > Also, we turning into twitter junkies, so if you've been infected too,
> > follow @cloudera for updates.
> >
> > See you Wednesday!
> >
> > Cheers,
> > Christophe
> >
>


Re: 1 file per record

2009-03-16 Thread Sean Arietta

I have a similar issue and would like some clarification if possible. Suppose
each file is meant to be emitted as a one single record to a set of map
tasks. That is, each key-value pair will include data from one file and one
file alone. 

I have written custom InputFormats and RecordReaders before so I am familiar
with the general process. Does it suffice to just return an empty array from
the InputFormat.getSplits() function and then take care of the actual record
emitting from inside the custom RecordReader? 

Thanks for your time!

-Sean


owen.omalley wrote:
> 
> On Oct 2, 2008, at 1:50 AM, chandravadana wrote:
> 
>> If we dont specify numSplits in getsplits(), then what is the default
>> number of splits taken...
> 
> The getSplits() is either library or user code, so it depends which  
> class you are using as your InputFormat. The FileInputFormats  
> (TextInputFormat and SequenceFileInputFormat) basically divide input  
> files by blocks, unless the requested number of mappers is really high.
> 
> -- Owen
> 
> 

-- 
View this message in context: 
http://www.nabble.com/1-file-per-record-tp19644985p22551968.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Cloudera's Distribution for Hadoop

2009-03-16 Thread Vadim Zaliva
Great news! I've been using homemade hadoop RPMs for some time and
will be glad to
switch to these.

Since I am using bleeding edge version of pig I will be interested in
PIG RPMs done daily from PIG SVN.

Vadim

On Mon, Mar 16, 2009 at 19:34, Christophe Bisciglia
 wrote:
> Mark, this is great feedback.
> To everyone else, let me get a little more explicit about our community
> support. We use Get Satisfaction: http://www.getsatisfaction.com/cloudera
>
> You'll notice a topic for "Ubuntu support" - this is essentially asking for
> .debs - http://www.getsatisfaction.com/cloudera/topics/ubuntu_support
>
> If you want this, get on there and say "me too" - if you want
> another platform, say that too.
>
> Really - we'll listen. We're building this for you. We want it to be as easy
> as possible for developers to get up and running so when your managers
> realize how cool Hadoop is, they can consider the value of paying for
> additional support.
>
> Christophe
>
> On Mon, Mar 16, 2009 at 7:21 PM, Mark Kerzner  wrote:
>
>> Christophe,
>>
>> if you do .deb, I will be the first one to try. As it is, I am second :)
>>
>> Mark
>>
>> On Mon, Mar 16, 2009 at 7:42 PM, Christophe Bisciglia <
>> christo...@cloudera.com> wrote:
>>
>> > Hey Hadoop Fans,
>> >
>> > It's been a crazy week here at Cloudera. Today we launched our
>> > Distribution for Hadoop. This is targeted at Hadoop users who want to
>> > use the most recent stable version of Hadoop and take advantage of
>> > standard packaging and deployment tools like RPMs and YUM. We also
>> > provide an AJAXy wizard to help you configure your cluster. We'll
>> > include more options for deployment (.debs, solaris packages, etc) as
>> > you ask for them, so please don't be shy - hit up our community
>> > support page.
>> >
>> > The high level features for our first release include:
>> > * RPM Deployment and a public YUM repository
>> > * Client RPMs for Hive and Pig (what else should we include? Tell us
>> > on community support! Link below.)
>> > * Standard Linux Service Management
>> > * Local Documentation and Man Pages
>> >
>> > We'll be going over some details and walk through deployment at the
>> > Bay Area Hadoop Users Group at Y! this Wednesday, but if you're from
>> > out of town, or want a head start, here are some links:
>> > * Blog post announcement:
>> >
>> http://www.cloudera.com/blog/2009/03/15/cloudera-distribution-for-hadoop/
>> > * Cloudera's Distribution for Hadoop Home Page:
>> > http://www.cloudera.com/hadoop
>> > * Community Support: http://www.cloudera.com/community-support
>> >
>> > Also, we turning into twitter junkies, so if you've been infected too,
>> > follow @cloudera for updates.
>> >
>> > See you Wednesday!
>> >
>> > Cheers,
>> > Christophe
>> >
>>
>


hadoop migration

2009-03-16 Thread bayee

Hi,

We are running a website with quiet a lot of traffic. At the moment we 
are using about 20 sql servers and about 60 application servers/file 
servers. We are thinking of porting everything to hadoop. My question is 
does 80 nodes of hadoop can perform much better than 20 sql server + 60 
native file servers?


We have tried setup 1 hadoop server and run a simple grep example, and 
the speed is very slow. Does hadoop can only perform under a lot of 
nodes? What is the minimum of nodes do we need to replace our current 20 
sql server + 60 app/file servers?


Best Wishes,
Hsin Yee


Re: hadoop migration

2009-03-16 Thread Edward J. Yoon
Hi,

Your SQL servers seems database accessible by the internet. Hadoop is
a distributed file-system And, it's quite different with (Database +
SAN storage) cluster architecture.

Of course, There is a storage solution called HBase for Hadoop. But,
In my experience, not applicable for online data access yet.

On Tue, Mar 17, 2009 at 1:33 PM, bayee  wrote:
> Hi,
>
> We are running a website with quiet a lot of traffic. At the moment we are
> using about 20 sql servers and about 60 application servers/file servers. We
> are thinking of porting everything to hadoop. My question is does 80 nodes
> of hadoop can perform much better than 20 sql server + 60 native file
> servers?
>
> We have tried setup 1 hadoop server and run a simple grep example, and the
> speed is very slow. Does hadoop can only perform under a lot of nodes? What
> is the minimum of nodes do we need to replace our current 20 sql server + 60
> app/file servers?
>
> Best Wishes,
> Hsin Yee
>



-- 
Best Regards, Edward J. Yoon
edwardy...@apache.org
http://blog.udanax.org


Re: hadoop migration

2009-03-16 Thread W
> Of course, There is a storage solution called HBase for Hadoop. But,
> In my experience, not applicable for online data access yet.
>

I see.., how about hypertable ? does it mature enough to be used in
production ? , i read that
hypertable can be integrated with hadoop, or is there any other
alternative other than hbase ?

Thanks!

Regards,
Wildan

-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: hadoop migration

2009-03-16 Thread Amandeep Khurana
Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0,
includes some patches which reduce the latency of responses and makes it
suitable to be used as a backend for a webapp. However the current release
isnt optimized for this purpose.

The idea behind Hadoop and the rest of the tools around it is more of a data
processing system than a backend datastore for a website. The output of the
processing that Hadoop does is typically taken into a MySQL cluster which
feeds a website.




Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Mar 16, 2009 at 10:05 PM, W  wrote:

> > Of course, There is a storage solution called HBase for Hadoop. But,
> > In my experience, not applicable for online data access yet.
> >
>
> I see.., how about hypertable ? does it mature enough to be used in
> production ? , i read that
> hypertable can be integrated with hadoop, or is there any other
> alternative other than hbase ?
>
> Thanks!
>
> Regards,
> Wildan
>
> --
> ---
> OpenThink Labs
> www.tobethink.com
>
> Aligning IT and Education
>
> >> 021-99325243
> Y! : hawking_123
> Linkedln : http://www.linkedin.com/in/wildanmaulana
>


intermediate results not getting compressed

2009-03-16 Thread Billy Pearson
I am running a large streaming job that processes that about 3TB of data I 
am seeing large jumps in hard drive space usage in the reduce part of the 
jobs I tracked the problem down. The job is set to compress map outputs but 
looking at the intermediate files on the local drives the intermediate files 
are not getting compressed during/after merges. I am going from having say 
2Gb of mapfile.out files to having one intermediate.X file sizing 100-350% 
larger then the map files. I have looked at one of the files and confirmed 
that it is not getting compressed as I can read the data in it. if it was 
only one merge then it would not be a problem but when you are merging 
70-100 of these you use tons of GB's and my task are starting to die as they 
run out of hard drive space end the end kill the job.


I am running 0.19.1-dev, r744282. I have searched the issues but found 
nothing about the compression.
Should the intermediate results not be compressed also if the map output 
files are set to be compressed?
If not then why do we have the map compression option just to save network 
traffic?





Re: Task Side Effect files and copying(getWorkOutputPath)

2009-03-16 Thread Amareshwari Sriramadasu

Saptarshi Guha wrote:

Hello,
I would like to produce side effect files which will be later copied
to the outputfolder.
I am using FileOuputFormat, and in the Map's close() method i copy
files (from the local tmp/ folder) to
FileOutputFormat.getWorkOutputPath(job);

  

FileOutputFormat.getWorkOutputPath(job) is the correct method to get directory 
for task-side effect files.

You should not use close() method, because promotion to output directory 
happens before close(). You can use configure() method.

See org.apache.hadoop.tools.HadoopArchives.

void close()  {
if (shouldcopy) {
ArrayList lop = new ArrayList();
for(String ff :  tempdir.list()){
lop.add(new Path(temppfx+ff));
}
dstFS.moveFromLocalFile(lop.toArray(new Path[]{}), dstPath);
}

However, this throws an error java.io.IOException:
`hdfs://X:54310/tmp/testseq/_temporary/_attempt_200903160945_0010_m_00_0':
specified destination directory doest not exist

I though this is the right to place to drop side effect files. Prior
to this I was copying o the output folder, but many were not copied,
or in fact all were, but during the reduce output stage many were
deleted - am not sure(I used NullOutputFormat and all the files were
present in the output folder)  So i resorted to getWorkOutputPath
which threw the above exception.

So if I'm using FileOutputFormat, and my maps and/or reduces produce
side effects files on the localFS
1)when should I copy them to the DFS (e.g the close method? or one at
a time in the map/reduce method)
2) Where should i copy them to.

I am using Hadoop 0.19 and have set jobConf.setNumTasksToExecutePerJvm(-1);
Also, each side effect file produced has a unique name, i.e there is
no overwriting.
  
You need not set jobConf.setNumTasksToExecutePerJvm(-1), even otherwise, 
each attempt will have unique work output path.


Thanks
Amareshwari


Re: hadoop migration

2009-03-16 Thread W
Thanks for the quick response Aman,

Ok .., i see the point now.

currently i'm doing some research on creating a google books like
application using hbase as
a backend for storing the files and solr as indexer. From this
prototype, my be i can measure how fast
is hbase on serving data to the client ... (google using bigTable for
their books.google.com right ?)

Thanks!

Regards,
Wildan

On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana  wrote:
> Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0,
> includes some patches which reduce the latency of responses and makes it
> suitable to be used as a backend for a webapp. However the current release
> isnt optimized for this purpose.
>
> The idea behind Hadoop and the rest of the tools around it is more of a data
> processing system than a backend datastore for a website. The output of the
> processing that Hadoop does is typically taken into a MySQL cluster which
> feeds a website.
>
>
>


-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: hadoop migration

2009-03-16 Thread Amandeep Khurana
AFAIK, Google uses BigTable for pretty much most of their backend stuff. The
thing to note here is that BigTable is much more mature than Hbase.

You can try it out and see how it works out for you. Do share your results
on the mailing list...


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Mar 16, 2009 at 10:28 PM, W  wrote:

> Thanks for the quick response Aman,
>
> Ok .., i see the point now.
>
> currently i'm doing some research on creating a google books like
> application using hbase as
> a backend for storing the files and solr as indexer. From this
> prototype, my be i can measure how fast
> is hbase on serving data to the client ... (google using bigTable for
> their books.google.com right ?)
>
> Thanks!
>
> Regards,
> Wildan
>
> On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana 
> wrote:
> > Hypertable is not as mature as Hbase yet. The next release of Hbase,
> 0.20.0,
> > includes some patches which reduce the latency of responses and makes it
> > suitable to be used as a backend for a webapp. However the current
> release
> > isnt optimized for this purpose.
> >
> > The idea behind Hadoop and the rest of the tools around it is more of a
> data
> > processing system than a backend datastore for a website. The output of
> the
> > processing that Hadoop does is typically taken into a MySQL cluster which
> > feeds a website.
> >
> >
> >
>
>
> --
> ---
> OpenThink Labs
> www.tobethink.com
>
> Aligning IT and Education
>
> >> 021-99325243
> Y! : hawking_123
> Linkedln : http://www.linkedin.com/in/wildanmaulana
>


Re: intermediate results not getting compressed

2009-03-16 Thread Chris Douglas
I am running 0.19.1-dev, r744282. I have searched the issues but  
found nothing about the compression.


AFAIK, there are no open issues that prevent intermediate compression  
from working. The following might be useful:


http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Data+Compression

Should the intermediate results not be compressed also if the map  
output files are set to be compressed?


These are controlled by separate options.

FileOutputFormat::setCompressOutput enables/disables compression on  
the final output
JobConf::setCompressMapOutput enables/disables compression of the  
intermediate output


If not then why do we have the map compression option just to save  
network traffic?


That's part of it. Also to save on disk bandwidth and intermediate  
space. -C