Re: structured data split

2011-11-11 Thread Denny Ye
hi Structured data is always being split into different blocks, likes a word or line. MapReduce task read HDFS data with the unit - *line* - it will read the whole line from the end of previous block to start of subsequent to obtains that part of line record. So you does not worry about the

Re: Sizing help

2011-11-11 Thread Matt Foley
I agree with Ted's argument that 3x replication is way better than 2x. But I do have to point out that, since 0.20.204, the loss of a disk no longer causes the loss of a whole node (thankfully!) unless it's the system disk. So in the example given, if you estimate a disk failure every 2 hours,

Re: structured data split

2011-11-11 Thread 臧冬松
Thanks Denny! So that means each map task will have to read from another DataNode inorder to read the end line of the previous block? Cheers, Donal 2011/11/11 Denny Ye denny...@gmail.com hi Structured data is always being split into different blocks, likes a word or line. MapReduce

Re: structured data split

2011-11-11 Thread 臧冬松
Hi Bejoy, I don't understand why it's impossible to have half of a line in one block, since the file is split into fixed size of blocks. My scenario is that I have lots of files from High Energy Physics experiment. These files are in binary format,about 2G each, but basically they are composed by

Re: Could not obtain block

2011-11-11 Thread Denny Ye
hi Steve, What's your HDFS release version? From the error log and HDFS0.21 code, I guess that the file does not have any replicas. You may focus on the missing replica of this file. Pay attention to the NameNode log with that block id and track the replica distribution. Or check the

Re: structured data split

2011-11-11 Thread Charles Earl
Hi, Please also feel free to contact me. I'm working with STAR project at Brookhaven Lab, and we are trying to build a MR workflow for analysis of particle data. I've done some preliminary experiments running Root and other nuclear physics analysis software in MR and have been looking at

Re: structured data split

2011-11-11 Thread Bejoy KS
Hi Donal I don't have much of an expose to the domain which you are pointing on to, but from a plain map reduce developer terms there would be my way of looking into processing such data format with map reduce - If the data is kind of flowing in continuously then I'd use flume to collect

Re: structured data split

2011-11-11 Thread 臧冬松
Thanks Bejoy, that help a lot! 2011/11/11, Bejoy KS bejoy.had...@gmail.com: Hi Donal I don't have much of an expose to the domain which you are pointing on to, but from a plain map reduce developer terms there would be my way of looking into processing such data format with map

Re: structured data split

2011-11-11 Thread Harsh J
Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce 2011/11/11 Bejoy KS bejoy.had...@gmail.com: Thanks Harsh for correcting me with that wonderful piece of information . Cleared a wrong assumption on hdfs storage

Re: Using HDFS to store few MB files for file sharing purposes

2011-11-11 Thread Ivan Kelly
As Todd said, HDFS isn't suited to this. You could take a look at Gluster though. It seems like it would fit your needs better. -Ivan

Re: structured data split

2011-11-11 Thread Bejoy KS
Thanks Harsh !... 2011/11/11 Harsh J ha...@cloudera.com Sorry Bejoy, I'd typed that URL out from what I remembered on my mind. Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce 2011/11/11 Bejoy KS bejoy.had...@gmail.com: Thanks Harsh for correcting me with that wonderful piece

Re: Sizing help

2011-11-11 Thread Ted Dunning
Matt, Thanks for pointing that out. I was talking about machine chassis failure since it is the more serious case, but should have pointed out that losing single disks is subject to the same logic with smaller amounts of data. If, however, an installation uses RAID-0 for higher read speed then

RE: Sizing help

2011-11-11 Thread Steve Ed
I understand that with 0.20.204, loss of a disk doesn't loss the node. But if we have to replace that lost disk, its again scheduling the whole node down, kicking replication From: Matt Foley [mailto:mfo...@hortonworks.com] Sent: Friday, November 11, 2011 1:58 AM To:

Re: Sizing help

2011-11-11 Thread Matt Foley
Nope; hot swap :-) On Nov 11, 2011, at 9:59 AM, Steve Ed sediso...@gmail.com wrote: I understand that with 0.20.204, loss of a disk doesn’t loss the node. But if we have to replace that lost disk, its again scheduling the whole node down, kicking replication *From:* Matt Foley

Re: Sizing help

2011-11-11 Thread Todd Lipcon
On Fri, Nov 11, 2011 at 10:15 AM, Matt Foley mfo...@hortonworks.com wrote: Nope; hot swap :-) AFAIK you can't re-add the marked-dead disk to the DN, can you? But yea, you can hot-swap the disk, then kick the DN process, which should take less than 10 minutes. That means the NN won't ever notice