HDFS
Hello to all, I have been attracted by the Hadoop project while looking for a solution for my application. Basically, I have an application hosting user generated content (images, sounds, videos) and I would like to have this available at all time for all my servers. Servers will basically add new content, user can manipulate the existing content, make compositions etc etc ... We have a few servers (2 for now) dedicated to hosting content, and right now, they are connected via sshfs on some folders, in order to shorten the transfert time between these content servers and the application servers. Would the Hadoop filesystem be usefull in my case, is it worth digging into it. In the case it is doable, how redundant the system is ? for instance, to store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ? I hope I made myself clear enough and will get encouraging answers, Bests to all, Eric
Re: HDFS
Hi, > I have been attracted by the Hadoop project while looking for a solution > for my application. > Basically, I have an application hosting user generated content (images, > sounds, videos) and I would like to have this available at all time for > all my servers. > Servers will basically add new content, user can manipulate the existing > content, make compositions etc etc ... > > We have a few servers (2 for now) dedicated to hosting content, and > right now, they are connected via sshfs on some folders, in order to > shorten the transfert time between these content servers and the > application servers. > > Would the Hadoop filesystem be usefull in my case, is it worth digging > into it. I guess not, your best choice would be something like MogileFS. HDFS is a filesystem optimized for distributed calculations, and thus it works best with big files (comparable to the size of block, like 64MB). Hosting a lots of smaller files would be an overkill. -- WBR, Mikhail Yakshin
Re: HDFS
Hi Eric, we are currently building a system for a very similar purpose (digital asset management) and we use HDFS currently for a volume of approx. 100TB with the option to scale into the PB range. Since we haven't gone into production yet, I cannot say it will work flawlessly but so far everything has worked very well with really good performance (especially read performance which is probably also in your case the most important factor). The most important thing you have to be aware of IMHO ist that you will not have a real file system on the OS level. If you use tools which need that to process the data you will need to do some copying (which we do in some cases). There is a project out there that makes HDFS available via FUSE but it appears to be rather alpha which is why we haven't dared to take a look at it for this project. Apart from the namenode, which you have to get redundant yourself (lots of posts in the archives on this topic) you can simply configure the level of redundancy (see docs). Hope this helps, Robert Monchanin Eric wrote: > Hello to all, > > I have been attracted by the Hadoop project while looking for a solution > for my application. > Basically, I have an application hosting user generated content (images, > sounds, videos) and I would like to have this available at all time for > all my servers. > Servers will basically add new content, user can manipulate the existing > content, make compositions etc etc ... > > We have a few servers (2 for now) dedicated to hosting content, and > right now, they are connected via sshfs on some folders, in order to > shorten the transfert time between these content servers and the > application servers. > > Would the Hadoop filesystem be usefull in my case, is it worth digging > into it. > > In the case it is doable, how redundant the system is ? for instance, to > store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ? > > I hope I made myself clear enough and will get encouraging answers, > > Bests to all, > > Eric >
Re: HDFS
On Fri, Sep 12, 2008 at 3:08 AM, Robert Krüger <[EMAIL PROTECTED]> wrote: > we are currently building a system for a very similar purpose (digital > asset management) and we use HDFS currently for a volume of approx. > 100TB with the option to scale into the PB range. Robert, would you mind expanding on why you picked HDFS over something like GFS or MogileFS? I would have agreed with Mikhail - HDFS seems like it's purpose-built for Hadoop, and wouldn't necessarily be the best choice if you just wanted a filesystem. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: How to move files from one location to another on hadoop
On Wed, Jul 30, 2008 at 2:06 PM, Rutuja Joshi <[EMAIL PROTECTED]> wrote: > Could anyone suggest any efficient way to move files from one location to > another on Hadoop. Please note that both the locations are on HDFS. > I tried looking for inbuilt file system APIs but couldn't find anything > suitable. The code you want to start with is: src/core/org/apache/hadoop/fs/FsShell.java (in 0.18.0, but I think it's been around for a while) That's where you'll see the implementation of 'hadoop dfs -mv filea fileb' - in this case, you're looking for rename(). -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: namenode multitreaded
The core of namenode functionality happens in single thread because of a global lock, unfortunately. The other cpus would still be used to some extent by network IO and other threads. Usually we don't see just one cpu at 100% and nothing else on the other cpus. What kind of load do you have? Raghu. Dmitry Pushkarev wrote: Hi. My namenode runs on a 8-core server with lots of RAM, but it only uses one core (100%). Is it possible to tell namenode to use all available cores? Thanks.
Re: How to move files from one location to another on hadoop
Copying between filesystems, particularly between HDFS filesystems, is best done with distcp: http://hadoop.apache.org/core/docs/r0.18.0/distcp.html -C On Sep 12, 2008, at 8:04 AM, James Moore wrote: On Wed, Jul 30, 2008 at 2:06 PM, Rutuja Joshi <[EMAIL PROTECTED]> wrote: Could anyone suggest any efficient way to move files from one location to another on Hadoop. Please note that both the locations are on HDFS. I tried looking for inbuilt file system APIs but couldn't find anything suitable. The code you want to start with is: src/core/org/apache/hadoop/fs/FsShell.java (in 0.18.0, but I think it's been around for a while) That's where you'll see the implementation of 'hadoop dfs -mv filea fileb' - in this case, you're looking for rename(). -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
RE: namenode multitreaded
I have 15+ million small files I like to process and move around..Thus my operations doesn't really include datanodes - they're idle when I for example do FS operations (like sort a bunch of new files written by tasktracker to appropriate folders). Now I tried to use HADOOP_OPTS=-server and it seems to help a little, but still performance isn't great. Perhaps problem is in the way I play with files - it's perl script over davf2 over WebDav which uses native API. Can anyone give an example of a jython or jruby file that'd recursively go over a hdfs folder and move all files to a different folder? (My programming skills are very modest..) -Original Message- From: Raghu Angadi [mailto:[EMAIL PROTECTED] Sent: Friday, September 12, 2008 9:41 AM To: core-user@hadoop.apache.org Subject: Re: namenode multitreaded The core of namenode functionality happens in single thread because of a global lock, unfortunately. The other cpus would still be used to some extent by network IO and other threads. Usually we don't see just one cpu at 100% and nothing else on the other cpus. What kind of load do you have? Raghu. Dmitry Pushkarev wrote: > Hi. > > > > My namenode runs on a 8-core server with lots of RAM, but it only uses one > core (100%). > > Is it possible to tell namenode to use all available cores? > > > > Thanks. > >
Why can't Hadoop be used for online applications ?
Hi, Here is a bsic doubt. I found in different documentation it is mentioned that Hadoop is not recommended for online applications. Can anyone please elaborate on the same ? Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Why can't Hadoop be used for online applications ?
Hadoop is best suited for distributed processing across many machines of large data sets. Most people use Hadoop to plow through large data sets in an offline fashion. One approach that you can use is to use Hadoop to process your data, then put it in an optimized form in HBase (i.e., similar to Google's Bigtable). Then, you can use HBase for querying the data in an online-access fashion. Refer to http://hadoop.apache.org/hbase/ for more information about HBase. Ryan On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi, > > Here is a bsic doubt. > > I found in different documentation it is mentioned that Hadoop is not > recommended for online applications. Can anyone please elaborate on the same ? > > Regards, > Sourav > > CAUTION - Disclaimer * > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely > for the use of the addressee(s). If you are not the intended recipient, please > notify the sender by e-mail and delete the original message. Further, you are > not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for any > damage > you may sustain as a result of any virus in this e-mail. You should carry out > your > own virus checks before opening the e-mail or attachment. Infosys reserves the > right to monitor and review the content of all messages sent to or from this > e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS End of Disclaimer INFOSYS*** >
RE: Why can't Hadoop be used for online applications ?
Thanks Ryan for your inputs. Regards, Sourav From: Ryan LeCompte [EMAIL PROTECTED] Sent: Friday, September 12, 2008 11:55 AM To: core-user@hadoop.apache.org Subject: Re: Why can't Hadoop be used for online applications ? Hadoop is best suited for distributed processing across many machines of large data sets. Most people use Hadoop to plow through large data sets in an offline fashion. One approach that you can use is to use Hadoop to process your data, then put it in an optimized form in HBase (i.e., similar to Google's Bigtable). Then, you can use HBase for querying the data in an online-access fashion. Refer to http://hadoop.apache.org/hbase/ for more information about HBase. Ryan On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi, > > Here is a bsic doubt. > > I found in different documentation it is mentioned that Hadoop is not > recommended for online applications. Can anyone please elaborate on the same ? > > Regards, > Sourav > > CAUTION - Disclaimer * > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely > for the use of the addressee(s). If you are not the intended recipient, please > notify the sender by e-mail and delete the original message. Further, you are > not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for any > damage > you may sustain as a result of any virus in this e-mail. You should carry out > your > own virus checks before opening the e-mail or attachment. Infosys reserves the > right to monitor and review the content of all messages sent to or from this > e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS End of Disclaimer INFOSYS*** >
Re: Why can't Hadoop be used for online applications ?
Hi Ryan! Does this means that HBase could be used for Online applications, for example, replacing MySQL in database-driven applications? Does anyone have any kind of benchmarks about the comparison between MySQL queries/updates and HBase queries/updates? Have a nice day, Camilo. On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hadoop is best suited for distributed processing across many machines > of large data sets. Most people use Hadoop to plow through large data > sets in an offline fashion. One approach that you can use is to use > Hadoop to process your data, then put it in an optimized form in HBase > (i.e., similar to Google's Bigtable). Then, you can use HBase for > querying the data in an online-access fashion. Refer to > http://hadoop.apache.org/hbase/ for more information about HBase. > > Ryan > > > On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Here is a bsic doubt. > > > > I found in different documentation it is mentioned that Hadoop is not > recommended for online applications. Can anyone please elaborate on the same > ? > > > > Regards, > > Sourav > > > > CAUTION - Disclaimer * > > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > solely > > for the use of the addressee(s). If you are not the intended recipient, > please > > notify the sender by e-mail and delete the original message. Further, you > are not > > to copy, disclose, or distribute this e-mail or its contents to any other > person and > > any such actions are unlawful. This e-mail may contain viruses. Infosys > has taken > > every reasonable precaution to minimize this risk, but is not liable for > any damage > > you may sustain as a result of any virus in this e-mail. You should carry > out your > > own virus checks before opening the e-mail or attachment. Infosys > reserves the > > right to monitor and review the content of all messages sent to or from > this e-mail > > address. Messages sent to or from this e-mail address may be stored on > the > > Infosys e-mail system. > > ***INFOSYS End of Disclaimer INFOSYS*** > > >
Re: Why can't Hadoop be used for online applications ?
Hey Camilo, HBase is not meant to be a replacement for MySQL or a traditional RDBMS (HBase is not transaction, for instance). I'd recommend reading the following article that describes what HBase/Bigtable really is: http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable Thanks, Ryan On Fri, Sep 12, 2008 at 3:25 PM, Camilo Gonzalez <[EMAIL PROTECTED]> wrote: > Hi Ryan! > > Does this means that HBase could be used for Online applications, for > example, replacing MySQL in database-driven applications? > > Does anyone have any kind of benchmarks about the comparison between MySQL > queries/updates and HBase queries/updates? > > Have a nice day, > > Camilo. > > On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > >> Hadoop is best suited for distributed processing across many machines >> of large data sets. Most people use Hadoop to plow through large data >> sets in an offline fashion. One approach that you can use is to use >> Hadoop to process your data, then put it in an optimized form in HBase >> (i.e., similar to Google's Bigtable). Then, you can use HBase for >> querying the data in an online-access fashion. Refer to >> http://hadoop.apache.org/hbase/ for more information about HBase. >> >> Ryan >> >> >> On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > Here is a bsic doubt. >> > >> > I found in different documentation it is mentioned that Hadoop is not >> recommended for online applications. Can anyone please elaborate on the same >> ? >> > >> > Regards, >> > Sourav >> > >> > CAUTION - Disclaimer * >> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended >> solely >> > for the use of the addressee(s). If you are not the intended recipient, >> please >> > notify the sender by e-mail and delete the original message. Further, you >> are not >> > to copy, disclose, or distribute this e-mail or its contents to any other >> person and >> > any such actions are unlawful. This e-mail may contain viruses. Infosys >> has taken >> > every reasonable precaution to minimize this risk, but is not liable for >> any damage >> > you may sustain as a result of any virus in this e-mail. You should carry >> out your >> > own virus checks before opening the e-mail or attachment. Infosys >> reserves the >> > right to monitor and review the content of all messages sent to or from >> this e-mail >> > address. Messages sent to or from this e-mail address may be stored on >> the >> > Infosys e-mail system. >> > ***INFOSYS End of Disclaimer INFOSYS*** >> > >> >
Tips on sorting using Hadoop
Hi, I want to sort my records ( consisting of string, int, float) using Hadoop. One way I have found is to set number of reducers = 1, but this would mean all the records go to 1 reducer and it won't be optimized. Can anyone point me to some better way to do sorting using Hadoop ? Thanks, Tenaali
aerialization.Deserializer.deserialize method help
This method's signature is {code} T deserialize(T); {code} But, the RecordReader next method is {code} boolean next(K,V); {code} So, if the deserialize method does not return the same T (i.e., K or V), how would this new Object be propagated back thru the RecordReader next method. It seems the contract on the deserialize method is that it must return the same T (although the javadocs say "may"). Am I missing something? And if not, why isn't the API boolean deserialize(T) ? Thanks, pete Ps for things like Thrift, there's no way to re-use the object as there's no clear method, so if this is the case, I don't see how it would work??
Accessing input files from different servers
Hi, I would like to process a set of log files (say web server access log) from a number of different machines. So I need to get those log files from the respective machines to my central HDFS. To achieve this - a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) in all those machines where the log files are getting created ? And then do a file get from the central HDFS server` ? b) Any other way to achive this ? Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: aerialization.Deserializer.deserialize method help
If you pass in null to the deserializer, it creates a new instance and returns it; passing in an instance reuses it. I don't understand the disconnect between Deserializer and the RecordReader. Does your RecordReader generate instances that only share a common subtype T? You need separate Deserializers for K and V, if that's the issue... -C On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote: This method's signature is {code} T deserialize(T); {code} But, the RecordReader next method is {code} boolean next(K,V); {code} So, if the deserialize method does not return the same T (i.e., K or V), how would this new Object be propagated back thru the RecordReader next method. It seems the contract on the deserialize method is that it must return the same T (although the javadocs say "may"). Am I missing something? And if not, why isn't the API boolean deserialize(T) ? Thanks, pete Ps for things like Thrift, there's no way to re-use the object as there's no clear method, so if this is the case, I don't see how it would work??
Re: Thinking about retriving DFS metadata from datanodes!!!
叶双明 wrote: Thanks for paying attention to my tentative idea! What I thought isn't how to store the meradata, but the final (or last) way to recover valuable data in the cluster when something worst (which destroy the metadata in all multiple NameNode) happen. i.e. terrorist attack or natural disasters destroy half of cluster nodes within all NameNode, we can recover as much data as possible by this mechanism, and hava big chance to recover entire data of cluster because fo original replication. You want to survive any event that loses a datacentre, you need to mirror the data off site, chosing that second site with an up to date fault line map of the city, geological knowledge of where recent eruptions ended up etc. Which is why nobody builds datacentres in Enumclaw WA that I'm aware of, the spec for the fabs in/near portland is they ought to withstand 1-2m of volcanic ash landing on them (what they'd have got if there'd been an easterly wind when Mount Saint Helens went). Then once you have some safe location for the second site, talk to your telco about how the high-bandwidth backbones in your city flow (Metropolitan Area Ethernet and the like), and try and find somewhere that meets your requirements. Then: come up with a protocol that efficiently keeps the two sites up to date. And reliably: S3 went down last month because they'd been using a Gossip-style update protocol but wheren't checksumming everything, because there's no need on a LAN, but of course on a cross-city network more things can go wrong, and for them it did. Something to keep multiple hadoop filesystems synchronised efficiently and reliably across sites could be very useful to many people. -steve
Re: aerialization.Deserializer.deserialize method help
What I mean is let's say I plug in a deserializer that always returns a new Object - in that case, since everything is pass by value, the new object cannot make its way back to the SequenceFileRecordReader user. While(sequenceFileRecordReader.next(mykey, myvalue)) { // do something } And then my deserializers one/both looks like: T deserialize(T obj) { // ignore obj return new T(params); } Obj would be the key or the value passed in by the user, but since I ignore it, basically what happens is the deserialized value actually gets thrown away. More specifically, it gets thrown away in SequenceFile.Reader I believe. -- pete On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote: > If you pass in null to the deserializer, it creates a new instance and > returns it; passing in an instance reuses it. > > I don't understand the disconnect between Deserializer and the > RecordReader. Does your RecordReader generate instances that only > share a common subtype T? You need separate Deserializers for K and V, > if that's the issue... -C > > On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote: > >> >> This method's signature is >> {code} >> T deserialize(T); >> {code} >> >> But, the RecordReader next method is >> >> {code} >> boolean next(K,V); >> {code} >> >> So, if the deserialize method does not return the same T (i.e., K or >> V), how >> would this new Object be propagated back thru the RecordReader next >> method. >> >> It seems the contract on the deserialize method is that it must >> return the >> same T (although the javadocs say "may"). >> >> Am I missing something? And if not, why isn't the API boolean >> deserialize(T) >> ? >> >> Thanks, pete >> >> Ps for things like Thrift, there's no way to re-use the object as >> there's no >> clear method, so if this is the case, I don't see how it would work?? >> >
Re: How to manage a large cluster?
James Moore wrote: On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <[EMAIL PROTECTED]> wrote: On 9/11/08 2:39 AM, "Alex Loddengaard" <[EMAIL PROTECTED]> wrote: I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: Maybe. :) Depends how often you like to be paged, doesn't it :) Instead, use a real system configuration management package such as bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the plug. :) ] Yes Allen, I owe you beer at the next apachecon we are both at. Actually, I think Y! were one of the sponsors at the UK event, so we owe you for that too. Or on EC2 and its competitors, just build a new image whenever you need to update Hadoop itself. 1. It's still good to have as much automation of your image build as you can; if you can build new machine images on demand you have have fun/make a mess of things. Look at http://instalinux.com to see the web GUI for creating linux images on demand that is used inside HP. 2. When you try and bring up everything from scratch, you have a choreography problem. DNS needs to be up early, and your authentication system, the management tools, then the other parts of the system. If you have a project where hadoop is integrated with the front end site, for example, you're app servers have to stay offline until HDFS is live. So it does get complex. 3. The Hadoop nodes are good here in that you aren't required to bring up the namenode first; the datanodes will wait; same for the task trackers and job tracker. But if you, say, need to point everything at a new hostname for the namenode, well, that's a config change that needs to be pushed out, somehow. I'm adding some stuff on different ways to deploy hadoop here: http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment -steve
Re: aerialization.Deserializer.deserialize method help
Oh, I see what you mean. Yes, you need to reuse the objects that you're given in your deserializer. This will change with HADOOP-1230, though. -C On Sep 12, 2008, at 2:28 PM, Pete Wyckoff wrote: What I mean is let's say I plug in a deserializer that always returns a new Object - in that case, since everything is pass by value, the new object cannot make its way back to the SequenceFileRecordReader user. While(sequenceFileRecordReader.next(mykey, myvalue)) { // do something } And then my deserializers one/both looks like: T deserialize(T obj) { // ignore obj return new T(params); } Obj would be the key or the value passed in by the user, but since I ignore it, basically what happens is the deserialized value actually gets thrown away. More specifically, it gets thrown away in SequenceFile.Reader I believe. -- pete On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote: If you pass in null to the deserializer, it creates a new instance and returns it; passing in an instance reuses it. I don't understand the disconnect between Deserializer and the RecordReader. Does your RecordReader generate instances that only share a common subtype T? You need separate Deserializers for K and V, if that's the issue... -C On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote: This method's signature is {code} T deserialize(T); {code} But, the RecordReader next method is {code} boolean next(K,V); {code} So, if the deserialize method does not return the same T (i.e., K or V), how would this new Object be propagated back thru the RecordReader next method. It seems the contract on the deserialize method is that it must return the same T (although the javadocs say "may"). Am I missing something? And if not, why isn't the API boolean deserialize(T) ? Thanks, pete Ps for things like Thrift, there's no way to re-use the object as there's no clear method, so if this is the case, I don't see how it would work??
Re: aerialization.Deserializer.deserialize method help
Specifically, line 75 of SequenceFileRecordReader: >boolean remaining = (in.next(key) != null); Throws out the return value of SequenceFile.next which is the result of deserialize(obj). -- pete On 9/12/08 2:28 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote: > > What I mean is let's say I plug in a deserializer that always returns a new > Object - in that case, since everything is pass by value, the new object > cannot make its way back to the SequenceFileRecordReader user. > > While(sequenceFileRecordReader.next(mykey, myvalue)) { > // do something > } > > And then my deserializers one/both looks like: > > T deserialize(T obj) { > // ignore obj > return new T(params); > } > > Obj would be the key or the value passed in by the user, but since I ignore > it, basically what happens is the deserialized value actually gets thrown > away. > > More specifically, it gets thrown away in SequenceFile.Reader I believe. > > -- pete > > > On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote: > >> If you pass in null to the deserializer, it creates a new instance and >> returns it; passing in an instance reuses it. >> >> I don't understand the disconnect between Deserializer and the >> RecordReader. Does your RecordReader generate instances that only >> share a common subtype T? You need separate Deserializers for K and V, >> if that's the issue... -C >> >> On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote: >> >>> >>> This method's signature is >>> {code} >>> T deserialize(T); >>> {code} >>> >>> But, the RecordReader next method is >>> >>> {code} >>> boolean next(K,V); >>> {code} >>> >>> So, if the deserialize method does not return the same T (i.e., K or >>> V), how >>> would this new Object be propagated back thru the RecordReader next >>> method. >>> >>> It seems the contract on the deserialize method is that it must >>> return the >>> same T (although the javadocs say "may"). >>> >>> Am I missing something? And if not, why isn't the API boolean >>> deserialize(T) >>> ? >>> >>> Thanks, pete >>> >>> Ps for things like Thrift, there's no way to re-use the object as >>> there's no >>> clear method, so if this is the case, I don't see how it would work?? >>> >> >
apache.mirror99.com mirror is very out of date
http://apache.mirror99.com/lucene/hadoop is quite out of date. Only versions available are 0.14.2 and 0.15.2. 0.14.2 is marked as "stable" and fails to build out of the box with ant (no target) on linux. (seems like there are missing .template files or something, error is from line 133 of build.xml if anyone cares). Probably should be updated or removed from the mirror list, it's somewhat confusing if you're new to wind up there, I had to ask for help in IRC to figure out what was going wrong. E
Re: aerialization.Deserializer.deserialize method help
Sorry - saw the response after I sent this. But the current javadocs are wrong and should probably say must return what was passed in. On 9/12/08 3:02 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote: > > Specifically, line 75 of SequenceFileRecordReader: > >>boolean remaining = (in.next(key) != null); > > Throws out the return value of SequenceFile.next which is the result of > deserialize(obj). > > -- pete > > > On 9/12/08 2:28 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote: > >> >> What I mean is let's say I plug in a deserializer that always returns a new >> Object - in that case, since everything is pass by value, the new object >> cannot make its way back to the SequenceFileRecordReader user. >> >> While(sequenceFileRecordReader.next(mykey, myvalue)) { >> // do something >> } >> >> And then my deserializers one/both looks like: >> >> T deserialize(T obj) { >> // ignore obj >> return new T(params); >> } >> >> Obj would be the key or the value passed in by the user, but since I ignore >> it, basically what happens is the deserialized value actually gets thrown >> away. >> >> More specifically, it gets thrown away in SequenceFile.Reader I believe. >> >> -- pete >> >> >> On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote: >> >>> If you pass in null to the deserializer, it creates a new instance and >>> returns it; passing in an instance reuses it. >>> >>> I don't understand the disconnect between Deserializer and the >>> RecordReader. Does your RecordReader generate instances that only >>> share a common subtype T? You need separate Deserializers for K and V, >>> if that's the issue... -C >>> >>> On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote: >>> This method's signature is {code} T deserialize(T); {code} But, the RecordReader next method is {code} boolean next(K,V); {code} So, if the deserialize method does not return the same T (i.e., K or V), how would this new Object be propagated back thru the RecordReader next method. It seems the contract on the deserialize method is that it must return the same T (although the javadocs say "may"). Am I missing something? And if not, why isn't the API boolean deserialize(T) ? Thanks, pete Ps for things like Thrift, there's no way to re-use the object as there's no clear method, so if this is the case, I don't see how it would work?? >>> >> >
Parameterized deserializers?
If I have a generic Serializer/Deserializers that take some runtime information to instantiate, how would this work in the current serializer/deserializer APIs? And depending on this runtime information, may return different Objects although they may all derive from the same class. For example, for Thrift, I may have something called a ThriftSerializer that is general: {code} Public class ThriftDeserializer implements Deserializer { T deserialize(T); } {code} How would I instantiate this, since the current getDeserializer takes only the Class but not configuration object. How would I implement createKey in RecordReader In other words, I think we need a {code}Class getClass(); {code} method in Deserializer() and a {code}Deserializer getDeserializer(Class, Configuration conf); {code} method in Serializer.java. Or is there another way to do this? IF not, I can open a JIRA for implementing parameterized serializers. Thanks, pete
Re: aerialization.Deserializer.deserialize method help
On Sep 12, 2008, at 3:01 PM, Chris Douglas wrote: Oh, I see what you mean. Yes, you need to reuse the objects that you're given in your deserializer. This isn't true in the general case. The Java serializer for instance, always returns a new instance. The SequenceFile reader has a pair of methods: public Object next(Object key) throws IOException; public Object nextValue(Object value) throws IOException; so that you can read java serialized objects from a sequence file. They also work as map outputs and reduce outputs. The only place where you are hosed is the RecordReader interface. HADOOP-1230's changes to the RecordReader were designed to fix the problem. -- Owen
Re: Accessing input files from different servers
> a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) > in all those machines where the log files are getting created ? And then do a > file get from the central HDFS server` ? I'd install hadoop on the machine, but you don't have to start any nodes there - you can log onto a cluster running elsewhere using the command line tools to put / get data from the cluster. From what I recall, this is actually better than running nodes locally as if you put data on locally, the blocks will tend to be posted to the local machine. Tim signature.asc Description: This is a digitally signed message part
Re: Parameterized deserializers?
I should mention this is out of the context of SequenceFiles where we get the class names in the file itself. Here there is some information inserted into the JobConf that tells me the class of the records in the input file. -- pete On 9/12/08 3:26 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote: > > If I have a generic Serializer/Deserializers that take some runtime > information to instantiate, how would this work in the current > serializer/deserializer APIs? And depending on this runtime information, may > return different Objects although they may all derive from the same class. > > For example, for Thrift, I may have something called a ThriftSerializer that > is general: > > {code} > Public class ThriftDeserializer implements > Deserializer { > T deserialize(T); > } > {code} > > How would I instantiate this, since the current getDeserializer takes only > the Class but not configuration object. > How would I implement createKey in RecordReader > > > In other words, I think we need a {code}Class getClass(); {code} method > in Deserializer() and a {code}Deserializer getDeserializer(Class, > Configuration conf); {code} method in Serializer.java. > > Or is there another way to do this? > > IF not, I can open a JIRA for implementing parameterized serializers. > > Thanks, pete > > >
Re: Parameterized deserializers?
If you make your Serialization implement Configurable it will be given a Configuration object that it can pass to the Deserializer on construction. Also, this thread may be related: http://www.nabble.com/Serialization-with-additional-schema-info-td19260579.html Tom On Sat, Sep 13, 2008 at 12:38 AM, Pete Wyckoff <[EMAIL PROTECTED]> wrote: > > I should mention this is out of the context of SequenceFiles where we get > the class names in the file itself. Here there is some information inserted > into the JobConf that tells me the class of the records in the input file. > > > -- pete > > > On 9/12/08 3:26 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote: > >> >> If I have a generic Serializer/Deserializers that take some runtime >> information to instantiate, how would this work in the current >> serializer/deserializer APIs? And depending on this runtime information, may >> return different Objects although they may all derive from the same class. >> >> For example, for Thrift, I may have something called a ThriftSerializer that >> is general: >> >> {code} >> Public class ThriftDeserializer implements >> Deserializer { >> T deserialize(T); >> } >> {code} >> >> How would I instantiate this, since the current getDeserializer takes only >> the Class but not configuration object. >> How would I implement createKey in RecordReader >> >> >> In other words, I think we need a {code}Class getClass(); {code} method >> in Deserializer() and a {code}Deserializer getDeserializer(Class, >> Configuration conf); {code} method in Serializer.java. >> >> Or is there another way to do this? >> >> IF not, I can open a JIRA for implementing parameterized serializers. >> >> Thanks, pete >> >> >> > >