Re: FileSystem Caching in Hadoop
I think this is the wrong angle to go about it - like you mentioned in your first post, the Linux file system cache *should* be taking care of this for us. That it is not is a fault of the current implementation and not an inherent problem. I think one solution is HDFS-347 - I'm putting the finishing touches on a design doc for that JIRA and should have it up in the next day or two. -Todd On Tue, Oct 6, 2009 at 5:25 PM, Edward Capriolo wrote: > On Tue, Oct 6, 2009 at 6:12 PM, Aaron Kimball wrote: > > Edward, > > > > Interesting concept. I imagine that implementing "CachedInputFormat" over > > something like memcached would make for the most straightforward > > implementation. You could store 64MB chunks in memcached and try to > retrieve > > them from there, falling back to the filesystem on failure. One obvious > > potential drawback of this is that a memcached cluster might store those > > blocks on different servers than the file chunks themselves, leading to > an > > increased number of network transfers during the mapping phase. I don't > know > > if it's possible to "pin" the objects in memcached to particular nodes; > > you'd want to do this for mapper locality reasons. > > > > I would say, though, that 1 GB out of 8 GB on a datanode is somewhat > > ambitious. It's been my observation that people tend to write > memory-hungry > > mappers. If you've got 8 cores in a node, and 1 GB each have already gone > to > > the OS, the datanode, and the tasktracker, that leaves only 5 GB for task > > processes. Running 6 or 8 map tasks concurrently can easily gobble that > up. > > On a 16 GB datanode with 8 cores, you might get that much wiggle room > > though. > > > > - Aaron > > > > > > On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo >wrote: > > > >> After looking at the HBaseRegionServer and its functionality, I began > >> wondering if there is a more general use case for memory caching of > >> HDFS blocks/files. In many use cases people wish to store data on > >> Hadoop indefinitely, however the last day,last week, last month, data > >> is probably the most actively used. For some Hadoop clusters the > >> amount of raw new data could be less then the RAM memory in the > >> cluster. > >> > >> Also some data will be used repeatedly, the same source data may be > >> used to generate multiple result sets, and those results may be used > >> as the input to other processes. > >> > >> I am thinking an answer could be to dedicate an amount of physical > >> memory on each DataNode, or on several dedicated node to a distributed > >> memcache like layer. Managing this cache should be straight forward > >> since hadoop blocks are pretty much static. (So say for a DataNode > >> with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had > >> 1000 Nodes that cache would be quite large. > >> > >> Additionally we could create a new file system type cachedhdfs > >> implemented as a facade, or possibly implement CachedInputFormat or > >> CachedOutputFormat. > >> > >> I know that the underlying filesystems have cache, but I think Hadoop > >> writing intermediate data is going to evict some of the data which > >> "should be" semi-permanent. > >> > >> So has anyone looked into something like this? This was the closest > >> thing I found. > >> > >> http://issues.apache.org/jira/browse/HADOOP-288 > >> > >> My goal here is to keep recent data in memory so that tools like Hive > >> can get a big boost on queries for new data. > >> > >> Does anyone have any ideas? > >> > > > > Aaron, > > Yes 1GB out of 8GB was just an arbitrary value I decided. Remember > that 16K of ram did get a man to the moon. :) I am thinking the value > would be configurable, say dfs.cache.mb. > > Also there is the details of cache eviction, or possibly including and > excluding paths and files. > > Other then the InputFormat concept we could plug the cache in directly > into the DFSclient. In this way the cache would always end up on the > node where the data was. Otherwise the InputFormat will have to manage > that which would be a lot of work. I think if we prove the concept we > can then follow up and get it more optimized. > > I am poking around the Hadoop internals to see what options we have. > My first implementation I will probably patch some code, run some > tests, profile performance. >
Re: Creating Lucene index in Hadoop
hi Ning , I am also looking at different approaches on indexing with hadoop , I could index using contrib package for hadoop into HDFS but since its not designed for random access what would be the other recommended ways to move them to Local file system Also what would be the best approach to begin with ? should we look into katta or solr integrations ? thanks in advance. Ning Li-5 wrote: > >> I'm missing why you would ever want the Lucene index in HDFS for >> reading. > > The Lucene indexes are written to HDFS, but that does not mean you > conduct search on the indexes stored in HDFS directly. HDFS is not > designed for random access. Usually the indexes are copied to the > nodes where search will be served. With > http://issues.apache.org/jira/browse/HADOOP-4801, however, it may > become feasible to search on HDFS directly. > > Cheers, > Ning > > > On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff > wrote: >> >> Does anyone have stats on how multiple readers on an optimized Lucene >> index in HDFS compares with a ParallelMultiReader (or whatever its >> called) over RPC on a local filesystem? >> >> I'm missing why you would ever want the Lucene index in HDFS for >> reading. >> >> Ian >> >> Ning Li writes: >> >>> I should have pointed out that Nutch index build and contrib/index >>> targets different applications. The latter is for applications who >>> simply want to build Lucene index from a set of documents - e.g. no >>> link analysis. >>> >>> As to writing Lucene indexes, both work the same way - write the final >>> results to local file system and then copy to HDFS. In contrib/index, >>> the intermediate results are in memory and not written to HDFS. >>> >>> Hope it clarifies things. >>> >>> Cheers, >>> Ning >>> >>> >>> On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff >>> wrote: I understand why you would index in the reduce phase, because the anchor text gets shuffled to be next to the document. However, when you index in the map phase, don't you just have to reindex later? The main point to the OP is that HDFS is a bad FS for writing Lucene indexes because of how Lucene works. The simple approach is to write your index outside of HDFS in the reduce phase, and then merge the indexes from each reducer manually. Ian Ning Li writes: > Or you can check out the index contrib. The difference of the two is > that: > - In Nutch's indexing map/reduce job, indexes are built in the > reduce phase. Afterwards, they are merged into smaller number of > shards if necessary. The last time I checked, the merge process does > not use map/reduce. > - In contrib/index, small indexes are built in the map phase. They > are merged into the desired number of shards in the reduce phase. In > addition, they can be merged into existing shards. > > Cheers, > Ning > > > On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote: >> you can see the nutch code. >> >> 2009/3/13 Mark Kerzner >> >>> Hi, >>> >>> How do I allow multiple nodes to write to the same index file in >>> HDFS? >>> >>> Thank you, >>> Mark >>> >> >> >> > > -- View this message in context: http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: A question on dfs.safemode.threshold.pct
Now it's clear. Thank you, Raghu. But if you set it to 1.1, the safemode is permanent :). Thanks, Manhee - Original Message - From: "Raghu Angadi" To: Sent: Wednesday, October 07, 2009 10:03 AM Subject: Re: A question on dfs.safemode.threshold.pct I am not sure what the real concern is... You can set it to 1.0 (or even 1.1 :)) if you prefer. Many admins do. Raghu. On Tue, Oct 6, 2009 at 5:20 PM, Manhee Jo wrote: Thank you, Raghu. Then, when the percentage is below 0.999, how can you tell if some datanodes are just slower than others or some of the data blocks are lost? I think "percentage 1" should have speacial meaning like it guarantees integrity of data in HDFS. If it's below 1, then the integrity is not said to be guaranteed. Or are there any other useful means that a NameNode can fix the lost blocks, so that it doesn't care even 0.1% of data is lost? Thanks, Manhee - Original Message - From: "Raghu Angadi" To: Sent: Wednesday, October 07, 2009 1:26 AM Subject: Re: A question on dfs.safemode.threshold.pct Yes, it is mostly geared towards replication greater than 1. One of the reasons for waiting for this threshold is to avoid HDFS starting unnecessary replications of blocks at the start up when some of the datanodes are slower to start up. When the replication is 1, you don't have that issue. A block either exists or does not. Raghu 2009/10/5 Manhee Jo Hi all, Why isn't the dfs.safemode.threshold.pct 1 by default? When dfs.replication.min=1 with dfs.safemode.threshold.pct=0.999, there might be chances for a NameNode to check in with incomplete data in its file system. Am I right? Is it permissible? Or is it assuming that replication would be always more than 1? Thanks, Manhee
Re: A question on dfs.safemode.threshold.pct
If I remember correctly, Having dfs.safemode.threshold.pct = 1 may lead to a problem that the Namenode is not leaving safemode because of floating point round off errors. Having dfs.safemode.threshold.pct > 1 means that Namenode can never exit safemode since it is not achievable. Nicholas Sze - Original Message > From: Raghu Angadi > To: common-user@hadoop.apache.org > Sent: Tuesday, October 6, 2009 6:03:52 PM > Subject: Re: A question on dfs.safemode.threshold.pct > > I am not sure what the real concern is... You can set it to 1.0 (or even 1.1 > :)) if you prefer. Many admins do. > > Raghu. > > On Tue, Oct 6, 2009 at 5:20 PM, Manhee Jo wrote: > > > Thank you, Raghu. > > Then, when the percentage is below 0.999, how can you tell > > if some datanodes are just slower than others or some of the data blocks > > are lost? > > I think "percentage 1" should have speacial meaning like > > it guarantees integrity of data in HDFS. > > If it's below 1, then the integrity is not said to be guaranteed. > > > > Or are there any other useful means that a NameNode can fix the lost > > blocks, > > so that it doesn't care even 0.1% of data is lost? > > > > > > Thanks, > > Manhee > > > > - Original Message - From: "Raghu Angadi" > > To: > > Sent: Wednesday, October 07, 2009 1:26 AM > > Subject: Re: A question on dfs.safemode.threshold.pct > > > > > > > > Yes, it is mostly geared towards replication greater than 1. One of the > >> reasons for waiting for this threshold is to avoid HDFS starting > >> unnecessary > >> replications of blocks at the start up when some of the datanodes are > >> slower > >> to start up. > >> > >> When the replication is 1, you don't have that issue. A block either > >> exists > >> or does not. > >> > >> Raghu > >> 2009/10/5 Manhee Jo > >> > >> Hi all, > >>> > >>> Why isn't the dfs.safemode.threshold.pct 1 by default? > >>> When dfs.replication.min=1 with dfs.safemode.threshold.pct=0.999, > >>> there might be chances for a NameNode to check in with incomplete data > >>> in its file system. Am I right? Is it permissible? Or is it assuming that > >>> replication would be always more than 1? > >>> > >>> > >>> Thanks, > >>> Manhee > >>> > >> > >> > > > >
Re: A question on dfs.safemode.threshold.pct
I am not sure what the real concern is... You can set it to 1.0 (or even 1.1 :)) if you prefer. Many admins do. Raghu. On Tue, Oct 6, 2009 at 5:20 PM, Manhee Jo wrote: > Thank you, Raghu. > Then, when the percentage is below 0.999, how can you tell > if some datanodes are just slower than others or some of the data blocks > are lost? > I think "percentage 1" should have speacial meaning like > it guarantees integrity of data in HDFS. > If it's below 1, then the integrity is not said to be guaranteed. > > Or are there any other useful means that a NameNode can fix the lost > blocks, > so that it doesn't care even 0.1% of data is lost? > > > Thanks, > Manhee > > - Original Message - From: "Raghu Angadi" > To: > Sent: Wednesday, October 07, 2009 1:26 AM > Subject: Re: A question on dfs.safemode.threshold.pct > > > > Yes, it is mostly geared towards replication greater than 1. One of the >> reasons for waiting for this threshold is to avoid HDFS starting >> unnecessary >> replications of blocks at the start up when some of the datanodes are >> slower >> to start up. >> >> When the replication is 1, you don't have that issue. A block either >> exists >> or does not. >> >> Raghu >> 2009/10/5 Manhee Jo >> >> Hi all, >>> >>> Why isn't the dfs.safemode.threshold.pct 1 by default? >>> When dfs.replication.min=1 with dfs.safemode.threshold.pct=0.999, >>> there might be chances for a NameNode to check in with incomplete data >>> in its file system. Am I right? Is it permissible? Or is it assuming that >>> replication would be always more than 1? >>> >>> >>> Thanks, >>> Manhee >>> >> >> > >
Re: Having multiple values in Value field
Thanks Tom. The link really was helpful. the CSVs were getting nasty to handle. On Tue, Oct 6, 2009 at 12:27 PM, Tom Chen wrote: > Hi Akshaya, > > Take a look at the yahoo hadoop tutorial for custom data types. > > http://developer.yahoo.com/hadoop/tutorial/module5.html#types > > It's actually quite easy to create your own types and stream them. You can > use the > > void readFields(DataInput in); > void write(DataOutput out); > > methods to initialize your objects. > > Tom > > > On Tue, Oct 6, 2009 at 12:52 AM, Amogh Vasekar > wrote: > >>> You can always pass them as comma delimited strings > > Which would be pretty expensive per right? Would avro be looking > into solving such problems? > > > > Amogh > > > > -Original Message- > > From: Jason Venner [mailto:jason.had...@gmail.com] > > Sent: Tuesday, October 06, 2009 11:33 AM > > To: common-user@hadoop.apache.org > > Subject: Re: Having multiple values in Value field > > > > You can always pass them as comma delimited strings, which is what you > are > > already doing with your python streaming code, and then use Text as your > > value. > > > > On Mon, Oct 5, 2009 at 10:54 PM, akshaya iyengar > > wrote: > > > >> I am having issues having multiple values in my value field.My desired > >> result is > >> , or even ,. > >> > >> It seems easy in Python where I can pass a tuple as value.What is the > best > >> way to do this in Java. > >> I have tried ArrayWritable but it looks like I need to write my own > class > >> like IntArrayWritable. > >> > >> Thanks, > >> Akshaya > >> > > > > > > > > -- > > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > > http://www.amazon.com/dp/1430219424?tag=jewlerymall > > www.prohadoopbook.com a community for Hadoop Professionals > > >
Reading a block of data in Map function
I am wondering how to read a block of data in Map. I have a file with a single number on every line and I wish to calculate some statistics. Once the file is divided into blocks and sent to different nodes by hadoop, is it possible to read a chunk of the data in each map function? Right now each map is reading one number at a time. Thanks, Akshaya
Re: FileSystem Caching in Hadoop
On Tue, Oct 6, 2009 at 6:12 PM, Aaron Kimball wrote: > Edward, > > Interesting concept. I imagine that implementing "CachedInputFormat" over > something like memcached would make for the most straightforward > implementation. You could store 64MB chunks in memcached and try to retrieve > them from there, falling back to the filesystem on failure. One obvious > potential drawback of this is that a memcached cluster might store those > blocks on different servers than the file chunks themselves, leading to an > increased number of network transfers during the mapping phase. I don't know > if it's possible to "pin" the objects in memcached to particular nodes; > you'd want to do this for mapper locality reasons. > > I would say, though, that 1 GB out of 8 GB on a datanode is somewhat > ambitious. It's been my observation that people tend to write memory-hungry > mappers. If you've got 8 cores in a node, and 1 GB each have already gone to > the OS, the datanode, and the tasktracker, that leaves only 5 GB for task > processes. Running 6 or 8 map tasks concurrently can easily gobble that up. > On a 16 GB datanode with 8 cores, you might get that much wiggle room > though. > > - Aaron > > > On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo wrote: > >> After looking at the HBaseRegionServer and its functionality, I began >> wondering if there is a more general use case for memory caching of >> HDFS blocks/files. In many use cases people wish to store data on >> Hadoop indefinitely, however the last day,last week, last month, data >> is probably the most actively used. For some Hadoop clusters the >> amount of raw new data could be less then the RAM memory in the >> cluster. >> >> Also some data will be used repeatedly, the same source data may be >> used to generate multiple result sets, and those results may be used >> as the input to other processes. >> >> I am thinking an answer could be to dedicate an amount of physical >> memory on each DataNode, or on several dedicated node to a distributed >> memcache like layer. Managing this cache should be straight forward >> since hadoop blocks are pretty much static. (So say for a DataNode >> with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had >> 1000 Nodes that cache would be quite large. >> >> Additionally we could create a new file system type cachedhdfs >> implemented as a facade, or possibly implement CachedInputFormat or >> CachedOutputFormat. >> >> I know that the underlying filesystems have cache, but I think Hadoop >> writing intermediate data is going to evict some of the data which >> "should be" semi-permanent. >> >> So has anyone looked into something like this? This was the closest >> thing I found. >> >> http://issues.apache.org/jira/browse/HADOOP-288 >> >> My goal here is to keep recent data in memory so that tools like Hive >> can get a big boost on queries for new data. >> >> Does anyone have any ideas? >> > Aaron, Yes 1GB out of 8GB was just an arbitrary value I decided. Remember that 16K of ram did get a man to the moon. :) I am thinking the value would be configurable, say dfs.cache.mb. Also there is the details of cache eviction, or possibly including and excluding paths and files. Other then the InputFormat concept we could plug the cache in directly into the DFSclient. In this way the cache would always end up on the node where the data was. Otherwise the InputFormat will have to manage that which would be a lot of work. I think if we prove the concept we can then follow up and get it more optimized. I am poking around the Hadoop internals to see what options we have. My first implementation I will probably patch some code, run some tests, profile performance.
Re: A question on dfs.safemode.threshold.pct
Thank you, Raghu. Then, when the percentage is below 0.999, how can you tell if some datanodes are just slower than others or some of the data blocks are lost? I think "percentage 1" should have speacial meaning like it guarantees integrity of data in HDFS. If it's below 1, then the integrity is not said to be guaranteed. Or are there any other useful means that a NameNode can fix the lost blocks, so that it doesn't care even 0.1% of data is lost? Thanks, Manhee - Original Message - From: "Raghu Angadi" To: Sent: Wednesday, October 07, 2009 1:26 AM Subject: Re: A question on dfs.safemode.threshold.pct Yes, it is mostly geared towards replication greater than 1. One of the reasons for waiting for this threshold is to avoid HDFS starting unnecessary replications of blocks at the start up when some of the datanodes are slower to start up. When the replication is 1, you don't have that issue. A block either exists or does not. Raghu 2009/10/5 Manhee Jo Hi all, Why isn't the dfs.safemode.threshold.pct 1 by default? When dfs.replication.min=1 with dfs.safemode.threshold.pct=0.999, there might be chances for a NameNode to check in with incomplete data in its file system. Am I right? Is it permissible? Or is it assuming that replication would be always more than 1? Thanks, Manhee
Re: Locality when placing Map tasks
Map tasks are generated based on InputSplits. An InputSplit is a logical description of the work that a task should use. The array of InputSplit objects is created on the client by the InputFormat. org.apache.hadoop.mapreduce.InputSplit has an abstract method: /** * Get the list of nodes by name where the data for the split would be local. * The locations do not need to be serialized. * @return a new array of the node nodes. * @throws IOException * @throws InterruptedException */ public abstract· String[] getLocations() throws IOException, InterruptedException; So the InputFormat needs to do something when it's creating its list of work items, to hint where these should go. If you take a look at FileInputFormat, you can see how it does this based on stat'ing the files and determining the block locations for each one and using those as node hints. Other InputFormats may ignore this entirely in which case there is no locality. The scheduler itself then does its "best job" of lining up tasks to nodes, but it's usually pretty naive. Basically, tasktrackers send heartbeats back to the JT wherein they may request another task. The scheduler then responds with a task. If there's a local task available, it'll send that one. If not, it'll send a non-local task instead. - Aaron On Fri, Oct 2, 2009 at 12:24 PM, Esteban Molina-Estolano < eesto...@cs.ucsc.edu> wrote: > Hi, > > I'm running Hadoop 0.19.1 on 19 nodes. I've been benchmarking a Hadoop > workload with 115 Map tasks, on two different distributed filesystems (KFS > and PVFS); in some tests, I also have a write-intensive non-Hadoop job > running in the background (an HPC checkpointing benchmark). I've found that > Hadoop sometimes makes most of the Map tasks data-local, and sometimes makes > none of the Map tasks data-local; this depends both on which filesystem I > use, and on whether the background task is running. (I never run multiple > Hadoop jobs concurrently in these tests.) > > I'd like to learn how the Hadoop scheduler places Map tasks, and how > locality is taken into account, so I can figure out why this is happening. > (I'm using the default FIFO scheduler.) Is there some documentation > available that would explain this? > > Thanks! >
Re: FileSystem Caching in Hadoop
Edward, Interesting concept. I imagine that implementing "CachedInputFormat" over something like memcached would make for the most straightforward implementation. You could store 64MB chunks in memcached and try to retrieve them from there, falling back to the filesystem on failure. One obvious potential drawback of this is that a memcached cluster might store those blocks on different servers than the file chunks themselves, leading to an increased number of network transfers during the mapping phase. I don't know if it's possible to "pin" the objects in memcached to particular nodes; you'd want to do this for mapper locality reasons. I would say, though, that 1 GB out of 8 GB on a datanode is somewhat ambitious. It's been my observation that people tend to write memory-hungry mappers. If you've got 8 cores in a node, and 1 GB each have already gone to the OS, the datanode, and the tasktracker, that leaves only 5 GB for task processes. Running 6 or 8 map tasks concurrently can easily gobble that up. On a 16 GB datanode with 8 cores, you might get that much wiggle room though. - Aaron On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo wrote: > After looking at the HBaseRegionServer and its functionality, I began > wondering if there is a more general use case for memory caching of > HDFS blocks/files. In many use cases people wish to store data on > Hadoop indefinitely, however the last day,last week, last month, data > is probably the most actively used. For some Hadoop clusters the > amount of raw new data could be less then the RAM memory in the > cluster. > > Also some data will be used repeatedly, the same source data may be > used to generate multiple result sets, and those results may be used > as the input to other processes. > > I am thinking an answer could be to dedicate an amount of physical > memory on each DataNode, or on several dedicated node to a distributed > memcache like layer. Managing this cache should be straight forward > since hadoop blocks are pretty much static. (So say for a DataNode > with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had > 1000 Nodes that cache would be quite large. > > Additionally we could create a new file system type cachedhdfs > implemented as a facade, or possibly implement CachedInputFormat or > CachedOutputFormat. > > I know that the underlying filesystems have cache, but I think Hadoop > writing intermediate data is going to evict some of the data which > "should be" semi-permanent. > > So has anyone looked into something like this? This was the closest > thing I found. > > http://issues.apache.org/jira/browse/HADOOP-288 > > My goal here is to keep recent data in memory so that tools like Hive > can get a big boost on queries for new data. > > Does anyone have any ideas? >
RE: Custom Record Reader Example?
I found the DBRecordReader a good example - it's in o.a.h.m.lib.db.DBInputFormat -Original Message- From: Mark Vigeant [mailto:mark.vige...@riskmetrics.com] Sent: Tuesday, October 06, 2009 5:22 PM To: common-user@hadoop.apache.org Subject: Custom Record Reader Example? Hey- I'm trying to update a custom recordreader written for 0.18.3 and was wondering if either A) Anyone has any example code for extending RecordReader in 0.20.1 (in the mapreduce package, not the mapred interface)? or B) Anyone can give me tips on how to write getCurrentKey() and getCurrentValue()? Thank you so much! Mark Vigeant RiskMetrics Group, Inc.
Custom Record Reader Example?
Hey- I'm trying to update a custom recordreader written for 0.18.3 and was wondering if either A) Anyone has any example code for extending RecordReader in 0.20.1 (in the mapreduce package, not the mapred interface)? or B) Anyone can give me tips on how to write getCurrentKey() and getCurrentValue()? Thank you so much! Mark Vigeant RiskMetrics Group, Inc.
state of the art WebDAV + HDFS
hi all, What would you consider the state of the art for WebDAV integration with HDFS? I'm having trouble discerning the functionality that aligns with each patch on HDFS-225 (https://issues.apache.org/jira/browse/HDFS-225) . I've read some patches do not support write operations. Not sure if that is true. If anyone could give a patch recommendation I would greatly appreciate. thanks! Brien
First Boston Hadoop Meetup, Wed Oct 28th
'lo all, We're starting a Boston Hadoop Meetup (finally ;-) -- first meeting will be on Wednesday, October 28th, 7 pm, at the HubSpot offices: http://www.meetup.com/bostonhadoop/ (HubSpot is at 1 Broadway, Cambridge on the fifth floor. There Will Be Food.) I'm stealing the organizing plan from the nice people at the Seattle meetup, to wit: we'll aim to have 2 c. 20 minute presentations, with plenty of time for Q&A after each, and then a few 5-minute lightning talks. Also, the eating and the chatting, and possibly the playing of ping pong. Please feel free to contact me if you've got an idea for a talk of any length, on Hadoop, Hive, Pig, Hbase, etc. -Dan Milstein 617-401-2855 dmilst...@hubspot.com http://dev.hubspot.com/
Re: Having multiple values in Value field
Hi Akshaya, Take a look at the yahoo hadoop tutorial for custom data types. http://developer.yahoo.com/hadoop/tutorial/module5.html#types It's actually quite easy to create your own types and stream them. You can use the void readFields(DataInput in); void write(DataOutput out); methods to initialize your objects. Tom On Tue, Oct 6, 2009 at 12:52 AM, Amogh Vasekar wrote: >>> You can always pass them as comma delimited strings > Which would be pretty expensive per right? Would avro be looking into solving such problems? > > Amogh > > -Original Message- > From: Jason Venner [mailto:jason.had...@gmail.com] > Sent: Tuesday, October 06, 2009 11:33 AM > To: common-user@hadoop.apache.org > Subject: Re: Having multiple values in Value field > > You can always pass them as comma delimited strings, which is what you are > already doing with your python streaming code, and then use Text as your > value. > > On Mon, Oct 5, 2009 at 10:54 PM, akshaya iyengar > wrote: > >> I am having issues having multiple values in my value field.My desired >> result is >> , or even ,. >> >> It seems easy in Python where I can pass a tuple as value.What is the best >> way to do this in Java. >> I have tried ArrayWritable but it looks like I need to write my own class >> like IntArrayWritable. >> >> Thanks, >> Akshaya >> > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.amazon.com/dp/1430219424?tag=jewlerymall > www.prohadoopbook.com a community for Hadoop Professionals >
Re: A question on dfs.safemode.threshold.pct
Yes, it is mostly geared towards replication greater than 1. One of the reasons for waiting for this threshold is to avoid HDFS starting unnecessary replications of blocks at the start up when some of the datanodes are slower to start up. When the replication is 1, you don't have that issue. A block either exists or does not. Raghu 2009/10/5 Manhee Jo > Hi all, > > Why isn't the dfs.safemode.threshold.pct 1 by default? > When dfs.replication.min=1 with dfs.safemode.threshold.pct=0.999, > there might be chances for a NameNode to check in with incomplete data > in its file system. Am I right? Is it permissible? Or is it assuming that > replication would be always more than 1? > > > Thanks, > Manhee
FileSystem Caching in Hadoop
After looking at the HBaseRegionServer and its functionality, I began wondering if there is a more general use case for memory caching of HDFS blocks/files. In many use cases people wish to store data on Hadoop indefinitely, however the last day,last week, last month, data is probably the most actively used. For some Hadoop clusters the amount of raw new data could be less then the RAM memory in the cluster. Also some data will be used repeatedly, the same source data may be used to generate multiple result sets, and those results may be used as the input to other processes. I am thinking an answer could be to dedicate an amount of physical memory on each DataNode, or on several dedicated node to a distributed memcache like layer. Managing this cache should be straight forward since hadoop blocks are pretty much static. (So say for a DataNode with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had 1000 Nodes that cache would be quite large. Additionally we could create a new file system type cachedhdfs implemented as a facade, or possibly implement CachedInputFormat or CachedOutputFormat. I know that the underlying filesystems have cache, but I think Hadoop writing intermediate data is going to evict some of the data which "should be" semi-permanent. So has anyone looked into something like this? This was the closest thing I found. http://issues.apache.org/jira/browse/HADOOP-288 My goal here is to keep recent data in memory so that tools like Hive can get a big boost on queries for new data. Does anyone have any ideas?
RE: Having multiple values in Value field
>> You can always pass them as comma delimited strings Which would be pretty expensive per right? Would avro be looking into solving such problems? Amogh -Original Message- From: Jason Venner [mailto:jason.had...@gmail.com] Sent: Tuesday, October 06, 2009 11:33 AM To: common-user@hadoop.apache.org Subject: Re: Having multiple values in Value field You can always pass them as comma delimited strings, which is what you are already doing with your python streaming code, and then use Text as your value. On Mon, Oct 5, 2009 at 10:54 PM, akshaya iyengar wrote: > I am having issues having multiple values in my value field.My desired > result is > , or even ,. > > It seems easy in Python where I can pass a tuple as value.What is the best > way to do this in Java. > I have tried ArrayWritable but it looks like I need to write my own class > like IntArrayWritable. > > Thanks, > Akshaya > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals