Hadoop as Cloud Storage
Dear Hadoop Guru's, After googling and find some information on using hadoop as cloud storage (long term). I have a problem to maintain lots of data (around 50 TB) much of them are TV Commercial (video files). I know, the best solution for long term file archiving is using tape backup, but i just curious, is hadoop can be used as 'data archiving' platform ? Thanks! Warm Regards, Wildan --- OpenThink Labs http://openthink-labs.tobethink.com/ Making IT, Business and Education in Harmony >> 087884599249 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
Re: ANN: Hadoop UI beta
+1 wow .., looks fantastic ... :) On the summary it's said it works only for 0.19. Just curious, does it work with the hadoop trunk .. Thanks! Best Regards, Wildan --- OpenThink Labs www.tobethink.com Aligning IT and Education >> 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana On Tue, Mar 31, 2009 at 6:11 PM, Stefan Podkowinski wrote: > Hello, > > I'd like to invite you to take a look at the recently released first > beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core. > Hadoo UI currently includes a HDFS file explorer and basic job > tracking features. > > Get it here: > http://code.google.com/p/hadoop-ui/ > > As this is the first release it may (and does) still contain bugs, but > I'd like to give everyone the chance to send feedback as early as > possible. > Give it a try :) > > - Stefan >
Re: hadoop-a small doubt
I already try the mountable HDFS, both webDav and FUSE approach, it seem both of it is not production ready .. CMIIW Best Regards, Wildan --- OpenThink Labs www.tobethink.com Aligning IT and Education >> 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana On Sun, Mar 29, 2009 at 2:52 PM, Sagar Naik wrote: > Yes u can > Java Client : > Copy the conf dir (same as one on namenode/datanode) and hadoop jars shud be > in the classpath of client > Non Java Client : > http://wiki.apache.org/hadoop/MountableHDFS > > > > -Sagar > > -Sagar > > deepya wrote: >> >> Hi, >> I am SreeDeepya doing MTech in IIIT.I am working on a project named cost >> effective and scalable storage server.I configured a small hadoop cluster >> with only two nodes one namenode and one datanode.I am new to hadoop. >> I have a small doubt. >> >> Can a system not in the hadoop cluster access the namenode or the >> datanodeIf yes,then can you please tell me the necessary >> configurations >> that has to be done. >> >> Thanks in advance. >> >> SreeDeepya >> >
Re: hadoop migration
Thanks for the quick response Aman, Ok .., i see the point now. currently i'm doing some research on creating a google books like application using hbase as a backend for storing the files and solr as indexer. From this prototype, my be i can measure how fast is hbase on serving data to the client ... (google using bigTable for their books.google.com right ?) Thanks! Regards, Wildan On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana wrote: > Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0, > includes some patches which reduce the latency of responses and makes it > suitable to be used as a backend for a webapp. However the current release > isnt optimized for this purpose. > > The idea behind Hadoop and the rest of the tools around it is more of a data > processing system than a backend datastore for a website. The output of the > processing that Hadoop does is typically taken into a MySQL cluster which > feeds a website. > > > -- --- OpenThink Labs www.tobethink.com Aligning IT and Education >> 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
Re: hadoop migration
> Of course, There is a storage solution called HBase for Hadoop. But, > In my experience, not applicable for online data access yet. > I see.., how about hypertable ? does it mature enough to be used in production ? , i read that hypertable can be integrated with hadoop, or is there any other alternative other than hbase ? Thanks! Regards, Wildan -- --- OpenThink Labs www.tobethink.com Aligning IT and Education >> 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
single node Hbase
Hello, Are there any Hadoop documentation resources showing how to run the current version of Hbase on a single node? Thanks, Peter W.
Lucene reduce
Hello, For those interested, you can filter and search Lucene documents in the reduce. code: import java.io.*; import java.util.*; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.Hits; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; public class ql { /** * Query Lucene using keys. * * input: * java^this page is about java * ruby^site only mentions rails * php^another resource about php * java^ejb3 discussed and spring * eof^eof * * make docs,search,mapreduce * * output: * php^topic^another resource about php * java^topic^this page is about java ***/ public static class M extends MapReduceBase implements Mapper { HashMap hm=new HashMap(); Map group_m=Collections.synchronizedMap(hm); String ITEM_KEY,BATCH_KEY="";int batch=0; public void map(WritableComparable wc,Writable w, OutputCollector out,Reporter rep)throws IOException { String ln=((Text)w).toString(); String[] parse_a=ln.split("\\^"); if(batch>(100-1)) // new lucene document group {out.collect(new Text(BATCH_KEY),new BytesWritable(ob (group_m))); BATCH_KEY="BATCH_"+key_maker(String.valueOf (batch));batch=0;group_m.clear();} else if(parse_a[0].equals("eof"))out.collect(new Text (BATCH_KEY),new BytesWritable(ob(group_m))); else ; ITEM_KEY="ITEM_"+key_maker(parse_a[0]); Document single_d=make_lucene_doc(parse_a[0],parse_a[1],ITEM_KEY); group_m.put(ITEM_KEY,single_d); batch++; } } public static class R extends MapReduceBase implements Reducer { public void reduce(WritableComparable wc,Iterator it, OutputCollector out,Reporter rep)throws IOException { while(it.hasNext()) { try { Map m=(Map)bo(((BytesWritable)it.next()).get()); if(m instanceof Map) { try { // build temp index Directory rd=new RAMDirectory(); Analyzer sa=new StandardAnalyzer(); IndexWriter iw=new IndexWriter(rd,sa,true); // unwrap,cast,send to mem List keys=new ArrayList(m.keySet()); Iterator itr_u=keys.iterator(); while(itr_u.hasNext()) { Object k_u=itr_u.next(); Document dtmp=(Document)m.get(k_u); iw.addDocument(dtmp); } iw.optimize();iw.close(); Searcher is=new IndexSearcher(rd); // simple doc filter Iterator itr_s=keys.iterator(); while(itr_s.hasNext()) { Object k_s=itr_s.next(); String tmp_topic=k_s.toString(); TermQuery tq_i=new TermQuery(new Term ("item",tmp_topic.trim())); // query term from key tmp_topic=tmp_topic.substring(tmp_topic.lastIndexOf ("_")+1,tmp_topic.length()); TermQuery tq_b=new TermQuery(new Term ("body",tmp_topic)); // search topic with inventory key BooleanQuery bq=new BooleanQuery(); bq.add(tq_i,BooleanClause.Occur.MUST); bq.add(tq_b,BooleanClause.Occur.MUST); Hits h=is.search(bq); for(int j=0;j private static Document make_lucene_doc(String in_tpc,String in_bdy,String in_itm) { Document d=new Document(); d.add(new Field ("topic",in_tpc,Field.Store.YES,Field.Index.TOKENIZED)); d.add(new Field ("item",in_itm,Field.Store.NO,Field.Index.UN_TOKENIZED)); d.add(new Field ("body",in_bdy,Field.Store.YES,Field.Index.TOKENIZED));
Re: retrieve ObjectWritable
Hi, If you have a requirement to pass an object or collection you can replace ObjectWritable with BytesWritable: test code... import java.io.*; import java.util.*; import org.apache.hadoop.io.BytesWritable; public class bao { public static void main(String args[]) { HashMap hm=new HashMap(); Map bm=Collections.synchronizedMap(hm); bm.put("123",new Integer(123)); bm.put("456",new Integer(456)); try { BytesWritable bw=new BytesWritable(serialize_method(bm)); Map m=(Map)deserialize_method(bw.get()); System.out.println("MAP_SIZE: "+ m.size()); // works } catch(Exception e) { System.out.println(e); } } // static serialize,deserialize... /*** stuff like this can be sent to reduce, but can't recast from ObjectWritable: private static MapWritable mw(Map m) { MapWritable rmw=new MapWritable(); List l=new ArrayList(m.keySet());Iterator i=l.iterator(); while(i.hasNext()){Object k=i.next();rmw.put(new Text (k.toString()), new ObjectWritable(serialize_method((Object)m.get(k;} return rmw; } ***/ } Good Luck, Peter W. Peter W. wrote: Hi, After trying to pass objects to reduce using ObjectWritable without success I learned the class instead sends primitives such as float. However, you can make it go as object by passing it as byte[] with: new ObjectWritable(serialize_method(obj)) but it's not easy to retrieve once inside reduce because ((ObjectWritable)values.next()).get() returns an object not an array, no deserialize. Trying to cast this object to it's original form delivers a 'ClassCastException b[' error (no ending square bracket) and empty part file. How do you retrieve ObjectWritable? Regards, Peter W.
Re: Searching email list
On 12/03/2008 4:18 PM, Cagdas Gerede wrote: > Is there an easy way to search this email list? > I couldn't find any web interface. > > Please help. http://wiki.apache.org/hadoop/MailingListArchives Daryl
retrieve ObjectWritable
Hi, After trying to pass objects to reduce using ObjectWritable without success I learned the class instead sends primitives such as float. However, you can make it go as object by passing it as byte[] with: new ObjectWritable(serialize_method(obj)) but it's not easy to retrieve once inside reduce because ((ObjectWritable)values.next()).get() returns an object not an array, no deserialize. Trying to cast this object to it's original form delivers a 'ClassCastException b[' error (no ending square bracket) and empty part file. How do you retrieve ObjectWritable? Regards, Peter W.
Re: Yahoo's production webmap is now on Hadoop
Doug, Correction duly noted. :) Keep up the good work and congratulations on the progress and accomplishments of the Hadoop project. Kind Regards, Peter W. On Feb 19, 2008, at 2:39 PM, Doug Cutting wrote: Peter W. wrote: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. In English, a trillion usually means 10^12, not 10^10. http://en.wikipedia.org/wiki/Trillion Doug
Re: Yahoo's production webmap is now on Hadoop
Guys, Thanks for the clarification and math explanations. Such a number would then likely be 100x my original estimate given that the web may have doubled for each year since that blog post and is growing exponentially. Index size was only a byproduct of trying to discern the significance of 1 trillion links in an inverted web graph. Hadoop has certainly arrived and become a valuable software asset likely to power next-generation Internet computing. Thanks again, Peter W. On Feb 19, 2008, at 5:33 PM, Eric Baldeschwieler wrote: Search engine Index size comparison is actually a very inexact science. Various 3rd parities comparing the major search engines do not come the the same conclusions. But ours is certainly world class and well over the discussed sizes. Here is an interesting bit of web history... A blog from AUGUST 08, 2005 discussing our index of over 19.2 billion web documents. It has only grown since then. http://www.ysearchblog.com/archives/000172.html On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote: Sorry to be picky about the math, but 1 Trillion = 10^12 = million million. At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9. At 100 links per page, this gives 10B pages. On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote: Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: Yahoo's production webmap is now on Hadoop
Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: Lucene-based Distributed Index Leveraging Hadoop
Howdy, Your work is outstanding and will hopefully be adopted soon. The HDFS distributed Lucene index solves many of the various dependencies introduced by achieving this another way using RMI, HTTP (serialized objects w/servlets) or Tomcat balancing with mysql databases, schemas and connection pools. Before this, other mixed options were available where Nutch obtains documents, html/ and xml parsers extract data, Hadoop reduces those results and Lucene stores, indexes same. Something like get document(Nutch), REST post as XML(Solr), XML to data(ROME,Abdera), data to map(Hadoop), reduce to tables(Hadoop,HBase) then reconstruct bytes to Lucene Document object for indexing. Obviously, yours is cleaner and more scalable. I'd want the master also to keep track of (task[id],[comp]leted,[prog] ress) in ways kind of like tables you could perform status updates: +--+--+--+ | id | comp | prog | +--+--+--+ Also, maybe the following indexing pipeline... index clients: from remote app machine1,machine2,machine3 using hdfs batch index lucene documents (hundred at a time) place in single encapsulation object connect to master select task id where (completed=0) && (progress=0) update progress=1 put object (hdfs) master: recreate collection from stream (in) iterate object, cast items to Document hash document key in the mapper, contents are IM index Lucene documents in reducer allowing Text object access for filtering purposes return indexed # as integer (rpc response) back on clients: updated progress=0,comp=1 when finished send master confirmation info with heartbeat Then add dates and logic for fixing extended race conditions where (completed=0) && (progress=1) on the master where clients can resubmit jobs using confirmed keys received as inventory lists. To update progress and completed tasks, somehow check the size of part-files in each labeled out dir or monitor Hadoop logs in appropriate temp dir. Run new JobClients accordingly. Sweet, Peter W. On Feb 6, 2008, at 10:59 AM, Ning Li wrote: There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's "Index Server Project Proposal" at http://www.mail-archive.com/[EMAIL PROTECTED]/ msg00338.html 2) Solr's "Distributed Search" at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's "Distributed Lucene" at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed "index" is partitioned into "shards". Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more "shard servers". Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A "master" keeps track of the shard servers and the shards being served by them. An "application" updates and queries the global index through an "index client". An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version o
Re: pig user meeting, Friday, February 8, 2008
Hi, Can a follow-up meeting be scheduled at the same time of the Hadoop summit on March 25th? As mapreduce becomes more prominent in corporate environments the benefits of Pig as a Sawzall alternative become obvious to that audience. Also, please take a Flickr photo! Later, Peter W. Otis Gospodnetic wrote: ... Is anyone going to be capturing the Piglet meeting on video for the those of us living in other corners of the planet? Thank you, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> Hi there, a couple of people plan to meet and talk about apache pig next Friday in the Mountain View area. (Event location is not yet sure). If you are interested please RSVP asap, so we can plan what kind of location size we looking for. http://upcoming.yahoo.com/event/420958/ Cheers, Stefan ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Mahout Machine Learning Project Launches
Hello, This is Mahout project seems very interesting. Any problem that has reducibility components using mapreduce and can then be described as a linear equation would be excellent candidates. Most Nutch developers probably don't need HMM but instead the power method to iterate over Markov chains or Perron-Frobenius. However, some of that work as it pertains to the web has been patented so it would be more productive for the Hadoop community to focus on other areas such as adjacency matrices, SALSA or bipartite graphs using Hbase. Bye, Peter W. On Feb 2, 2008, at 3:43 AM, edward yoon wrote: I thought of Hidden Markov Models (HMM) as absolutely impossible on MR model. If anyone have some information, please let me know. Thanks. On 2/2/08, edward yoon <[EMAIL PROTECTED]> wrote: I read an interesting piece of information in that NISP paper, and i was implemented but Now, there's too much mailing-list for me to read. Lucene, Core, Hbase, Pig, Solr, Mahout . :( Too distributed. On 2/2/08, gopi <[EMAIL PROTECTED]> wrote: I'm definitely excited about Machine Learning Algorithms being implemented into this project! I'm currently a student studying a Machine Learning, and would love to help out in every possible manner. Thanks Chaitanya Sharma On Jan 25, 2008 5:55 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: (Apologies for cross-posting) The Lucene PMC is pleased to announce the creation of the Mahout Machine Learning project, located at http://lucene.apache.org/ mahout. Mahout's goal is to create a suite of practical, scalable machine learning libraries. Our initial plan is to utilize Hadoop ( http://hadoop.apache.org ) to implement a variety of algorithms including naive bayes, neural networks, support vector machines and k-Means, among others. While our initial focus is on these algorithms, we welcome other machine learning ideas as well. Naturally, we are looking for volunteers to help grow the community and make the project successful. So, if machine learning is your thing, come on over and lend a hand! Cheers, Grant Ingersoll http://lucene.apache.org/mahout
Re: graph data representation for mapreduce
Cam, Making a directed graph in Hadoop is not very difficult but traversing live might be since the result is a separate file. Basically, you kick out a destination node as your key in the mapper and from nodes as intermediate values. Concatenate from values in the reducer assigning weights for each edge. Assigned edge scores come from a computation done in the reducer or number passed by key. This gives a simple but weighted from/to depiction and can be experimented with and improved by subsequent passes or REST style calls in the mapper for mysqldb weights. Later, Peter W. Cam Bazz wrote: Hello, I have been long interested in storing graphs, in databases, object databases and lucene like indexes. Has anyone done any work on storing and processing graphs with map reduce? If I were to start, where would I start from. I am interested in finding shortest paths in a large graph.
Re: Hadoop future?
Lukas, I would like Hadoop and Cygwin to now come standard with each edition of Windows Vista Business. But that's just me. Bye, Peter W. Lukas Vlcek wrote: ... Does anybody have any idea how this could impact specifically Hadoop future? I know this is all about speculations now...