Hadoop as Cloud Storage
Dear Hadoop Guru's, After googling and find some information on using hadoop as cloud storage (long term). I have a problem to maintain lots of data (around 50 TB) much of them are TV Commercial (video files). I know, the best solution for long term file archiving is using tape backup, but i just curious, is hadoop can be used as 'data archiving' platform ? Thanks! Warm Regards, Wildan --- OpenThink Labs http://openthink-labs.tobethink.com/ Making IT, Business and Education in Harmony 087884599249 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
Re: ANN: Hadoop UI beta
+1 wow .., looks fantastic ... :) On the summary it's said it works only for 0.19. Just curious, does it work with the hadoop trunk .. Thanks! Best Regards, Wildan --- OpenThink Labs www.tobethink.com Aligning IT and Education 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana On Tue, Mar 31, 2009 at 6:11 PM, Stefan Podkowinski spo...@gmail.com wrote: Hello, I'd like to invite you to take a look at the recently released first beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core. Hadoo UI currently includes a HDFS file explorer and basic job tracking features. Get it here: http://code.google.com/p/hadoop-ui/ As this is the first release it may (and does) still contain bugs, but I'd like to give everyone the chance to send feedback as early as possible. Give it a try :) - Stefan
Re: hadoop-a small doubt
I already try the mountable HDFS, both webDav and FUSE approach, it seem both of it is not production ready .. CMIIW Best Regards, Wildan --- OpenThink Labs www.tobethink.com Aligning IT and Education 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana On Sun, Mar 29, 2009 at 2:52 PM, Sagar Naik sn...@attributor.com wrote: Yes u can Java Client : Copy the conf dir (same as one on namenode/datanode) and hadoop jars shud be in the classpath of client Non Java Client : http://wiki.apache.org/hadoop/MountableHDFS -Sagar -Sagar deepya wrote: Hi, I am SreeDeepya doing MTech in IIIT.I am working on a project named cost effective and scalable storage server.I configured a small hadoop cluster with only two nodes one namenode and one datanode.I am new to hadoop. I have a small doubt. Can a system not in the hadoop cluster access the namenode or the datanodeIf yes,then can you please tell me the necessary configurations that has to be done. Thanks in advance. SreeDeepya
Re: hadoop migration
Thanks for the quick response Aman, Ok .., i see the point now. currently i'm doing some research on creating a google books like application using hbase as a backend for storing the files and solr as indexer. From this prototype, my be i can measure how fast is hbase on serving data to the client ... (google using bigTable for their books.google.com right ?) Thanks! Regards, Wildan On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana ama...@gmail.com wrote: Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0, includes some patches which reduce the latency of responses and makes it suitable to be used as a backend for a webapp. However the current release isnt optimized for this purpose. The idea behind Hadoop and the rest of the tools around it is more of a data processing system than a backend datastore for a website. The output of the processing that Hadoop does is typically taken into a MySQL cluster which feeds a website. -- --- OpenThink Labs www.tobethink.com Aligning IT and Education 021-99325243 Y! : hawking_123 Linkedln : http://www.linkedin.com/in/wildanmaulana
single node Hbase
Hello, Are there any Hadoop documentation resources showing how to run the current version of Hbase on a single node? Thanks, Peter W.
Lucene reduce
Hello, For those interested, you can filter and search Lucene documents in the reduce. code: import java.io.*; import java.util.*; import org.apache.lucene.index.Term; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.store.Directory; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.BooleanQuery; import org.apache.lucene.search.BooleanClause; import org.apache.lucene.search.TermQuery; import org.apache.lucene.search.Hits; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.WritableComparable; import org.apache.hadoop.mapred.*; public class ql { /** * Query Lucene using keys. * * input: * java^this page is about java * ruby^site only mentions rails * php^another resource about php * java^ejb3 discussed and spring * eof^eof * * make docs,search,mapreduce * * output: * php^topic^another resource about php * java^topic^this page is about java ***/ public static class M extends MapReduceBase implements Mapper { HashMap hm=new HashMap(); Map group_m=Collections.synchronizedMap(hm); String ITEM_KEY,BATCH_KEY=;int batch=0; public void map(WritableComparable wc,Writable w, OutputCollector out,Reporter rep)throws IOException { String ln=((Text)w).toString(); String[] parse_a=ln.split(\\^); if(batch(100-1)) // new lucene document group {out.collect(new Text(BATCH_KEY),new BytesWritable(ob (group_m))); BATCH_KEY=BATCH_+key_maker(String.valueOf (batch));batch=0;group_m.clear();} else if(parse_a[0].equals(eof))out.collect(new Text (BATCH_KEY),new BytesWritable(ob(group_m))); else ; ITEM_KEY=ITEM_+key_maker(parse_a[0]); Document single_d=make_lucene_doc(parse_a[0],parse_a[1],ITEM_KEY); group_m.put(ITEM_KEY,single_d); batch++; } } public static class R extends MapReduceBase implements Reducer { public void reduce(WritableComparable wc,Iterator it, OutputCollector out,Reporter rep)throws IOException { while(it.hasNext()) { try { Map m=(Map)bo(((BytesWritable)it.next()).get()); if(m instanceof Map) { try { // build temp index Directory rd=new RAMDirectory(); Analyzer sa=new StandardAnalyzer(); IndexWriter iw=new IndexWriter(rd,sa,true); // unwrap,cast,send to mem List keys=new ArrayList(m.keySet()); Iterator itr_u=keys.iterator(); while(itr_u.hasNext()) { Object k_u=itr_u.next(); Document dtmp=(Document)m.get(k_u); iw.addDocument(dtmp); } iw.optimize();iw.close(); Searcher is=new IndexSearcher(rd); // simple doc filter Iterator itr_s=keys.iterator(); while(itr_s.hasNext()) { Object k_s=itr_s.next(); String tmp_topic=k_s.toString(); TermQuery tq_i=new TermQuery(new Term (item,tmp_topic.trim())); // query term from key tmp_topic=tmp_topic.substring(tmp_topic.lastIndexOf (_)+1,tmp_topic.length()); TermQuery tq_b=new TermQuery(new Term (body,tmp_topic)); // search topic with inventory key BooleanQuery bq=new BooleanQuery(); bq.add(tq_i,BooleanClause.Occur.MUST); bq.add(tq_b,BooleanClause.Occur.MUST); Hits h=is.search(bq); for(int j=0;jh.length();j++) { Document doc=h.doc(j); String tmp_tpc=doc.get(topic); String tmp_bdy=doc.get(body); out.collect(wc,new Text(tmp_tpc+^topic^+tmp_bdy)); } } keys.clear();is.close(); } catch(Exception e){System.out.println(e
Re: Searching email list
On 12/03/2008 4:18 PM, Cagdas Gerede wrote: Is there an easy way to search this email list? I couldn't find any web interface. Please help. http://wiki.apache.org/hadoop/MailingListArchives Daryl
retrieve ObjectWritable
Hi, After trying to pass objects to reduce using ObjectWritable without success I learned the class instead sends primitives such as float. However, you can make it go as object by passing it as byte[] with: new ObjectWritable(serialize_method(obj)) but it's not easy to retrieve once inside reduce because ((ObjectWritable)values.next()).get() returns an object not an array, no deserialize. Trying to cast this object to it's original form delivers a 'ClassCastException b[' error (no ending square bracket) and empty part file. How do you retrieve ObjectWritable? Regards, Peter W.
Re: Yahoo's production webmap is now on Hadoop
Amazing milestone, Looks like Y! had approximately 1B documents in the WebMap: one trillion links=(10k million links/10 links per page)=1000 million pages=one billion. If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has acheived one-tenth of its scale? Good stuff, Peter W. On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote: The link inversion and ranking algorithms for Yahoo Search are now being generated on Hadoop: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- largest-production-hadoop.html Some Webmap size data: * Number of links between pages in the index: roughly 1 trillion links * Size of output: over 300 TB, compressed! * Number of cores used to run a single Map-Reduce job: over 10,000 * Raw disk used in the production cluster: over 5 Petabytes
Re: Lucene-based Distributed Index Leveraging Hadoop
Howdy, Your work is outstanding and will hopefully be adopted soon. The HDFS distributed Lucene index solves many of the various dependencies introduced by achieving this another way using RMI, HTTP (serialized objects w/servlets) or Tomcat balancing with mysql databases, schemas and connection pools. Before this, other mixed options were available where Nutch obtains documents, html/ and xml parsers extract data, Hadoop reduces those results and Lucene stores, indexes same. Something like get document(Nutch), REST post as XML(Solr), XML to data(ROME,Abdera), data to map(Hadoop), reduce to tables(Hadoop,HBase) then reconstruct bytes to Lucene Document object for indexing. Obviously, yours is cleaner and more scalable. I'd want the master also to keep track of (task[id],[comp]leted,[prog] ress) in ways kind of like tables you could perform status updates: +--+--+--+ | id | comp | prog | +--+--+--+ Also, maybe the following indexing pipeline... index clients: from remote app machine1,machine2,machine3 using hdfs batch index lucene documents (hundred at a time) place in single encapsulation object connect to master select task id where (completed=0) (progress=0) update progress=1 put object (hdfs) master: recreate collection from stream (in) iterate object, cast items to Document hash document key in the mapper, contents are IM index Lucene documents in reducer allowing Text object access for filtering purposes return indexed # as integer (rpc response) back on clients: updated progress=0,comp=1 when finished send master confirmation info with heartbeat Then add dates and logic for fixing extended race conditions where (completed=0) (progress=1) on the master where clients can resubmit jobs using confirmed keys received as inventory lists. To update progress and completed tasks, somehow check the size of part-files in each labeled out dir or monitor Hadoop logs in appropriate temp dir. Run new JobClients accordingly. Sweet, Peter W. On Feb 6, 2008, at 10:59 AM, Ning Li wrote: There have been several proposals for a Lucene-based distributed index architecture. 1) Doug Cutting's Index Server Project Proposal at http://www.mail-archive.com/[EMAIL PROTECTED]/ msg00338.html 2) Solr's Distributed Search at http://wiki.apache.org/solr/DistributedSearch 3) Mark Butler's Distributed Lucene at http://wiki.apache.org/hadoop/DistributedLucene We have also been working on a Lucene-based distributed index architecture. Our design differs from the above proposals in the way it leverages Hadoop as much as possible. In particular, HDFS is used to reliably store Lucene instances, Map/Reduce is used to analyze documents and update Lucene instances in parallel, and Hadoop's IPC framework is used. Our design is geared for applications that require a highly scalable index and where batch updates to each Lucene instance are acceptable (verses finer-grained document at a time updates). We have a working implementation of our design and are in the process of evaluating its performance. An overview of our design is provided below. We welcome feedback and would like to know if you are interested in working on it. If so, we would be happy to make the code publicly available. At the same time, we would like to collaborate with people working on existing proposals and see if we can consolidate our efforts. TERMINOLOGY A distributed index is partitioned into shards. Each shard corresponds to a Lucene instance and contains a disjoint subset of the documents in the index. Each shard is stored in HDFS and served by one or more shard servers. Here we only talk about a single distributed index, but in practice multiple indexes can be supported. A master keeps track of the shard servers and the shards being served by them. An application updates and queries the global index through an index client. An index client communicates with the shard servers to execute a query. KEY RPC METHODS This section lists the key RPC methods in our design. To simplify the discussion, some of their parameters have been omitted. On the Shard Servers // Execute a query on this shard server's Lucene instance. // This method is called by an index client. SearchResults search(Query query); On the Master // Tell the master to update the shards, i.e., Lucene instances. // This method is called by an index client. boolean updateShards(Configuration conf); // Ask the master where the shards are located. // This method is called by an index client. LocatedShards getShardLocations(); // Send a heartbeat to the master. This method is called by a // shard server. In the response, the master informs the // shard server when to switch to a newer version of the index. ShardServerCommand sendHeartbeat(); QUERYING THE INDEX To query the index, an application sends a search request to an index client. The index
Re: Mahout Machine Learning Project Launches
Hello, This is Mahout project seems very interesting. Any problem that has reducibility components using mapreduce and can then be described as a linear equation would be excellent candidates. Most Nutch developers probably don't need HMM but instead the power method to iterate over Markov chains or Perron-Frobenius. However, some of that work as it pertains to the web has been patented so it would be more productive for the Hadoop community to focus on other areas such as adjacency matrices, SALSA or bipartite graphs using Hbase. Bye, Peter W. On Feb 2, 2008, at 3:43 AM, edward yoon wrote: I thought of Hidden Markov Models (HMM) as absolutely impossible on MR model. If anyone have some information, please let me know. Thanks. On 2/2/08, edward yoon [EMAIL PROTECTED] wrote: I read an interesting piece of information in that NISP paper, and i was implemented but Now, there's too much mailing-list for me to read. Lucene, Core, Hbase, Pig, Solr, Mahout . :( Too distributed. On 2/2/08, gopi [EMAIL PROTECTED] wrote: I'm definitely excited about Machine Learning Algorithms being implemented into this project! I'm currently a student studying a Machine Learning, and would love to help out in every possible manner. Thanks Chaitanya Sharma On Jan 25, 2008 5:55 PM, Grant Ingersoll [EMAIL PROTECTED] wrote: (Apologies for cross-posting) The Lucene PMC is pleased to announce the creation of the Mahout Machine Learning project, located at http://lucene.apache.org/ mahout. Mahout's goal is to create a suite of practical, scalable machine learning libraries. Our initial plan is to utilize Hadoop ( http://hadoop.apache.org ) to implement a variety of algorithms including naive bayes, neural networks, support vector machines and k-Means, among others. While our initial focus is on these algorithms, we welcome other machine learning ideas as well. Naturally, we are looking for volunteers to help grow the community and make the project successful. So, if machine learning is your thing, come on over and lend a hand! Cheers, Grant Ingersoll http://lucene.apache.org/mahout
Re: Hadoop future?
Lukas, I would like Hadoop and Cygwin to now come standard with each edition of Windows Vista Business. But that's just me. Bye, Peter W. Lukas Vlcek wrote: ... Does anybody have any idea how this could impact specifically Hadoop future? I know this is all about speculations now...