Hadoop as Cloud Storage

2009-06-16 Thread W
Dear Hadoop Guru's,

After googling and find some information on using hadoop as cloud
storage (long term).
I have a problem to maintain lots of data (around 50 TB) much of them
are TV Commercial (video files).

I know, the best solution for long term file archiving is using tape
backup, but i just curious, is hadoop
can be used as 'data archiving' platform ?

Thanks!

Warm Regards,
Wildan
---
OpenThink Labs
http://openthink-labs.tobethink.com/

Making IT, Business and Education in Harmony

 087884599249

Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: ANN: Hadoop UI beta

2009-03-31 Thread W
+1 wow .., looks fantastic ... :)

On the summary it's said it works only for 0.19. Just curious, does it
work with the hadoop trunk ..

Thanks!

Best Regards,
Wildan

---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana



On Tue, Mar 31, 2009 at 6:11 PM, Stefan Podkowinski spo...@gmail.com wrote:
 Hello,

 I'd like to invite you to take a look at the recently released first
 beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
 Hadoo UI currently includes a HDFS file explorer and basic job
 tracking features.

 Get it here:
 http://code.google.com/p/hadoop-ui/

 As this is the first release it may (and does) still contain bugs, but
 I'd like to give everyone the chance to send feedback as early as
 possible.
 Give it a try :)

 - Stefan



Re: hadoop-a small doubt

2009-03-30 Thread W
I already try the mountable HDFS, both webDav and FUSE approach, it
seem both of it is not
production ready ..

CMIIW

Best Regards,
Wildan

---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana



On Sun, Mar 29, 2009 at 2:52 PM, Sagar Naik sn...@attributor.com wrote:
 Yes u can
 Java Client :
 Copy the conf dir (same as one on namenode/datanode) and hadoop jars shud be
 in the classpath of client
 Non Java Client :
 http://wiki.apache.org/hadoop/MountableHDFS



 -Sagar

 -Sagar

 deepya wrote:

 Hi,
   I am SreeDeepya doing MTech in IIIT.I am working on a project named cost
 effective and scalable storage server.I configured a small hadoop cluster
 with only two nodes one namenode and one datanode.I am new to hadoop.
 I have a small doubt.

 Can a system not in the hadoop cluster access the namenode or the
 datanodeIf yes,then can you please tell me the necessary
 configurations
 that has to be done.

 Thanks in advance.

 SreeDeepya




Re: hadoop migration

2009-03-16 Thread W
Thanks for the quick response Aman,

Ok .., i see the point now.

currently i'm doing some research on creating a google books like
application using hbase as
a backend for storing the files and solr as indexer. From this
prototype, my be i can measure how fast
is hbase on serving data to the client ... (google using bigTable for
their books.google.com right ?)

Thanks!

Regards,
Wildan

On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana ama...@gmail.com wrote:
 Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0,
 includes some patches which reduce the latency of responses and makes it
 suitable to be used as a backend for a webapp. However the current release
 isnt optimized for this purpose.

 The idea behind Hadoop and the rest of the tools around it is more of a data
 processing system than a backend datastore for a website. The output of the
 processing that Hadoop does is typically taken into a MySQL cluster which
 feeds a website.





-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


single node Hbase

2008-03-17 Thread Peter W.

Hello,

Are there any Hadoop documentation resources showing
how to run the current version of Hbase on a single node?

Thanks,

Peter W.


Lucene reduce

2008-03-15 Thread Peter W .

Hello,

For those interested, you can filter and
search Lucene documents in the reduce.

code:

import java.io.*;
import java.util.*;

import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.Hits;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.*;

public class ql
   {
   /**
   * Query Lucene using keys.
   *
   * input:
   * java^this page is about java
   * ruby^site only mentions rails
   * php^another resource about php
   * java^ejb3 discussed and spring
   * eof^eof
   *
   * make docs,search,mapreduce
   *
   * output:
   * php^topic^another resource about php
   * java^topic^this page is about java
   ***/

   public static class M extends MapReduceBase implements Mapper
  {
  HashMap hm=new HashMap();
  Map group_m=Collections.synchronizedMap(hm);
  String ITEM_KEY,BATCH_KEY=;int batch=0;

  public void map(WritableComparable wc,Writable w,
 OutputCollector out,Reporter rep)throws IOException
 {
 String ln=((Text)w).toString();
 String[] parse_a=ln.split(\\^);

 if(batch(100-1)) // new lucene document group
{out.collect(new Text(BATCH_KEY),new BytesWritable(ob 
(group_m)));
BATCH_KEY=BATCH_+key_maker(String.valueOf 
(batch));batch=0;group_m.clear();}
 else if(parse_a[0].equals(eof))out.collect(new Text 
(BATCH_KEY),new BytesWritable(ob(group_m)));

 else ;

 ITEM_KEY=ITEM_+key_maker(parse_a[0]);
 Document single_d=make_lucene_doc(parse_a[0],parse_a[1],ITEM_KEY);
 group_m.put(ITEM_KEY,single_d);
 batch++;
 }
  }

   public static class R extends MapReduceBase implements Reducer
  {
  public void reduce(WritableComparable wc,Iterator it,
 OutputCollector out,Reporter rep)throws IOException
 {
 while(it.hasNext())
{
try
   {
   Map m=(Map)bo(((BytesWritable)it.next()).get());
   if(m instanceof Map)
  {
  try
 {
 // build temp index
 Directory rd=new RAMDirectory();
 Analyzer sa=new StandardAnalyzer();
 IndexWriter iw=new IndexWriter(rd,sa,true);

 // unwrap,cast,send to mem
 List keys=new ArrayList(m.keySet());
 Iterator itr_u=keys.iterator();
 while(itr_u.hasNext())
{
Object k_u=itr_u.next();
Document dtmp=(Document)m.get(k_u);
iw.addDocument(dtmp);
}

 iw.optimize();iw.close();
 Searcher is=new IndexSearcher(rd);

 // simple doc filter
 Iterator itr_s=keys.iterator();
 while(itr_s.hasNext())
{
Object k_s=itr_s.next();
String tmp_topic=k_s.toString();
	TermQuery tq_i=new TermQuery(new Term 
(item,tmp_topic.trim()));


// query term from key
	tmp_topic=tmp_topic.substring(tmp_topic.lastIndexOf 
(_)+1,tmp_topic.length());
	TermQuery tq_b=new TermQuery(new Term 
(body,tmp_topic));


// search topic with inventory key
BooleanQuery bq=new BooleanQuery();
bq.add(tq_i,BooleanClause.Occur.MUST);
bq.add(tq_b,BooleanClause.Occur.MUST);

Hits h=is.search(bq);
for(int j=0;jh.length();j++)
   {
   Document doc=h.doc(j);
   String tmp_tpc=doc.get(topic);
   String tmp_bdy=doc.get(body);
   out.collect(wc,new Text(tmp_tpc+^topic^+tmp_bdy));
   }
}
 keys.clear();is.close();
 }
  catch(Exception e){System.out.println(e

Re: Searching email list

2008-03-12 Thread Daryl C. W. O'Shea
On 12/03/2008 4:18 PM, Cagdas Gerede wrote:
 Is there an easy way to search this email list?
 I couldn't find any web interface.
 
 Please help.

http://wiki.apache.org/hadoop/MailingListArchives

Daryl



retrieve ObjectWritable

2008-02-25 Thread Peter W.

Hi,

After trying to pass objects to reduce using
ObjectWritable without success I learned the
class instead sends primitives such as float.

However, you can make it go as object by
passing it as byte[] with:

new ObjectWritable(serialize_method(obj))

but it's not easy to retrieve once inside reduce
because ((ObjectWritable)values.next()).get()
returns an object not an array, no deserialize.

Trying to cast this object to it's original form
delivers a 'ClassCastException b[' error
(no ending square bracket) and empty part file.

How do you retrieve ObjectWritable?

Regards,

Peter W.


Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Peter W.

Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000 million  
pages=one billion.


If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has  
acheived one-tenth of its scale?


Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:

The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
largest-production-hadoop.html


Some Webmap size data:

* Number of links between pages in the index: roughly 1  
trillion links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes





Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Peter W.

Howdy,

Your work is outstanding and will hopefully be adopted soon.

The HDFS distributed Lucene index solves many of the various
dependencies introduced by achieving this another way using
RMI, HTTP (serialized objects w/servlets) or Tomcat balancing
with mysql databases, schemas and connection pools.

Before this, other mixed options were available where Nutch
obtains documents, html/ and xml parsers extract data, Hadoop
reduces those results and Lucene stores, indexes same.

Something like get document(Nutch), REST post as XML(Solr), XML to
data(ROME,Abdera), data to map(Hadoop), reduce to tables(Hadoop,HBase)
then reconstruct bytes to Lucene Document object for indexing.

Obviously, yours is cleaner and more scalable.

I'd want the master also to keep track of (task[id],[comp]leted,[prog] 
ress)

in ways kind of like tables you could perform status updates:

+--+--+--+
| id   | comp | prog |
+--+--+--+

Also, maybe the following indexing pipeline...

index clients:
from remote app machine1,machine2,machine3 using hdfs

batch index lucene documents (hundred at a time)
place in single encapsulation object
connect to master
select task id where (completed=0)  (progress=0)
update progress=1
put object (hdfs)

master:
recreate collection from stream (in)
iterate object, cast items to Document
hash document key in the mapper, contents are IM
index Lucene documents in reducer allowing
Text object access for filtering purposes
return indexed # as integer (rpc response)

back on clients:
updated progress=0,comp=1 when finished
send master confirmation info with heartbeat

Then add dates and logic for fixing extended race
conditions where (completed=0)  (progress=1) on
the master where clients can resubmit jobs using
confirmed keys received as inventory lists.

To update progress and completed tasks, somehow
check the size of part-files in each labeled out dir
or monitor Hadoop logs in appropriate temp dir.

Run new JobClients accordingly.

Sweet,

Peter W.




On Feb 6, 2008, at 10:59 AM, Ning Li wrote:


There have been several proposals for a Lucene-based distributed index
architecture.
 1) Doug Cutting's Index Server Project Proposal at
http://www.mail-archive.com/[EMAIL PROTECTED]/ 
msg00338.html

 2) Solr's Distributed Search at
http://wiki.apache.org/solr/DistributedSearch
 3) Mark Butler's Distributed Lucene at
http://wiki.apache.org/hadoop/DistributedLucene

We have also been working on a Lucene-based distributed index  
architecture.
Our design differs from the above proposals in the way it leverages  
Hadoop
as much as possible. In particular, HDFS is used to reliably store  
Lucene
instances, Map/Reduce is used to analyze documents and update  
Lucene instances
in parallel, and Hadoop's IPC framework is used. Our design is  
geared for
applications that require a highly scalable index and where batch  
updates
to each Lucene instance are acceptable (verses finer-grained  
document at

a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is  
provided below.
We welcome feedback and would like to know if you are interested in  
working
on it. If so, we would be happy to make the code publicly  
available. At the
same time, we would like to collaborate with people working on  
existing

proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed index is partitioned into shards. Each shard  
corresponds to
a Lucene instance and contains a disjoint subset of the documents  
in the index.
Each shard is stored in HDFS and served by one or more shard  
servers. Here
we only talk about a single distributed index, but in practice  
multiple indexes

can be supported.

A master keeps track of the shard servers and the shards being  
served by

them. An application updates and queries the global index through an
index client. An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

  On the Shard Servers
// Execute a query on this shard server's Lucene instance.
// This method is called by an index client.
SearchResults search(Query query);

  On the Master
// Tell the master to update the shards, i.e., Lucene instances.
// This method is called by an index client.
boolean updateShards(Configuration conf);

// Ask the master where the shards are located.
// This method is called by an index client.
LocatedShards getShardLocations();

// Send a heartbeat to the master. This method is called by a
// shard server. In the response, the master informs the
// shard server when to switch to a newer version of the index.
ShardServerCommand sendHeartbeat();

QUERYING THE INDEX
To query the index, an application sends a search request to an  
index client.
The index

Re: Mahout Machine Learning Project Launches

2008-02-06 Thread Peter W.

Hello,

This is Mahout project seems very interesting.

Any problem that has reducibility components
using mapreduce and can then be described as a
linear equation would be excellent candidates.

Most Nutch developers probably don't need HMM
but instead the power method to iterate over
Markov chains or Perron-Frobenius.

However, some of that work as it pertains to
the web has been patented so it would be more
productive for the Hadoop community to focus
on other areas such as adjacency matrices,
SALSA or bipartite graphs using Hbase.

Bye,

Peter W.


On Feb 2, 2008, at 3:43 AM, edward yoon wrote:

I thought of Hidden Markov Models (HMM) as absolutely impossible on  
MR model.

If anyone have some information, please let me know.

Thanks.

On 2/2/08, edward yoon [EMAIL PROTECTED] wrote:

I read an interesting piece of information in that NISP paper, and i
was implemented but

Now, there's too much mailing-list for me to read.
Lucene, Core, Hbase, Pig, Solr, Mahout . :(

Too distributed.

On 2/2/08, gopi [EMAIL PROTECTED] wrote:
I'm definitely excited about Machine Learning Algorithms being  
implemented

into this project!
I'm currently a student studying a Machine Learning, and would  
love to help

out in every possible manner.

Thanks
Chaitanya Sharma

On Jan 25, 2008 5:55 PM, Grant Ingersoll [EMAIL PROTECTED]  
wrote:



(Apologies for cross-posting)

The Lucene PMC is pleased to announce the creation of the Mahout
Machine Learning project, located at http://lucene.apache.org/ 
mahout.

Mahout's goal is to create a suite of practical, scalable machine
learning libraries.  Our initial plan is to utilize Hadoop (
http://hadoop.apache.org
) to implement a variety of algorithms including naive bayes,  
neural

networks, support vector machines and k-Means, among others.  While
our initial focus is on these algorithms, we welcome other machine
learning ideas as well.

Naturally, we are looking for volunteers to help grow the community
and make the project successful.  So, if machine learning is your
thing, come on over and lend a hand!

Cheers,
Grant Ingersoll

http://lucene.apache.org/mahout


Re: Hadoop future?

2008-02-01 Thread Peter W.

Lukas,

I would like Hadoop and Cygwin to now come standard with
each edition of Windows Vista Business. But that's just me.

Bye,

Peter W.

Lukas Vlcek wrote:


...
Does anybody have any idea how this could impact specifically  
Hadoop future?

I know this is all about speculations now...