Hadoop as Cloud Storage

2009-06-16 Thread W
Dear Hadoop Guru's,

After googling and find some information on using hadoop as cloud
storage (long term).
I have a problem to maintain lots of data (around 50 TB) much of them
are TV Commercial (video files).

I know, the best solution for long term file archiving is using tape
backup, but i just curious, is hadoop
can be used as 'data archiving' platform ?

Thanks!

Warm Regards,
Wildan
---
OpenThink Labs
http://openthink-labs.tobethink.com/

Making IT, Business and Education in Harmony

>> 087884599249

Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: ANN: Hadoop UI beta

2009-03-31 Thread W
+1 wow .., looks fantastic ... :)

On the summary it's said it works only for 0.19. Just curious, does it
work with the hadoop trunk ..

Thanks!

Best Regards,
Wildan

---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana



On Tue, Mar 31, 2009 at 6:11 PM, Stefan Podkowinski  wrote:
> Hello,
>
> I'd like to invite you to take a look at the recently released first
> beta of Hadoop UI, a graphical Flex/Java based client for Hadoop Core.
> Hadoo UI currently includes a HDFS file explorer and basic job
> tracking features.
>
> Get it here:
> http://code.google.com/p/hadoop-ui/
>
> As this is the first release it may (and does) still contain bugs, but
> I'd like to give everyone the chance to send feedback as early as
> possible.
> Give it a try :)
>
> - Stefan
>


Re: hadoop-a small doubt

2009-03-30 Thread W
I already try the mountable HDFS, both webDav and FUSE approach, it
seem both of it is not
production ready ..

CMIIW

Best Regards,
Wildan

---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana



On Sun, Mar 29, 2009 at 2:52 PM, Sagar Naik  wrote:
> Yes u can
> Java Client :
> Copy the conf dir (same as one on namenode/datanode) and hadoop jars shud be
> in the classpath of client
> Non Java Client :
> http://wiki.apache.org/hadoop/MountableHDFS
>
>
>
> -Sagar
>
> -Sagar
>
> deepya wrote:
>>
>> Hi,
>>   I am SreeDeepya doing MTech in IIIT.I am working on a project named cost
>> effective and scalable storage server.I configured a small hadoop cluster
>> with only two nodes one namenode and one datanode.I am new to hadoop.
>> I have a small doubt.
>>
>> Can a system not in the hadoop cluster access the namenode or the
>> datanodeIf yes,then can you please tell me the necessary
>> configurations
>> that has to be done.
>>
>> Thanks in advance.
>>
>> SreeDeepya
>>
>


Re: hadoop migration

2009-03-16 Thread W
Thanks for the quick response Aman,

Ok .., i see the point now.

currently i'm doing some research on creating a google books like
application using hbase as
a backend for storing the files and solr as indexer. From this
prototype, my be i can measure how fast
is hbase on serving data to the client ... (google using bigTable for
their books.google.com right ?)

Thanks!

Regards,
Wildan

On Tue, Mar 17, 2009 at 12:13 PM, Amandeep Khurana  wrote:
> Hypertable is not as mature as Hbase yet. The next release of Hbase, 0.20.0,
> includes some patches which reduce the latency of responses and makes it
> suitable to be used as a backend for a webapp. However the current release
> isnt optimized for this purpose.
>
> The idea behind Hadoop and the rest of the tools around it is more of a data
> processing system than a backend datastore for a website. The output of the
> processing that Hadoop does is typically taken into a MySQL cluster which
> feeds a website.
>
>
>


-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


Re: hadoop migration

2009-03-16 Thread W
> Of course, There is a storage solution called HBase for Hadoop. But,
> In my experience, not applicable for online data access yet.
>

I see.., how about hypertable ? does it mature enough to be used in
production ? , i read that
hypertable can be integrated with hadoop, or is there any other
alternative other than hbase ?

Thanks!

Regards,
Wildan

-- 
---
OpenThink Labs
www.tobethink.com

Aligning IT and Education

>> 021-99325243
Y! : hawking_123
Linkedln : http://www.linkedin.com/in/wildanmaulana


single node Hbase

2008-03-17 Thread Peter W.

Hello,

Are there any Hadoop documentation resources showing
how to run the current version of Hbase on a single node?

Thanks,

Peter W.


Lucene reduce

2008-03-15 Thread Peter W .

Hello,

For those interested, you can filter and
search Lucene documents in the reduce.

code:

import java.io.*;
import java.util.*;

import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.Hits;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.*;

public class ql
   {
   /**
   * Query Lucene using keys.
   *
   * input:
   * java^this page is about java
   * ruby^site only mentions rails
   * php^another resource about php
   * java^ejb3 discussed and spring
   * eof^eof
   *
   * make docs,search,mapreduce
   *
   * output:
   * php^topic^another resource about php
   * java^topic^this page is about java
   ***/

   public static class M extends MapReduceBase implements Mapper
  {
  HashMap hm=new HashMap();
  Map group_m=Collections.synchronizedMap(hm);
  String ITEM_KEY,BATCH_KEY="";int batch=0;

  public void map(WritableComparable wc,Writable w,
 OutputCollector out,Reporter rep)throws IOException
 {
 String ln=((Text)w).toString();
 String[] parse_a=ln.split("\\^");

 if(batch>(100-1)) // new lucene document group
{out.collect(new Text(BATCH_KEY),new BytesWritable(ob 
(group_m)));
BATCH_KEY="BATCH_"+key_maker(String.valueOf 
(batch));batch=0;group_m.clear();}
 else if(parse_a[0].equals("eof"))out.collect(new Text 
(BATCH_KEY),new BytesWritable(ob(group_m)));

 else ;

 ITEM_KEY="ITEM_"+key_maker(parse_a[0]);
 Document single_d=make_lucene_doc(parse_a[0],parse_a[1],ITEM_KEY);
 group_m.put(ITEM_KEY,single_d);
 batch++;
 }
  }

   public static class R extends MapReduceBase implements Reducer
  {
  public void reduce(WritableComparable wc,Iterator it,
 OutputCollector out,Reporter rep)throws IOException
 {
 while(it.hasNext())
{
try
   {
   Map m=(Map)bo(((BytesWritable)it.next()).get());
   if(m instanceof Map)
  {
  try
 {
 // build temp index
 Directory rd=new RAMDirectory();
 Analyzer sa=new StandardAnalyzer();
 IndexWriter iw=new IndexWriter(rd,sa,true);

 // unwrap,cast,send to mem
 List keys=new ArrayList(m.keySet());
 Iterator itr_u=keys.iterator();
 while(itr_u.hasNext())
{
Object k_u=itr_u.next();
Document dtmp=(Document)m.get(k_u);
iw.addDocument(dtmp);
}

 iw.optimize();iw.close();
 Searcher is=new IndexSearcher(rd);

 // simple doc filter
 Iterator itr_s=keys.iterator();
 while(itr_s.hasNext())
{
Object k_s=itr_s.next();
String tmp_topic=k_s.toString();
	TermQuery tq_i=new TermQuery(new Term 
("item",tmp_topic.trim()));


// query term from key
	tmp_topic=tmp_topic.substring(tmp_topic.lastIndexOf 
("_")+1,tmp_topic.length());
	TermQuery tq_b=new TermQuery(new Term 
("body",tmp_topic));


// search topic with inventory key
BooleanQuery bq=new BooleanQuery();
bq.add(tq_i,BooleanClause.Occur.MUST);
bq.add(tq_b,BooleanClause.Occur.MUST);

Hits h=is.search(bq);
for(int j=0;j   private static Document make_lucene_doc(String in_tpc,String  
in_bdy,String in_itm)

  {
  Document d=new Document();
  d.add(new Field 
("topic",in_tpc,Field.Store.YES,Field.Index.TOKENIZED));
  d.add(new Field 
("item",in_itm,Field.Store.NO,Field.Index.UN_TOKENIZED));
  d.add(new Field 
("body",in_bdy,Field.Store.YES,Field.Index.TOKENIZED));

  

Re: retrieve ObjectWritable

2008-03-15 Thread Peter W.

Hi,

If you have a requirement to pass an object or collection
you can replace ObjectWritable with BytesWritable:

test code...

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.BytesWritable;

public class bao
   {
   public static void main(String args[])
  {
  HashMap hm=new HashMap();
  Map bm=Collections.synchronizedMap(hm);
  bm.put("123",new Integer(123));
  bm.put("456",new Integer(456));

  try
 {
 BytesWritable bw=new BytesWritable(serialize_method(bm));  
 Map m=(Map)deserialize_method(bw.get());
 System.out.println("MAP_SIZE: "+ m.size()); // works
 }

  catch(Exception e)
 {
 System.out.println(e);
 }
  }

// static serialize,deserialize...

/***

stuff like this can be sent to reduce,
but can't recast from ObjectWritable:

   private static MapWritable mw(Map m)
  {
  MapWritable rmw=new MapWritable();
  List l=new ArrayList(m.keySet());Iterator i=l.iterator();
  while(i.hasNext()){Object k=i.next();rmw.put(new Text 
(k.toString()),

  new ObjectWritable(serialize_method((Object)m.get(k;}
  return rmw;
  }

***/

   }

Good Luck,

Peter W.




Peter W. wrote:


Hi,

After trying to pass objects to reduce using
ObjectWritable without success I learned the
class instead sends primitives such as float.

However, you can make it go as object by
passing it as byte[] with:

new ObjectWritable(serialize_method(obj))

but it's not easy to retrieve once inside reduce
because ((ObjectWritable)values.next()).get()
returns an object not an array, no deserialize.

Trying to cast this object to it's original form
delivers a 'ClassCastException b[' error
(no ending square bracket) and empty part file.

How do you retrieve ObjectWritable?

Regards,

Peter W.




Re: Searching email list

2008-03-12 Thread Daryl C. W. O'Shea
On 12/03/2008 4:18 PM, Cagdas Gerede wrote:
> Is there an easy way to search this email list?
> I couldn't find any web interface.
> 
> Please help.

http://wiki.apache.org/hadoop/MailingListArchives

Daryl



retrieve ObjectWritable

2008-02-25 Thread Peter W.

Hi,

After trying to pass objects to reduce using
ObjectWritable without success I learned the
class instead sends primitives such as float.

However, you can make it go as object by
passing it as byte[] with:

new ObjectWritable(serialize_method(obj))

but it's not easy to retrieve once inside reduce
because ((ObjectWritable)values.next()).get()
returns an object not an array, no deserialize.

Trying to cast this object to it's original form
delivers a 'ClassCastException b[' error
(no ending square bracket) and empty part file.

How do you retrieve ObjectWritable?

Regards,

Peter W.


Re: Yahoo's production webmap is now on Hadoop

2008-02-20 Thread Peter W .

Doug,

Correction duly noted. :)

Keep up the good work and congratulations on the progress
and accomplishments of the Hadoop project.

Kind Regards,

Peter W.




On Feb 19, 2008, at 2:39 PM, Doug Cutting wrote:


Peter W. wrote:
one trillion links=(10k million links/10 links per page)=1000  
million pages=one billion.


In English, a trillion usually means 10^12, not 10^10.

http://en.wikipedia.org/wiki/Trillion

Doug




Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Peter W.

Guys,

Thanks for the clarification and math explanations.

Such a number would then likely be 100x my original
estimate given that the web may have doubled for each
year since that blog post and is growing exponentially.

Index size was only a byproduct of trying to discern the
significance of 1 trillion links in an inverted web graph.

Hadoop has certainly arrived and become a valuable software
asset likely to power next-generation Internet computing.

Thanks again,

Peter W.


On Feb 19, 2008, at 5:33 PM, Eric Baldeschwieler wrote:

Search engine Index size comparison is actually a very inexact  
science.  Various 3rd parities comparing the major search engines  
do not come the the same conclusions.  But ours is certainly world  
class and well over the discussed sizes.


Here is an interesting bit of web history...  A blog from AUGUST  
08, 2005 discussing our index of over 19.2 billion web documents.   
It has only grown since then.


http://www.ysearchblog.com/archives/000172.html


On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:




Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
million.
At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
At 100

links per page, this gives 10B pages.


On 2/19/08 2:25 PM, "Peter W." <[EMAIL PROTECTED]> wrote:


Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000  
million

pages=one billion.

If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
acheived one-tenth of its scale?

Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:


The link inversion and ranking algorithms for Yahoo Search are now
being generated on Hadoop:

http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
largest-production-hadoop.html

Some Webmap size data:

* Number of links between pages in the index: roughly 1
trillion links
* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over  
10,000

* Raw disk used in the production cluster: over 5 Petabytes











Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Peter W.

Amazing milestone,

Looks like Y! had approximately 1B documents in the WebMap:

one trillion links=(10k million links/10 links per page)=1000 million  
pages=one billion.


If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has  
acheived one-tenth of its scale?


Good stuff,

Peter W.




On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:

The link inversion and ranking algorithms for Yahoo Search are now  
being generated on Hadoop:


http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds- 
largest-production-hadoop.html


Some Webmap size data:

* Number of links between pages in the index: roughly 1  
trillion links

* Size of output: over 300 TB, compressed!
* Number of cores used to run a single Map-Reduce job: over 10,000
* Raw disk used in the production cluster: over 5 Petabytes





Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Peter W.

Howdy,

Your work is outstanding and will hopefully be adopted soon.

The HDFS distributed Lucene index solves many of the various
dependencies introduced by achieving this another way using
RMI, HTTP (serialized objects w/servlets) or Tomcat balancing
with mysql databases, schemas and connection pools.

Before this, other mixed options were available where Nutch
obtains documents, html/ and xml parsers extract data, Hadoop
reduces those results and Lucene stores, indexes same.

Something like get document(Nutch), REST post as XML(Solr), XML to
data(ROME,Abdera), data to map(Hadoop), reduce to tables(Hadoop,HBase)
then reconstruct bytes to Lucene Document object for indexing.

Obviously, yours is cleaner and more scalable.

I'd want the master also to keep track of (task[id],[comp]leted,[prog] 
ress)

in ways kind of like tables you could perform status updates:

+--+--+--+
| id   | comp | prog |
+--+--+--+

Also, maybe the following indexing pipeline...

index clients:
from remote app machine1,machine2,machine3 using hdfs

batch index lucene documents (hundred at a time)
place in single encapsulation object
connect to master
select task id where (completed=0) && (progress=0)
update progress=1
put object (hdfs)

master:
recreate collection from stream (in)
iterate object, cast items to Document
hash document key in the mapper, contents are IM
index Lucene documents in reducer allowing
Text object access for filtering purposes
return indexed # as integer (rpc response)

back on clients:
updated progress=0,comp=1 when finished
send master confirmation info with heartbeat

Then add dates and logic for fixing extended race
conditions where (completed=0) && (progress=1) on
the master where clients can resubmit jobs using
confirmed keys received as inventory lists.

To update progress and completed tasks, somehow
check the size of part-files in each labeled out dir
or monitor Hadoop logs in appropriate temp dir.

Run new JobClients accordingly.

Sweet,

Peter W.




On Feb 6, 2008, at 10:59 AM, Ning Li wrote:


There have been several proposals for a Lucene-based distributed index
architecture.
 1) Doug Cutting's "Index Server Project Proposal" at
http://www.mail-archive.com/[EMAIL PROTECTED]/ 
msg00338.html

 2) Solr's "Distributed Search" at
http://wiki.apache.org/solr/DistributedSearch
 3) Mark Butler's "Distributed Lucene" at
http://wiki.apache.org/hadoop/DistributedLucene

We have also been working on a Lucene-based distributed index  
architecture.
Our design differs from the above proposals in the way it leverages  
Hadoop
as much as possible. In particular, HDFS is used to reliably store  
Lucene
instances, Map/Reduce is used to analyze documents and update  
Lucene instances
in parallel, and Hadoop's IPC framework is used. Our design is  
geared for
applications that require a highly scalable index and where batch  
updates
to each Lucene instance are acceptable (verses finer-grained  
document at

a time updates).

We have a working implementation of our design and are in the process
of evaluating its performance. An overview of our design is  
provided below.
We welcome feedback and would like to know if you are interested in  
working
on it. If so, we would be happy to make the code publicly  
available. At the
same time, we would like to collaborate with people working on  
existing

proposals and see if we can consolidate our efforts.

TERMINOLOGY
A distributed "index" is partitioned into "shards". Each shard  
corresponds to
a Lucene instance and contains a disjoint subset of the documents  
in the index.
Each shard is stored in HDFS and served by one or more "shard  
servers". Here
we only talk about a single distributed index, but in practice  
multiple indexes

can be supported.

A "master" keeps track of the shard servers and the shards being  
served by

them. An "application" updates and queries the global index through an
"index client". An index client communicates with the shard servers to
execute a query.

KEY RPC METHODS
This section lists the key RPC methods in our design. To simplify the
discussion, some of their parameters have been omitted.

  On the Shard Servers
// Execute a query on this shard server's Lucene instance.
// This method is called by an index client.
SearchResults search(Query query);

  On the Master
// Tell the master to update the shards, i.e., Lucene instances.
// This method is called by an index client.
boolean updateShards(Configuration conf);

// Ask the master where the shards are located.
// This method is called by an index client.
LocatedShards getShardLocations();

// Send a heartbeat to the master. This method is called by a
// shard server. In the response, the master informs the
// shard server when to switch to a newer version o

Re: pig user meeting, Friday, February 8, 2008

2008-02-06 Thread Peter W.

Hi,

Can a follow-up meeting be scheduled at the
same time of the Hadoop summit on March 25th?

As mapreduce becomes more prominent in corporate
environments the benefits of Pig as a Sawzall
alternative become obvious to that audience.

Also, please take a Flickr photo!

Later,

Peter W.


Otis Gospodnetic wrote:


...
Is anyone going to be capturing the Piglet meeting on video for the  
those of us living in other corners of the planet?


Thank you,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 

From: Stefan Groschupf <[EMAIL PROTECTED]>
Hi there,

a couple of people plan to meet and talk about apache pig next  
Friday in the Mountain View area.

(Event location is not yet sure).
If you are interested please RSVP asap, so we can plan what kind  
of location size we looking for.


http://upcoming.yahoo.com/event/420958/

Cheers,
Stefan


~~~
101tec Inc.
Menlo Park, California, USA
http://www.101tec.com


Re: Mahout Machine Learning Project Launches

2008-02-06 Thread Peter W.

Hello,

This is Mahout project seems very interesting.

Any problem that has reducibility components
using mapreduce and can then be described as a
linear equation would be excellent candidates.

Most Nutch developers probably don't need HMM
but instead the power method to iterate over
Markov chains or Perron-Frobenius.

However, some of that work as it pertains to
the web has been patented so it would be more
productive for the Hadoop community to focus
on other areas such as adjacency matrices,
SALSA or bipartite graphs using Hbase.

Bye,

Peter W.


On Feb 2, 2008, at 3:43 AM, edward yoon wrote:

I thought of Hidden Markov Models (HMM) as absolutely impossible on  
MR model.

If anyone have some information, please let me know.

Thanks.

On 2/2/08, edward yoon <[EMAIL PROTECTED]> wrote:

I read an interesting piece of information in that NISP paper, and i
was implemented but

Now, there's too much mailing-list for me to read.
Lucene, Core, Hbase, Pig, Solr, Mahout . :(

Too distributed.

On 2/2/08, gopi <[EMAIL PROTECTED]> wrote:
I'm definitely excited about Machine Learning Algorithms being  
implemented

into this project!
I'm currently a student studying a Machine Learning, and would  
love to help

out in every possible manner.

Thanks
Chaitanya Sharma

On Jan 25, 2008 5:55 PM, Grant Ingersoll <[EMAIL PROTECTED]>  
wrote:



(Apologies for cross-posting)

The Lucene PMC is pleased to announce the creation of the Mahout
Machine Learning project, located at http://lucene.apache.org/ 
mahout.

Mahout's goal is to create a suite of practical, scalable machine
learning libraries.  Our initial plan is to utilize Hadoop (
http://hadoop.apache.org
) to implement a variety of algorithms including naive bayes,  
neural

networks, support vector machines and k-Means, among others.  While
our initial focus is on these algorithms, we welcome other machine
learning ideas as well.

Naturally, we are looking for volunteers to help grow the community
and make the project successful.  So, if machine learning is your
thing, come on over and lend a hand!

Cheers,
Grant Ingersoll

http://lucene.apache.org/mahout


Re: graph data representation for mapreduce

2008-02-01 Thread Peter W .

Cam,

Making a directed graph in Hadoop is not
very difficult but traversing live might be
since the result is a separate file.

Basically, you kick out a destination node
as your key in the mapper and from nodes as
intermediate values. Concatenate from values in
the reducer assigning weights for each edge.

Assigned edge scores come from a computation
done in the reducer or number passed by key.

This gives a simple but weighted from/to
depiction and can be experimented with and
improved by subsequent passes or REST style
calls in the mapper for mysqldb weights.

Later,

Peter W.

Cam Bazz wrote:


Hello,

I have been long interested in storing graphs, in databases, object
databases and lucene like indexes.


Has anyone done any work on storing and processing graphs with map  
reduce?
If I were to start, where would I start from. I am interested in  
finding

shortest paths in a large graph.




Re: Hadoop future?

2008-02-01 Thread Peter W.

Lukas,

I would like Hadoop and Cygwin to now come standard with
each edition of Windows Vista Business. But that's just me.

Bye,

Peter W.

Lukas Vlcek wrote:


...
Does anybody have any idea how this could impact specifically  
Hadoop future?

I know this is all about speculations now...