Re: Hadoop and Matlab

2008-12-12 Thread Edward J. Yoon
Just FYI, See hama (http://incubator.apache.org/hama/)

We are working on parallel math (from scalapack/matlab people we got a
positive answer) project using hadoop.

On Sat, Dec 13, 2008 at 12:39 PM, Dmitry Pushkarev  wrote:
> Hi.
>
>
>
> Can anyone share experience of successfully parallelizing matlab tasks using
> hadoop?
>
>
>
> We have implemented this thing with python (in form of simple module that
> takes serialized function and data array and runs this function on the
> cluster)m but we really have no clue how to that in Matlab.
>
>
>
> Ideally we want to use Matlab in the same way -  write .m file that takes
> set of parameters and returns some value, specify list of input parameters
> (like lists of variable to try for Gaussian kernels) and run it on the
> cluster, in somewhat failproof manner - that's the ideal situation.
>
>
>
> Has anyone tried that?
>
>
>
> ---
>
> Dmitry
>
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardy...@apache.org
http://blog.udanax.org


Hadoop and Matlab

2008-12-12 Thread Dmitry Pushkarev
Hi.

 

Can anyone share experience of successfully parallelizing matlab tasks using
hadoop?

 

We have implemented this thing with python (in form of simple module that
takes serialized function and data array and runs this function on the
cluster)m but we really have no clue how to that in Matlab.

 

Ideally we want to use Matlab in the same way -  write .m file that takes
set of parameters and returns some value, specify list of input parameters
(like lists of variable to try for Gaussian kernels) and run it on the
cluster, in somewhat failproof manner - that's the ideal situation.

 

Has anyone tried that? 

 

---

Dmitry



Announcing Cloudera's One Day Hadoop Training

2008-12-12 Thread Christophe Bisciglia
Hadoop Fans,

I'm happy to announce that Cloudera, in addition to providing
commercial support for Hadoop, is now offering a one-day, professional
training course for Hadoop. It's open to anyone in the community, and
is focused on helping you get the most out of Hadoop and related
tools. Come spend a day working with us and other users facing similar
challenges to your own.

For full details, see our website: http://www.cloudera.com/hadoop-training

We primarily focus on the following the themes:
   * What must our organization do differently to capture and
effectively use very-large scale data?
   * What tools help us analyze large-scale data and extract
meaningful results, and how do we use them?
   * How can we reorient our data generation and collection processes
to enable more powerful analysis later?

The morning is instructional, the afternoon is hands on. We provide a
cluster with interesting data, and you are free to load your own as
well.

Cheers,
Christophe


Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

2008-12-12 Thread Aaron Kimball
Hi Stuart,

Good sleuthing out that problem :) The correct way to submit patches is to
file a ticket on JIRA (https://issues.apache.org/jira/browse/HADOOP). Create
an account, create a new issue describing the bug, and then attach the patch
file. There'll be a discussion there and others can review your patch and
include it in the codebase.

Cheers,
- Aaron

On Fri, Dec 12, 2008 at 12:14 PM, Stuart White wrote:

> Ok, I'll answer my own question.
>
> This is caused by the fact that hadoop uses
> system.getProperty("path.separator") as the delimiter in the list of
> jar files passed via -libjars.
>
> If your job spans platforms, system.getProperty("path.separator")
> returns a different delimiter on the different platforms.
>
> My solution is to use a comma as the delimiter, rather than the
> path.separator.
>
> I realize comma is, perhaps, a poor choice for a delimiter because it
> is valid in filenames on both Windows and Linux, but the -libjars uses
> it as the delimiter when listing the additional required jars.  So, I
> figured if it's already being used as a delimiter, then it's
> reasonable to use it internally as well.
>
> I've attached a patch (against 0.19.0) that applies this change.
>
> Now, with this change, I can submit hadoop jobs (requiring multiple
> supporting jars) from my Windows laptop (via cygwin) to my 10-node
> Linux hadoop cluster.
>
> Any chance this change could be applied to the hadoop codebase?
>


Re: how do I know the datanode slave status

2008-12-12 Thread Aaron Kimball
Try bin/hadoop dfsadmin -report
Also, look at http://name-node-address:50070/

- Aaron

On Wed, Dec 10, 2008 at 11:09 PM, santu  wrote:

> Hi,
>
>  Is it possible to findout the status of the datanode salve? I want to know
> the details like
>- is the slave is running?
>- when the last data received?
>- any critical problem as of now?
>
> With regards,
> R. SANTHANA GOPALAN.
>
>
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
>


Re: -libjars with multiple jars broken when client and cluster reside on different OSs?

2008-12-12 Thread Stuart White
Ok, I'll answer my own question.

This is caused by the fact that hadoop uses
system.getProperty("path.separator") as the delimiter in the list of
jar files passed via -libjars.

If your job spans platforms, system.getProperty("path.separator")
returns a different delimiter on the different platforms.

My solution is to use a comma as the delimiter, rather than the path.separator.

I realize comma is, perhaps, a poor choice for a delimiter because it
is valid in filenames on both Windows and Linux, but the -libjars uses
it as the delimiter when listing the additional required jars.  So, I
figured if it's already being used as a delimiter, then it's
reasonable to use it internally as well.

I've attached a patch (against 0.19.0) that applies this change.

Now, with this change, I can submit hadoop jobs (requiring multiple
supporting jars) from my Windows laptop (via cygwin) to my 10-node
Linux hadoop cluster.

Any chance this change could be applied to the hadoop codebase?
diff -ur src/core/org/apache/hadoop/filecache/DistributedCache.java 
src_working/core/org/apache/hadoop/filecache/DistributedCache.java
--- src/core/org/apache/hadoop/filecache/DistributedCache.java  2008-11-13 
21:09:36.0 -0600
+++ src_working/core/org/apache/hadoop/filecache/DistributedCache.java  
2008-12-12 14:07:48.865460800 -0600
@@ -710,7 +710,7 @@
 throws IOException {
 String classpath = conf.get("mapred.job.classpath.archives");
 conf.set("mapred.job.classpath.archives", classpath == null ? archive
- .toString() : classpath + System.getProperty("path.separator")
+ .toString() : classpath + ","
  + archive.toString());
 FileSystem fs = FileSystem.get(conf);
 URI uri = fs.makeQualified(archive).toUri();
@@ -727,8 +727,7 @@
 String classpath = conf.get("mapred.job.classpath.archives");
 if (classpath == null)
   return null;
-ArrayList list = Collections.list(new StringTokenizer(classpath, System
-  
.getProperty("path.separator")));
+ArrayList list = Collections.list(new StringTokenizer(classpath, ","));
 Path[] paths = new Path[list.size()];
 for (int i = 0; i < list.size(); i++) {
   paths[i] = new Path((String) list.get(i));


Re: hadoop mapper 100% but cannot complete?

2008-12-12 Thread hc busy
ahhh, apologies for badmouthing hadoop...

So, I finally disocvered one problem that may have caused this kind of
degradation of performance. After growing the data set even larger to 35gb,
Hadoop crashed with disk full error. It would appear that the system will
actually continue to work when the disk is almost full, but there appear to
be some thing that causes it to slow down. Does hdfs juggle blocks around
when there isn't enough space on a slave machine? That would explain why it
was slowing down so much when the fs is almost full... Another part of this
is that I've updated by expectation of the speedup, accounting for the sort
that is happening, it is indeed faster.

I've upgraded to 0.18.2, and I now see the exception that is slowing down
the reducer near the end of the run(pasted below),any suggestions on this
one?

  at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)

2008-12-12 12:06:28,403 WARN /:
/mapOutput?job=job_200812121139_0001&map=attempt_200812121139_0001_m_000114_0&reduce=2:
java.lang.IllegalStateException: Committed
  at
org.mortbay.jetty.servlet.ServletHttpResponse.resetBuffer(ServletHttpResponse.java:212)
  at
org.mortbay.jetty.servlet.ServletHttpResponse.sendError(ServletHttpResponse.java:375)
  at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2504)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
  at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
  at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
  at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:475)
  at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
  at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
  at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationContext.java:635)
  at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
  at org.mortbay.http.HttpServer.service(HttpServer.java:954)
  at org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
  at org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
  at org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
  at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244)
  at org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
  at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
2008-12-12 12:06:31,107 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200812121139_0001_r_07_0 0.29801327% reduce > copy (135 of 151
at 1.42 MB/s) >


On Thu, Dec 11, 2008 at 12:13 PM, hc busy  wrote:

> And aside from refusing to declare task complete after everything is 100%,
> I also notied that the mapper seems too slow. It's taking the same amount of
> time for 4 machines to read and write through the 30gb file as if I did it
> with a /bin/cat on one machine. Do you guys have any suggestions with
> regards to these two problems?
> On Wed, Dec 10, 2008 at 4:37 PM, hc busy  wrote:
>
>> Guys, I've just configured a hadoop cluster for the first time, and I'm
>> running a null map-reduction over the streaming interface. (/bin/cat for
>> both map and reducer). So I noticed that the mapper and reducer complete
>> 100% in the web ui within a reasonable amount of time, but the job does not
>> complete. On command line it displays
>>
>> ...INFO streaming.StreamJob: map 100% reduce 100%
>>
>> In the web ui, it shows map completion graph is 100%, but does not display
>> a reduce completion graph. The four machines are well equiped to handle the
>> size of data (30gb). Looking at the task tracker on each of the machines, I
>> noticed that it is ticking down the percents very very slowly:
>>
>> 2008-12-10 16:18:55,265 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_02_0 46.684883% Records R/W=149326846/149326834
>> > reduce
>> 2008-12-10 16:18:57,055 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_06_0 47.566963% Records R/W=151739348/151739342
>> > reduce
>> 2008-12-10 16:18:58,268 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_02_0 46.826576% Records R/W=149326846/149326834
>> > reduce
>> 2008-12-10 16:19:00,058 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_06_0 47.741756% Records R/W=153377016/153376990
>> > reduce
>> 2008-12-10 16:19:01,271 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_02_0 46.9636% Records R/W=149326846/149326834 >
>> reduce
>> 2008-12-10 16:19:03,061 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_06_0 47.94259% Records R/W=153377016/153376990
>> > reduce
>> 2008-12-10 16:19:04,274 INFO org.apache.hadoop.mapred.TaskTracker:
>> task_200812101532_0001_r_02_0 47.110992% Records R/W=150960648/150960644
>> > reduce
>>
>> so it would continue like this for hours and hours. What buffer am I
>> setting too small, or what could possiblly m

Re: problem in inputSplit

2008-12-12 Thread Zhengguo 'Mike' SUN
It seemed it was complaining about the default constructor of IndexDirSplit. 
Try to change "protected class IndexDirSplit" to "static class IndexDirSplit".





From: ZhiHong Fu 
To: core-user@hadoop.apache.org
Sent: Friday, December 12, 2008 2:26:36 AM
Subject: problem in inputSplit

Hello,

Now I have encountered a very werid problem in custom split, in which I
define a IndexDirSplit cotaining a list of Index Directory Path,

I implemented like this:

package zju.edu.tcmsearch.lucene.search.format;

import java.io.IOException;
import java.io.DataInput;
import java.io.DataOutput;
import java.util.List;
import java.util.ArrayList;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.FileSystem;



public class IndexDirInputFormat implements InputFormat{
private static final Log LOG =
LogFactory.getLog(IndexDirInputFormat.class);
private static final String INVALID="invalid";

public static void configure(JobConf job){

}
public IndexDirReader getRecordReader(InputSplit split,
  JobConf job, Reporter reporter) throws IOException {
  reporter.setStatus(split.toString());
  return new IndexDirReader((IndexDirSplit)split,job,reporter);
}

public IndexDirSplit[] getSplits(JobConf job,int numSplits) throws
IOException {

  int numMaps=job.getNumMapTasks();
  LOG.info("tcm.search.indexDirs: "+job.get("tcm.search.indexDirs"));
  String[] indexDirs=job.get("tcm.search.indexDirs").split(",");
  FileSystem fs=FileSystem.get(job);

  int index=0;
  for(int i=0;i{

  private IndexDirSplit split;
  private JobConf job;
  private Reporter reporter;
  private int offset=0;
  public IndexDirReader(IndexDirSplit split,JobConf job,Reporter reporter){
  this.split=split;
  this.job=job;
  this.reporter=reporter;
  }
  public Text createKey(){
  return new Text();
  }

  public Text createValue(){
  return new Text();
  }

  public boolean next(Text key,Text value) throws IOException{

  List dbIndexPaths=split.getDbIndexPaths();

  if(offset>dbIndexPaths.size())
  return false;

  key.set("map"+offset);
  value.set(dbIndexPaths.get(offset));
  return true;

  }
  public float getProgress() throws IOException{
  return offset;
  }
  public long getPos() throws IOException{
  return offset;
  }

  public void close() throws IOException{

  }
}

protected class IndexDirSplit implements InputSplit{

  protected List dbIndexPaths=new ArrayList();
  protected int length=0;

  public IndexDirSplit(){ }

  public long getLength() throws IOException{
  return dbIndexPaths.size();
  }


  public List getDbIndexPaths() {
  return dbIndexPaths;
  }

  public void add(String indexPath){
  this.dbIndexPaths.add(indexPath);
  length++;
  }
  public String[] getLocations() throws IOException{
  return new String[]{};
  }

  public IndexDirSplit(List dbIndexPaths){
  this.dbIndexPaths=dbIndexPaths;
  this.length=dbIndexPaths.size();
  }

  public void readFields(DataInput in) throws IOException {
// throw new IOException("readFields(DataInput in) method haven't been
implemented!");
  length=in.readInt();
  List tmpDirs=new ArrayList();
  for(int i=0;i()
  at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:199)
  at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
Caused by: java.lang.NoSuchMethodException:
zju.edu.tcmsearch.lucene.search.format.IndexDirInputFormat$IndexDirSplit.()
  at java.lang.Class.getConstructor0(Class.java:2678)
  at java.lang.Class.getDeclaredConstructor(Class.java:1953)
  at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:74)
  ... 2 more

I don't know why , I have implemented a custom InputFormat like this,It
works successfully , Now I am very puzzled. Anybody can help me ? thanks!

regards!



  

Problems While doing Distributed search

2008-12-12 Thread elangovan anbalahan
I am not able to perform distributed search with two machines having indexed
data.

I have crawled data on two machines, one in the server itself and the other
on another linux laptop.

These are the changes i have made :
1)/etc/hosts/
# /etc/hosts (for master AND slave)
192.168.1.106master
192.168.1.105slave

2)created a server folder inside TOMCAT_HOME

3)created a search-server.txt file inside that with the following content
master 1234
slave  5678

4)modified nutch-site.xml

searcher.dir
/usr/share/tomcat6/server/search-server


5)On the server I ran this command

bin/nutch server 1234 /path/to/crawledDir

6)On slave i ran this command
bin/nutch server 5678  /path/to/crawledDir

7)I opened http://localhost:8080/nutch-0.9/
and performed search

But it is giving me zero results.


What am i doing wrong in this ??? i have also attached Tomcat logs.

Help 

I checked Tomcat Logs, This is what i have in them
2008-12-12 04:44:30,922 INFO  PluginRepository - Plugins: looking in:
/var/lib/tomcat6/webapps/nutch-0.9/WEB-INF/classes/plugins
2008-12-12 04:44:31,072 INFO  PluginRepository - Plugin Auto-activation
mode: [true]
2008-12-12 04:44:31,072 INFO  PluginRepository - Registered Plugins:
2008-12-12 04:44:31,072 INFO  PluginRepository - the nutch core
extension points (nutch-extensionpoints)
2008-12-12 04:44:31,072 INFO  PluginRepository - Basic Query Filter
(query-basic)
2008-12-12 04:44:31,072 INFO  PluginRepository - Basic URL Normalizer
(urlnormalizer-basic)
2008-12-12 04:44:31,072 INFO  PluginRepository - Basic Indexing Filter
(index-basic)
2008-12-12 04:44:31,072 INFO  PluginRepository - Html Parse Plug-in
(parse-html)
2008-12-12 04:44:31,072 INFO  PluginRepository - Basic Summarizer
Plug-in (summary-basic)
2008-12-12 04:44:31,072 INFO  PluginRepository - Site Query Filter
(query-site)
2008-12-12 04:44:31,072 INFO  PluginRepository - HTTP Framework
(lib-http)
2008-12-12 04:44:31,072 INFO  PluginRepository - Text Parse Plug-in
(parse-text)
2008-12-12 04:44:31,072 INFO  PluginRepository - Regex URL Filter
(urlfilter-regex)
2008-12-12 04:44:31,072 INFO  PluginRepository - Pass-through URL
Normalizer (urlnormalizer-pass)
2008-12-12 04:44:31,072 INFO  PluginRepository - Http Protocol Plug-in
(protocol-http)
2008-12-12 04:44:31,072 INFO  PluginRepository - Regex URL Normalizer
(urlnormalizer-regex)
2008-12-12 04:44:31,072 INFO  PluginRepository - OPIC Scoring Plug-in
(scoring-opic)
2008-12-12 04:44:31,072 INFO  PluginRepository - CyberNeko HTML Parser
(lib-nekohtml)
2008-12-12 04:44:31,072 INFO  PluginRepository - JavaScript Parser
(parse-js)
2008-12-12 04:44:31,072 INFO  PluginRepository - URL Query Filter
(query-url)
2008-12-12 04:44:31,073 INFO  PluginRepository - Regex URL Filter
Framework (lib-regex-filter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Registered
Extension-Points:
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Summarizer
(org.apache.nutch.searcher.Summarizer)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch URL Normalizer
(org.apache.nutch.net.URLNormalizer)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Protocol
(org.apache.nutch.protocol.Protocol)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Analysis
(org.apache.nutch.analysis.NutchAnalyzer)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch URL Filter
(org.apache.nutch.net.URLFilter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Indexing Filter
(org.apache.nutch.indexer.IndexingFilter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Online Search
Results Clustering Plugin (org.apache.nutch.clustering.OnlineClusterer)
2008-12-12 04:44:31,073 INFO  PluginRepository - HTML Parse Filter
(org.apache.nutch.parse.HtmlParseFilter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Content Parser
(org.apache.nutch.parse.Parser)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Nutch Query Filter
(org.apache.nutch.searcher.QueryFilter)
2008-12-12 04:44:31,073 INFO  PluginRepository - Ontology Model Loader
(org.apache.nutch.ontology.Ontology)
2008-12-12 04:44:31,083 INFO  NutchBean - creating new bean
2008-12-12 04:44:31,106 INFO  NutchBean - opening indexes in
/usr/share/tomcat6/server/search-server/indexes
2008-12-12 04:44:31,167 INFO  Configuration - found resource
common-terms.utf8 at
file:/var/lib/tomcat6/webapps/nutch-0.9/WEB-INF/classes/common-terms.utf8
2008-12-12 04:44:31,175 INFO  NutchBean - opening segments in
/usr/share/tomcat6/server/search-server/segments
2008-12-12 04:44:31,189 INFO  SummarizerFactory - Using the first summarizer
extension found: Basic Summarizer
2008-12-12 04:44:31,189 INFO  NutchBean - opening linkdb in
/usr/share/tomcat6/server/search-server/linkdb
2008-12-12 04:44:31,199 INFO  NutchBean - query req

Re: JDBC input/output format

2008-12-12 Thread Fredrik Hedberg
I highly doubt using Hadoop for that would be the most efficient  
solution, unless you have a "sharded" database infrastructure and  
extend the Hadoop database input/output format accordingly.


 - Fredrik

On Dec 12, 2008, at 5:26 AM, Edward J. Yoon wrote:


Does anyone think about database to database migration using hadoop?

On Tue, Dec 9, 2008 at 4:07 AM, Alex Loddengaard   
wrote:
Here are some useful links with regard to reading from and writing  
to MySQL:





Those two issues should answer your questions.

Alex

On Mon, Dec 8, 2008 at 9:10 AM, Edward Capriolo >wrote:



Is anyone working on a JDBC RecordReader/InputFormat. I was thinking
this would be very useful for sending data into mappers. Writing  
data
to a relational database might be more application dependent but  
still

possible.







--
Best Regards, Edward J. Yoon @ NHN, corp.
edwardy...@apache.org
http://blog.udanax.org