streaming but no sorting

2009-04-28 Thread Dmitry Pushkarev
Hi.

 

I'm writing streaming based tasks that involves running thousands of
mappers, after that I want to put all these outputs into small number (say
30) output files mainly so that disk space will be used more efficiently,
the way I'm doing it right now is using /bin/cat as reducer and setting
number of reducers to desired. This involves two highly ineffective (for the
task) steps - sorting and fetching.  Is there a way to get around that? 

Ideally I'd want all mapper outputs to be written to one file, one record
per line.

 

Thanks. 

 

---

Dmitry Pushkarev

+1-650-644-8988

 



RE: CloudBurst: Hadoop for DNA Sequence Analysis

2009-04-08 Thread Dmitry Pushkarev
As a matter of fact it is nowhere near close to being data intensive, it
does take gigabytes of input data to process, however it is mostly RAM and
CPU intensive. Although post-processing of alignment files is exactly where
hadoop excels. At least as far as I understand majority of time is spent on
DP alignment whereas navigation in seed space and N*log(n) sort requires
only a fraction of that time - that was my experience applying hadoop
cluster to sequencing human genomes.



---
Dmitry Pushkarev
+1-650-644-8988

-Original Message-
From: michael.sch...@gmail.com [mailto:michael.sch...@gmail.com] On Behalf
Of Michael Schatz
Sent: Wednesday, April 08, 2009 9:19 PM
To: core-user@hadoop.apache.org
Subject: CloudBurst: Hadoop for DNA Sequence Analysis

Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz



HDD benchmark/checking tool

2009-02-03 Thread Dmitry Pushkarev
Dear hadoop users,

 

Recently I have had a number of drive failures that slowed down processes a
lot until they were discovered. It is there any easy way or tool, to check
HDD performance and see if there any IO errors?

Currently I wrote a simple script that looks at /var/log/messages and greps
everything abnormal for /dev/sdaX. But if you have better solution I'd
appreciate if you share it.

 

---

Dmitry Pushkarev

+1-650-644-8988

 



RE: Is Hadoop Suitable for me?

2009-01-28 Thread Dmitry Pushkarev
Definitely not, 

You should be looking at expandable Ethernet storage that can be extended by
connecting additional SAS arrays. (like dell powervault and similar things
from other companies)

600Mb is just 6 seconds over gigabit network...

---
Dmitry Pushkarev


-Original Message-
From: Simon [mailto:sim...@bigair.net.au] 
Sent: Wednesday, January 28, 2009 10:02 PM
To: core-user@hadoop.apache.org
Subject: Is Hadoop Suitable for me?

Hi Hadoop Users,


I am trying to build a storage system for the office of about 20-30 users
which will store everything.

From normal everyday documents to computer configuration files to big files
(600mb) which are generated every hour.

 

Is Hadoop suitable for this kind of environment?

 

Regards,

Simon




streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Hi.

 

I'm running streaming on relatively big (2Tb) dataset, which is  being split
by hadoop in 64mb pieces.  One of the problems I have with that is my map
tasks take very long time to initialize (they need to load 3GB database into
RAM) and they are finishing these 64mb in 10 seconds. 

 

So I'm wondering if there is any way to make hadoop give larger datasets to
map jobs? (trivial way, of course would be to split dataset to N files and
make it feed one file at a time, but is there any standard solution for
that?)

 

Thanks.

 

---

Dmitry Pushkarev

+1-650-644-8988

 



RE: streaming split sizes

2009-01-20 Thread Dmitry Pushkarev
Well, database is specifically designed to fit into memory and if it is not
it will slow things down hundreds of time. One simple hack I came to is to
replace map tasks by /bin/cat and then run 150 reducers that will have
database constantly in memory. Parallelism is also not a problems, since
we're running very small (15 nodes, 120 cores) specifically built for the
task.

---
Dmitry Pushkarev
+1-650-644-8988

-Original Message-
From: Delip Rao [mailto:delip...@gmail.com] 
Sent: Tuesday, January 20, 2009 6:19 PM
To: core-user@hadoop.apache.org
Subject: Re: streaming split sizes

Hi Dmitry,

Not a direct answer to your question but I think the right approach
would be to not load your database into memory during config() but
instead lookup the database from map() via Hbase or something similar.
That way you don't have to worry about the split sizes. In fact using
fewer splits would limit the parallelism you can achieve, given that
your maps are so fast.

- delip

On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev u...@stanford.edu wrote:
 Hi.



 I'm running streaming on relatively big (2Tb) dataset, which is  being
split
 by hadoop in 64mb pieces.  One of the problems I have with that is my map
 tasks take very long time to initialize (they need to load 3GB database
into
 RAM) and they are finishing these 64mb in 10 seconds.



 So I'm wondering if there is any way to make hadoop give larger datasets
to
 map jobs? (trivial way, of course would be to split dataset to N files and
 make it feed one file at a time, but is there any standard solution for
 that?)



 Thanks.



 ---

 Dmitry Pushkarev

 +1-650-644-8988







Hadoop and Matlab

2008-12-12 Thread Dmitry Pushkarev
Hi.

 

Can anyone share experience of successfully parallelizing matlab tasks using
hadoop?

 

We have implemented this thing with python (in form of simple module that
takes serialized function and data array and runs this function on the
cluster)m but we really have no clue how to that in Matlab.

 

Ideally we want to use Matlab in the same way -  write .m file that takes
set of parameters and returns some value, specify list of input parameters
(like lists of variable to try for Gaussian kernels) and run it on the
cluster, in somewhat failproof manner - that's the ideal situation.

 

Has anyone tried that? 

 

---

Dmitry



Hadoop and security.

2008-10-05 Thread Dmitry Pushkarev
Dear hadoop users, 

 

I'm lucky to work in academic environment where information security is not
the question. However, I'm sure that most of the hadoop users aren't. 

 

Here is the question: how secure hadoop is?  (or let's say foolproof)

 

Here is the answer: http://www.google.com/search?client=opera
http://www.google.com/search?client=operarls=enq=Hadoop+Map/Reduce+Admini
strationsourceid=operaie=utf-8oe=utf-8
rls=enq=Hadoop+Map/Reduce+Administrationsourceid=operaie=utf-8oe=utf-8
not quite.

 

What we're seeing here is open hadoop cluster, where anyone who capable of
installing hadoop and changing his username to webcrawl can use their
cluster and read their data, even though firewall is perfectly installed and
ports like ssh are filtered to outsiders. After you've played enough with
data, you can observe that you can submit jobs as well, and these jobs can
execute shell commands. Which is very, very sad.

 

In my view, this significantly limits distributed hadoop applications, where
part of your cluster may reside on EC2 or other distant datacenter, since
you always need to have certain ports open to an array of ip addresses (if
your instances are dynamic) which isn't acceptable if anyone from that ip
range can connect to your cluster.

 

Can we propose to developers to introduce some basic user-management and
access controls to help hadoop make one step further towards
production-quality system?

 

And, by the way add robots.txt to default distribution.  (but I doubt it
will help, as it takes less than a week to scan all internet for given port
on home DSL..)

 

---

Dmitry

 



hadoop under windows.

2008-10-03 Thread Dmitry Pushkarev
Hi.

 

I have a strange problem with hadoop when I run jobs under windows (my
laptop runs XP, but all cluster machines including namenode run Ubuntu).  I
run job (which runs perfectly under linux, and all configs and Java versions
are the same), all mappers  finishes successfully, and so does reducer but
when I tries to copy resulting file to the output directory I get things
like: 

 

 

03.10.2008 21:47:24 *INFO * audit:
ugi=Dmitry,mkpasswd,root,None,Administrators,Users
ip=/171.65.102.211cmd=rename
src=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320
05_0013_r_00_0/part-0
dst=/user/public/tmp/streaming-job12345/out48/_temporary/_attempt_2008100320
05_0013_r_00_0/part-0perm=Dmitry:supergroup:rw-r--r--
(FSNamesystem.java, line 94)

 

And then it deletes the file. And I get no output. 

Why does it renames the files into itself and does it have anything to do
with Path.getParent()?

 

Thanks.



jython HBase map/red task

2008-09-17 Thread Dmitry Pushkarev
Hi. I'm writing mapreduce task in jython, and I can't launch
ToolRunner.run, jython says TypeError: integer required on ToolRunner.run
line and I can't get more detailed explanation. I guess the error is either
in ToolRunner or in setConf:

 

What am I doing wrong? J

 

And can anyone share a sample working MapRed Hadoop job in jython
(prefereably for HBase)?

 

class rowcounter(TableMap,Tool):

def map(self, key, value, output, reporter):

  pass

 

def setConf(self,c):

  pass

def getConf(self):

  return Configuration()



def run(self,args):

  print Someday I'll be doing something   

  pass

 

def main(args):

ToolRunner.run(Configuration, rowcounter(),args)

sys.exit()

 

if __name__ == __main__:

main(sys.argv)

 

 



RE: Installing Hadoop on OS X 10.5 single node cluster (MacPro) posted to wiki

2008-09-14 Thread Dmitry Pushkarev
Awesome, wish I had it couple of weeks ago.

By the way, can someone give me a Jython code that interacts with HBase? 
I want to learn to write simple mapreducers (for example to go over all rows
in a given column, compute something, and put result into another column).
It work I promise to write detailed wiki page.

-Original Message-
From: Sandy [mailto:[EMAIL PROTECTED] 
Sent: Sunday, September 14, 2008 8:58 PM
To: core-user@hadoop.apache.org
Subject: Installing Hadoop on OS X 10.5 single node cluster (MacPro) posted
to wiki

Hi all,

I just posted a set of instructions on how to install hadoop on a Mac Pro
single node cluster. Please let me know if you see any changes to be made or
if you have any suggestions or comments!

It can be viewed here:
http://wiki.apache.org/hadoop/Running_Hadoop_On_OS_X_10.5_64-bit_(Single-Nod
e_Cluster)


Thanks,

-SM



RE: HDFS

2008-09-13 Thread Dmitry Pushkarev
Why not use HAR over HDFS? Idea being that if you don't do too much of
writing, having files compacter to har archives (that will be stored in 64mb
slices) might be a good answer. 

Thus the question for hadoop developers, is hadoop har-aware? 
In two senses:
1. Whether it tries to assign tasks close to data (know to which piece of
har file a given file belong to)
2. If har folder is fed to a map task, will hadoop give all files to the
task altogether through local datanode, or getting each file will result in
accessing namenode, resolving, and etc?

-Original Message-
From: Monchanin Eric [mailto:[EMAIL PROTECTED] 
Sent: Saturday, September 13, 2008 4:59 PM
To: core-user@hadoop.apache.org
Subject: Re: HDFS

Hi,

Thank you all for your answers, I am running deep into various
distributed file systems.
So far checking on mogilefs, kfs ... which seem to better fit my needs.
I'll be dealing indeed with thousands of small files (few kB to few MB,
10 MB top).
Our service is based on a web portal driven by php.
Though I'm having hard times compiling any of them (mogilefs with perl
and dependancies, kfs with c++ and dependancies).

If any one of you has any experience to share, I'll be glad to listen.

Cheers,

Eric

Robert Krüger a écrit :
 Hi Eric,

 we are currently building a system for a very similar purpose (digital
 asset management) and we use HDFS currently for a volume of approx.
 100TB with the option to scale into the PB range. Since we haven't gone
 into production yet, I cannot say it will work flawlessly but so far
 everything has worked very well with really good performance (especially
 read performance which is probably also in your case the most important
 factor). The most important thing you have to be aware of IMHO ist that
 you will not have a real file system on the OS level. If you use tools
 which need that to process the data you will need to do some copying
 (which we do in some cases). There is a project out there that makes
 HDFS available via FUSE but it appears to be rather alpha which is why
 we haven't dared to take a look at it for this project.

 Apart from the namenode, which you have to get redundant yourself (lots
 of posts in the archives on this topic) you can simply configure the
 level of redundancy (see docs).

 Hope this helps,

 Robert


 Monchanin Eric wrote:
   
 Hello to all,

 I have been attracted by the Hadoop project while looking for a solution
 for my application.
 Basically, I have an application hosting user generated content (images,
 sounds, videos) and I would like to have this available at all time for
 all my servers.
 Servers will basically add new content, user can manipulate the existing
 content, make compositions etc etc ...

 We have a few servers (2 for now) dedicated to hosting content, and
 right now, they are connected via sshfs on some folders, in order to
 shorten the transfert time between these content servers and the
 application servers.

 Would the Hadoop filesystem be usefull in my case, is it worth digging
 into it.

 In the case it is doable, how redundant the system is ? for instance, to
 store 1 MB of data, how much storage do I need (I guess at least 2 MB
...) ?

 I hope I made myself clear enough and will get encouraging answers,

 Bests to all,

 Eric

 


   




RE: namenode multitreaded

2008-09-12 Thread Dmitry Pushkarev
I have 15+ million small files I like to process and move around..Thus my
operations doesn't really include datanodes - they're idle when I for
example do FS operations (like sort a bunch of new files written by
tasktracker to appropriate folders). Now I tried to use HADOOP_OPTS=-server
and it seems to help a little, but still performance isn't great. 

Perhaps problem is in the way I play with files - it's perl script over
davf2 over WebDav which uses native API. 

Can anyone give an example of a jython or jruby file that'd recursively go
over a hdfs folder and move all files to a different folder? (My programming
skills are very modest..)


-Original Message-
From: Raghu Angadi [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 12, 2008 9:41 AM
To: core-user@hadoop.apache.org
Subject: Re: namenode multitreaded


The core of namenode functionality happens in single thread because of a 
global lock, unfortunately. The other cpus would still be used to some 
extent by network IO and other threads. Usually we don't see just one 
cpu at 100% and nothing else on the other cpus.

What kind of load do you have?

Raghu.

Dmitry Pushkarev wrote:
 Hi.
 
  
 
 My namenode runs on a 8-core server with lots of RAM, but it only uses one
 core (100%).
 
 Is it possible to tell namenode to use all available cores?
 
  
 
 Thanks.
 
 



namenode multitreaded

2008-09-11 Thread Dmitry Pushkarev
Hi.

 

My namenode runs on a 8-core server with lots of RAM, but it only uses one
core (100%).

Is it possible to tell namenode to use all available cores?

 

Thanks.



RE: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread Dmitry Pushkarev
This will effectively ruin system on large scale. Since you will have to update 
all blocks when you play with metadata...

-Original Message-
From: 叶双明 [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 10, 2008 12:06 AM
To: core-user@hadoop.apache.org
Subject: Re: Thinking about retriving DFS metadata from datanodes!!!

I think let each block carry three simple additional information which
doesn't use in normal situation:
   1. which file that it belong to
   2. which block is it in the file
   3. how many blocks of the file
After the cluster system has been destroy, we can set up new NameNode , and
then , rebuild metadata from the information reported from datanodes.

And the cost is  a little disk space, indeed less than 1k each block I
think.  I don't think it replace of multiple NameNodes or compare to , but
just a  possible mechanism to recover data, the point is 'recover.

hehe~~ thanks.

2008/9/10 Raghu Angadi [EMAIL PROTECTED]


 The main problem is the complexity of maintaining accuracy of the metadata.
 In other words, what you think is the cost?

 Do you think writing fsimage to multiple places helps with the terrorist
 attack? It is supported even now.

 Raghu.

 叶双明 wrote:

 Thanks for paying attention  to my tentative idea!

 What I thought isn't how to store the meradata, but the final (or last)
 way
 to recover valuable data in the cluster when something worst (which
 destroy
 the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
 natural disasters destroy half of cluster nodes within all NameNode, we
 can
 recover as much data as possible by this mechanism, and hava big chance to
 recover entire data of cluster because fo original replication.

 Any suggestion is appreciate!






-- 
Sorry for my english!! 明



number of tasks on a node.

2008-09-09 Thread Dmitry Pushkarev
Hi.

 

How can node find out how many task are being run on it at a given time?

I want tasktracer nodes (which are assigned from amazon EC) to shutdown if
nothing is being run for some period of time, but don't yet see right way of
implementing this.



RE: task assignment managemens.

2008-09-08 Thread Dmitry Pushkarev
How about just specify machines to run the task on? I haven't seen it
anywhere..

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Sunday, September 07, 2008 9:55 PM
To: core-user@hadoop.apache.org
Subject: Re: task assignment managemens.

No that is not possible today. However, you might want to look at the
TaskScheduler to see if you can implement a scheduler to provide this kind
of task scheduling.

In the current hadoop, one point regarding computationally intensive task is
that if the machine is not able to keep up with the rest of the machines
(and the task on that machine is running slower than others), speculative
execution, if enabled, can help a lot. Also, implicitly, faster/better
machines get more work than the slower machines.


On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:

 Dear Hadoop users,
 
  
 
 Is it possible without using java manage task assignment to implement some
 simple rules?  Like do not launch more that 1 instance of crawling task
on
 a machine, and do not run data intensive tasks on remote machines, and do
 not run computationally intensive tasks on single-core machines:etc.
 
  
 
 Now it's done by failing tasks that decided to run on a wrong machine, but
I
 hope to find some solution on jobtracker side..
 
  
 
 ---
 
 Dmitry
 




task assignment managemens.

2008-09-07 Thread Dmitry Pushkarev
Dear Hadoop users,

 

Is it possible without using java manage task assignment to implement some
simple rules?  Like do not launch more that 1 instance of crawling task  on
a machine, and do not run data intensive tasks on remote machines, and do
not run computationally intensive tasks on single-core machines:etc.

 

Now it's done by failing tasks that decided to run on a wrong machine, but I
hope to find some solution on jobtracker side..

 

---

Dmitry



RE: no output from job run on cluster

2008-09-04 Thread Dmitry Pushkarev
Hi,

I'd check java version installed, that was the problem in my case, and
surprisingly no output from hadoop. If it help - can you submit bug request
? :) 

-Original Message-
From: Shirley Cohen [mailto:[EMAIL PROTECTED] 
Sent: Thursday, September 04, 2008 10:07 AM
To: core-user@hadoop.apache.org
Subject: no output from job run on cluster

Hi,

I'm running on hadoop-0.18.0. I have a m-r job that executes  
correctly in standalone mode. However, when run on a cluster, the  
same job produces zero output. It is very bizarre. I looked in the  
logs and couldn't find anything unusual. All I see are the usual  
deprecated filesystem name warnings. Has this ever happened to  
anyone? Do you have any suggestions on how I might go about  
diagnosing the problem?

Thanks,

Shirley



RE: har/unhar utility

2008-09-03 Thread Dmitry Pushkarev
Not quite, I want to be able to create har archives on local system and then
send them to HDFS, and back since I work with many small files (10kb) and
hadoop seem to behave poorly with them.

Perhaps HBASE is another option. Is anyone using it in production mode?
And do I really need to downgrade to 17.x to install it?

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 03, 2008 3:35 AM
To: core-user@hadoop.apache.org
Subject: Re: har/unhar utility

Are you looking for user documentation on har? If so, here it is:
http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html


On 9/3/08 3:21 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:

 Does anyone have har/unhar utility?
 
  Or at least format description: It looks pretty obvious though, but just
in
 case.
 
  
 
 Thanks
 
 
 
 




RE: har/unhar utility

2008-09-03 Thread Dmitry Pushkarev
Probably, but the current idea is to bypass writing small files to HDFS by
creating my own local har archive and uploading it. (small files lower
transfer speed from 40-70MB/s to hundreds ok kbps :(

-Original Message-
From: Devaraj Das [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 03, 2008 4:00 AM
To: core-user@hadoop.apache.org
Subject: Re: har/unhar utility

You could create a har archive of the small files and then pass the
corresponding har filesystem as input to your mapreduce job. Would that
work?


On 9/3/08 4:24 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:

 Not quite, I want to be able to create har archives on local system and
then
 send them to HDFS, and back since I work with many small files (10kb) and
 hadoop seem to behave poorly with them.
 
 Perhaps HBASE is another option. Is anyone using it in production mode?
 And do I really need to downgrade to 17.x to install it?
 
 -Original Message-
 From: Devaraj Das [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 03, 2008 3:35 AM
 To: core-user@hadoop.apache.org
 Subject: Re: har/unhar utility
 
 Are you looking for user documentation on har? If so, here it is:
 http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html
 
 
 On 9/3/08 3:21 PM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:
 
 Does anyone have har/unhar utility?
 
  Or at least format description: It looks pretty obvious though, but just
 in
 case.
 
  
 
 Thanks
 
 
 
 
 
 




datanodes in virtual networks.

2008-09-01 Thread Dmitry Pushkarev
Dear hadoop users,

 

Our lab in slowly switching from SGE to hadoop, however not everything seems
to be easy and obvious. We are in no way computer scientists, we're just
physicists, biologist and couple of statisticians trying to solve our
computational problems, please take this into consideration if questions
will look to you obvious..  

Our setup:

1.   Data cluster - 4 Raided and Hadooped servers, with 2TB of storage
each, they all have real IP addresses, one of them reserved for NameNode.

2.   Computational cluster:  100 dualcore servers running Sun Grid
Engine, they live on virtual network (10.0.0.X) and can connect to outside
world, but not accessible from out of the cluster. On these we don't have
root access, and these are shared via SGE with other people, who get
reasonably nervous when see idle reserved servers. 

 

Basic Idea is to create on-demand computational cluster,  which when needed
will reserve servers from second cluster run jobs and let them go.

 

Currently it is done via script that reserves server for namenode 25 servers
for datanode copies data from first cluster, runs job, send result back and
releases servers. I still want to make them work together using one
namenode. 

 

After a week playing with hadoop I couldn't answer some of my question vie
thorough RTFM, so I'd really appreciate is you can answer at least some of
them in our context:

 

1.   Is it possible to connect servers from second cluster to first
namenode? What worries me is implementation of data-transfer protocol,
because some of the nodes cannot be reached but they can easily reach any
other node.  Will hadoop try to establish connection both ways to transfer
data between nodes?

 

2.   It is possible to specify reliability of the node, that is to
make replica on the node with raid installed counts as two replicas as
probability of failure is much lower. 

 

3.   I also bumped into problems with decommissioning, after I add hosts
to free to dfs.hosts.exclude file and refreshNodes, they are marked as
Decommission in progress for days, even though data is removed from them
within first several minutes. What I currently do is shoot them down with
some delay, but I really hope to see Decommissioned one day. What am I
probably doing wrong?

 

4.   The same question about dead hosts. I do a simple exercise: I
create 20 datanodes on empty cluster, then I kill 15 of them and try to
store a file on HDFS, hadoop fails because some nodes that it thinks in
service aren't accessible. Is it possible to tell hadoop to remove these
nodes from the list and do not try to store data on them? My current
solution is hadoop-stop/start via cron every hour.

 

5.   We also have some external secure storage that can be accesses via
NFS from fists DATA cluster,  and it'd be great if I could somehow mount
this storage to HDFS folder and tell hadoop that all data written to that
folder shouldn't be replicated rather they should go directly to NFS.

 

6.   Ironically none of us who uses cluster knows java, and most tasks
are launched via streaming with C++ programs/perl scripts.  The problem is
how to write/read files from HDFS in this context, we currently use things
like   -moveFromLocal  but it doesn't seems to be right answer, because it
slows things down a lot.

 

7.   On one of the DataCluster machines with run pretty large MySQL
database, and just thinking whether it is possible to spread database across
the cluster, has anyone tried that?

 

8.   Fuse-hdfs works great, but we really hope to be able to write to
HDFS someday, how to enable it?

 

9.   And may be someone can point out where to look for ways to specify
how to partition data for the map jobs, in some our tasks processing of one
line of input file takes several minutes, currently we split these files to
many one-line files and process them independently, but a simple
streaming-compatible way to tell hadoop that for example we want each job to
take 10 lines or to split the 10kb input file into 1 map tasks would
help as a lot!

 

 

 

Thanks in advance.