c++ problem

2011-03-16 Thread Manish Yadav
please dont give me example of word count.i just want want a simple c++ 
program to run on hadoop.


Re: c++ problem

2011-03-16 Thread Harsh J
C++ programs run on any of the OS they're written for. Hadoop is to be
used as a platform to make these programs work as part of a Map/Reduce
application.

On Wed, Mar 16, 2011 at 12:53 PM, Manish Yadav  wrote:
> please dont give me example of word count.i just want want a simple c++
> program to run on hadoop.
>



-- 
Harsh J
http://harshj.com


Iostat on Hadoop

2011-03-16 Thread Matthew John
Hi all,

Can someone give pointers on using Iostat to account for IO overheads
(disk read/writes) in a MapReduce job.

Matthew John


Re: Iostat on Hadoop

2011-03-16 Thread Jérôme Thièvre INA
Hi Matthew,

you can use iostat -xm 2 to monitor disk usage.
Look at %util column. When numbers are between 90-100% for some devices, you
start to have some processes that are in disk sleep status and you may have
excessive loads.
Use htop to monitor disk sleep processes. Sort on the S column and watch for
the D status.

Jérôme Thièvre

2011/3/16 Matthew John 

> Hi all,
>
> Can someone give pointers on using Iostat to account for IO overheads
> (disk read/writes) in a MapReduce job.
>
> Matthew John
>


Re: DFSClient: Could not complete file

2011-03-16 Thread Chris Curtin
Thanks. Spent a lot of time looking at logs and nothing on the reducers
until they start complaining about 'could not complete'.

Found this in the jobtracker log file:

2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
10.120.41.103:50010
2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3829493505250917008_9959810 in pipeline
10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad datanode
10.120.41.103:50010
2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
retrying...

Looking at the logs from the various times this happens, the 'from datanode'
in the first message is any of the data nodes (roughly equal in # of times
it fails), so I don't think it is one specific node having problems.
Any other ideas?

Thanks,

Chris
On Sun, Mar 13, 2011 at 3:45 AM, icebergs  wrote:

> You should check the bad reducers' logs carefully.There may be more
> information about it.
>
> 2011/3/10 Chris Curtin 
>
> > Hi,
> >
> > The last couple of days we have been seeing 10's of thousands of these
> > errors in the logs:
> >
> >  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
> >
> >
> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_03_0/4129371_172307245/part-3
> > retrying...
> > When this is going on the reducer in question is always the last reducer
> in
> > a job.
> >
> > Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs
> > another and it succeeds. Sometimes hadoop kills the reducer and the new
> one
> > also fails, so it gets killed and the cluster goes into a loop of
> > kill/launch/kill.
> >
> > At first we thought it was related to the size of the data being
> evaluated
> > (4+GB), but we've seen it several times today on < 100 MB
> >
> > Searching here or online doesn't show a lot about what this error means
> and
> > how to fix it.
> >
> > We are running 0.20.2, r911707
> >
> > Any suggestions?
> >
> >
> > Thanks,
> >
> > Chris
> >
>


Re: DFSClient: Could not complete file

2011-03-16 Thread Chris Curtin
Caught something today I missed before:

11/03/16 09:32:49 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink 10.120.41.105:50010
11/03/16 09:32:49 INFO hdfs.DFSClient: Abandoning block
blk_-517003810449127046_10039793
11/03/16 09:32:49 INFO hdfs.DFSClient: Waiting to find target node:
10.120.41.103:50010
11/03/16 09:34:04 INFO hdfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: 69000 millis timeout while waiting for
channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=/10.120.41.85:34323 remote=/10.120.41.105:50010]
11/03/16 09:34:04 INFO hdfs.DFSClient: Abandoning block
blk_2153189599588075377_10039793
11/03/16 09:34:04 INFO hdfs.DFSClient: Waiting to find target node:
10.120.41.105:50010
11/03/16 09:34:55 INFO hdfs.DFSClient: Could not complete file
/tmp/hadoop/mapred/system/job_201103160851_0014/job.jar retrying...



On Wed, Mar 16, 2011 at 9:00 AM, Chris Curtin wrote:

> Thanks. Spent a lot of time looking at logs and nothing on the reducers
> until they start complaining about 'could not complete'.
>
> Found this in the jobtracker log file:
>
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
> blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
> 10.120.41.103:50010
> 2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_3829493505250917008_9959810 in pipeline
> 10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad
> datanode 10.120.41.103:50010
> 2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
> complete file
> /var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
> retrying...
>
> Looking at the logs from the various times this happens, the 'from
> datanode' in the first message is any of the data nodes (roughly equal in #
> of times it fails), so I don't think it is one specific node having
> problems.
> Any other ideas?
>
> Thanks,
>
> Chris
>   On Sun, Mar 13, 2011 at 3:45 AM, icebergs  wrote:
>
>> You should check the bad reducers' logs carefully.There may be more
>> information about it.
>>
>> 2011/3/10 Chris Curtin 
>>
>> > Hi,
>> >
>> > The last couple of days we have been seeing 10's of thousands of these
>> > errors in the logs:
>> >
>> >  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
>> >
>> >
>> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_03_0/4129371_172307245/part-3
>> > retrying...
>> > When this is going on the reducer in question is always the last reducer
>> in
>> > a job.
>> >
>> > Sometimes the reducer recovers. Sometimes hadoop kills that reducer,
>> runs
>> > another and it succeeds. Sometimes hadoop kills the reducer and the new
>> one
>> > also fails, so it gets killed and the cluster goes into a loop of
>> > kill/launch/kill.
>> >
>> > At first we thought it was related to the size of the data being
>> evaluated
>> > (4+GB), but we've seen it several times today on < 100 MB
>> >
>> > Searching here or online doesn't show a lot about what this error means
>> and
>> > how to fix it.
>> >
>> > We are running 0.20.2, r911707
>> >
>> > Any suggestions?
>> >
>> >
>> > Thanks,
>> >
>> > Chris
>> >
>>
>
>


Re: c++ problem

2011-03-16 Thread Keith Wiley
Why don't you write up a typical Hello World in C++, then make that run as a 
mapper on Hadoop streaming (or pipes).  If you send the "Hello World" to cout 
(as opposed to cerr or a file or something like that) it will automatically be 
interpreted as Hadoop output.  Voila!  Your first C++ Hadoop program.

On Mar 16, 2011, at 12:23 AM, Manish Yadav wrote:

> please dont give me example of word count.i just want want a simple c++ 
> program to run on hadoop.



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland




Lost Task Tracker because of no heartbeat

2011-03-16 Thread Baran_Cakici

Hi Everyone,

I make a Project with Hadoop-MapRedeuce for my master-Thesis. I have a
strange problem on my System.

First of all, I use Hadoop-0.20.2 on Windows XP Pro with Eclipse Plug-In.
When I start a job with big Input(4GB - it`s may be not to big, but
algorithm require some time), then i lose my Task Tracker in several minutes
or seconds. I mean, "Seconds since heartbeat" increase 
and then after 600 Seconds I lose TaskTracker.  

I read somewhere, that can be occured because of small number of open files
(ulimit -n). I try to increase this value, but i can write as max value in
Cygwin 3200.(ulimit -n 3200) and default value is 256. Actually I don`t
know, is it helps or not.

In my job and task tracker.log have I some Errors, I posted those to.

Jobtracker.log

-Call to localhost/127.0.0.1:9000 failed on local exception:
java.io.IOException: An existing connection was forcibly closed by the
remote host

another :
-
2011-03-15 12:13:30,718 INFO org.apache.hadoop.mapred.JobTracker:
attempt_201103151143_0002_m_91_0 is 97125 ms debug.
2011-03-15 12:16:50,718 INFO org.apache.hadoop.mapred.JobTracker:
attempt_201103151143_0002_m_91_0 is 297125 ms debug.
2011-03-15 12:20:10,718 INFO org.apache.hadoop.mapred.JobTracker:
attempt_201103151143_0002_m_91_0 is 497125 ms debug.
2011-03-15 12:23:30,718 INFO org.apache.hadoop.mapred.JobTracker:
attempt_201103151143_0002_m_91_0 is 697125 ms debug.

Error launching task
Lost tracker 'tracker_apple:localhost/127.0.0.1:2654'

there are my logs(jobtracker.log, tasktracker.log ...) in attachment 

I need really Help, I don`t have so much time for my Thessis.

Thanks a lot for your Helps,

Baran 

http://old.nabble.com/file/p31164785/logs.rar logs.rar 
-- 
View this message in context: 
http://old.nabble.com/Lost-Task-Tracker-because-of-no-heartbeat-tp31164785p31164785.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Lost Task Tracker because of no heartbeat

2011-03-16 Thread Nitin Khandelwal
Hi,
Just do context.progress() after small interval of time inside Your
Map/reduce. That will do. If you are using Older package then, you can use
reporter.progress().

Thanks & Regards,
Nitin Khandelwal

On 16 March 2011 21:30, Baran_Cakici  wrote:

>
> Hi Everyone,
>
> I make a Project with Hadoop-MapRedeuce for my master-Thesis. I have a
> strange problem on my System.
>
> First of all, I use Hadoop-0.20.2 on Windows XP Pro with Eclipse Plug-In.
> When I start a job with big Input(4GB - it`s may be not to big, but
> algorithm require some time), then i lose my Task Tracker in several
> minutes
> or seconds. I mean, "Seconds since heartbeat" increase
> and then after 600 Seconds I lose TaskTracker.
>
> I read somewhere, that can be occured because of small number of open files
> (ulimit -n). I try to increase this value, but i can write as max value in
> Cygwin 3200.(ulimit -n 3200) and default value is 256. Actually I don`t
> know, is it helps or not.
>
> In my job and task tracker.log have I some Errors, I posted those to.
>
> Jobtracker.log
>
> -Call to localhost/127.0.0.1:9000 failed on local exception:
> java.io.IOException: An existing connection was forcibly closed by the
> remote host
>
> another :
> -
> 2011-03-15 12:13:30,718 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_201103151143_0002_m_91_0 is 97125 ms debug.
> 2011-03-15 12:16:50,718 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_201103151143_0002_m_91_0 is 297125 ms debug.
> 2011-03-15 12:20:10,718 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_201103151143_0002_m_91_0 is 497125 ms debug.
> 2011-03-15 12:23:30,718 INFO org.apache.hadoop.mapred.JobTracker:
> attempt_201103151143_0002_m_91_0 is 697125 ms debug.
>
> Error launching task
> Lost tracker 'tracker_apple:localhost/127.0.0.1:2654'
>
> there are my logs(jobtracker.log, tasktracker.log ...) in attachment
>
> I need really Help, I don`t have so much time for my Thessis.
>
> Thanks a lot for your Helps,
>
> Baran
>
> http://old.nabble.com/file/p31164785/logs.rar logs.rar
> --
> View this message in context:
> http://old.nabble.com/Lost-Task-Tracker-because-of-no-heartbeat-tp31164785p31164785.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 


Nitin Khandelwal


Question on Master

2011-03-16 Thread Mark
I know the master node is responsible for namenode and job tracker, but 
other than that is there any data stored on that machine? Basically what 
I am asking is should there be an generous amount of free space on that 
machine?


So for example I have a large drive I want to swap out of my master and 
put into another machine which will be a used as a node. Before doing 
this, is there anything I should back up from my master? Where is the 
namenode data stored?


Thanks


Re: YYC/Calgary/Alberta Hadoop Users?

2011-03-16 Thread James Seigel
Hello again.

I am guessing with the lack of response that there are either no hadoop people 
from Calgary, or they are afraid to meetup :)

How about just speaking up if you use hadoop in Calgary :)

Cheers
James.
\
On 2011-03-07, at 8:40 PM, James Seigel wrote:

> Hello,
> 
> Just wondering if there are any YYC hadoop users in the crowd and if
> there is any interest in a meetup of any sort?
> 
> Cheers
> James Seigel
> Captain Hammer
> Tynt



hadoop fs -rmr /*?

2011-03-16 Thread W.P. McNeill
On HDFS, anyone can run hadoop fs -rmr /* and delete everything.  The
permissions system minimizes the danger of accidental global deletion on
UNIX or NT because you're less likely to type an administrator password by
accident.  But HDFS has no such safeguard, and the typo corollary to
Murphy's Law guarantees that someone is going to accidentally do this at
some point.  From reading documentation and Goolging around it seems like
the mechanisms for protecting high-value HDFS data from accidental deletion
are:

   1. Set fs.trash.interval to a non-zero value.
   2. Write your own backup utility.

(1) is nice because it's built in to HDFS, but it only works for shell
operations, and you may not have spare terrabytes of Trash to catch the big
accidental deletes.  (2) seems like a roll-your-own distcp solution.

What are examples of systems people working with high-value HDFS data put in
place so that they can sleep at night?  Are there easy-to-use and reliable
backup utilities, either to another HDFS cluster or to DVD/tape?  Is there a
way to disable fs -rmr for certain directories?


Re: Question on Master

2011-03-16 Thread Harsh J
NameNode and JobTracker do not require a lot of storage space by
themselves. The NameNode needs some space to store its edits and
fsimage, and both require logging space.

However, you may make use of multiple disks for NameNode, in order to
have a redundant backup copy of the NN image available in case one of
the disks crash. Other solutions to this second/third location include
storing to an HA-NFS mount, or an externally attached disk mount.

NameNode data is stored in the ${dfs.name.dir} set of directories
[Defined in hdfs-site.xml]. Ensure its content is preserved perfectly
(including permissions) if you're planning to switch disks for your
NN.

On Wed, Mar 16, 2011 at 10:04 PM, Mark  wrote:
> I know the master node is responsible for namenode and job tracker, but
> other than that is there any data stored on that machine? Basically what I
> am asking is should there be an generous amount of free space on that
> machine?
>
> So for example I have a large drive I want to swap out of my master and put
> into another machine which will be a used as a node. Before doing this, is
> there anything I should back up from my master? Where is the namenode data
> stored?
>
> Thanks
>



-- 
Harsh J
http://harshj.com


Re: Question on Master

2011-03-16 Thread Mark

Ok thanks for the clarification.

Just to be sure though..

- The master will have the ${dfs.name.dir} but not ${dfs.data.dir}
- The nodes will have ${dfs.data.dir} but not ${dfs.name.dir}

Is that correct?

On 3/16/11 10:43 AM, Harsh J wrote:

NameNode and JobTracker do not require a lot of storage space by
themselves. The NameNode needs some space to store its edits and
fsimage, and both require logging space.

However, you may make use of multiple disks for NameNode, in order to
have a redundant backup copy of the NN image available in case one of
the disks crash. Other solutions to this second/third location include
storing to an HA-NFS mount, or an externally attached disk mount.

NameNode data is stored in the ${dfs.name.dir} set of directories
[Defined in hdfs-site.xml]. Ensure its content is preserved perfectly
(including permissions) if you're planning to switch disks for your
NN.

On Wed, Mar 16, 2011 at 10:04 PM, Mark  wrote:

I know the master node is responsible for namenode and job tracker, but
other than that is there any data stored on that machine? Basically what I
am asking is should there be an generous amount of free space on that
machine?

So for example I have a large drive I want to swap out of my master and put
into another machine which will be a used as a node. Before doing this, is
there anything I should back up from my master? Where is the namenode data
stored?

Thanks






Re: hadoop fs -rmr /*?

2011-03-16 Thread David Rosenstrauch

On 03/16/2011 01:35 PM, W.P. McNeill wrote:

On HDFS, anyone can run hadoop fs -rmr /* and delete everything.


Not sure how you have your installation set but on ours (we installed 
Cloudera CDH), only user "hadoop" has full read/write access to HDFS. 
Since we rarely either login as user hadoop, or run jobs as that user, 
this forces us to explicitly set and chown directory trees in HDFS that 
only specific users can access, thus enforcing file read/write restrictions.


HTH,

DR


Re: Question on Master

2011-03-16 Thread Harsh J
Yes, ${dfs.name.dir} is a NameNode used prop, while the other's a
DataNode used prop.

On Wed, Mar 16, 2011 at 11:41 PM, Mark  wrote:
> Ok thanks for the clarification.
>
> Just to be sure though..
>
> - The master will have the ${dfs.name.dir} but not ${dfs.data.dir}
> - The nodes will have ${dfs.data.dir} but not ${dfs.name.dir}
>
> Is that correct?
>
> On 3/16/11 10:43 AM, Harsh J wrote:
>>
>> NameNode and JobTracker do not require a lot of storage space by
>> themselves. The NameNode needs some space to store its edits and
>> fsimage, and both require logging space.
>>
>> However, you may make use of multiple disks for NameNode, in order to
>> have a redundant backup copy of the NN image available in case one of
>> the disks crash. Other solutions to this second/third location include
>> storing to an HA-NFS mount, or an externally attached disk mount.
>>
>> NameNode data is stored in the ${dfs.name.dir} set of directories
>> [Defined in hdfs-site.xml]. Ensure its content is preserved perfectly
>> (including permissions) if you're planning to switch disks for your
>> NN.
>>
>> On Wed, Mar 16, 2011 at 10:04 PM, Mark  wrote:
>>>
>>> I know the master node is responsible for namenode and job tracker, but
>>> other than that is there any data stored on that machine? Basically what
>>> I
>>> am asking is should there be an generous amount of free space on that
>>> machine?
>>>
>>> So for example I have a large drive I want to swap out of my master and
>>> put
>>> into another machine which will be a used as a node. Before doing this,
>>> is
>>> there anything I should back up from my master? Where is the namenode
>>> data
>>> stored?
>>>
>>> Thanks
>>>
>>
>>
>



-- 
Harsh J
http://harshj.com


Any one know where to get Hadoop production cluster log

2011-03-16 Thread He Chen
Hi all

I am working on Hadoop scheduler. But I do not know where to get log from
Hadoop production clusters. Any suggestions?

Bests

Chen


Re: hadoop fs -rmr /*?

2011-03-16 Thread Ted Dunning
W.P is correct, however, that standard techniques like snapshots and mirrors
and point in time backups do not exist in standard hadoop.

This requires a variety of creative work-arounds if you use stock hadoop.

It is not uncommon for people to have memories of either removing everything
or somebody close to them doing the same thing.

Few people have memories of doing it twice.

On Wed, Mar 16, 2011 at 11:20 AM, David Rosenstrauch wrote:

> On 03/16/2011 01:35 PM, W.P. McNeill wrote:
>
>> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.
>>
>
> Not sure how you have your installation set but on ours (we installed
> Cloudera CDH), only user "hadoop" has full read/write access to HDFS. Since
> we rarely either login as user hadoop, or run jobs as that user, this forces
> us to explicitly set and chown directory trees in HDFS that only specific
> users can access, thus enforcing file read/write restrictions.
>
> HTH,
>
> DR
>


unsubscribe hadoop

2011-03-16 Thread Thanh Pham

unsubscribe hadoop



Re: Why hadoop is written in java?

2011-03-16 Thread baloodevil
See this for comment on java handling numeric calculations like sparse
matrices...
http://acs.lbl.gov/software/colt/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p2688781.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: hadoop fs -rmr /*?

2011-03-16 Thread Brian Bockelman
Hi W.P.,

Hadoop does apply permissions taken from the shell.  So, if the directory is 
owned by user "brian" and user "ted" does a "rmr /user/brian", then you get a 
permission denied error.

By default, this is not safeguarded against malicious users.  A malicious user 
will do whatever they want with Hadoop cluster.  This safeguards against 
accidents of authorized users.

*However*, there is a security branch that uses Kerberos for authentication.  
This is considered secure.

In the end though, Disks Are Not Backups.  As someone who has accidentally 
deleted over 100TB of data in a matter of minutes, I can assure you that high 
value data belongs backed up on tape, ejected from a vault, and with the write 
protection tab flipped on.

Hadoop IS NOT ARCHIVAL STORAGE.   That's an important fact.

Brian

On Mar 16, 2011, at 12:35 PM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.  The
> permissions system minimizes the danger of accidental global deletion on
> UNIX or NT because you're less likely to type an administrator password by
> accident.  But HDFS has no such safeguard, and the typo corollary to
> Murphy's Law guarantees that someone is going to accidentally do this at
> some point.  From reading documentation and Goolging around it seems like
> the mechanisms for protecting high-value HDFS data from accidental deletion
> are:
> 
>   1. Set fs.trash.interval to a non-zero value.
>   2. Write your own backup utility.
> 
> (1) is nice because it's built in to HDFS, but it only works for shell
> operations, and you may not have spare terrabytes of Trash to catch the big
> accidental deletes.  (2) seems like a roll-your-own distcp solution.
> 
> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?  Are there easy-to-use and reliable
> backup utilities, either to another HDFS cluster or to DVD/tape?  Is there a
> way to disable fs -rmr for certain directories?



smime.p7s
Description: S/MIME cryptographic signature


Re: Why hadoop is written in java?

2011-03-16 Thread Ted Dunning
Note that that comment is now 7 years old.

See Mahout for a more modern take on numerics using Hadoop (and other tools)
for scalable machine learning and data mining.

On Wed, Mar 16, 2011 at 10:43 AM, baloodevil  wrote:

> See this for comment on java handling numeric calculations like sparse
> matrices...
> http://acs.lbl.gov/software/colt/
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Why-hadoop-is-written-in-java-tp1673148p2688781.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>


Re: hadoop fs -rmr /*?

2011-03-16 Thread Allen Wittenauer

On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.

In addition to what everyone else has said, I'm fairly certain that 
-rmr / is specifically safeguarded against.  But /* might have slipped through 
the cracks.

> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?

I set in place crontabs where we randomly delete the entire file system 
to remind folks that HDFS is still immature.

:D

OK, not really.

In reality, we have basically a policy that everyone signs off on 
before getting an account where they understand that Hadoop should not be 
considered 'primary storage', is not a data warehouse, is not backed up, and 
could disappear at any moment.  But we also make sure that the base (ETL'd) 
data lives on multiple grids.  Any other data should be reproducible from that 
base data.  



Re: hadoop fs -rmr /*?

2011-03-16 Thread Allen Wittenauer

On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.

In addition to what everyone else has said, I'm fairly certain that 
-rmr / is specifically safeguarded against.  But /* might have slipped through 
the cracks.

> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?

I set in place crontabs where we randomly delete the entire file system 
to remind folks that HDFS is still immature.

:D

OK, not really.

In reality, we have basically a policy that everyone signs off on 
before getting an account where they understand that Hadoop should not be 
considered 'primary storage', is not a data warehouse, is not backed up, and 
could disappear at any moment.  But we also make sure that the base (ETL'd) 
data lives on multiple grids.  Any other data should be reproducible from that 
base data.  


How does sqoop distribute it's data evenly across HDFS?

2011-03-16 Thread BeThere
The sqoop documentation seems to imply that it uses the key information 
provided to it on the command line to ensure that the SQL data is distributed 
evenly across the DFS. However I cannot see any mechanism for achieving this 
explicitly other than relying on the implicit distribution provided by default 
by HDFS. Is this correct or are there methods on some API that allow me to 
manage the distribution to ensure that it is balanced across all nodes in my 
cluster?

Thanks,

 Andy D



decommissioning node woes

2011-03-16 Thread Rita
Hello,

I have been struggling with decommissioning data  nodes. I have a 50+ data
node cluster (no MR) with each server holding about 2TB of storage. I split
the nodes into 2 racks.


I edit the 'exclude' file and then do a -refreshNodes. I see the node
immediate in 'Decommiosied node' and I also see it as a 'live' node!
Eventhough I wait 24+ hours its still like this. I am suspecting its a bug
in my version.  The data node process is still running on the node I am
trying to decommission. So, sometimes I kill -9 the process and I see the
'under replicated' blocks...this can't be the normal procedure.

There were even times that I had corrupt blocks because I was impatient --
waited 24-34 hours

I am using 23 August, 2010: release 0.21.0

 version.

Is this a known bug? Is there anything else I need to do to decommission a
node?







-- 
--- Get your facts first, then you can distort them as you please.--


Cloudera Flume

2011-03-16 Thread Mark
Sorry if this is not the correct list to post this on, it was the 
closest I could find.


We are using a taildir('/var/log/foo/') source on all of our agents. If 
this agent goes down and data can not be sent to the collector for some 
time, what happens when this agent becomes available again? Will the 
agent tail the whole directory starting from the beginning of all files 
thus adding duplicate data to our sink?


I've read that I could set the startFromEnd parameter to true. In that 
case however if an agent goes down then we would lose any data that gets 
written to our file until the agent comes back up. How do people handle 
this? It seems like you either have to deal with the fact that you will 
have duplicate or missing data.


Thanks||


Re: Cloudera Flume

2011-03-16 Thread James Seigel
I believe sir there should be a flume support group on cloudera. I'm
guessing most of us here haven't used it and therefore aren't  much
help.

This is vanilla hadoop land. :)

Cheers and good luck!
James

On a side note, how much data are you pumping through it?


Sent from my mobile. Please excuse the typos.

On 2011-03-16, at 7:53 PM, Mark  wrote:

> Sorry if this is not the correct list to post this on, it was the closest I 
> could find.
>
> We are using a taildir('/var/log/foo/') source on all of our agents. If this 
> agent goes down and data can not be sent to the collector for some time, what 
> happens when this agent becomes available again? Will the agent tail the 
> whole directory starting from the beginning of all files thus adding 
> duplicate data to our sink?
>
> I've read that I could set the startFromEnd parameter to true. In that case 
> however if an agent goes down then we would lose any data that gets written 
> to our file until the agent comes back up. How do people handle this? It 
> seems like you either have to deal with the fact that you will have duplicate 
> or missing data.
>
> Thanks||


how am I able to get output file names?

2011-03-16 Thread Jun Young Kim

hi,

after completing a job, I want to know the output file names because I 
used MultipleOutoutput class to generate several output files.


do you know how I can get it?

thanks.

--
Junyoung Kim (juneng...@gmail.com)



Re: Cloudera Flume

2011-03-16 Thread Mark

Sorry about that

FYI, About 1GB/day across 4 collectors at the moment

On 3/16/11 6:55 PM, James Seigel wrote:

I believe sir there should be a flume support group on cloudera. I'm
guessing most of us here haven't used it and therefore aren't  much
help.

This is vanilla hadoop land. :)

Cheers and good luck!
James

On a side note, how much data are you pumping through it?


Sent from my mobile. Please excuse the typos.

On 2011-03-16, at 7:53 PM, Mark  wrote:


Sorry if this is not the correct list to post this on, it was the closest I 
could find.

We are using a taildir('/var/log/foo/') source on all of our agents. If this 
agent goes down and data can not be sent to the collector for some time, what 
happens when this agent becomes available again? Will the agent tail the whole 
directory starting from the beginning of all files thus adding duplicate data 
to our sink?

I've read that I could set the startFromEnd parameter to true. In that case 
however if an agent goes down then we would lose any data that gets written to 
our file until the agent comes back up. How do people handle this? It seems 
like you either have to deal with the fact that you will have duplicate or 
missing data.

Thanks||


Re: how am I able to get output file names?

2011-03-16 Thread Harsh J
You could enable counter features in MultipleOutputs, and then get
each unique name out from the group of counters it'd have created at
job's end.

On Thu, Mar 17, 2011 at 7:53 AM, Jun Young Kim  wrote:
> hi,
>
> after completing a job, I want to know the output file names because I used
> MultipleOutoutput class to generate several output files.
>
> do you know how I can get it?
>
> thanks.
>
> --
> Junyoung Kim (juneng...@gmail.com)
>
>



-- 
Harsh J
http://harshj.com


Re: How does sqoop distribute it's data evenly across HDFS?

2011-03-16 Thread Harsh J
There's a balancer available to re-balance DNs across the HDFS cluster
in general. It is available in the $HADOOP_HOME/bin/ directory as
start-balancer.sh

But what I think sqoop implies is that your data is balanced due to
the map jobs it runs for imports (using a provided split factor
between maps), which should make it write chunks of data out to
different DataNodes.

I guess you could get more information on the Sqoop mailing list
sqoop-u...@cloudera.org,
https://groups.google.com/a/cloudera.org/group/sqoop-user/topics

On Thu, Mar 17, 2011 at 5:04 AM, BeThere  wrote:
> The sqoop documentation seems to imply that it uses the key information 
> provided to it on the command line to ensure that the SQL data is distributed 
> evenly across the DFS. However I cannot see any mechanism for achieving this 
> explicitly other than relying on the implicit distribution provided by 
> default by HDFS. Is this correct or are there methods on some API that allow 
> me to manage the distribution to ensure that it is balanced across all nodes 
> in my cluster?
>
> Thanks,
>
>         Andy D
>
>



-- 
Harsh J
http://harshj.com