Kerberos and Delegation Tokens

2012-03-17 Thread Praveen Sripati
Hi,

According to the 'Hadoop - The Definitive Guide'

 In a distributed system like HDFS or MapReduce, there are many
client-server interactions, each of which must be authenticated. For
example, an HDFS read operation will involve multiple calls to the namenode
and calls to one or more datanodes. Instead of using the three-step
Kerberos ticket exchange protocol to authenticate each call, which would
present a high load on the KDC on a busy cluster, Hadoop uses delegation
tokens to allow later authenticated access without having to contact the
KDC again.

Once the authentication is established between the client and the NameNode,
there is no need to contact the KDC (Key Distribution Center) till the
ticket expires for any NameNode queries. So, I don't see how delegation
tokens will lower the burden on the KDC by having to contact the KDC fewer
times.

Could someone please explain me how delegation tokens help?

Praveen


Re: Security at file level in Hadoop

2012-02-22 Thread Praveen Sripati
According to this (http://goo.gl/rfwy4)

 Prior to 0.22, Hadoop uses the 'whoami' and id commands to determine the
user and groups of the running process.

How does this work now?

Praveen

On Wed, Feb 22, 2012 at 6:03 PM, Joey Echeverria j...@cloudera.com wrote:

 HDFS supports POSIX style file and directory permissions (read, write,
 execute) for the owner, group and world. You can change the permissions
 with hadoop fs -chmod permissions path

 -Joey


 On Feb 22, 2012, at 5:32, shreya@cognizant.com wrote:

  Hi
 
 
 
 
 
  I want to implement security at file level in Hadoop, essentially
  restricting certain data to certain users.
 
  Ex - File A can be accessed only by a user X
 
  File B can be accessed by only user X and user Y
 
 
 
  Is this possible in Hadoop, how do we do it? At what level are these
  permissions applied (before copying to HDFS or after putting in HDFS)?
 
  When the file gets replicated does it retain these permissions?
 
 
 
  Thanks
 
  Shreya
 
 
  This e-mail and any files transmitted with it are for the sole use of
 the intended recipient(s) and may contain confidential and privileged
 information.
  If you are not the intended recipient, please contact the sender by
 reply e-mail and destroy all copies of the original message.
  Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.



Re: Can't achieve load distribution

2012-02-02 Thread Praveen Sripati
 I have a simple MR job, and I want each Mapper to get one line from my
input file (which contains further instructions for lengthy processing).

Use the NLineInputFormat class.

http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html

Praveen

On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.comwrote:

 Thanks!
 Mark

 On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote:

  Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop.
 
  Best Regards,
  Anil
 
  On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com
 wrote:
 
   Anil,
  
   do you mean one block of HDFS, like 64MB?
  
   Mark
  
   On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com
  wrote:
  
   Do u have enough data to start more than one mapper?
   If entire data is less than a block size then only 1 mapper will run.
  
   Best Regards,
   Anil
  
   On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com
  wrote:
  
   Hi,
  
   I have a simple MR job, and I want each Mapper to get one line from
 my
   input file (which contains further instructions for lengthy
  processing).
   Each line is 100 characters long, and I tell Hadoop to read only 100
   bytes,
  
  
  
 
 job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
   100);
  
   I see that this part works - it reads only one line at a time, and
 if I
   change this parameter, it listens.
  
   However, on a cluster only one node receives all the map tasks. Only
  one
   map tasks is started. The others never get anything, they just wait.
  I've
   added 100 seconds wait to the mapper - no change!
  
   Any advice?
  
   Thank you. Sincerely,
   Mark
  
 



Re: Can't achieve load distribution

2012-02-02 Thread Praveen Sripati
Mark,

NLineInputFormat was not something which was introduced in 0.21, I have
just sent the reference to the 0.21 url FYI. It's in 0.20.205, 1.0.0 and
0.23 releases also.

Praveen

On Fri, Feb 3, 2012 at 1:25 AM, Mark Kerzner mark.kerz...@shmsoft.comwrote:

 Praveen,

 this seems just like the right thing, but it's API 0.21 (I googled about
 the problems with it), so I have to use either the next Cloudera release,
 or Hortonworks, or something, am I right?

 Mark

 On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati praveensrip...@gmail.com
 wrote:

   I have a simple MR job, and I want each Mapper to get one line from my
  input file (which contains further instructions for lengthy processing).
 
  Use the NLineInputFormat class.
 
 
 
 http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html
 
  Praveen
 
  On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.com
  wrote:
 
   Thanks!
   Mark
  
   On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com
  wrote:
  
Yes, if ur block size is 64mb. Btw, block size is configurable in
  Hadoop.
   
Best Regards,
Anil
   
On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com
   wrote:
   
 Anil,

 do you mean one block of HDFS, like 64MB?

 Mark

 On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com
wrote:

 Do u have enough data to start more than one mapper?
 If entire data is less than a block size then only 1 mapper will
  run.

 Best Regards,
 Anil

 On Feb 1, 2012, at 4:21 PM, Mark Kerzner 
 mark.kerz...@shmsoft.com
wrote:

 Hi,

 I have a simple MR job, and I want each Mapper to get one line
 from
   my
 input file (which contains further instructions for lengthy
processing).
 Each line is 100 characters long, and I tell Hadoop to read only
  100
 bytes,



   
  
 
 job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength,
 100);

 I see that this part works - it reads only one line at a time,
 and
   if I
 change this parameter, it listens.

 However, on a cluster only one node receives all the map tasks.
  Only
one
 map tasks is started. The others never get anything, they just
  wait.
I've
 added 100 seconds wait to the mapper - no change!

 Any advice?

 Thank you. Sincerely,
 Mark

   
  
 



Re: connection between slaves and master

2012-01-10 Thread Praveen Sripati
Mark,

 [mark@node67 ~]$ telnet node77

You need to specify the port number along with the server name like `telnet
node77 1234`.

 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s).

Slaves are not able to connect to the master. The configurations `
fs.default.name` and `mapred.job.tracker` should point to the master and
not to localhost when the master and slaves are on different machines.

Praveen

On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com wrote:

 Hello guys,

  I'm requesting from a PBS scheduler a number of  machines to run Hadoop
 and even though all hadoop daemons start normally on the master and slaves,
 the slaves don't have worker tasks in them. Digging into that, there seems
 to be some blocking between nodes (?) don't know how to describe it except
 that on slave if I telnet master-node  it should be able to connect, but
 I get this error:

 [mark@node67 ~]$ telnet node77

 Trying 192.168.1.77...
 telnet: connect to address 192.168.1.77: Connection refused
 telnet: Unable to connect to remote host: Connection refused

 The log at the slave nodes shows the same thing, even though it has
 datanode and tasktracker started from the maste (?):

 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 0 time(s).
 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 1 time(s).
 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 2 time(s).
 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 3 time(s).
 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 4 time(s).
 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 5 time(s).
 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 6 time(s).
 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 7 time(s).
 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 8 time(s).
 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
 to server: localhost/127.0.0.1:12123. Already tried 9 time(s).
 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at
 localhost/
 127.0.0.1:12123 not available yet, Z...

  Any suggestions of what I can do?

 Thanks,
 Mark



Re: Allowing multiple users to submit jobs in hadoop 0.20.205 ?

2012-01-03 Thread Praveen Sripati
By default `security.job.submission.protocol.acl` is set to * in the
hadoop-policy.xml, so it will allow any/multiple users to submit/query job
status.

Check this (1) for more details.

  property
namesecurity.job.submission.protocol.acl/name
value*/value
descriptionACL for JobSubmissionProtocol, used by job clients to
communciate with the jobtracker for job submission, querying job status
etc.
The ACL is a comma-separated list of user and group names. The user and
group list is separated by a blank. For e.g. alice,bob users,wheel.
A special value of * means all users are allowed./description
  /property

(1) http://hadoop.apache.org/common/docs/r0.20.2/service_level_auth.html

Praveen

On Tue, Jan 3, 2012 at 10:46 AM, praveenesh kumar praveen...@gmail.comwrote:

 Hi,

 How can I allow multiple users to submit jobs in hadoop 0.20.205 ?

 Thanks,
 Praveenesh



Re: Hive starting error

2012-01-03 Thread Praveen Sripati
http://hive.apache.org/releases.html#21+June%2C+2011%3A+release+0.7.1+available

 21 June, 2011: release 0.7.1 available

 This release is the latest release of Hive and it works with Hadoop
0.20.1 and 0.20.2

I don't see the method the method thrown in the exception in 0.20.205.

Praveen

On Fri, Dec 30, 2011 at 3:09 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi,

 I am using Hive 0.7.1 on hadoop 0.20.205
 While running hive. its giving me following error :

 Exception in thread main java.lang.NoSuchMethodError:

 org.apache.hadoop.security.UserGroupInformation.login(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/security/UserGroupInformation;
at

 org.apache.hadoop.hive.shims.Hadoop20Shims.getUGIForConf(Hadoop20Shims.java:448)
at

 org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(HadoopDefaultAuthenticator.java:51)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at

 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at

 org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.java:222)
at
 org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:219)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:417)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

 Any Idea on how to resolve this issue ?

 Thanks,
 Praveenesh



Re: Remote access to namenode is not allowed despite the services are already started.

2012-01-02 Thread Praveen Sripati
Changing the VM settings won't help.

Change the value of fs.default.name to hdfs://106.77.211.187:9000 from
hdfs://localhost:9000 in core-site.xml for both the client and the
NameNode. Replace the IP address with the IP address of the node on which
the NameNode is running or with the hostname.

Praveen

2012/1/2 Harsh J ha...@cloudera.com

 Woraphol,

 Yes you'd need to tweak some settings in your VMs such that they allow
 remote connections. Could also be a firewall running inside of your
 NameNode instance preventing this. Once you get the telnet working
 after troubleshooting your network settings (I don't know the bullseye
 spot here, sorry), it should be fine after-on.

 2012/1/1   s4510...@hotmail.com:
 
  Dear all,
I successfully installed and run Hadoop on a single machine whose ip
 is 192.168.1.109 (In fact it is actually an Ubuntu instance running on
 virtual box ) . When typing jps it shows2473 DataNode2765 TaskTracker3373
 Jps2361 NameNode2588 SecondaryNameNode2655 JobTracker
   This should mean that the hadoop is up and running.Running commands
 like ./hadoop fs -ls is fine and produces the expected result.
  But If I try to connect it from my windows box whose ip is 192.168.1.80
 by writingJava code's HDFS API to connect it as follows: Configuration conf
 = new Configuration();FileSystem hdfs = null;Path filenamePath = new
 Path(FILE_NAME);
  hdfs = FileSystem.get(conf); -- the problem occurred at this line
  when I run the code, the error displayed as follows:
  11/12/07 20:37:24 INFO ipc.Client: Retrying connect to server: /
 192.168.1.109:9000. Already tried 0 time(s).11/12/07 20:37:26 INFO
 ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already
 tried 1 time(s).11/12/07 20:37:28 INFO ipc.Client: Retrying connect to
 server: /192.168.1.109:9000. Already tried 2 time(s).11/12/07 20:37:30
 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already
 tried 3 time(s).11/12/07 20:37:32 INFO ipc.Client: Retrying connect to
 server: /192.168.1.109:9000. Already tried 4 time(s).11/12/07 20:37:33
 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already
 tried 5 time(s).11/12/07 20:37:35 INFO ipc.Client: Retrying connect to
 server: /192.168.1.109:9000. Already tried 6 time(s).11/12/07 20:37:37
 INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000. Already
 tried 7 time(s).11/12/07 20:37:39 INFO ipc.Client: Retrying connect to
 server: /192.168.1.109:9000. Already tried 8 time(s).11/12/07 20:37:41
   INFO ipc.Client: Retrying connect to server: /192.168.1.109:9000.
 Already tried 9 time(s).java.net.ConnectException: Call to /
 192.168.1.109:9000 failed on  connection exception:
 java.net.ConnectException: Connection refused: no further information
  To make sure if the socket is already opened and waits for the incoming
 connections on the hadoop serer, I netstat on the ubuntu boxthe result
 shows as follows:
 
  tcp0  0 localhost:51201 *:*
 LISTEN  2765/java   tcp0  0 *:50020 *:*
 LISTEN  2473/java   tcp0  0
 localhost:9000  *:* LISTEN  2361/java
 tcp0  0 localhost:9001  *:* LISTEN
  2655/java   tcp0  0 *:mysql *:*
   LISTEN  -   tcp0  0 *:50090
   *:* LISTEN  2588/java   tcp0
  0 *:11211 *:* LISTEN  -
 tcp0  0 *:40843 *:*
 LISTEN  2473/java   tcp0  0 *:58699 *:*
 LISTEN  -   tcp0  0 *:50060
 *:* LISTEN  2765/java   tcp
0  0 *:50030
*:* LISTEN  2655/java   tcp
  0  0 *:53966 *:* LISTEN
  2655/java   tcp0  0 *:www   *:*
   LISTEN  -   tcp0  0 *:epmd
*:* LISTEN  -   tcp0
  0 *:55826 *:* LISTEN  2588/java
 tcp0  0 *:ftp   *:*
 LISTEN  -   tcp0  0 *:50070 *:*
 LISTEN  2361/java   tcp0  0 *:52822
 *:* LISTEN  2361/java   tcp
0  0 *:ssh   *:* LISTEN  -
 tcp0  0 *:55672 *:*
 LISTEN  -   tcp0  0 *:50010
 *:*
   LISTEN  2473/java   tcp0  0 *:50075
 *:* LISTEN  2473/java
  I noticed that if the local address column is something like
 localhost:9000 

Re: Hadoop MySQL database access

2011-12-29 Thread Praveen Sripati
Check the `mapreduce.job.reduce.slowstart.completedmaps` parameter. The
reducers cannot start processing the data from the mappers until the all
the map tasks are complete, but the reducers can start fetching the data
from the nodes on which the map tasks have completed.

Praveen

On Thu, Dec 29, 2011 at 12:44 AM, Prashant Kommireddi
prash1...@gmail.comwrote:

 By design reduce would start only after all the maps finish. There is
 no way for the reduce to begin grouping/merging by key unless all the
 maps have finished.

 Sent from my iPhone

 On Dec 28, 2011, at 8:53 AM, JAGANADH G jagana...@gmail.com wrote:

  Hi All,
 
  I wrote a map reduce program to fetch data from MySQL and process the
  data(word count).
  The program executes successfully . But I noticed that the reduce task
  starts after finishing the map task only .
  Is there any way to run the map and reduce in parallel.
 
  The program fetches data from MySQL and writes the processed output to
  hdfs.
  I am using hadoop in pseduo-distributed mode .
  --
  **
  JAGANADH G
  http://jaganadhg.in
  *ILUGCBE*
  http://ilugcbe.org.in



Re: Automate Hadoop installation

2011-12-06 Thread Praveen Sripati
Also, checkout Ambari (http://incubator.apache.org/ambari/) which is still
in the Incubator status. How does Ambari and Puppet compare?

Regards,
Praveen

On Tue, Dec 6, 2011 at 1:00 PM, alo alt wget.n...@googlemail.com wrote:

 Hi,

 to deploy software I suggest pulp:
 https://fedorahosted.org/pulp/wiki/HowTo

 For a package-based distro (debian, redhat, centos) you can build apache's
 hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a
 redhat / centos take a look at spacewalk.

 best,
  Alex


 On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote:

  These that great project called BigTop (in the apache incubator) which
  provides for building of Hadoop stack.
 
  The part of what it provides is a set of Puppet recipes which will allow
  you
  to do exactly what you're looking for with perhaps some minor
 corrections.
 
  Serious, look at Puppet - otherwise it will be a living through nightmare
  of
  configuration mismanagements.
 
  Cos
 
  On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote:
   Hi all,
  
   Can anyone guide me how to automate the hadoop
 installation/configuration
   process?
   I want to install hadoop on 10-20 nodes which may even exceed to 50-100
   nodes ?
   I know we can use some configuration tools like puppet/or
 shell-scripts ?
   Has anyone done it ?
  
   How can we do hadoop installations on so many machines parallely ? What
  are
   the best practices for this ?
  
   Thanks,
   Praveenesh
 



 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 *P **Think of the environment: please don't print this email unless you
 really need to.*



Re: Multiple Mappers for Multiple Tables

2011-12-06 Thread Praveen Sripati
MultipleInputs take multiple Path (files) and not DB as input. As mentioned
earlier export tables into HDFS either using Sqoop or native DB export tool
and then do the processing. Sqoop is configured to use native DB export
tool whenever possible.

Regards,
Praveen

On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote:

 Thanks Bejoy,
 I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
 Path parameter. Are these paths just ignored here?

 On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Justin,
 Just to add on to my response. If you need to fetch data from
  rdbms on your mapper using your custom mapreduce code you can use the
  DBInputFormat in your mapper class with MultipleInputs. You have to be
  careful in using the number of mappers for your application as dbs would
 be
  constrained with a limit on maximum simultaneous connections. Also you
 need
  to ensure that that the same Query is not executed n number of times in n
  mappers all fetching the same data, It'd be just wastage of network.
 Sqoop
  + Hive would be my recommendation and a good combination for such use
  cases. If you have Pig competency you can also look into pig instead of
  hive.
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote:
 
   Justin
   If I get your requirement right you need to get in data from
   multiple rdbms sources and do a join on the same, also may be some more
   custom operations on top of this. For this you don't need to go in for
   writing your custom mapreduce code unless it is that required. You can
   achieve the same in two easy steps
   - Import data from RDBMS into Hive using SQOOP (Import)
   - Use hive to do some join and processing on this data
  
   Hope it helps!..
  
   Regards
   Bejoy.K.S
  
  
   On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com
  wrote:
  
   I would like join some db tables, possibly from different databases,
 in
  a
   MR job.
  
   I would essentially like to use MultipleInputs, but that seems file
   oriented. I need a different mapper for each db table.
  
   Suggestions?
  
   Thanks!
  
   Justin Vincent
  
  
  
 



Re: Running a job continuously

2011-12-06 Thread Praveen Sripati
If the requirement is for real time data processing, using Flume
will not suffice as there is a time lag between the collection of files
by Flume and processing done by Hadoop. Consider frameworks like S4,
Storm (from Twitter), HStreaming etc which suits realtime processing.

Regards,
Praveen

On Tue, Dec 6, 2011 at 10:39 AM, Ravi teja ch n v
raviteja.c...@huawei.comwrote:

 Hi Burak,

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Just to add to Bejoy's point,
 with Oozie, you can specify the data dependency for running your job.
 When specific amount of data is in, your can configure Oozie to run your
 job.
 I think this will suffice your requirement.

 Regards,
 Ravi Teja

 
 From: burakkk [burak.isi...@gmail.com]
 Sent: 06 December 2011 04:03:59
 To: mapreduce-u...@hadoop.apache.org
 Cc: common-user@hadoop.apache.org
 Subject: Re: Running a job continuously

 Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
 execute the MR job on the same algorithm but different files have different
 velocity.

 Both Storm and facebook's hadoop are designed for that. But i want to use
 apache distribution.

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Mike Spreitzer, both output and input are continuous. Output isn't relevant
 to the input. Only that i want is all the incoming files are processed by
 the same job and the same algorithm.
 For ex, you think about wordcount problem. When you want to run wordcount,
 you implement that:
 http://wiki.apache.org/hadoop/WordCount

 But when the program find that code job.waitForCompletion(true);, somehow
 job will end up. When you want to make it continuously, what will you do in
 hadoop without other tools?
 One more thing is you assumption that the input file's name is
 filename_timestamp(filename_20111206_0030)

 public static void main(String[] args) throws Exception {Configuration
 conf = new Configuration();Job job = new Job(conf,
 wordcount);job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.waitForCompletion(true); }

 On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Burak
 If you have a continuous inflow of data, you can choose flume to
  aggregate the files into larger sequence files or so if they are small
 and
  when you have a substantial chunk of data(equal to hdfs block size). You
  can push that data on to hdfs based on your SLAs you need to schedule
 your
  jobs using oozie or simpe shell script. In very simple terms
  - push input data (could be from flume collector) into a staging hdfs dir
  - before triggering the job(hadoop jar) copy the input from staging to
  main input dir
  - execute the job
  - archive the input and output into archive dirs(any other dirs).
 - the output archive dir could be source of output data
  - delete output dir and empty input dir
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:
 
  Hi everyone,
  I want to run a MR job continuously. Because i have streaming data and i
  try to analyze it all the time in my way(algorithm). For example you
 want
  to solve wordcount problem. It's the simplest one :) If you have some
  multiple files and the new files are keep going, how do you handle it?
  You could execute a MR job per one file but you have to do it repeatly.
 So
  what do you think?
 
  Thanks
  Best regards...
 
  --
 
  *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
  *
  *
 
 
 


 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *



Re: Availability of Job traces or logs

2011-12-04 Thread Praveen Sripati
Arun,

I want to control the split placements.

InputSplits are logical and part of the input data, there is nothing to do
with placement of the InputSplits. InputSplits are calculated on a client
by the InputFormat class when a job is submitted and the InputSplit
metadata data is put in HDFS to be fetched later.

Each InputSplit is processed by a map task. The Hadoop framework makes sure
that the task and the InputSplit it processes are as close as possible to
avoid any overheads.

MAPREDUCE-207 is for moving the calculation of the InputSplits from the
client to the cluster, but I don't see any progress in it.

BTW, what is the new scheduler about?

Regards,
Praveen

On Sun, Dec 4, 2011 at 10:19 AM, ArunKumar arunk...@gmail.com wrote:

 Amar,

 I am attempting to write a new scheduler for Hadoop and test it using
 Mumak.

 1 I want to test its behaviour under different size of jobs traces(meaning
 number of jobs say 5,10,25,50,100) under different number of nodes.
 Till now i was using only the test/data given by mumak which has 19 jobs
 and 1529 node topology.
 I don' have many nodes with me to run some programs and collect logs and
 use Rumen to generate traces.

 2 I want to control the split placements so i need to modify preferred
 locations for task attempts in the trace but the trace for even 19 jobs is
 huge. So, I was thinking whether i can get a small, medium and large number
 of Job traces with corresponding topology trace so that modifying will be
 easier.


 Arun


 On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] 
 ml-node+s472066n3556710...@n3.nabble.com wrote:

  Arun,
  You can very well run synthetic workloads like large scale sort,
 wordcount
  etc or more realistic workloads like PigMix (
  https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
  enough cluster, these workloads work pretty well. Is there a specific
  reason why you want traces of varied sizes from various organizations?
 
   How can i make sure that the rumen generates only say 25 jobs,50 jobs
 or
  so
  Do you want to get 25/50 jobs based on some filtering criterion? I
  recently faced a similar situation where I wanted to extract jobs from a
  Rumen trace based on job ids. I will be happy to share these filtering
  tools.
 
  Amar
 
 
  On 12/1/11 8:48 AM, ArunKumar [hidden email]
 http://user/SendEmail.jtp?type=nodenode=3556710i=0
  wrote:
 
  Hi guys !
 
  Apart from generating the job traces from RUMEN , can i get logs or job
  traces of varied sizes from some organizations.
 
  How can i make sure that the rumen generates only say 25 jobs,50 jobs or
  so
  ?
 
 
  Thanks,
  Arun
 
  --
  View this message in context:
 
 http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
  Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
 
 
 
  --
   If you reply to this email, your message will be added to the discussion
  below:
 
 
 http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
   To unsubscribe from Availability of Job traces or logs, click here
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=3550462code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3
 
  .
  NAML
 http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespacebreadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
 
 


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: How do I programmatically get total job execution time?

2011-12-02 Thread Praveen Sripati
Hi,

Ran a job using new MR API in stand alone mode and 0.21. Both,
Job#getFinishTime and Job#getStartTime are returning 0. Not sure, if this
is a bug.

Thanks,
Praveen

On Sat, Dec 3, 2011 at 6:14 AM, Raj V rajv...@yahoo.com wrote:

 As Harsh said, I don't think there is a simple way to way to find when the
 job ended, especially after the job is completed.

 But cant you just wait for your job to complete and log the time when the
 job completed?

 Raj



 
  From: Harsh J ha...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Friday, December 2, 2011 12:53 PM
 Subject: Re: How do I programmatically get total job execution time?
 
 I remember hitting this once in 0.20 - seems like an API limitation. The
 resolution we took back then was to get a list of all tasks, and get the
 end time with the last ended task's completion time (sort and pick). There
 may be other ways though - others can comment on that perhaps (metrics?
 job-history?)
 
 On 02-Dec-2011, at 11:27 PM, W.P. McNeill wrote:
 
  After my Hadoop job has successfully completed I'd like to log the total
  amount of time it took. This is the Finished in statistic in the web
 UI.
  How do I get this number programmatically? Is there some way I can query
  the Job object? I didn't see anything in the API documentation.
 
 02-Dec-2011, at 11:27 PM, W.P. McNeill wrote:
 
  After my Hadoop job has successfully completed I'd like to log the total
  amount of time it took. This is the Finished in statistic in the web
 UI.
  How do I get this number programmatically? Is there some way I can query
  the Job object? I didn't see anything in the API documentation.
 
 
 
 



Re: Binary content

2011-09-02 Thread Praveen Sripati
Mohit,

Hadoop: The Definitive Guide (Chapter 3 - Hadoop I/O) has a section on
SequenceFile and is worth reading.

http://oreilly.com/catalog/9780596521981

Thanks,
Praveen

On Thu, Sep 1, 2011 at 9:15 PM, Owen O'Malley o...@hortonworks.com wrote:

 On Thu, Sep 1, 2011 at 8:37 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

 Thanks! Is there a specific tutorial I can focus on to see how it could be
  done?
 

 Take the word count example and change its output format to be
 SequenceFileOutputFormat.

 job.setOutputFormatClass(SequenceFileOutputFormat.class);

 and it will generate SequenceFiles instead of text. There is
 SequenceFileInputFormat for reading.

 -- Owen



Eclipse Hadoop Plugin Error creating New Hadoop location ....

2011-05-30 Thread Praveen Sripati
Hi,

I am trying to run Hadoop from Eclipse using the Eclipse Hadoop Plugin and
stuck with the following problem.

First copied the hadoop-0.21.0-eclipse-plugin.jar to the Eclipse Plugin
folder, started eclipse and switched to the Map/Reduce perspective. In the
Map/Reduce Locations View when I try to add a New Hadoop location 
the following error appears in the Eclipse Error Log. The version of Eclipse
is Helios Service Release 2.

Message : Unhandled event loop exception

Exception Stack Trace :

java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at
org.apache.hadoop.eclipse.server.HadoopServer.init(HadoopServer.java:223)
at
org.apache.hadoop.eclipse.servers.HadoopLocationWizard.init(HadoopLocationWizard.java:88)
at
org.apache.hadoop.eclipse.actions.NewLocationAction$1.init(NewLocationAction.java:41)
at
org.apache.hadoop.eclipse.actions.NewLocationAction.run(NewLocationAction.java:40)
at org.eclipse.jface.action.Action.runWithEvent(Action.java:498)
at
org.eclipse.jface.action.ActionContributionItem.handleWidgetSelection(ActionContributionItem.java:584)
at
org.eclipse.jface.action.ActionContributionItem.access$2(ActionContributionItem.java:501)
at
org.eclipse.jface.action.ActionContributionItem$5.handleEvent(ActionContributionItem.java:411)
at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1053)
at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4066)
at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3657)
at org.eclipse.ui.internal.Workbench.runEventLoop(Workbench.java:2640)
at org.eclipse.ui.internal.Workbench.runUI(Workbench.java:2604)
at org.eclipse.ui.internal.Workbench.access$4(Workbench.java:2438)
at org.eclipse.ui.internal.Workbench$7.run(Workbench.java:671)
at
org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:332)
at
org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:664)
at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:149)
at
org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:115)
at
org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
at
org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:110)
at
org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:79)
at
org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:369)
at
org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:179)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:620)
at org.eclipse.equinox.launcher.Main.basicRun(Main.java:575)
at org.eclipse.equinox.launcher.Main.run(Main.java:1408)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
at
org.eclipse.osgi.internal.loader.BundleLoader.findClassInternal(BundleLoader.java:506)
at
org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:422)
at
org.eclipse.osgi.internal.loader.BundleLoader.findClass(BundleLoader.java:410)
at
org.eclipse.osgi.internal.baseadaptor.DefaultClassLoader.loadClass(DefaultClassLoader.java:107)
at java.lang.ClassLoader.loadClass(Unknown Source)
... 32 more

When I used the hadoop-0.20.2-eclipse-plugin.jar and
hadoop-eclipse-plugin-0.20.203.0.jar, none of the views appear in the
Map/Reduce perspective and also there are no corresponding view in the
Windows - Show View - Other  also.

Thanks,
Praveen


Hadoop Jar Files

2011-05-30 Thread Praveen Sripati
Hi,

I have extracted the hadoop-0.20.2, hadoop-0.20.203.0 and hadoop-0.21.0
files.

In the hadoop-0.21.0 folder the hadoop-hdfs-0.21.0.jar,
hadoop-mapred-0.21.0.jar and the hadoop-common-0.21.0.jar files are there.
But in the  hadoop-0.20.2 and the hadoop-0.20.203.0 releases the same files
are missing.

Have the jar files been packaged differently in the 0.20.2 and 0.20.203.0
releases or should I get these jars from some other projects?

Thanks,
Praveen