Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-02 Thread Nitin Pawar
i can think of following options

1) write a simple get and put code which gets the data from DFS and loads
it in dfs
2) see if the distcp  between both versions are compatible
3) this is what I had done (and my data was hardly few hundred GB) .. did a
dfs -copyToLocal and then in the new grid did a copyFromLocal

On Thu, May 3, 2012 at 11:41 AM, Austin Chungath  wrote:

> Hi,
> I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> I don't want to lose the data that is in the HDFS of Apache hadoop
> 0.20.205.
> How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> What is the best practice/ techniques to do this?
>
> Thanks & Regards,
> Austin
>



-- 
Nitin Pawar


Re: i can not send mail to common-user

2012-05-03 Thread Nitin Pawar
this one just came in :)

2012/5/3 JunYong Li 

> --
> Regards
> Junyong
>



-- 
Nitin Pawar


Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-03 Thread Nitin Pawar
you can actually look at the distcp

http://hadoop.apache.org/common/docs/r0.20.0/distcp.html

but this means that you have two different set of clusters available to do
the migration

On Thu, May 3, 2012 at 12:51 PM, Austin Chungath  wrote:

> Thanks for the suggestions,
> My concerns are that I can't actually copyToLocal from the dfs because the
> data is huge.
>
> Say if my hadoop was 0.20 and I am upgrading to 0.20.205 I can do a
> namenode upgrade. I don't have to copy data out of dfs.
>
> But here I am having Apache hadoop 0.20.205 and I want to use CDH3 now,
> which is based on 0.20
> Now it is actually a downgrade as 0.20.205's namenode info has to be used
> by 0.20's namenode.
>
> Any idea how I can achieve what I am trying to do?
>
> Thanks.
>
> On Thu, May 3, 2012 at 12:23 PM, Nitin Pawar  >wrote:
>
> > i can think of following options
> >
> > 1) write a simple get and put code which gets the data from DFS and loads
> > it in dfs
> > 2) see if the distcp  between both versions are compatible
> > 3) this is what I had done (and my data was hardly few hundred GB) ..
> did a
> > dfs -copyToLocal and then in the new grid did a copyFromLocal
> >
> > On Thu, May 3, 2012 at 11:41 AM, Austin Chungath 
> > wrote:
> >
> > > Hi,
> > > I am migrating from Apache hadoop 0.20.205 to CDH3u3.
> > > I don't want to lose the data that is in the HDFS of Apache hadoop
> > > 0.20.205.
> > > How do I migrate to CDH3u3 but keep the data that I have on 0.20.205.
> > > What is the best practice/ techniques to do this?
> > >
> > > Thanks & Regards,
> > > Austin
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>



-- 
Nitin Pawar


Re: How to add debugging to map- red code

2012-05-04 Thread Nitin Pawar
here is a sample code from log4j documentation
if you want to specify a specific file where you want to write the log ..
you can have a log4j properties file and add it to the classpath

 import com.foo.Bar;

 // Import log4j classes.
 *import org.apache.log4j.Logger;
 import org.apache.log4j.BasicConfigurator;*

 public class MyApp {

   // Define a static logger variable so that it references the
   // Logger instance named "MyApp".
   *static* Logger logger = *Logger.getLogger(MyApp.class);*

   public static void main(String[] args) {

 // Set up a simple configuration that logs on the console.
 *BasicConfigurator.configure();*

 logger.info("Entering application.");
 Bar bar = new Bar();
 bar.doIt();
 logger.info("Exiting application.");
   }
 }


On Sat, May 5, 2012 at 3:40 AM, Mapred Learn  wrote:

> Hi Harsh,
> Could you show one sample of how to do this ?
>
> I have not seen/written  any mapper code where people use log4j logger or
> log4j file to set the log level.
>
> Thanks in advance
> -JJ
>
> On Thu, May 3, 2012 at 4:32 PM, Harsh J  wrote:
>
> > Doing (ii) would be an isolated app-level config and wouldn't get
> > affected by the toggling of
> > (i). The feature from (i) is available already in CDH 4.0.0-b2 btw.
> >
> > On Fri, May 4, 2012 at 4:58 AM, Mapred Learn 
> > wrote:
> > > Hi Harsh,
> > >
> > > Does doing (ii) mess up with hadoop (i) level ?
> > >
> > > Or does it happen in both the options anyways ?
> > >
> > >
> > > Thanks,
> > > -JJ
> > >
> > > On Fri, Apr 20, 2012 at 8:28 AM, Harsh J  wrote:
> > >
> > >> Yes this is possible, and there's two ways to do this.
> > >>
> > >> 1. Use a distro/release that carries the
> > >> https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will
> let
> > >> you avoid work (see 2, which is same as your idea)
> > >>
> > >> 2. Configure your implementation's logger object's level in the
> > >> setup/setConf methods of the task, by looking at some conf prop to
> > >> decide the level. This will work just as well - and will also avoid
> > >> changing Hadoop's own Child log levels, unlike the (1) method.
> > >>
> > >> On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn  >
> > >> wrote:
> > >> > Hi,
> > >> > I m trying to find out best way to add debugging in map- red code.
> > >> > I have System.out.println() statements that I keep on commenting and
> > >> uncommenting so as not to increase stdout size
> > >> >
> > >> > But problem is anytime I need debug, I Hv to re-compile.
> > >> >
> > >> > If there a way, I can define log levels using log4j in map-red code
> > and
> > >> define log level as conf option ?
> > >> >
> > >> > Thanks,
> > >> > JJ
> > >> >
> > >> > Sent from my iPhone
> > >>
> > >>
> > >>
> > >> --
> > >> Harsh J
> > >>
> >
> >
> >
> > --
> > Harsh J
> >
>



-- 
Nitin Pawar


Re: Best practice to migrate HDFS from 0.20.205 to CDH3u3

2012-05-07 Thread Nitin Pawar
t; >> Mike Segel
> >>
> >> On May 3, 2012, at 11:26 AM, Suresh Srinivas 
> >> wrote:
> >>
> >> > This probably is a more relevant question in CDH mailing lists. That
> >> said,
> >> > what Edward is suggesting seems reasonable. Reduce replication factor,
> >> > decommission some of the nodes and create a new cluster with those
> nodes
> >> > and do distcp.
> >> >
> >> > Could you share with us the reasons you want to migrate from Apache
> 205?
> >> >
> >> > Regards,
> >> > Suresh
> >> >
> >> > On Thu, May 3, 2012 at 8:25 AM, Edward Capriolo <
> edlinuxg...@gmail.com
> >> >wrote:
> >> >
> >> >> Honestly that is a hassle, going from 205 to cdh3u3 is probably more
> >> >> or a cross-grade then an upgrade or downgrade. I would just stick it
> >> >> out. But yes like Michael said two clusters on the same gear and
> >> >> distcp. If you are using RF=3 you could also lower your replication
> to
> >> >> rf=2 'hadoop dfs -setrepl 2' to clear headroom as you are moving
> >> >> stuff.
> >> >>
> >> >>
> >> >> On Thu, May 3, 2012 at 7:25 AM, Michel Segel <
> >> michael_se...@hotmail.com>
> >> >> wrote:
> >> >>> Ok... When you get your new hardware...
> >> >>>
> >> >>> Set up one server as your new NN, JT, SN.
> >> >>> Set up the others as a DN.
> >> >>> (Cloudera CDH3u3)
> >> >>>
> >> >>> On your existing cluster...
> >> >>> Remove your old log files, temp files on HDFS anything you don't
> need.
> >> >>> This should give you some more space.
> >> >>> Start copying some of the directories/files to the new cluster.
> >> >>> As you gain space, decommission a node, rebalance, add node to new
> >> >> cluster...
> >> >>>
> >> >>> It's a slow process.
> >> >>>
> >> >>> Should I remind you to make sure you up you bandwidth setting, and
> to
> >> >> clean up the hdfs directories when you repurpose the nodes?
> >> >>>
> >> >>> Does this make sense?
> >> >>>
> >> >>> Sent from a remote device. Please excuse any typos...
> >> >>>
> >> >>> Mike Segel
> >> >>>
> >> >>> On May 3, 2012, at 5:46 AM, Austin Chungath 
> >> wrote:
> >> >>>
> >> >>>> Yeah I know :-)
> >> >>>> and this is not a production cluster ;-) and yes there is more
> >> hardware
> >> >>>> coming :-)
> >> >>>>
> >> >>>> On Thu, May 3, 2012 at 4:10 PM, Michel Segel <
> >> michael_se...@hotmail.com
> >> >>> wrote:
> >> >>>>
> >> >>>>> Well, you've kind of painted yourself in to a corner...
> >> >>>>> Not sure why you didn't get a response from the Cloudera lists,
> but
> >> >> it's a
> >> >>>>> generic question...
> >> >>>>>
> >> >>>>> 8 out of 10 TB. Are you talking effective storage or actual disks?
> >> >>>>> And please tell me you've already ordered more hardware.. Right?
> >> >>>>>
> >> >>>>> And please tell me this isn't your production cluster...
> >> >>>>>
> >> >>>>> (Strong hint to Strata and Cloudea... You really want to accept my
> >> >>>>> upcoming proposal talk... ;-)
> >> >>>>>
> >> >>>>>
> >> >>>>> Sent from a remote device. Please excuse any typos...
> >> >>>>>
> >> >>>>> Mike Segel
> >> >>>>>
> >> >>>>> On May 3, 2012, at 5:25 AM, Austin Chungath 
> >> >> wrote:
> >> >>>>>
> >> >>>>>> Yes. This was first posted on the cloudera mailing list. There
> >> were no
> >> >>>>>> responses.
> >> >>>>>>
> >> >>>>>> But this is not related to cloudera as such.
> >> >>>>>&

Re: How to configure application for Eternal jar

2012-05-26 Thread Nitin Pawar
may be this is what you are looking for

http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/

On Sat, May 26, 2012 at 6:18 PM, samir das mohapatra <
samir.help...@gmail.com> wrote:

> Hi All,
>  How to configure the external jar , which is use by application
> internally.
>   For eample:
>  JDBC ,Hive Driver etc.
>
> Note:- I dont have  permission to start and stop the hadoop machine.
>  So  I need to configure application level (Not hadoop level )
>
>  If we will put jar inside the lib folder of the hadoop then i think we
> need to re-start the hadoop
>  without this,  is there any other way  to do so.
>
>
>
>
> Thanks
> samir
>



-- 
Nitin Pawar


Re: Splunk + Hadoop

2012-05-28 Thread Nitin Pawar
Hi Shreya,

if you are looking at data locality, then you may or may not use hadoop out
of the box.
It will all depend on how you design the data layout on top of hdfs and how
do you implement search based on the customer queries.

a good idea might be have hop-in queryable database like mysql inbetween
where you can store the results of your data being processed on hadoop and
then use solr search for fast access and search.

Thanks,
Nitin

On Mon, May 28, 2012 at 12:41 PM,  wrote:

> Hi Abhishek,
>
> I am looking for a scenario where the customer representative needs to
> respond back to the customers on call.
> They need to search on huge data and then respond back in few seconds.
>
> Thanks and Regards,
> Shreya Pal
> Architect Technology
> Cognizant Technology Pvt Ltd
> Vnet - 205594
> Mobile - +91-9766310680
>
>
> -Original Message-
> From: Abhishek Pratap Singh [mailto:manu.i...@gmail.com]
> Sent: Tuesday, May 22, 2012 2:44 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Splunk + Hadoop
>
> I have used Hadoop and Splunk both. Can you please let me know what is
> your requirement?
> Real time processing with hadoop depends upon What defines "Real time" in
> particular scenario. Based on requirement, Real time (near real time) can
> be achieved.
>
> ~Abhishek
>
> On Fri, May 18, 2012 at 3:58 PM, Russell Jurney  >wrote:
>
> > Because that isn't Cube.
> >
> > Russell Jurney
> > twitter.com/rjurney
> > russell.jur...@gmail.com
> > datasyndrome.com
> >
> > On May 18, 2012, at 2:01 PM, Ravi Shankar Nair
> >  wrote:
> >
> > > Why not Hbase with Hadoop?
> > > It's a best bet.
> > > Rgds, Ravi
> > >
> > > Sent from my Beethoven
> > >
> > >
> > > On May 18, 2012, at 3:29 PM, Russell Jurney
> > > 
> > wrote:
> > >
> > >> I'm playing with using Hadoop and Pig to load MongoDB with data for
> > Cube to
> > >> consume. Cube <https://github.com/square/cube/wiki> is a realtime
> > tool...
> > >> but we'll be replaying events from the past.  Does that count?  It
> > >> is
> > nice
> > >> to batch backfill metrics into 'real-time' systems in bulk.
> > >>
> > >> On Fri, May 18, 2012 at 12:11 PM,  wrote:
> > >>
> > >>> Hi ,
> > >>>
> > >>> Has anyone used Hadoop and splunk, or any other real-time
> > >>> processing
> > tool
> > >>> over Hadoop?
> > >>>
> > >>> Regards,
> > >>> Shreya
> > >>>
> > >>>
> > >>>
> > >>> This e-mail and any files transmitted with it are for the sole use
> > >>> of
> > the
> > >>> intended recipient(s) and may contain confidential and privileged
> > >>> information. If you are not the intended recipient(s), please
> > >>> reply to
> > the
> > >>> sender and destroy all copies of the original message. Any
> > >>> unauthorized review, use, disclosure, dissemination, forwarding,
> > >>> printing or
> > copying of
> > >>> this email, and/or any action taken in reliance on the contents of
> > >>> this e-mail is strictly prohibited and may be unlawful.
> > >>>
> > >>
> > >> Russell Jurney twitter.com/rjurney russell.jur...@gmail.com
> > datasyndrome.com
> >
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful.
>



-- 
Nitin Pawar


Re: Help with DFSClient Exception.

2012-05-28 Thread Nitin Pawar
Whats the block size?
also are you experiencing any slowness in network?

i am guessing you are using EC2

these issues normally come with network problems

On Mon, May 28, 2012 at 3:57 PM, akshaymb  wrote:

>
> Hi,
>
> We are frequently observing the exception
> java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
> not complete file
>
> /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
> Giving up.
> on our cluster.  The exception occurs during writing a file.  We are using
> Hadoop 0.20.2. It’s ~250 nodes cluster and on average 1 box goes down every
> 3 days.
>
> Detailed stack trace :
> 12/05/27 23:26:54 INFO mapred.JobClient: Task Id :
> attempt_201205232329_28133_r_02_0, Status : FAILED
> java.io.IOException: DFSClient_attempt_201205232329_28133_r_02_0 could
> not complete file
>
> /output/tmp/test/_temporary/_attempt_201205232329_28133_r_02_0/part-r-2.
> Giving up.
>at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
>at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
>at
>
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
>at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
>at
>
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
>at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
>at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Our investigation:
> We have min replication factor set to 2.  As mentioned
> http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html here  , “A call
> to complete() will not return true until all the file's blocks have been
> replicated the minimum number of times.  Thus, DataNode failures may cause
> a
> client to call complete() several times before succeeding”, we should retry
> complete() several times.
> The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls
> complete() function and retries it for 20 times.  But in spite of that file
> blocks are not replicated minimum number of times. The retry count is not
> configurable.  Changing min replication factor to 1 is also not a good idea
> since there are continuously jobs running on our cluster.
>
> Do we have any solution / workaround for this problem?
>
> What is min replication factor in general used in industry.
>
> Let me know if any further inputs required.
>
> Thanks,
> -Akshay
>
>
>
> --
> View this message in context:
> http://old.nabble.com/Help-with-DFSClient-Exception.-tp33918949p33918949.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Nitin Pawar


Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
hive is one approach (similar to routine databases but exactly not the same)

if you are looking at mapreduce program then using multipleinput formats
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html



On Tue, May 29, 2012 at 4:02 PM, Michel Segel wrote:

> Hive?
> Sure Assuming you mean that the id is a FK common amongst the tables...
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On May 29, 2012, at 5:29 AM, "liuzhg"  wrote:
>
> > Hi,
> >
> > I wonder that if Hadoop can solve effectively the question as following:
> >
> > ==
> > input file: a.txt, b.txt
> > result: c.txt
> >
> > a.txt:
> > id1,name1,age1,...
> > id2,name2,age2,...
> > id3,name3,age3,...
> > id4,name4,age4,...
> >
> > b.txt:
> > id1,address1,...
> > id2,address2,...
> > id3,address3,...
> >
> > c.txt
> > id1,name1,age1,address1,...
> > id2,name2,age2,address2,...
> > 
> >
> > I know that it can be done well by database.
> > But I want to handle it with hadoop if possible.
> > Can hadoop meet the requirement?
> >
> > Any suggestion can help me. Thank you very much!
> >
> > Best Regards,
> >
> > Gump
> >
> >
> >
>



-- 
Nitin Pawar


Re: How to mapreduce in the scenario

2012-05-29 Thread Nitin Pawar
if you have huge dataset (huge meaning that around tera bytes or at the
least few GBs) then yes, hadoop has the advantage of distributed systems
and is much faster

but on a smaller set of records it is not as good as RDBMS

On Wed, May 30, 2012 at 6:53 AM, liuzhg  wrote:

> Hi,
>
> Mike, Nitin, Devaraj, Soumya, samir, Robert
>
> Thank you all for your suggestions.
>
> Actually, I want to know if hadoop has any advantage than routine database
> in performance for solving this kind of problem ( join data ).
>
>
>
> Best Regards,
>
> Gump
>
>
>
>
>
> On Tue, May 29, 2012 at 6:53 PM, Soumya Banerjee
>  wrote:
>
> Hi,
>
> You can also try to use the Hadoop Reduce Side Join functionality.
> Look into the contrib/datajoin/hadoop-datajoin-*.jar for the base MAP and
> Reduce classes to do the same.
>
> Regards,
> Soumya.
>
>
> On Tue, May 29, 2012 at 4:10 PM, Devaraj k  wrote:
>
> > Hi Gump,
> >
> >   Mapreduce fits well for solving these types(joins) of problem.
> >
> > I hope this will help you to solve the described problem..
> >
> > 1. Mapoutput key and value classes : Write a map out put key
> > class(Text.class), value class(CombinedValue.class). Here value class
> > should be able to hold the values from both the files(a.txt and b.txt) as
> > shown below.
> >
> > class CombinedValue implements WritableComparator
> > {
> >   String name;
> >   int age;
> >   String address;
> >   boolean isLeft; // flag to identify from which file
> > }
> >
> > 2. Mapper : Write a map() function which can parse from both the
> > files(a.txt, b.txt) and produces common output key and value class.
> >
> > 3. Partitioner : Write the partitioner in such a way that it will Send
> all
> > the (key, value) pairs to same reducer which are having same key.
> >
> > 4. Reducer : In the reduce() function, you will receive the records from
> > both the files and you can combine those easily.
> >
> >
> > Thanks
> > Devaraj
> >
> >
> > 
> > From: liuzhg [liu...@cernet.com]
> > Sent: Tuesday, May 29, 2012 3:45 PM
> > To: common-user@hadoop.apache.org
> > Subject: How to mapreduce in the scenario
> >
> > Hi,
> >
> > I wonder that if Hadoop can solve effectively the question as following:
> >
> > ==
> > input file: a.txt, b.txt
> > result: c.txt
> >
> > a.txt:
> > id1,name1,age1,...
> > id2,name2,age2,...
> > id3,name3,age3,...
> > id4,name4,age4,...
> >
> > b.txt:
> > id1,address1,...
> > id2,address2,...
> > id3,address3,...
> >
> > c.txt
> > id1,name1,age1,address1,...
> > id2,name2,age2,address2,...
> > 
> >
> > I know that it can be done well by database.
> > But I want to handle it with hadoop if possible.
> > Can hadoop meet the requirement?
> >
> > Any suggestion can help me. Thank you very much!
> >
> > Best Regards,
> >
> > Gump
> >
>
>
>
>


-- 
Nitin Pawar


Re: Hadoop cluster hardware configuration

2012-06-04 Thread Nitin Pawar
if you tell us the purpose of this cluster, then it would be helpful to
tell exactly how good it is

On Mon, Jun 4, 2012 at 3:57 PM, praveenesh kumar wrote:

> Hello all,
>
> I am looking forward to build a 5 node hadoop cluster with the following
> configurations per machine.  --
>
> 1. Intel Xeon E5-2609 (2.40GHz/4-core)
> 2. 32 GB RAM (8GB 1Rx4 PC3)
> 3. 5 x 900GB 6G SAS 10K hard disk ( total 4.5 TB storage/machine)
> 4. Ethernet 1GbE connection
>
> I would like the experts to please review it and share if this sounds like
> a optimal/deserving hadoop hardware configuration or not ?
> I know without knowing the actual use case its not worth commenting, but
> still in general I would like to have the views. Also please suggest if I
> am missing something.
>
> Regards,
> Praveenesh
>



-- 
Nitin Pawar


Re: Hadoop cluster hardware configuration

2012-06-04 Thread Nitin Pawar
if you are doing computations using hadoop on a miniscale yes this hardware
is good enough.

Normally hadoop clusters are pre-occupied with the heavy loads so they are
not shared for multiple usage unless your utilization of hadoop is on lower
side and then you want to reuse the hardware.



On Mon, Jun 4, 2012 at 5:52 PM, praveenesh kumar wrote:

> On a very high level... we would be utilizing cluster not only for hadoop
> but for other I/O bound or in-memory operations.
> That is the reason we are going for SAS hard disks. And we also need to
> perform lots of computational tasks for which we have RAM kept to 32 GB,
> which can be increased. So on a high level just wanted to know does these
> hardware specs make sense ?
>
> Regards,
> Praveenesh
>
> On Mon, Jun 4, 2012 at 5:46 PM, Nitin Pawar 
> wrote:
>
> > if you tell us the purpose of this cluster, then it would be helpful to
> > tell exactly how good it is
> >
> > On Mon, Jun 4, 2012 at 3:57 PM, praveenesh kumar  > >wrote:
> >
> > > Hello all,
> > >
> > > I am looking forward to build a 5 node hadoop cluster with the
> following
> > > configurations per machine.  --
> > >
> > > 1. Intel Xeon E5-2609 (2.40GHz/4-core)
> > > 2. 32 GB RAM (8GB 1Rx4 PC3)
> > > 3. 5 x 900GB 6G SAS 10K hard disk ( total 4.5 TB storage/machine)
> > > 4. Ethernet 1GbE connection
> > >
> > > I would like the experts to please review it and share if this sounds
> > like
> > > a optimal/deserving hadoop hardware configuration or not ?
> > > I know without knowing the actual use case its not worth commenting,
> but
> > > still in general I would like to have the views. Also please suggest
> if I
> > > am missing something.
> > >
> > > Regards,
> > > Praveenesh
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>



-- 
Nitin Pawar


Re: Web Service Interface for triggering a Hadoop Job

2012-06-05 Thread Nitin Pawar
 per user,
>  users with more
>  preferences will
> be
>  sampled down
>  (default: 1000)
>  --minPrefsPerUser (-mp) minPrefsPerUser ignore users with
>  less preferences
> than
>  this (default: 1)
>  --booleanData (-b) booleanData  Treat input as
>  without pref
> values
>  --threshold (-tr) threshold discard item pairs
>  with a similarity
>  value below this
>  --help (-h)     Print out help
>  --tempDir tempDir   Intermediate
> output
>  directory
>  --startPhase startPhase First phase to run
>  --endPhase endPhase Last phase to run
>
> Why do I get the above output?
>
> Thank you in advance.
>
> Nick K.
>



-- 
Nitin Pawar


Re: Nutch hadoop integration

2012-06-08 Thread Nitin Pawar
may be this will help you if you have not already checked it

http://wiki.apache.org/nutch/NutchHadoopTutorial

On Fri, Jun 8, 2012 at 1:29 PM, abhishek tiwari <
abhishektiwari.u...@gmail.com> wrote:

> how can i integrate hadood and nutch ..anyone please brief me .
>



-- 
Nitin Pawar


Re: Can I remove a folder on HDFS when decommissioning a data node?

2012-06-26 Thread Nitin Pawar
use -skipTrash while removing that way it wont be stored in .Trash

On Tue, Jun 26, 2012 at 8:21 PM, Michael Segel wrote:

> Hi,
>
> Yes you can remove a file while there is a node or node(s) being
> decommissioned.
>
> I wonder if there's a way to manually clear out the .trash which may also
> give you more space.
>
> On Jun 26, 2012, at 2:56 AM, Adrian Liu wrote:
>
> > Hi,
> >
> > I tried to decommission a datanode, and then found that the active nodes
> don't have enough space to store the replicated blocks on the
> decommissioning node.  So can I remove a big folder when the
> decommissioning is in progress?  Will this causes some fatal errors?
>  Thanks in advance.
> >
> > Adrian Liu
> > adri...@yahoo-inc.com
> >
> >
> >
> >
>
>


-- 
Nitin Pawar


Re: Problem to kill jobs with java

2012-06-27 Thread Nitin Pawar
when you fire a hive query .. it returns with the job kill message
may be you can capture that and execute it as u need

On Wed, Jun 27, 2012 at 8:13 PM, hadoop  wrote:

> Hi Folks,
>
>
>
> I m using java client to run queries on hive, suggest me some way so that I
> can kill the query whenever I need.
>
> Or how can I find the jobid to kill it.
>
>
>
> Regards
>
> Vikas Srivastava
>
>


-- 
Nitin Pawar


Re: Need example programs other then wordcount for hadoop

2012-06-29 Thread Nitin Pawar
one u can try  is data validation and enrichment
in all formats from plain format to csv to xml and any format u can

On Fri, Jun 29, 2012 at 10:16 PM, Saravanan Nagarajan <
saravanan.nagarajan...@gmail.com> wrote:

> HI all,
>
> I ran word count examples in hadoop and it's very good starting point for
> hadoop.But i am looking for more programs with advanced concept . If you
> have any programs or suggestion, please send to me at "
> saravanan.nagarajan...@gmail.com".
>
> If you have best practices,please share with me.
>
> Regards,
> Saravanan
>



-- 
Nitin Pawar


Re: Hadoop removal

2012-07-06 Thread Nitin Pawar
t;> >
> > >> > Please help me on this issue?
> > >> >
> > >> > Thanks,
> > >> > Prabhu.
> > >> >
> > >> > DISCLAIMER
> > >> > ==
> > >> > This e-mail may contain privileged and confidential information
> > >> > which is the property of Persistent Systems Ltd. It is intended
> > >> > only for the use
> > >> of
> > >> > the individual or entity to which it is addressed. If you are not
> > >> > the intended recipient, you are not authorized to read, retain,
> > >> > copy, print, distribute or use this message. If you have received
> > >> > this communication
> > >> in
> > >> > error, please notify the sender and delete all copies of this
> message.
> > >> > Persistent Systems Ltd. does not accept any liability for virus
> > >> > infected mails.
> > >> >
> > >>
> > >
> > >
> >
> > DISCLAIMER
> > ==
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>



-- 
Nitin Pawar


Re: Hadoop 1.0.3 setup

2012-07-09 Thread Nitin Pawar
t; Problem binding to md-trngpoc1/10.5.114.110:54310 : Address alrea
> > dy in use
> > at org.apache.hadoop.ipc.Server.bind(Server.java:227)
> > at org.apache.hadoop.ipc.Server$Listener.(Server.java:301)
> > at org.apache.hadoop.ipc.Server.(Server.java:1483)
> > at org.apache.hadoop.ipc.RPC$Server.(RPC.java:545)
> > at org.apache.hadoop.ipc.RPC.getServer(RPC.java:506)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:294)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:496)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288)
> > Caused by: java.net.BindException: Address already in use
> > at sun.nio.ch.Net.bind(Native Method)
> > at
> > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:126)
> > at
> sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> > at org.apache.hadoop.ipc.Server.bind(Server.java:225)
> > ... 8 more
> >
> > 2012-07-09 17:05:43,908 INFO
> > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> > /
> > SHUTDOWN_MSG: Shutting down NameNode at md-trngpoc1/10.5.114.110
> > /
> >
> >
> > Please suggest on this issue. What i am doing wrong?
> >
> > Thanks,
> > Prabhu
> >
>



-- 
Nitin Pawar


Re: Hadoop Job Profiling Tool

2012-07-09 Thread Nitin Pawar
job history has most of the details you are looking for at each task levels

On Tue, Jul 10, 2012 at 4:07 AM, Mike S  wrote:

> Is there any Hadoop specific tool that can profile a job time? For
> instance, if a job took x amount of time, the tool can profile how
> this time was consumed at different stage likes time spent to read, to
>  map, to reduce reduce , shuttle move that and the different steps of
> a Hadoop M/R job?
>
> If not, how do you recommend finding the map/reduce job bottle neck?
>



-- 
Nitin Pawar


Re: hadoop dfs -ls

2012-07-16 Thread Nitin Pawar
I managed to solve this by moving contents on hdfs-site.xml to
core-site.xml


Thanks

On Fri, Jul 13, 2012 at 9:54 PM, Leo Leung  wrote:

> Hi Nitin,
>
>
>
> Normally your conf should reside in /etc/hadoop/conf (if you don't have
> one. Copy it from the namenode - and keep it sync)
>
>
>
> hadoop (script) by default depends on hadoop-setup.sh which depends on
> hadoop-env.sh in /etc/hadoop/conf
>
>
>
> Or during runtime specify the config dir
>
> i.e:
>
>
>
> [hdfs]$  hadoop [--config ] 
>
>
>
>
>
> P.S. Some useful links:
>
> http://wiki.apache.org/hadoop/FAQ
>
> http://wiki.apache.org/hadoop/FrontPage
>
> http://wiki.apache.org/hadoop/
>
> http://hadoop.apache.org/common/docs/r1.0.3/
>
>
>
>
>
>
>
>
>
> -Original Message-
> From: d...@paraliatech.com [mailto:d...@paraliatech.com] On Behalf Of
> Dave Beech
> Sent: Friday, July 13, 2012 6:18 AM
> To: common-user@hadoop.apache.org
> Subject: Re: hadoop dfs -ls
>
>
>
> Hi Nitin
>
>
>
> It's likely that your hadoop command isn't finding the right configuration.
>
> In particular it doesn't know where your namenode is
> (fs.default.namesetting in core-site.xml)
>
>
>
> Maybe you need to set the HADOOP_CONF_DIR environment variable to point to
> your conf directory.
>
>
>
> Dave
>
>
>
> On 13 July 2012 14:11, Nitin Pawar  nitinpawar...@gmail.com>> wrote:
>
>
>
> > Hi,
>
> >
>
> > I have done setup numerous times but this time i did after some break.
>
> >
>
> > I managed to get the cluster up and running fine but when I do  hadoop
>
> > dfs -ls /
>
> >
>
> > it actually shows me contents of linux file system
>
> >
>
> > I am using hadoop-1.0.3 on rhel5.6
>
> >
>
> > Can anyone suggest what I must have done wrong?
>
> >
>
> > --
>
> > Nitin Pawar
>
> >
>



-- 
Nitin Pawar


Re: performance on memory usage

2012-07-24 Thread Nitin Pawar
hadoop will not use/hold on to memory unless its needed.

you push load on the cluster and the stats will automatically grow

On Tue, Jul 24, 2012 at 2:52 PM, Kamil Rogoń
 wrote:
> Hello,
>
> Reading on the Internet best practices for selecting hardware for Hadoop I
> noticed there are always many RAM memory. On my Hadoop environment I have
> 16GB memory on all hardware, but I am worried about small utilization of it:
>
> $ free -m
>  total   used   free shared buffers cached
> Mem: 15997  15907 90  0 287  15064
> -/+ buffers/cache:555  15442
> Swap:15258150  15108
>
> $ free -m
>  total   used   free shared buffers cached
> Mem: 16029  15937 92  0 228  14320
> -/+ buffers/cache:   1388  14641
> Swap:15258   1017  14240
>
> As you see "buffers used" is below 10%. What options should I look closer? I
> changed "Heap Size" of Cluster, but utilization doesn't grow (Heap Size is
> 70.23 MB / 3.47 GB (1%)).
>
> Current config which can impact on memory:
>
> fs.inmemory.size.mb
> 200
>
> io.sort.mb
> 200
>
> io.file.buffer.size
> 131072
>
> dfs.block.size
> 134217728
>
> mapred.child.java.opts
> -Xmx1024M
>
> export HADOOP_HEAPSIZE=4000
>
>
> Thanks for your reply,
> K.R.
>



-- 
Nitin Pawar


Re: Hadoop 1.0.3 start-daemon.sh doesn't start all the expected daemons

2012-07-27 Thread Nitin Pawar
I had written a script to setup a single node setup
this is just a dummy script to get things in working state on a single node

just download the files and run the script
https://github.com/nitinpawar/hadoop/

On Fri, Jul 27, 2012 at 5:46 PM, Bejoy Ks  wrote:
> Hi Dinesh
>
> Try using $HADOOP_HOME/bin/start-all.sh . It starts all the hadoop
> daemons including TT and DN.
>
>
> Regards
> Bejoy KS



-- 
Nitin Pawar


Re: hi, I need to help: Hadoop

2012-08-14 Thread Nitin Pawar
also you can take a look at

http://blog.rajeevsharma.in/2009/06/using-hdfs-in-java-0200.html

On Wed, Aug 15, 2012 at 12:12 PM, Harsh J  wrote:
> Hi Huong,
>
> See http://developer.yahoo.com/hadoop/tutorial/module2.html#programmatically
>
> On Wed, Aug 15, 2012 at 10:24 AM, huong hoang minh
>  wrote:
>> I am researching Hadoop  technology. And I don't know how to access
>> and copy data from HDFS to the local machine by Java. Can you help me
>> , step by step?
>>  Thank you very much.
>> --
>> Hoàng Minh Hương
>> GOYOH VIETNAM 44-Trần Cung-Hà Nội-Viet NAM
>> Tel: 0915318789
>
>
>
> --
> Harsh J



-- 
Nitin Pawar


Re: reg hadoop on AWS

2012-10-05 Thread Nitin Pawar
Hi Sudha,

best way to use hadoop on aws is via whirr

Thanks,
nitin

On Fri, Oct 5, 2012 at 4:45 PM, sudha sadhasivam
 wrote:
> Sir
> We tried to setup hadoop on AWS. The procedure is given. We face problem with 
> the parameters needed for input and output files. Can somebody provide us 
> with a sample exercise with steps for working on hadoop in AWS?
> thanking you
> Dr G Sudha



-- 
Nitin Pawar


Re: REg - Hive

2012-10-19 Thread Nitin Pawar
do you mean to say you want to say "select * from table where
anycolumn like/equal newyork" ? and then see the column
definition/name where it occurs ?



On Fri, Oct 19, 2012 at 3:43 PM, sudha sadhasivam
 wrote:
> Sir
>
> In our system, we need hive query to  retrieve the field name where a 
> particular data item belongs
> For example, if we have a table having location information, when queried for 
> New York, the query should retuen the field names where New York occurs like 
> "City", "Administrative division" etc.
>
> Kindly inform whether it is possible to retrieve meta-data information for a 
> particular field value in Hive
> Thanking you
> G Sudha



-- 
Nitin Pawar


Re: Multi-threaded map task

2013-01-13 Thread Nitin Pawar
Thats because its distributed processing framework over network
On Jan 14, 2013 11:27 AM, "Mark Olimpiati"  wrote:

> Hi, this is a simple question, but why wasn't map or reduce tasks
> programmed to be multi-threaded ? ie. instead of spawning 6 map tasks for 6
> cores, run one map task with 6 parallel threads.
>
> In fact I tried this myself, but turns that threading is not helping as it
> would be in regular java programs for some reason .. any feedback on this
> topic?
>
> Thanks,
> Mark
>


Re: Error when connecting Hive

2014-03-07 Thread Nitin Pawar
somehow looks like hive is not able to find hadoop libs


On Fri, Mar 7, 2014 at 11:48 PM, Manish  wrote:

> Please look into the below issue & help.
>
>
>
>  Original Message 
> Subject:Error when connecting Hive
> Date:   Fri, 07 Mar 2014 20:51:25 +0530
> From:   Manish 
> Reply-To:   u...@hive.apache.org
> To: u...@hive.apache.org 
>
>
>
> I am getting below error when connecting to hive shell. I thought it is
> because of log directory issue. But even fixing the directory permission
> the error still exists.
>
> manish@localhost:/tmp/manish$ hive
> Logging initialized using configuration in
> file:/etc/hive/conf.dist/hive-log4j.properties
> Hive history
> file=/tmp/manish/hive_job_log_b53b2f66-a751-4a9c-9e4d-
> 6ee64156e21f_1189315217.txt
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/hadoop/security/authentication/util/KerberosName
> at
> org.apache.hadoop.security.UserGroupInformation.ensureInitialized(
> UserGroupInformation.java:214)
> at
> org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(
> UserGroupInformation.java:277)
> at
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(
> UserGroupInformation.java:668)
> at
> org.apache.hadoop.security.UserGroupInformation.getCurrentUser(
> UserGroupInformation.java:573)
> at
> org.apache.hadoop.hive.shims.HadoopShimsSecure.getUGIForConf(
> HadoopShimsSecure.java:520)
> at
> org.apache.hadoop.hive.ql.security.HadoopDefaultAuthenticator.setConf(
> HadoopDefaultAuthenticator.java:51)
> at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(
> ReflectionUtils.java:133)
> at
> org.apache.hadoop.hive.ql.metadata.HiveUtils.getAuthenticator(HiveUtils.
> java:365)
> at
> org.apache.hadoop.hive.ql.session.SessionState.start(
> SessionState.java:304)
> at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:669)
> at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:613)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
> 57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.security.authentication.util.KerberosName
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 17 more
>
>
>
>


-- 
Nitin Pawar