Re: simple hadoop psuedo distr. mode instructions

2012-03-22 Thread Jay Vyas
great thanks Jagat !

On Fri, Mar 23, 2012 at 1:42 AM, Jagat  wrote:

> Hi Jay
>
> Just follow this to install
>
> http://jugnu-life.blogspot.in/2012/03/hadoop-installation-tutorial.html
>
> The official tutorial at link below is also useful
>
> http://hadoop.apache.org/common/docs/r1.0.1/single_node_setup.html
>
> Thanks
>
> Jagat
>
> On Fri, Mar 23, 2012 at 12:08 PM, Jay Vyas  wrote:
>
> > Hi guys : What the latest, simplest, best directions to get a tiny,
> > psuedodistributed hadoop setup running on my ubuntu machine ?
> >
> > On Wed, Mar 21, 2012 at 5:14 PM,  wrote:
> >
> > > Owen,
> > >
> > > Is there interest in reverting hadoop-2399 in 0.23.x ?
> > >
> > > - Milind
> > >
> > > ---
> > > Milind Bhandarkar
> > > Greenplum Labs, EMC
> > > (Disclaimer: Opinions expressed in this email are those of the author,
> > and
> > > do not necessarily represent the views of any organization, past or
> > > present, the author might be affiliated with.)
> > >
> > >
> > >
> > > On 3/19/12 11:20 PM, "Owen O'Malley"  wrote:
> > >
> > > >On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak 
> > > >wrote:
> > > >
> > > >> Hi Owen O'Malley,
> > > >>  Thank you for that Instant reply. It's working now. Can you explain
> > me
> > > >> what you mean by "input to reducer is reused" in little detail?
> > > >
> > > >
> > > >Each time the statement "Text value = values.next();" is executed it
> > > >always
> > > >returns the same Text object with the contents of that object changed.
> > > >When
> > > >you add the Text to the list, you are adding a pointer to the same
> Text
> > > >object. At the end you have 6 copies of the same pointer instead of 6
> > > >different Text objects.
> > > >
> > > >The reason that I said it is my fault, is because I added the
> > optimization
> > > >that causes it. If you are interested in Hadoop archeology, it was
> > > >HADOOP-2399 that made the change. I also did HADOOP-3522 to improve
> the
> > > >documentation in the area.
> > > >
> > > >-- Owen
> > >
> > >
> >
> >
> > --
> > Jay Vyas
> > MMSB/UCHC
> >
>



-- 
Jay Vyas
MMSB/UCHC


Re: Very strange Java Collection behavior in Hadoop

2012-03-22 Thread Jagat
Hi Jay

Just follow this to install

http://jugnu-life.blogspot.in/2012/03/hadoop-installation-tutorial.html

The official tutorial at link below is also useful

http://hadoop.apache.org/common/docs/r1.0.1/single_node_setup.html

Thanks

Jagat

On Fri, Mar 23, 2012 at 12:08 PM, Jay Vyas  wrote:

> Hi guys : What the latest, simplest, best directions to get a tiny,
> psuedodistributed hadoop setup running on my ubuntu machine ?
>
> On Wed, Mar 21, 2012 at 5:14 PM,  wrote:
>
> > Owen,
> >
> > Is there interest in reverting hadoop-2399 in 0.23.x ?
> >
> > - Milind
> >
> > ---
> > Milind Bhandarkar
> > Greenplum Labs, EMC
> > (Disclaimer: Opinions expressed in this email are those of the author,
> and
> > do not necessarily represent the views of any organization, past or
> > present, the author might be affiliated with.)
> >
> >
> >
> > On 3/19/12 11:20 PM, "Owen O'Malley"  wrote:
> >
> > >On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak 
> > >wrote:
> > >
> > >> Hi Owen O'Malley,
> > >>  Thank you for that Instant reply. It's working now. Can you explain
> me
> > >> what you mean by "input to reducer is reused" in little detail?
> > >
> > >
> > >Each time the statement "Text value = values.next();" is executed it
> > >always
> > >returns the same Text object with the contents of that object changed.
> > >When
> > >you add the Text to the list, you are adding a pointer to the same Text
> > >object. At the end you have 6 copies of the same pointer instead of 6
> > >different Text objects.
> > >
> > >The reason that I said it is my fault, is because I added the
> optimization
> > >that causes it. If you are interested in Hadoop archeology, it was
> > >HADOOP-2399 that made the change. I also did HADOOP-3522 to improve the
> > >documentation in the area.
> > >
> > >-- Owen
> >
> >
>
>
> --
> Jay Vyas
> MMSB/UCHC
>


Re: Very strange Java Collection behavior in Hadoop

2012-03-22 Thread Jay Vyas
Hi guys : What the latest, simplest, best directions to get a tiny,
psuedodistributed hadoop setup running on my ubuntu machine ?

On Wed, Mar 21, 2012 at 5:14 PM,  wrote:

> Owen,
>
> Is there interest in reverting hadoop-2399 in 0.23.x ?
>
> - Milind
>
> ---
> Milind Bhandarkar
> Greenplum Labs, EMC
> (Disclaimer: Opinions expressed in this email are those of the author, and
> do not necessarily represent the views of any organization, past or
> present, the author might be affiliated with.)
>
>
>
> On 3/19/12 11:20 PM, "Owen O'Malley"  wrote:
>
> >On Mon, Mar 19, 2012 at 11:05 PM, madhu phatak 
> >wrote:
> >
> >> Hi Owen O'Malley,
> >>  Thank you for that Instant reply. It's working now. Can you explain me
> >> what you mean by "input to reducer is reused" in little detail?
> >
> >
> >Each time the statement "Text value = values.next();" is executed it
> >always
> >returns the same Text object with the contents of that object changed.
> >When
> >you add the Text to the list, you are adding a pointer to the same Text
> >object. At the end you have 6 copies of the same pointer instead of 6
> >different Text objects.
> >
> >The reason that I said it is my fault, is because I added the optimization
> >that causes it. If you are interested in Hadoop archeology, it was
> >HADOOP-2399 that made the change. I also did HADOOP-3522 to improve the
> >documentation in the area.
> >
> >-- Owen
>
>


-- 
Jay Vyas
MMSB/UCHC


IBM China Big Data team recruitment

2012-03-22 Thread lulynn_2008
Please send you resume to jian...@cn.ibm.com

Job Description:



Big Data processing is becoming more and more hot in industry and IBM invest 
significantly in this new area to gain the leadership position in the 
marketplace.

You will join CDL Infosphere Big Data(BigInsights) team, an energetic and 
innovative team that are working with SVL to architect, design, and develop the 
next generation enterprise product in the Big Data area. This new initiative 
includes Hadoop-powered distributed parallel data processing system, big data 
analytics, and management capability for business and IT, supporting 
structured, semi-structured and unstructured data designed for enterprise class 
analytics and performance. We are looking for technical leaders, developers and 
QAs(including professional hire, campus hire and internal transfer) to bring 
their unique expertise to build and expand this key initiative.

A strong candidate must be able to independently design, code, and test major 
features, as well as work jointly with other team members to deliver complex 
product component, mentor and lead in the design and implementation of large 
scale modules and systems.

Job Responsibility

·  Design and implement a scalable and reliable distributed data processing and 
management infrastructure that spans multiple technologies, including Hadoop, 
data warehouse, analytics, storage management, indexing, and extreme-volume 
data movement and management and optimizing hardware and software 
configurations.

·  Design and implement system modules to support componentized and high 
performance parallel applications, including communications infrastructure, 
metadata services, administrative and user interfaces, and client APIs.

·  Enhance IBM Hadoop components and integration with IBM products and other 
popular products

·  Working with engineers, architects, managers, and quality assurance to 
design and implement innovative solutions incorporating functionality, 
performance, scalability, reliability, and adherence to agile development goals 
and principles.

· Work with customers to propose solutions and help customers implement 
them.

· More responsibilities depending on emerging customer requirement and your 
capabilities 


Required Skill:



-   Excellent communication skills including presentation, verbal and written 
skills on both English and Chinese

-   5 years and more of designing and implementing large scalable systems (for 
technical leaders)

-   3 years and more of leading the architecture, design and development of 
enterprise software (for technicall leaders)

-   Strong Java development and object oriented programming skills including 
familiarity with J2EE/Applet/Servlet/JSP/Java/JSON/Python/REST/AJAX

-   Understanding of distributed systems, map-reduce algorithms, Hadoop, 
object-oriented programming, and performance optimization techniques. 
Hadoop/Hbase development/running experience is a big plus.

-   Database server development experience is a plus

-   Web application development experience is a plus

-   Data warehouse and analytics experience is a plus

-NoSQL experience is a plus

-Ability to work with customers, understand customer business requirements 
and communicate them to development organization

Qualifications:  Bachelor or above Degree in Computer Science or relevant areas

 

RE: hadoop on cygwin : tasktrakker is throwing error : need helpv

2012-03-22 Thread Santosh Borse
I got rid of this error by installing 0.20.2 version 

From: Santosh Borse [santosh_bo...@persistent.co.in]
Sent: Friday, March 23, 2012 7:52 AM
To: common-user@hadoop.apache.org
Subject: hadoop on cygwin : tasktrakker is throwing error : need helpv

I have installed hadoop on cygwin to help me to write MR code in windows 
eclipse.


2012-03-22 22:19:57,896 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.io.IOException: Failed to set permissions of 
path: \tmp\hadoop-uygwin\mapred\local\ttprivate to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:726)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1457)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3716)

2012-03-22 22:19:57,897 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:



Config details
-
OS : Win 7
Hadoop :  hadoop-1.0.1


Please let me know if you can help.


-Santosh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


hadoop on cygwin : tasktrakker is throwing error : need helpv

2012-03-22 Thread Santosh Borse
I have installed hadoop on cygwin to help me to write MR code in windows 
eclipse.


2012-03-22 22:19:57,896 ERROR org.apache.hadoop.mapred.TaskTracker: Can not 
start task tracker because java.io.IOException: Failed to set permissions of 
path: \tmp\hadoop-uygwin\mapred\local\ttprivate to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:682)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:655)
at 
org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at 
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:726)
at org.apache.hadoop.mapred.TaskTracker.(TaskTracker.java:1457)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3716)

2012-03-22 22:19:57,897 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG:



Config details
-
OS : Win 7
Hadoop :  hadoop-1.0.1


Please let me know if you can help.


-Santosh


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: number of partitions

2012-03-22 Thread Harsh J
This shouldn't be the case at all. Can you share your Partitioner code
and the job.xml of the job that showed this behavior?

In any case: How do you "set the numberOfReducer to 4"?

2012/3/23 Harun Raşit ER :
> I wrote a custom partitioner. But when I work as standalone or
> pseudo-distributed mode, the number of partitions is always 1. I set the
> numberOfReducer to 4, but the numOfPartitions parameter of custom
> partitioner is still 1 and all my four mappers' results are going to 1
> reducer. The other reducers yield empty files.
>
> How can i set the number of partitions in standalone or pseudo-distributed
> mode?
>
> thanks for your helps.



-- 
Harsh J


number of partitions

2012-03-22 Thread Harun Raşit ER
I wrote a custom partitioner. But when I work as standalone or
pseudo-distributed mode, the number of partitions is always 1. I set the
numberOfReducer to 4, but the numOfPartitions parameter of custom
partitioner is still 1 and all my four mappers' results are going to 1
reducer. The other reducers yield empty files.

How can i set the number of partitions in standalone or pseudo-distributed
mode?

thanks for your helps.


Re: Number of retries

2012-03-22 Thread Bejoy KS
Hi Mohit
 To add on, duplicates won't be there if your output is written to a hdfs 
file. Because if one attempt of a task is completed only that output file is 
copied to the final output destn and the files generated by other task attempts 
that are killed are just ignored.

Regards
Bejoy KS

Sent from handheld, please excuse typos.

-Original Message-
From: "Bejoy KS" 
Date: Thu, 22 Mar 2012 19:55:55 
To: 
Reply-To: bejoy.had...@gmail.com
Subject: Re: Number of retries

Mohit
  If you are writing to a db from a job in an atomic way, this would pop 
up. You can avoid this only by disabling speculative execution. 
Drilling down from web UI to a task level would get you the tasks where 
multiple attempts were there.

--Original Message--
From: Mohit Anchlia
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Number of retries
Sent: Mar 23, 2012 01:21

I am seeing wierd problem where I am seeing duplicate rows in the database.
I am wondering if this is because of some internal retries that might be
causing this. Is there a way to look at which tasks were retried? I am not
sure what else might cause because when I look at the output data I don't
see any duplicates in the file.



Regards
Bejoy KS

Sent from handheld, please excuse typos.

Re: Number of retries

2012-03-22 Thread Bejoy KS
Mohit
  If you are writing to a db from a job in an atomic way, this would pop 
up. You can avoid this only by disabling speculative execution. 
Drilling down from web UI to a task level would get you the tasks where 
multiple attempts were there.

--Original Message--
From: Mohit Anchlia
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: Number of retries
Sent: Mar 23, 2012 01:21

I am seeing wierd problem where I am seeing duplicate rows in the database.
I am wondering if this is because of some internal retries that might be
causing this. Is there a way to look at which tasks were retried? I am not
sure what else might cause because when I look at the output data I don't
see any duplicates in the file.



Regards
Bejoy KS

Sent from handheld, please excuse typos.


Re: tasktracker/jobtracker.. expectation..

2012-03-22 Thread Bejoy Ks
Hi Patai
 JobTracker automatically handles this situation by attempting the task
on different nodes.Could you verify the number of attempts that these
failed tasks made. Was that just one? If more whether all the
task attempts were triggered on the same node or not? Did all of them fail
with the same error? You can get this information from the jobtracker web
UI, drill down to task level and then further down a failed task.

Regards
Bejoy

On Thu, Mar 22, 2012 at 11:25 PM, Patai Sangbutsarakum <
silvianhad...@gmail.com> wrote:

> Hi all,
>
> I have a job fail this morning because of 2 tasks were trying to write
> into disk that somehow turned read-only.
> Originally, i was thinking/dreaming that in this case somehow those 2
> tasks will be exported automatically
> to other dn/tt that  also has the required data block, and won't fail.
>
> I strongly believe that Hadoop can do that but i just didn't know it
> well enough to enable it.
>
> /dev/sdj1 /hadoop10 ext3 ro,noatime,data=ordered 0 0
>
> Error initializing attempt_201203211854_2633_m_17_0: EROFS:
> Read-only file system at
> org.apache.hadoop.io.nativeio.NativeIO.chmod(Native Method) at
>
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:496)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:319)
> at
> org.apache.hadoop.mapred.JobLocalizer.createLocalDirs(JobLocalizer.java:144)
> at
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:190)
> at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1199)
> at java.security.AccessController.doPrivileged(Native Method) at
> javax.security.auth.Subject.doAs(Subject.java:396) at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> at
> org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1174)
> at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1089)
> at org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:2257)
> at
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:2221)
>
> Hope this make sense.
> Patai
>


Re: rack awareness and safemode

2012-03-22 Thread Patai Sangbutsarakum
Roger that

On Thu, Mar 22, 2012 at 10:40 AM, John Meagher  wrote:
> Make sure you run "hadoop fsck /".  It should report a lot of blocks
> with the replication policy violated.  In the sort term it isn't
> anything to worry about and everything will work fine even with those
> errors.  Run the script I sent out earlier to fix those errors and
> bring everything into compliance with the new rack awareness setup.
>
>
> On Thu, Mar 22, 2012 at 13:36, Patai Sangbutsarakum
>  wrote:
>> I restarted the cluster yesterday with rack-awareness enable.
>> Things went well. confirm that there was no issues at all.
>>
>> Thanks you all again.
>>
>>
>> On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum
>>  wrote:
>>> Thanks you all.
>>>
>>>
>>> On Tue, Mar 20, 2012 at 2:44 PM, Harsh J  wrote:
 John has already addressed your concern. I'd only like to add that
 fixing of replication violations does not require your NN to be in
 safe mode and it won't be. Your worry can hence be voided :)

 On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum
  wrote:
> Thanks for your reply and script. Hopefully it still apply to 0.20.203
> As far as I play with test cluster. The balancer would take care of
> replica placement.
> I just don't want to fall into the situation that the hdfs sit in the
> safemode
> for hours and users can't use hadoop and start yelping.
>
> Let's hear from others.
>
>
> Thanks
> Patai
>
>
> On 3/20/12 1:27 PM, "John Meagher"  wrote:
>
>>ere's the script I used (all sorts of caveats about it assuming a
>>replication factor of 3 and no real error handling, etc)...
>>
>>for f in `hadoop fsck / | grep "Replica placement policy is violated"
>>| head -n8 | awk -F: '{print $1}'`; do
>>    hadoop fs -setrep -w 4 $f
>>    hadoop fs -setrep 3 $f
>>done
>>
>>
>



 --
 Harsh J


Re: rack awareness and safemode

2012-03-22 Thread John Meagher
Make sure you run "hadoop fsck /".  It should report a lot of blocks
with the replication policy violated.  In the sort term it isn't
anything to worry about and everything will work fine even with those
errors.  Run the script I sent out earlier to fix those errors and
bring everything into compliance with the new rack awareness setup.


On Thu, Mar 22, 2012 at 13:36, Patai Sangbutsarakum
 wrote:
> I restarted the cluster yesterday with rack-awareness enable.
> Things went well. confirm that there was no issues at all.
>
> Thanks you all again.
>
>
> On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum
>  wrote:
>> Thanks you all.
>>
>>
>> On Tue, Mar 20, 2012 at 2:44 PM, Harsh J  wrote:
>>> John has already addressed your concern. I'd only like to add that
>>> fixing of replication violations does not require your NN to be in
>>> safe mode and it won't be. Your worry can hence be voided :)
>>>
>>> On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum
>>>  wrote:
 Thanks for your reply and script. Hopefully it still apply to 0.20.203
 As far as I play with test cluster. The balancer would take care of
 replica placement.
 I just don't want to fall into the situation that the hdfs sit in the
 safemode
 for hours and users can't use hadoop and start yelping.

 Let's hear from others.


 Thanks
 Patai


 On 3/20/12 1:27 PM, "John Meagher"  wrote:

>ere's the script I used (all sorts of caveats about it assuming a
>replication factor of 3 and no real error handling, etc)...
>
>for f in `hadoop fsck / | grep "Replica placement policy is violated"
>| head -n8 | awk -F: '{print $1}'`; do
>    hadoop fs -setrep -w 4 $f
>    hadoop fs -setrep 3 $f
>done
>
>

>>>
>>>
>>>
>>> --
>>> Harsh J


Re: rack awareness and safemode

2012-03-22 Thread Patai Sangbutsarakum
I restarted the cluster yesterday with rack-awareness enable.
Things went well. confirm that there was no issues at all.

Thanks you all again.


On Tue, Mar 20, 2012 at 4:19 PM, Patai Sangbutsarakum
 wrote:
> Thanks you all.
>
>
> On Tue, Mar 20, 2012 at 2:44 PM, Harsh J  wrote:
>> John has already addressed your concern. I'd only like to add that
>> fixing of replication violations does not require your NN to be in
>> safe mode and it won't be. Your worry can hence be voided :)
>>
>> On Wed, Mar 21, 2012 at 2:08 AM, Patai Sangbutsarakum
>>  wrote:
>>> Thanks for your reply and script. Hopefully it still apply to 0.20.203
>>> As far as I play with test cluster. The balancer would take care of
>>> replica placement.
>>> I just don't want to fall into the situation that the hdfs sit in the
>>> safemode
>>> for hours and users can't use hadoop and start yelping.
>>>
>>> Let's hear from others.
>>>
>>>
>>> Thanks
>>> Patai
>>>
>>>
>>> On 3/20/12 1:27 PM, "John Meagher"  wrote:
>>>
ere's the script I used (all sorts of caveats about it assuming a
replication factor of 3 and no real error handling, etc)...

for f in `hadoop fsck / | grep "Replica placement policy is violated"
| head -n8 | awk -F: '{print $1}'`; do
    hadoop fs -setrep -w 4 $f
    hadoop fs -setrep 3 $f
done


>>>
>>
>>
>>
>> --
>> Harsh J


Re: setNumTasks

2012-03-22 Thread Shi Yu
If you want to control the number of input splits at fine granularity, 
you could customize the NLineInputFormat. You need to determine the 
number of lines per each split.  Thus you need to know before is the 
number of lines in your input data, for instance, using


hadoop -text /input/dir/* | wc -l

will give you a number, lets assume it is N

If you have K number of nodes, each nodes has C number of core, 
basically you could start K*C number of mapper jobs.  And you want to 
further assume each mapper process 2 splits (in case that some jobs are 
finished earlier), therefore the optimal number of lines in 
NLineInputFormat is around


N/(2*K*C)

Thus might give you an optimal job balance.   Remember, the 
NLineInputFormat usually takes longer time than other input format to 
initialize, and the line split only concerns about number of lines, but 
is unaware about the content length per each line. Thus, in sequence 
data analysis is some lines are significantly longer than other lines, 
the mapper assigned with longer lines will be much slower than those 
assigned with smaller lines.  So randomly mixing short and long lines 
before split is more preferable.



Shi


On 3/22/2012 10:01 AM, Bejoy Ks wrote:

Hi Mohit
   The number of map tasks is determined by your number of input splits
and the Input Format used by your MR job. Setting this value won't help you
control the same. AFAIK it would get effective if the value in
mapred.map.tasks is greater than the no of tasks calculated by the Job
based on the splits and Input Format.

Regards
Bejoy KS

On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchliawrote:


Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
confusing as to what it's purpose is for? I tried setting it for my job
still I see more map tasks running than *mapred.map.tasks*

On Thu, Mar 22, 2012 at 7:53 AM, Harsh J  wrote:


There isn't such an API as "setNumTasks". There is however,
"setNumReduceTasks", which sets "mapred.reduce.tasks".

Does this answer your question?

On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia
wrote:

Could someone please help me answer this question?

On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia
What is the corresponding system property for setNumTasks? Can it be

used

explicitly as system property like "mapred.tasks."?



--
Harsh J





Re: hadoop permission guideline

2012-03-22 Thread Harsh J
Hi Michael,

Am moving your question to the scm-us...@cloudera.org group which is
home to the community of Cloudera Manager users. You will get better
responses here.

In case you wish to browse or subscribe to this group, visit
https://groups.google.com/a/cloudera.org/forum/#!forum/scm-users

(BCC'd common-user@)

On Thu, Mar 22, 2012 at 8:21 PM, Michael Wang  wrote:
> I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to
> install all needed packages. When it was installed, the root is used.  I
> found the installation created some users, such as hdfs, hive,
> mapred,hue,hbase...
> After the installation, should we change some permission or ownership of
> some directories/files? For example, to use HIVE. It works fine with root
> user, since the metatore directory belongs to root. But in order to let
> other user use HIVE, I have to change metastore ownership to a specific
> non-root user, then it works. Is it the best practice?
> Another example is the start-all.sh, stop-all.sh they all belong to
> root. Should I change them to other user? I guess there are more cases...
>
> Thanks,
>
>
>
> This electronic message, including any attachments, may contain
> proprietary, confidential or privileged information for the sole use of the
> intended recipient(s). You are hereby notified that any unauthorized
> disclosure, copying, distribution, or use of this message is prohibited. If
> you have received this message in error, please immediately notify the
> sender by reply e-mail and delete it.



--
Harsh J


Re: setNumTasks

2012-03-22 Thread Bejoy Ks
Hi Mohit
  The number of map tasks is determined by your number of input splits
and the Input Format used by your MR job. Setting this value won't help you
control the same. AFAIK it would get effective if the value in
mapred.map.tasks is greater than the no of tasks calculated by the Job
based on the splits and Input Format.

Regards
Bejoy KS

On Thu, Mar 22, 2012 at 8:28 PM, Mohit Anchlia wrote:

> Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
> confusing as to what it's purpose is for? I tried setting it for my job
> still I see more map tasks running than *mapred.map.tasks*
>
> On Thu, Mar 22, 2012 at 7:53 AM, Harsh J  wrote:
>
> > There isn't such an API as "setNumTasks". There is however,
> > "setNumReduceTasks", which sets "mapred.reduce.tasks".
> >
> > Does this answer your question?
> >
> > On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia 
> > wrote:
> > > Could someone please help me answer this question?
> > >
> > > On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia  > >wrote:
> > >
> > >> What is the corresponding system property for setNumTasks? Can it be
> > used
> > >> explicitly as system property like "mapred.tasks."?
> >
> >
> >
> > --
> > Harsh J
> >
>


Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Sorry I meant *setNumMapTasks. *What is mapred.map.tasks for? It's
confusing as to what it's purpose is for? I tried setting it for my job
still I see more map tasks running than *mapred.map.tasks*

On Thu, Mar 22, 2012 at 7:53 AM, Harsh J  wrote:

> There isn't such an API as "setNumTasks". There is however,
> "setNumReduceTasks", which sets "mapred.reduce.tasks".
>
> Does this answer your question?
>
> On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia 
> wrote:
> > Could someone please help me answer this question?
> >
> > On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia  >wrote:
> >
> >> What is the corresponding system property for setNumTasks? Can it be
> used
> >> explicitly as system property like "mapred.tasks."?
>
>
>
> --
> Harsh J
>


Re: hadoop permission guideline

2012-03-22 Thread Suresh Srinivas
Can you please take this discussion CDH mailing list?

On Mar 22, 2012, at 7:51 AM, Michael Wang  wrote:

> I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to 
> install all needed packages. When it was installed, the root is used.  I 
> found the installation created some users, such as hdfs, hive, 
> mapred,hue,hbase...
> After the installation, should we change some permission or ownership of some 
> directories/files? For example, to use HIVE. It works fine with root user, 
> since the metatore directory belongs to root. But in order to let other user 
> use HIVE, I have to change metastore ownership to a specific non-root user, 
> then it works. Is it the best practice?
> Another example is the start-all.sh, stop-all.sh they all belong to root. 
> Should I change them to other user? I guess there are more cases...
> 
> Thanks,
> 
> 
> 
> This electronic message, including any attachments, may contain proprietary, 
> confidential or privileged information for the sole use of the intended 
> recipient(s). You are hereby notified that any unauthorized disclosure, 
> copying, distribution, or use of this message is prohibited. If you have 
> received this message in error, please immediately notify the sender by reply 
> e-mail and delete it.


Re: setNumTasks

2012-03-22 Thread Harsh J
There isn't such an API as "setNumTasks". There is however,
"setNumReduceTasks", which sets "mapred.reduce.tasks".

Does this answer your question?

On Thu, Mar 22, 2012 at 8:21 PM, Mohit Anchlia  wrote:
> Could someone please help me answer this question?
>
> On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia wrote:
>
>> What is the corresponding system property for setNumTasks? Can it be used
>> explicitly as system property like "mapred.tasks."?



-- 
Harsh J


hadoop permission guideline

2012-03-22 Thread Michael Wang
I have installed Cloudera hadoop (CDH). I used its Cloudera Manager to install 
all needed packages. When it was installed, the root is used.  I found the 
installation created some users, such as hdfs, hive, mapred,hue,hbase...
After the installation, should we change some permission or ownership of some 
directories/files? For example, to use HIVE. It works fine with root user, 
since the metatore directory belongs to root. But in order to let other user 
use HIVE, I have to change metastore ownership to a specific non-root user, 
then it works. Is it the best practice?
Another example is the start-all.sh, stop-all.sh they all belong to root. 
Should I change them to other user? I guess there are more cases...

Thanks,



This electronic message, including any attachments, may contain proprietary, 
confidential or privileged information for the sole use of the intended 
recipient(s). You are hereby notified that any unauthorized disclosure, 
copying, distribution, or use of this message is prohibited. If you have 
received this message in error, please immediately notify the sender by reply 
e-mail and delete it.


Re: setNumTasks

2012-03-22 Thread Mohit Anchlia
Could someone please help me answer this question?

On Wed, Mar 14, 2012 at 8:06 AM, Mohit Anchlia wrote:

> What is the corresponding system property for setNumTasks? Can it be used
> explicitly as system property like "mapred.tasks."?


Re: Snappy Error

2012-03-22 Thread Mohit Anchlia
Looks like org.apache.hadoop.io.compress.SnappyCodec is not in the
classpath?

On Thu, Mar 22, 2012 at 4:30 AM, hadoop hive  wrote:

> HI Folks,
>
> i follow all ther steps and build and install snappy and after creating
> sequencetable when i m insert overwrite the data into this table its
> throwing this error.
>
>
> java.io.IOException: Cannot create an instance of InputFormat class
> org.apache.hadoop.mapred.TextInputFormat as specified in mapredWork!
>at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:197)
>at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:236)
>at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>at org.apache.hadoop.mapred.Child.main(Child.java:170)
> Caused by: java.lang.RuntimeException: Error in configuring object
>at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
>at
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
>at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>at
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getInputFormatFromCache(HiveInputFormat.java:193)
>... 4 more
> Caused by: java.lang.reflect.InvocationTargetException
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
>... 7 more
> Caused by: java.lang.IllegalArgumentException: Compression codec
> org.apache.hadoop.io.compress.SnappyCodec not found.
>at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:96)
>at
> org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:134)
>at
> org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:41)
>... 12 more
> Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.io.compress.SnappyCodec
>at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
>at java.lang.Class.forName0(Native Method)
>at java.lang.Class.forName(Class.java:247)
>at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:762)
>at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89)
>... 14 more
>