Quick question

2010-10-06 Thread Maha A. Alabduljalil

Hi Every one,

  I've started up hadoop (hdfs data and name nodes, JobTracker and  
TaskTrakers), using the quick start guidance. The web view of the  
filesystem and jobtracker suddenly started to give can't be found by  
safari.


 Notice I'm actually accessing hadoop via ssh to my school account.  
Could that be the problem?


 Thank you,

  Maha



Quick question

2011-02-18 Thread maha
Hi all,

  I want to check if the following statement is right:

 If I use TextInputFormat to process a text file with 2000 lines (each ending 
with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES 
. 

In other words,  the input is not split byte-wise but by lines. 

Is that right?


Thank you,
Maha

Re: Quick question

2010-10-06 Thread Asif Jan

Hi

check if the ports are open outside school network else

you will have to use ssh tunneling if you want to access ports serving  
the webpages (as it is more likely that these are not open by default)


try something like

ssh -L50030:hadoop-host-address:50030 ur-usern...@cluster-head-node


Then open  localhost:50030 to see job-tracker page.



cheers



On Oct 6, 2010, at 9:14 PM, Maha A. Alabduljalil wrote:


Hi Every one,

 I've started up hadoop (hdfs data and name nodes, JobTracker and  
TaskTrakers), using the quick start guidance. The web view of the  
filesystem and jobtracker suddenly started to give can't be found by  
safari.


Notice I'm actually accessing hadoop via ssh to my school account.  
Could that be the problem?


Thank you,

 Maha














Re: Quick question

2010-10-06 Thread Maha A. Alabduljalil

Thanks Asif it worked ! :)

 Maha

Quoting Asif Jan :


Hi

check if the ports are open outside school network else

you will have to use ssh tunneling if you want to access ports  
serving the webpages (as it is more likely that these are not open  
by default)


try something like

ssh -L50030:hadoop-host-address:50030 ur-usern...@cluster-head-node


Then open  localhost:50030 to see job-tracker page.



cheers



On Oct 6, 2010, at 9:14 PM, Maha A. Alabduljalil wrote:


Hi Every one,

I've started up hadoop (hdfs data and name nodes, JobTracker and  
TaskTrakers), using the quick start guidance. The web view of the  
filesystem and jobtracker suddenly started to give can't be found  
by safari.


Notice I'm actually accessing hadoop via ssh to my school account.  
Could that be the problem?


   Thank you,

Maha

















 Thank you,

  Maha Alabduljalil
  m...@cs.ucsb.edu



another quick question

2010-10-06 Thread Maha A. Alabduljalil

Hi again,

  I guess my questions are easy..

Since I'm installing hadoop in my school machine I have to veiw  
namenode online via hdfs://host-name:50070 instead of the default link  
provided by Hadoop Quick Start. (ie.hdfs://localhost:50070).


  Do you think I should set my hadoop.tmp.dir to the machine I'm  
currenlty working  on and I can do the default way?



 Thank you,

  Maha



Re: Quick question

2011-02-18 Thread Ted Dunning
The input is effectively split by lines, but under the covers, the actual
splits are by byte.  Each mapper will cleverly scan from the specified start
to the next line after the start point.  At then end, it will over-read to
the end of line that is at or after the end of its specified region.  This
can make the last split be a bit smaller than the others and the first be a
bit larger.

Practically speaking, however, your 2000 line file is extremely unlikely to
be split at all because it is sooo small.

On Fri, Feb 18, 2011 at 11:14 AM, maha  wrote:

> Hi all,
>
>  I want to check if the following statement is right:
>
>  If I use TextInputFormat to process a text file with 2000 lines (each
> ending with \n) with 20 mappers. Then each map will have a sequence of
> COMPLETE LINES .
>
> In other words,  the input is not split byte-wise but by lines.
>
> Is that right?
>
>
> Thank you,
> Maha


RE: Quick question

2011-02-18 Thread Jim Falgout
That's right. The TextInputFormat handles situations where records cross split 
boundaries. What your mapper will see is "whole" records. 

-Original Message-
From: maha [mailto:m...@umail.ucsb.edu] 
Sent: Friday, February 18, 2011 1:14 PM
To: common-user
Subject: Quick question

Hi all,

  I want to check if the following statement is right:

 If I use TextInputFormat to process a text file with 2000 lines (each ending 
with \n) with 20 mappers. Then each map will have a sequence of COMPLETE LINES 
. 

In other words,  the input is not split byte-wise but by lines. 

Is that right?


Thank you,
Maha



Re: Quick question

2011-02-18 Thread maha
Thanks Ted and Jim :)
Maha

On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:

> That's right. The TextInputFormat handles situations where records cross 
> split boundaries. What your mapper will see is "whole" records. 
> 
> -Original Message-
> From: maha [mailto:m...@umail.ucsb.edu] 
> Sent: Friday, February 18, 2011 1:14 PM
> To: common-user
> Subject: Quick question
> 
> Hi all,
> 
>  I want to check if the following statement is right:
> 
> If I use TextInputFormat to process a text file with 2000 lines (each ending 
> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE 
> LINES . 
> 
> In other words,  the input is not split byte-wise but by lines. 
> 
> Is that right?
> 
> 
> Thank you,
> Maha
> 



Re: Quick question

2011-02-20 Thread maha
Hi again Jim and Ted,

 I understood that each mapper will be getting a block of lines... but even 
thought I had only 2 mappers for a 16 lines of input file and TextInputFormat 
is used. A map-function is processed for each of those 16 lines!

I wanted a block of lines per map ... hence something like map1 has 8 lines and 
map2 has 8 lines. 

So first question: is there a difference between Mappers and maps ?

Second: Does that mean I need to write my own inputFormat to make the 
InputSplit equal to multipleLines ???

Thank you,

Maha


On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:

> That's right. The TextInputFormat handles situations where records cross 
> split boundaries. What your mapper will see is "whole" records. 
> 
> -Original Message-
> From: maha [mailto:m...@umail.ucsb.edu] 
> Sent: Friday, February 18, 2011 1:14 PM
> To: common-user
> Subject: Quick question
> 
> Hi all,
> 
>  I want to check if the following statement is right:
> 
> If I use TextInputFormat to process a text file with 2000 lines (each ending 
> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE 
> LINES . 
> 
> In other words,  the input is not split byte-wise but by lines. 
> 
> Is that right?
> 
> 
> Thank you,
> Maha
> 



Re: Quick question

2011-02-20 Thread maha
Actually the following solved my problem ... but I'm a little suspicious of the 
side effect of doing the following instead of using my own InputSplit to be 5 
lines.

 conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # 
of maps = # lines
 conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per 
mapper = 5

If you have any thought of whether the upper solution is worst that writing my 
own inputSplit to be about 5 lines, let me know.

Thanks everyone !

Maha

On Feb 20, 2011, at 11:47 AM, maha wrote:

> Hi again Jim and Ted,
> 
> I understood that each mapper will be getting a block of lines... but even 
> thought I had only 2 mappers for a 16 lines of input file and TextInputFormat 
> is used. A map-function is processed for each of those 16 lines!
> 
> I wanted a block of lines per map ... hence something like map1 has 8 lines 
> and map2 has 8 lines. 
> 
> So first question: is there a difference between Mappers and maps ?
> 
> Second: Does that mean I need to write my own inputFormat to make the 
> InputSplit equal to multipleLines ???
> 
> Thank you,
> 
> Maha
> 
> 
> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
> 
>> That's right. The TextInputFormat handles situations where records cross 
>> split boundaries. What your mapper will see is "whole" records. 
>> 
>> -Original Message-
>> From: maha [mailto:m...@umail.ucsb.edu] 
>> Sent: Friday, February 18, 2011 1:14 PM
>> To: common-user
>> Subject: Quick question
>> 
>> Hi all,
>> 
>> I want to check if the following statement is right:
>> 
>> If I use TextInputFormat to process a text file with 2000 lines (each ending 
>> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE 
>> LINES . 
>> 
>> In other words,  the input is not split byte-wise but by lines. 
>> 
>> Is that right?
>> 
>> 
>> Thank you,
>> Maha
>> 
> 



Re: Quick question

2011-02-20 Thread maha
Yet the map-function was processed 16 times as described by the 
NLineInputSplit.   I want the map-function to be one for the whole inputSplit 
of 5 Lines and not for each of the 16 lines.

Any ideas other than building my own inputFormat?

Thank you,

Maha
 
On Feb 20, 2011, at 11:59 AM, maha wrote:

> Actually the following solved my problem ... but I'm a little suspicious of 
> the side effect of doing the following instead of using my own InputSplit to 
> be 5 lines.
> 
> conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // 
> # of maps = # lines
> conf.setInt("mapred.line.input.format.linespermap", 5); //# of lines per 
> mapper = 5
> 
> If you have any thought of whether the upper solution is worst that writing 
> my own inputSplit to be about 5 lines, let me know.
> 
> Thanks everyone !
> 
> Maha
>   
> On Feb 20, 2011, at 11:47 AM, maha wrote:
> 
>> Hi again Jim and Ted,
>> 
>> I understood that each mapper will be getting a block of lines... but even 
>> thought I had only 2 mappers for a 16 lines of input file and 
>> TextInputFormat is used. A map-function is processed for each of those 16 
>> lines!
>> 
>> I wanted a block of lines per map ... hence something like map1 has 8 lines 
>> and map2 has 8 lines. 
>> 
>> So first question: is there a difference between Mappers and maps ?
>> 
>> Second: Does that mean I need to write my own inputFormat to make the 
>> InputSplit equal to multipleLines ???
>> 
>> Thank you,
>> 
>> Maha
>> 
>> 
>> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
>> 
>>> That's right. The TextInputFormat handles situations where records cross 
>>> split boundaries. What your mapper will see is "whole" records. 
>>> 
>>> -Original Message-
>>> From: maha [mailto:m...@umail.ucsb.edu] 
>>> Sent: Friday, February 18, 2011 1:14 PM
>>> To: common-user
>>> Subject: Quick question
>>> 
>>> Hi all,
>>> 
>>> I want to check if the following statement is right:
>>> 
>>> If I use TextInputFormat to process a text file with 2000 lines (each 
>>> ending with \n) with 20 mappers. Then each map will have a sequence of 
>>> COMPLETE LINES . 
>>> 
>>> In other words,  the input is not split byte-wise but by lines. 
>>> 
>>> Is that right?
>>> 
>>> 
>>> Thank you,
>>> Maha
>>> 
>> 
> 



Re: Quick question

2011-02-20 Thread Ted Dunning
This is the most important thing that you have said. The map function
is called once per unit of input but the mapper object persists for
many input units of input.

You have a little bit of control over how many mapper objects there
are and how many machines they are created on and how many pieces your
input is broken into.  That control is limited, however, unless you
build your own input format. The standard input formats are optimized
for very large inputs and may not give you the flexibility that you
want for your experiments. That is unfortunate for the purpose of
learning about hadoop but hadoop is designed mostly for dealing with
very large data and isn't usually designed to be easy to understand.
Where easy coincides with powerful then easy is good but powerful
isn't always easy.

On Sunday, February 20, 2011, maha  wrote:
> So first question: is there a difference between Mappers and maps ?


RE: Quick question

2011-02-21 Thread Jim Falgout
You're scenario matches the capability of NLineInputFormat exactly, so that 
looks to be the best solution. If you wrote your own input format, it would 
have to basically do what NLineInputFormat is already doing for you.

-Original Message-
From: maha [mailto:m...@umail.ucsb.edu] 
Sent: Sunday, February 20, 2011 2:00 PM
To: common-user@hadoop.apache.org
Subject: Re: Quick question

Actually the following solved my problem ... but I'm a little suspicious of the 
side effect of doing the following instead of using my own InputSplit to be 5 
lines.

 conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // # 
of maps = # lines  conf.setInt("mapred.line.input.format.linespermap", 5); //# 
of lines per mapper = 5

If you have any thought of whether the upper solution is worst that writing my 
own inputSplit to be about 5 lines, let me know.

Thanks everyone !

Maha

On Feb 20, 2011, at 11:47 AM, maha wrote:

> Hi again Jim and Ted,
> 
> I understood that each mapper will be getting a block of lines... but even 
> thought I had only 2 mappers for a 16 lines of input file and TextInputFormat 
> is used. A map-function is processed for each of those 16 lines!
> 
> I wanted a block of lines per map ... hence something like map1 has 8 lines 
> and map2 has 8 lines. 
> 
> So first question: is there a difference between Mappers and maps ?
> 
> Second: Does that mean I need to write my own inputFormat to make the 
> InputSplit equal to multipleLines ???
> 
> Thank you,
> 
> Maha
> 
> 
> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
> 
>> That's right. The TextInputFormat handles situations where records cross 
>> split boundaries. What your mapper will see is "whole" records. 
>> 
>> -Original Message-
>> From: maha [mailto:m...@umail.ucsb.edu]
>> Sent: Friday, February 18, 2011 1:14 PM
>> To: common-user
>> Subject: Quick question
>> 
>> Hi all,
>> 
>> I want to check if the following statement is right:
>> 
>> If I use TextInputFormat to process a text file with 2000 lines (each ending 
>> with \n) with 20 mappers. Then each map will have a sequence of COMPLETE 
>> LINES . 
>> 
>> In other words,  the input is not split byte-wise but by lines. 
>> 
>> Is that right?
>> 
>> 
>> Thank you,
>> Maha
>> 
> 




Re: Quick question

2011-02-21 Thread maha
Thanks for your answers Ted and Jim :)

Maha

On Feb 21, 2011, at 6:41 AM, Jim Falgout wrote:

> You're scenario matches the capability of NLineInputFormat exactly, so that 
> looks to be the best solution. If you wrote your own input format, it would 
> have to basically do what NLineInputFormat is already doing for you.
> 
> -Original Message-
> From: maha [mailto:m...@umail.ucsb.edu] 
> Sent: Sunday, February 20, 2011 2:00 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Quick question
> 
> Actually the following solved my problem ... but I'm a little suspicious of 
> the side effect of doing the following instead of using my own InputSplit to 
> be 5 lines.
> 
> conf.setInputFormat(org.apache.hadoop.mapred.lib.NLineInputFormat.class); // 
> # of maps = # lines  conf.setInt("mapred.line.input.format.linespermap", 5); 
> //# of lines per mapper = 5
> 
> If you have any thought of whether the upper solution is worst that writing 
> my own inputSplit to be about 5 lines, let me know.
> 
> Thanks everyone !
> 
> Maha
>   
> On Feb 20, 2011, at 11:47 AM, maha wrote:
> 
>> Hi again Jim and Ted,
>> 
>> I understood that each mapper will be getting a block of lines... but even 
>> thought I had only 2 mappers for a 16 lines of input file and 
>> TextInputFormat is used. A map-function is processed for each of those 16 
>> lines!
>> 
>> I wanted a block of lines per map ... hence something like map1 has 8 lines 
>> and map2 has 8 lines. 
>> 
>> So first question: is there a difference between Mappers and maps ?
>> 
>> Second: Does that mean I need to write my own inputFormat to make the 
>> InputSplit equal to multipleLines ???
>> 
>> Thank you,
>> 
>> Maha
>> 
>> 
>> On Feb 18, 2011, at 11:55 AM, Jim Falgout wrote:
>> 
>>> That's right. The TextInputFormat handles situations where records cross 
>>> split boundaries. What your mapper will see is "whole" records. 
>>> 
>>> -Original Message-
>>> From: maha [mailto:m...@umail.ucsb.edu]
>>> Sent: Friday, February 18, 2011 1:14 PM
>>> To: common-user
>>> Subject: Quick question
>>> 
>>> Hi all,
>>> 
>>> I want to check if the following statement is right:
>>> 
>>> If I use TextInputFormat to process a text file with 2000 lines (each 
>>> ending with \n) with 20 mappers. Then each map will have a sequence of 
>>> COMPLETE LINES . 
>>> 
>>> In other words,  the input is not split byte-wise but by lines. 
>>> 
>>> Is that right?
>>> 
>>> 
>>> Thank you,
>>> Maha
>>> 
>> 
> 
> 



Re: Quick question

2011-02-21 Thread maha
How can then I produce an output/file per mapper not map-task?

Thank you,
Maha

On Feb 20, 2011, at 10:22 PM, Ted Dunning wrote:

> This is the most important thing that you have said. The map function
> is called once per unit of input but the mapper object persists for
> many input units of input.
> 
> You have a little bit of control over how many mapper objects there
> are and how many machines they are created on and how many pieces your
> input is broken into.  That control is limited, however, unless you
> build your own input format. The standard input formats are optimized
> for very large inputs and may not give you the flexibility that you
> want for your experiments. That is unfortunate for the purpose of
> learning about hadoop but hadoop is designed mostly for dealing with
> very large data and isn't usually designed to be easy to understand.
> Where easy coincides with powerful then easy is good but powerful
> isn't always easy.
> 
> On Sunday, February 20, 2011, maha  wrote:
>> So first question: is there a difference between Mappers and maps ?



Re: another quick question

2010-10-06 Thread Asif Jan

Hi

The tmp directory is local to the machine running the hadoop system,  
so if your hadoop is on a remote machine, tmp directory has to be on  
that machine


Your question is not clear to me e.g. what you want to do?

asif


On Oct 6, 2010, at 9:55 PM, Maha A. Alabduljalil wrote:


Hi again,

 I guess my questions are easy..

Since I'm installing hadoop in my school machine I have to veiw  
namenode online via hdfs://host-name:50070 instead of the default  
link provided by Hadoop Quick Start. (ie.hdfs://localhost:50070).


 Do you think I should set my hadoop.tmp.dir to the machine I'm  
currenlty working  on and I can do the default way?



Thank you,

 Maha













Re: another quick question

2010-10-06 Thread Maha A. Alabduljalil

Sorry I'm confused. The story is:

  I ssh into my school-account using my home-computer and installed  
hadoop in school directory. I used to open the browser of  
Hadoop-Quick-Start (hdfs://localhost:50070) from my home computer and  
it showed me the file-system. Yesterday, however, I only wrote the  
property hadoop.tmp.dir = /cs/student/maha into core-site.xml, then  I  
couldn't open hdfs://localhost:50070 on my home computer. I had to  
type hdfs://school-host:50070 to see it.


  I want to use the default link again using my home computer.

  Thanks,
  Maha





Quoting Asif Jan :


Hi

The tmp directory is local to the machine running the hadoop system,  
so if your hadoop is on a remote machine, tmp directory has to be on  
that machine


Your question is not clear to me e.g. what you want to do?

asif


On Oct 6, 2010, at 9:55 PM, Maha A. Alabduljalil wrote:


Hi again,

I guess my questions are easy..

Since I'm installing hadoop in my school machine I have to veiw  
namenode online via hdfs://host-name:50070 instead of the default  
link provided by Hadoop Quick Start. (ie.hdfs://localhost:50070).


Do you think I should set my hadoop.tmp.dir to the machine I'm  
currenlty working  on and I can do the default way?



   Thank you,

Maha
















 Thank you,

  Maha Alabduljalil
  m...@cs.ucsb.edu



Re: another quick question

2010-10-06 Thread Jeff Zhang
Hi Maha ,

I don't think the hadoop.tmp.dir relates the problem of web ui. The
web ui is bind to 0.0.0.0:50070
And localhost is mapped to 127.0.0.1 of your home machine in your client side.

On Thu, Oct 7, 2010 at 4:22 AM, Maha A. Alabduljalil
 wrote:
> Sorry I'm confused. The story is:
>
>  I ssh into my school-account using my home-computer and installed hadoop in
> school directory. I used to open the browser of Hadoop-Quick-Start
> (hdfs://localhost:50070) from my home computer and it showed me the
> file-system. Yesterday, however, I only wrote the property hadoop.tmp.dir =
> /cs/student/maha into core-site.xml, then  I couldn't open
> hdfs://localhost:50070 on my home computer. I had to type
> hdfs://school-host:50070 to see it.
>
>  I want to use the default link again using my home computer.
>
>  Thanks,
>  Maha
>
>
>
>
>
> Quoting Asif Jan :
>
>> Hi
>>
>> The tmp directory is local to the machine running the hadoop system, so if
>> your hadoop is on a remote machine, tmp directory has to be on that machine
>>
>> Your question is not clear to me e.g. what you want to do?
>>
>> asif
>>
>>
>> On Oct 6, 2010, at 9:55 PM, Maha A. Alabduljalil wrote:
>>
>>> Hi again,
>>>
>>> I guess my questions are easy..
>>>
>>> Since I'm installing hadoop in my school machine I have to veiw namenode
>>> online via hdfs://host-name:50070 instead of the default link provided by
>>> Hadoop Quick Start. (ie.hdfs://localhost:50070).
>>>
>>> Do you think I should set my hadoop.tmp.dir to the machine I'm currenlty
>>> working  on and I can do the default way?
>>>
>>>
>>>                   Thank you,
>>>
>>> Maha
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>                     Thank you,
>
>  Maha Alabduljalil
>  m...@cs.ucsb.edu
>
>



-- 
Best Regards

Jeff Zhang


Re: another quick question

2010-10-06 Thread Maha A. Alabduljalil
Well I went to check. Now, I'm using the school machine and the UI  
from quick start worked fine ie. hdfs://localhost:50070. looking at my  
fileSystem the temporary directory created has system file in it. is  
that why?


/cs/student/maha/tmp/mapred/system

  I'm able to use the "localhost" instead of typing
   hdfs://school-host-machine:50070

   I think I need a Networks course.

   Maha


Quoting Jeff Zhang :


Hi Maha ,

I don't think the hadoop.tmp.dir relates the problem of web ui. The
web ui is bind to 0.0.0.0:50070
And localhost is mapped to 127.0.0.1 of your home machine in your  
client side.


On Thu, Oct 7, 2010 at 4:22 AM, Maha A. Alabduljalil
 wrote:

Sorry I'm confused. The story is:

 I ssh into my school-account using my home-computer and installed hadoop in
school directory. I used to open the browser of Hadoop-Quick-Start
(hdfs://localhost:50070) from my home computer and it showed me the
file-system. Yesterday, however, I only wrote the property hadoop.tmp.dir =
/cs/student/maha into core-site.xml, then  I couldn't open
hdfs://localhost:50070 on my home computer. I had to type
hdfs://school-host:50070 to see it.

 I want to use the default link again using my home computer.

 Thanks,
 Maha





Quoting Asif Jan :


Hi

The tmp directory is local to the machine running the hadoop system, so if
your hadoop is on a remote machine, tmp directory has to be on that machine

Your question is not clear to me e.g. what you want to do?

asif


On Oct 6, 2010, at 9:55 PM, Maha A. Alabduljalil wrote:


Hi again,

I guess my questions are easy..

Since I'm installing hadoop in my school machine I have to veiw namenode
online via hdfs://host-name:50070 instead of the default link provided by
Hadoop Quick Start. (ie.hdfs://localhost:50070).

Do you think I should set my hadoop.tmp.dir to the machine I'm currenlty
working  on and I can do the default way?


                  Thank you,

Maha
















                    Thank you,

 Maha Alabduljalil
 m...@cs.ucsb.edu






--
Best Regards

Jeff Zhang





 Thank you,

  Maha Alabduljalil
  m...@cs.ucsb.edu



quick question about Pipes CLI

2009-12-09 Thread Prakhar Sharma
Hi all,
In the Pipes CLI:
bin/hadoop pipes \
  [-inputformat class] \
  [-map class] \
  [-partitioner class] \
  [-reduce class] \
  [-writer class] \

do class in "-inputFormat class" means a java class i.e., path to a .class file?
(I am bit of a novice in Java)

Thanks,
Prakhar


Quick Question: LineSplit or BlockSplit

2011-02-07 Thread maha
Hi,

  I would appreciate it if you could give me your thoughts if there is affect 
on efficiency if:

  1) Mappers were per line in a document
 
  or 

  2) Mappers were per block of lines in a document.


 I know the obvious difference I can see is that (1) has more mappers. Does 
that mean (1) will be slower because of scheduling time ?

Thank you,
Maha
 

Re: quick question about Pipes CLI

2009-12-09 Thread Philip Zeyliger
I believe "class" would be something like
"org.apache.hadoop.mapred.TextInputFormat" or whatever.  I haven't had a
chance to try it to make sure, however.

-- Philip

On Wed, Dec 9, 2009 at 9:15 PM, Prakhar Sharma wrote:

> Hi all,
> In the Pipes CLI:
> bin/hadoop pipes \
>  [-inputformat class] \
>  [-map class] \
>  [-partitioner class] \
>  [-reduce class] \
>  [-writer class] \
>
> do class in "-inputFormat class" means a java class i.e., path to a .class
> file?
> (I am bit of a novice in Java)
>
> Thanks,
> Prakhar
>


Re: quick question about Pipes CLI

2009-12-09 Thread Prakhar Sharma
Thanks Philip, that worked.

Regards,
Prakhar

On Thu, Dec 10, 2009 at 12:25 AM, Philip Zeyliger  wrote:
> I believe "class" would be something like
> "org.apache.hadoop.mapred.TextInputFormat" or whatever.  I haven't had a
> chance to try it to make sure, however.
>
> -- Philip
>
> On Wed, Dec 9, 2009 at 9:15 PM, Prakhar Sharma 
> wrote:
>
>> Hi all,
>> In the Pipes CLI:
>> bin/hadoop pipes \
>>  [-inputformat class] \
>>  [-map class] \
>>  [-partitioner class] \
>>  [-reduce class] \
>>  [-writer class] \
>>
>> do class in "-inputFormat class" means a java class i.e., path to a .class
>> file?
>> (I am bit of a novice in Java)
>>
>> Thanks,
>> Prakhar
>>
>


Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
Option (1) isn't the way that things normally work.  Besides, mappers are
called many times for each construction of a mapper.

On Mon, Feb 7, 2011 at 3:38 PM, maha  wrote:

> Hi,
>
>  I would appreciate it if you could give me your thoughts if there is
> affect on efficiency if:
>
>  1) Mappers were per line in a document
>
>  or
>
>  2) Mappers were per block of lines in a document.
>
>
>  I know the obvious difference I can see is that (1) has more mappers. Does
> that mean (1) will be slower because of scheduling time ?
>
> Thank you,
> Maha
>


Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Mark Kerzner
Ted,

I am also interested in this answer.

I put the name of a zip file on a line in an input file, and I want one
mapper to read this line, and start working on it (since it now knows the
path in HDFS). Are you saying it's not doable?

Thank you,
Mark

On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning  wrote:

> Option (1) isn't the way that things normally work.  Besides, mappers are
> called many times for each construction of a mapper.
>
> On Mon, Feb 7, 2011 at 3:38 PM, maha  wrote:
>
> > Hi,
> >
> >  I would appreciate it if you could give me your thoughts if there is
> > affect on efficiency if:
> >
> >  1) Mappers were per line in a document
> >
> >  or
> >
> >  2) Mappers were per block of lines in a document.
> >
> >
> >  I know the obvious difference I can see is that (1) has more mappers.
> Does
> > that mean (1) will be slower because of scheduling time ?
> >
> > Thank you,
> > Maha
> >
>


Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Ted Dunning
That is quite doable.  One way to do it is to make the max split size quite
small.

On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner  wrote:

> Ted,
>
> I am also interested in this answer.
>
> I put the name of a zip file on a line in an input file, and I want one
> mapper to read this line, and start working on it (since it now knows the
> path in HDFS). Are you saying it's not doable?
>
> Thank you,
> Mark
>
> On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning  wrote:
>
> > Option (1) isn't the way that things normally work.  Besides, mappers are
> > called many times for each construction of a mapper.
> >
> > On Mon, Feb 7, 2011 at 3:38 PM, maha  wrote:
> >
> > > Hi,
> > >
> > >  I would appreciate it if you could give me your thoughts if there is
> > > affect on efficiency if:
> > >
> > >  1) Mappers were per line in a document
> > >
> > >  or
> > >
> > >  2) Mappers were per block of lines in a document.
> > >
> > >
> > >  I know the obvious difference I can see is that (1) has more mappers.
> > Does
> > > that mean (1) will be slower because of scheduling time ?
> > >
> > > Thank you,
> > > Maha
> > >
> >
>


Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread Mark Kerzner
Thanks!
Mark

On Mon, Feb 7, 2011 at 8:28 PM, Ted Dunning  wrote:

> That is quite doable.  One way to do it is to make the max split size quite
> small.
>
> On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner 
> wrote:
>
> > Ted,
> >
> > I am also interested in this answer.
> >
> > I put the name of a zip file on a line in an input file, and I want one
> > mapper to read this line, and start working on it (since it now knows the
> > path in HDFS). Are you saying it's not doable?
> >
> > Thank you,
> > Mark
> >
> > On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning 
> wrote:
> >
> > > Option (1) isn't the way that things normally work.  Besides, mappers
> are
> > > called many times for each construction of a mapper.
> > >
> > > On Mon, Feb 7, 2011 at 3:38 PM, maha  wrote:
> > >
> > > > Hi,
> > > >
> > > >  I would appreciate it if you could give me your thoughts if there is
> > > > affect on efficiency if:
> > > >
> > > >  1) Mappers were per line in a document
> > > >
> > > >  or
> > > >
> > > >  2) Mappers were per block of lines in a document.
> > > >
> > > >
> > > >  I know the obvious difference I can see is that (1) has more
> mappers.
> > > Does
> > > > that mean (1) will be slower because of scheduling time ?
> > > >
> > > > Thank you,
> > > > Maha
> > > >
> > >
> >
>


Re: Quick Question: LineSplit or BlockSplit

2011-02-07 Thread maha
Thanks Ted. Then I have to write my own InputFormat to read a block-of-lines 
per mapper.
 
 NLineInputFormat didn't work with me, any working example about it is 
appreciate it.

Thanks again,

Maha





On Feb 7, 2011, at 6:32 PM, Mark Kerzner wrote:

> Thanks!
> Mark
> 
> On Mon, Feb 7, 2011 at 8:28 PM, Ted Dunning  wrote:
> 
>> That is quite doable.  One way to do it is to make the max split size quite
>> small.
>> 
>> On Mon, Feb 7, 2011 at 6:14 PM, Mark Kerzner 
>> wrote:
>> 
>>> Ted,
>>> 
>>> I am also interested in this answer.
>>> 
>>> I put the name of a zip file on a line in an input file, and I want one
>>> mapper to read this line, and start working on it (since it now knows the
>>> path in HDFS). Are you saying it's not doable?
>>> 
>>> Thank you,
>>> Mark
>>> 
>>> On Mon, Feb 7, 2011 at 8:10 PM, Ted Dunning 
>> wrote:
>>> 
 Option (1) isn't the way that things normally work.  Besides, mappers
>> are
 called many times for each construction of a mapper.
 
 On Mon, Feb 7, 2011 at 3:38 PM, maha  wrote:
 
> Hi,
> 
> I would appreciate it if you could give me your thoughts if there is
> affect on efficiency if:
> 
> 1) Mappers were per line in a document
> 
> or
> 
> 2) Mappers were per block of lines in a document.
> 
> 
> I know the obvious difference I can see is that (1) has more
>> mappers.
 Does
> that mean (1) will be slower because of scheduling time ?
> 
> Thank you,
> Maha
> 
 
>>> 
>>