Re: DFSIO

2012-03-01 Thread madhu phatak
Hi Harsha,
 Sorry i read DFSIO as DFS Input/Output which i thought reading and writing
using HDFS API:)

On Fri, Mar 2, 2012 at 12:32 PM, Harsh J  wrote:

> Madhu,
>
> That is incorrect. TestDFSIO is a MapReduce job and you need HDFS+MR
> setup to use it.
>
> On Fri, Mar 2, 2012 at 11:07 AM, madhu phatak 
> wrote:
> > Hi,
> >  Only HDFS should be enough.
> >
> > On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do  wrote:
> >
> >> hi all,
> >>
> >> in order to run DFSIO in my cluster,
> >> do i need to run JobTracker, and TaskTracker,
> >> or just running HDFS is enough?
> >>
> >> Many thanks,
> >> Thanh
> >>
> >
> >
> >
> > --
> > Join me at http://hadoopworkshop.eventbrite.com/
>
>
>
> --
> Harsh J
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Harsh J
On Fri, Mar 2, 2012 at 10:18 AM, Subir S  wrote:
> Hello Folks,
>
> Are there any pointers to such comparisons between Apache Pig and Hadoop
> Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.

> Also there was a claim in our company that Pig performs better than Map
> Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.

-- 
Harsh J


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Jie Li
Considering Pig essentially translates scripts into Map Reduce jobs, one
can always write as good Map Reduce jobs as Pig does. You can refer to "Pig
experience" paper to see the overhead Pig introduces, but it's been
improved all the time.

Btw if you really care about the performance, how you configure Hadoop and
Pig can also play an important role.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish

On Thu, Mar 1, 2012 at 11:48 PM, Subir S  wrote:

> Hello Folks,
>
> Are there any pointers to such comparisons between Apache Pig and Hadoop
> Streaming Map Reduce jobs?
>
> Also there was a claim in our company that Pig performs better than Map
> Reduce jobs? Is this true? Are there any such benchmarks available
>
> Thanks, Subir
>


Re: DFSIO

2012-03-01 Thread Harsh J
Madhu,

That is incorrect. TestDFSIO is a MapReduce job and you need HDFS+MR
setup to use it.

On Fri, Mar 2, 2012 at 11:07 AM, madhu phatak  wrote:
> Hi,
>  Only HDFS should be enough.
>
> On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do  wrote:
>
>> hi all,
>>
>> in order to run DFSIO in my cluster,
>> do i need to run JobTracker, and TaskTracker,
>> or just running HDFS is enough?
>>
>> Many thanks,
>> Thanh
>>
>
>
>
> --
> Join me at http://hadoopworkshop.eventbrite.com/



-- 
Harsh J


Re: DFSIO

2012-03-01 Thread madhu phatak
Hi,
 Only HDFS should be enough.

On Fri, Nov 25, 2011 at 1:45 AM, Thanh Do  wrote:

> hi all,
>
> in order to run DFSIO in my cluster,
> do i need to run JobTracker, and TaskTracker,
> or just running HDFS is enough?
>
> Many thanks,
> Thanh
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Reducer NullPointerException

2012-03-01 Thread madhu phatak
Hi,
 It seems like you trying to run only the reducer without a mapper. Can you
share main() method code which you trying to run?

On Mon, Jan 23, 2012 at 11:43 AM, burakkk  wrote:

> Hello everyone,
> I have 3 server(1 master, 2 slave) and I installed cdh3u2 on each
> server. I execute simple wordcount example but reducer had a
> NullPointerException. How can i solve this problem?
>
> The error log is that:
> Error: java.lang.NullPointerException
>   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
> 768)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.run(ReduceTask.java:2733)
>
> Error: java.lang.NullPointerException
>   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
> 768)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.run(ReduceTask.java:2733)
>
> Error: java.lang.NullPointerException
>   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
> 768)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.run(ReduceTask.java:2733)
>
> Error: java.lang.NullPointerException
>   at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:
> 768)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2806)
>   at org.apache.hadoop.mapred.ReduceTask$ReduceCopier
> $GetMapEventsThread.run(ReduceTask.java:2733)
>
>
> Thanks
> Best Regards
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Re: Where Is DataJoinMapperBase?

2012-03-01 Thread madhu phatak
Hi,
 Please look inside $HADOOP_HOME/contrib/datajoin folder of 0.20.2 version.
You will find the jar.

On Sat, Feb 11, 2012 at 1:09 AM, Bing Li  wrote:

> Hi, all,
>
> I am starting to learn advanced Map/Reduce. However, I cannot find the
> class DataJoinMapperBase in my downloaded Hadoop 1.0.0 and 0.20.2. So I
> searched on the Web and get the following link.
>
> http://www.java2s.com/Code/Jar/h/Downloadhadoop0201datajoinjar.htm
>
> From the link I got the package, hadoop-0.20.1-datajoin.jar. My question is
> why the package is not included in Hadoop 1.0.0 and 0.20.2? Is the correct
> way to get it?
>
> Thanks so much!
>
> Best regards,
> Bing
>



-- 
Join me at http://hadoopworkshop.eventbrite.com/


Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Subir S
Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Thanks, Subir


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Marc Sturlese
Absolutely. In case I don't find the root of the problem soon I'll definitely
try it.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792530.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Marc Sturlese
Absolutely. In case I don't find the root of the problem soon I'll definitely
try it.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792531.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Joey Echeverria
I know this doesn't fix lzo, but have you considered Snappy for the
intermediate output compression? It gets similar compression ratios
and compress/decompress speed, but arguably has better Hadoop
integration.

-Joey

On Thu, Mar 1, 2012 at 10:01 PM, Marc Sturlese  wrote:
> I use to have 2.05 but now as I said I installed 2.06
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Marc Sturlese
I use to have 2.05 but now as I said I installed 2.06

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792511.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Marc Sturlese
Yes, The steps I followed where:
1-Intall lzo 2.06 in a machine with the same kernel as my nodes.
2-Compile there lzo 0.4.15 (in /lib replaced cdh3u3 per my hadoop 0.20.2
release)
3-Replace hadoop-lzo-0.4.9.jar for the now compiled hadoop-lzo-0.4.15.jar in
the hadoop lib directory of all my nodes and master
4-Put de generated native files in the native lib directory of all the nodes
and master
5-In my jar job, replaced the jar library hadoop-lzo-0.4.9.jar for
hadoop-lzo-0.4.15.jar

And sometimes when a job is running I get (4 times so the job gets killed):

...org.apache.hadoop.mapred.ReduceTask: Shuffling 3188320 bytes (1025174 raw
bytes) into RAM from attempt_201202291221_1501_m_000480_0
2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask: Task
attempt_201202291221_1501_r_000105_0: Failed fetch #1 from
attempt_201202291221_1501_m_46_0
2012-03-02 02:32:55,496 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201202291221_1501_r_000105_0 adding host hadoop-01.backend to
penalty box, next contact in 4 seconds
2012-03-02 02:32:55,496 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201202291221_1501_r_000105_0: Got 1 map-outputs from previous
failures
2012-03-02 02:32:55,497 FATAL org.apache.hadoop.mapred.TaskRunner:
attempt_201202291221_1501_r_000105_0 : Map output copy failure :
java.lang.InternalError: lzo1x_decompress returned: -8
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native
Method)
at
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:305)
at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:76)
at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1553)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1432)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1285)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1216)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792505.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Harsh J
Marc,

Was the lzo libs on your server upgraded to a higher version recently?

Also, when you deployed a built copy of 0.4.15, did you ensure you
replaced the older native libs for hadoop-lzo as well?

On Fri, Mar 2, 2012 at 9:05 AM, Marc Sturlese  wrote:
> Tried but still getting the error 0.4.15. Really lost with this.
> My hadoop release is 0.20.2 from more than a year ago. Could this be related
> to the problem?
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



-- 
Harsh J


Re: LZO exception decompressing (returned -8)

2012-03-01 Thread Marc Sturlese
Tried but still getting the error 0.4.15. Really lost with this.
My hadoop release is 0.20.2 from more than a year ago. Could this be related
to the problem?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/LZO-exception-decompressing-returned-8-tp3783652p3792484.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Adding nodes

2012-03-01 Thread George Datskos

Mohit,

New datanodes will connect to the namenode so thats how the namenode 
knows.  Just make sure the datanodes have the correct {fs.default.dir} 
in their hdfs-site.xml and then start them.  The namenode can, however, 
choose to reject the datanode if you are using the {dfs.hosts} and 
{dfs.hosts.exclude} settings in the namenode's hdfs-site.xml.


The namenode doesn't actually care about the slaves file.  It's only 
used by the start/stop scripts.



On 2012/03/02 10:35, Mohit Anchlia wrote:

I actually meant to ask how does namenode/jobtracker know there is a new
node in the cluster. Is it initiated by namenode when slave file is edited?
Or is it initiated by tasktracker when tasktracker is started?






Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
Thanks all for the answers!!

On Thu, Mar 1, 2012 at 5:52 PM, Arpit Gupta  wrote:

> It is initiated by the slave.
>
> If you have defined files to state which slaves can talk to the namenode
> (using config dfs.hosts) and which hosts cannot (using
> property dfs.hosts.exclude) then you would need to edit these files and
> issue the refresh command.
>
>
>  On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:
>
>  On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria 
> wrote:
>
> Not quite. Datanodes get the namenode host from fs.defalt.name in
>
> core-site.xml. Task trackers find the job tracker from the
>
> mapred.job.tracker setting in mapred-site.xml.
>
>
>
> I actually meant to ask how does namenode/jobtracker know there is a new
> node in the cluster. Is it initiated by namenode when slave file is edited?
> Or is it initiated by tasktracker when tasktracker is started?
>
>
> Sent from my iPhone
>
>
> On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:
>
>
>  On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria 
>
> wrote:
>
>
>  You only have to refresh nodes if you're making use of an allows file.
>
>
>  Thanks does it mean that when tasktracker/datanode starts up it
>
>  communicates with namenode using master file?
>
>
>  Sent from my iPhone
>
>
>  On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:
>
>
>   Is this the right procedure to add nodes? I took some from hadoop wiki
>
>  FAQ:
>
>
>   http://wiki.apache.org/hadoop/FAQ
>
>
>   1. Update conf/slave
>
>   2. on the slave nodes start datanode and tasktracker
>
>   3. hadoop balancer
>
>
>   Do I also need to run dfsadmin -refreshnodes?
>
>
>
>
>
> --
> Arpit
> Hortonworks, Inc.
> email: ar...@hortonworks.com
>
> 
>  
> 
>


Re: Adding nodes

2012-03-01 Thread Arpit Gupta
It is initiated by the slave. If you have defined files to state which slaves can talk to the namenode (using config dfs.hosts) and which hosts cannot (using property dfs.hosts.exclude) then you would need to edit these files and issue the refresh command.On Mar 1, 2012, at 5:35 PM, Mohit Anchlia wrote:On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria  wrote:Not quite. Datanodes get the namenode host from fs.defalt.name incore-site.xml. Task trackers find the job tracker from themapred.job.tracker setting in mapred-site.xml.I actually meant to ask how does namenode/jobtracker know there is a newnode in the cluster. Is it initiated by namenode when slave file is edited?Or is it initiated by tasktracker when tasktracker is started?Sent from my iPhoneOn Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria wrote:You only have to refresh nodes if you're making use of an allows file.Thanks does it mean that when tasktracker/datanode starts up itcommunicates with namenode using master file?Sent from my iPhoneOn Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:Is this the right procedure to add nodes? I took some from hadoop wikiFAQ:http://wiki.apache.org/hadoop/FAQ1. Update conf/slave2. on the slave nodes start datanode and tasktracker3. hadoop balancerDo I also need to run dfsadmin -refreshnodes?
--ArpitHortonworks, Inc.email: ar...@hortonworks.com



Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan
WHat Joey said is correct for both apache and cloudera distros. The DN/TT  
daemons  will connect to the NN/JT using the config files. The master and slave 
files are used for starting the correct daemons.



>
> From: anil gupta 
>To: common-user@hadoop.apache.org; Raj Vishwanathan  
>Sent: Thursday, March 1, 2012 5:42 PM
>Subject: Re: Adding nodes
> 
>Whatever Joey said is correct for Cloudera's distribution. For same, I am
>not confident about other distribution as i haven't tried them.
>
>Thanks,
>Anil
>
>On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan  wrote:
>
>> The master and slave files, if I remember correctly are used to start the
>> correct daemons on the correct nodes from the master node.
>>
>>
>> Raj
>>
>>
>> >
>> > From: Joey Echeverria 
>> >To: "common-user@hadoop.apache.org" 
>> >Cc: "common-user@hadoop.apache.org" 
>> >Sent: Thursday, March 1, 2012 4:57 PM
>> >Subject: Re: Adding nodes
>> >
>> >Not quite. Datanodes get the namenode host from fs.defalt.name in
>> core-site.xml. Task trackers find the job tracker from the
>> mapred.job.tracker setting in mapred-site.xml.
>> >
>> >Sent from my iPhone
>> >
>> >On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:
>> >
>> >> On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria 
>> wrote:
>> >>
>> >>> You only have to refresh nodes if you're making use of an allows file.
>> >>>
>> >>> Thanks does it mean that when tasktracker/datanode starts up it
>> >> communicates with namenode using master file?
>> >>
>> >> Sent from my iPhone
>> >>>
>> >>> On Mar 1, 2012, at 18:29, Mohit Anchlia 
>> wrote:
>> >>>
>>  Is this the right procedure to add nodes? I took some from hadoop wiki
>> >>> FAQ:
>> 
>>  http://wiki.apache.org/hadoop/FAQ
>> 
>>  1. Update conf/slave
>>  2. on the slave nodes start datanode and tasktracker
>>  3. hadoop balancer
>> 
>>  Do I also need to run dfsadmin -refreshnodes?
>> >>>
>> >
>> >
>> >
>>
>
>
>
>-- 
>Thanks & Regards,
>Anil Gupta
>
>
>

Re: Adding nodes

2012-03-01 Thread anil gupta
Whatever Joey said is correct for Cloudera's distribution. For same, I am
not confident about other distribution as i haven't tried them.

Thanks,
Anil

On Thu, Mar 1, 2012 at 5:10 PM, Raj Vishwanathan  wrote:

> The master and slave files, if I remember correctly are used to start the
> correct daemons on the correct nodes from the master node.
>
>
> Raj
>
>
> >
> > From: Joey Echeverria 
> >To: "common-user@hadoop.apache.org" 
> >Cc: "common-user@hadoop.apache.org" 
> >Sent: Thursday, March 1, 2012 4:57 PM
> >Subject: Re: Adding nodes
> >
> >Not quite. Datanodes get the namenode host from fs.defalt.name in
> core-site.xml. Task trackers find the job tracker from the
> mapred.job.tracker setting in mapred-site.xml.
> >
> >Sent from my iPhone
> >
> >On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:
> >
> >> On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria 
> wrote:
> >>
> >>> You only have to refresh nodes if you're making use of an allows file.
> >>>
> >>> Thanks does it mean that when tasktracker/datanode starts up it
> >> communicates with namenode using master file?
> >>
> >> Sent from my iPhone
> >>>
> >>> On Mar 1, 2012, at 18:29, Mohit Anchlia 
> wrote:
> >>>
>  Is this the right procedure to add nodes? I took some from hadoop wiki
> >>> FAQ:
> 
>  http://wiki.apache.org/hadoop/FAQ
> 
>  1. Update conf/slave
>  2. on the slave nodes start datanode and tasktracker
>  3. hadoop balancer
> 
>  Do I also need to run dfsadmin -refreshnodes?
> >>>
> >
> >
> >
>



-- 
Thanks & Regards,
Anil Gupta


Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
On Thu, Mar 1, 2012 at 4:57 PM, Joey Echeverria  wrote:

> Not quite. Datanodes get the namenode host from fs.defalt.name in
> core-site.xml. Task trackers find the job tracker from the
> mapred.job.tracker setting in mapred-site.xml.
>

I actually meant to ask how does namenode/jobtracker know there is a new
node in the cluster. Is it initiated by namenode when slave file is edited?
Or is it initiated by tasktracker when tasktracker is started?

>
> Sent from my iPhone
>
> On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:
>
> > On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria 
> wrote:
> >
> >> You only have to refresh nodes if you're making use of an allows file.
> >>
> >> Thanks does it mean that when tasktracker/datanode starts up it
> > communicates with namenode using master file?
> >
> > Sent from my iPhone
> >>
> >> On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:
> >>
> >>> Is this the right procedure to add nodes? I took some from hadoop wiki
> >> FAQ:
> >>>
> >>> http://wiki.apache.org/hadoop/FAQ
> >>>
> >>> 1. Update conf/slave
> >>> 2. on the slave nodes start datanode and tasktracker
> >>> 3. hadoop balancer
> >>>
> >>> Do I also need to run dfsadmin -refreshnodes?
> >>
>


Re: Adding nodes

2012-03-01 Thread Raj Vishwanathan
The master and slave files, if I remember correctly are used to start the 
correct daemons on the correct nodes from the master node.


Raj


>
> From: Joey Echeverria 
>To: "common-user@hadoop.apache.org"  
>Cc: "common-user@hadoop.apache.org"  
>Sent: Thursday, March 1, 2012 4:57 PM
>Subject: Re: Adding nodes
> 
>Not quite. Datanodes get the namenode host from fs.defalt.name in 
>core-site.xml. Task trackers find the job tracker from the mapred.job.tracker 
>setting in mapred-site.xml. 
>
>Sent from my iPhone
>
>On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:
>
>> On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria  wrote:
>> 
>>> You only have to refresh nodes if you're making use of an allows file.
>>> 
>>> Thanks does it mean that when tasktracker/datanode starts up it
>> communicates with namenode using master file?
>> 
>> Sent from my iPhone
>>> 
>>> On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:
>>> 
 Is this the right procedure to add nodes? I took some from hadoop wiki
>>> FAQ:
 
 http://wiki.apache.org/hadoop/FAQ
 
 1. Update conf/slave
 2. on the slave nodes start datanode and tasktracker
 3. hadoop balancer
 
 Do I also need to run dfsadmin -refreshnodes?
>>> 
>
>
>

Re: Adding nodes

2012-03-01 Thread Joey Echeverria
Not quite. Datanodes get the namenode host from fs.defalt.name in 
core-site.xml. Task trackers find the job tracker from the mapred.job.tracker 
setting in mapred-site.xml. 

Sent from my iPhone

On Mar 1, 2012, at 18:49, Mohit Anchlia  wrote:

> On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria  wrote:
> 
>> You only have to refresh nodes if you're making use of an allows file.
>> 
>> Thanks does it mean that when tasktracker/datanode starts up it
> communicates with namenode using master file?
> 
> Sent from my iPhone
>> 
>> On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:
>> 
>>> Is this the right procedure to add nodes? I took some from hadoop wiki
>> FAQ:
>>> 
>>> http://wiki.apache.org/hadoop/FAQ
>>> 
>>> 1. Update conf/slave
>>> 2. on the slave nodes start datanode and tasktracker
>>> 3. hadoop balancer
>>> 
>>> Do I also need to run dfsadmin -refreshnodes?
>> 


Re: Adding nodes

2012-03-01 Thread Mohit Anchlia
On Thu, Mar 1, 2012 at 4:46 PM, Joey Echeverria  wrote:

> You only have to refresh nodes if you're making use of an allows file.
>
> Thanks does it mean that when tasktracker/datanode starts up it
communicates with namenode using master file?

Sent from my iPhone
>
> On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:
>
> > Is this the right procedure to add nodes? I took some from hadoop wiki
> FAQ:
> >
> > http://wiki.apache.org/hadoop/FAQ
> >
> > 1. Update conf/slave
> > 2. on the slave nodes start datanode and tasktracker
> > 3. hadoop balancer
> >
> > Do I also need to run dfsadmin -refreshnodes?
>


Re: Adding nodes

2012-03-01 Thread Joey Echeverria
You only have to refresh nodes if you're making use of an allows file. 

Sent from my iPhone

On Mar 1, 2012, at 18:29, Mohit Anchlia  wrote:

> Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:
> 
> http://wiki.apache.org/hadoop/FAQ
> 
> 1. Update conf/slave
> 2. on the slave nodes start datanode and tasktracker
> 3. hadoop balancer
> 
> Do I also need to run dfsadmin -refreshnodes?


Adding nodes

2012-03-01 Thread Mohit Anchlia
Is this the right procedure to add nodes? I took some from hadoop wiki FAQ:

http://wiki.apache.org/hadoop/FAQ

1. Update conf/slave
2. on the slave nodes start datanode and tasktracker
3. hadoop balancer

Do I also need to run dfsadmin -refreshnodes?


Re: High quality hadoop logo?

2012-03-01 Thread Keith Wiley
Excellent!

Thank you.

Sent from my phone, please excuse my brevity.
Keith Wiley, kwi...@keithwiley.com, http://keithwiley.com


Owen O'Malley  wrote:

On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley  wrote:
> Sorry, false alarm.  I was looking at the popup thumbnails in google image 
> search.  If I click all the way through, there are some high quality
> versions available.  Why is the version on the Apache site (and the Wikipedia 
> page) so poor?

The high resolution images are in subversion:

http://svn.apache.org/repos/asf/hadoop/logos/

-- Owen



Re: Streaming Hadoop using C

2012-03-01 Thread Mark question
Starfish worked great for wordcount .. I didn't run it on my application
because I have only map tasks.

Mark

On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl wrote:

> How was your experience of starfish?
> C
> On Mar 1, 2012, at 12:35 AM, Mark question wrote:
>
> > Thank you for your time and suggestions, I've already tried starfish, but
> > not jmap. I'll check it out.
> > Thanks again,
> > Mark
> >
> > On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl  >wrote:
> >
> >> I assume you have also just tried running locally and using the jdk
> >> performance tools (e.g. jmap) to gain insight by configuring hadoop to
> run
> >> absolute minimum number of tasks?
> >> Perhaps the discussion
> >>
> >>
> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
> >> might be relevant?
> >> On Feb 29, 2012, at 3:53 PM, Mark question wrote:
> >>
> >>> I've used hadoop profiling (.prof) to show the stack trace but it was
> >> hard
> >>> to follow. jConsole locally since I couldn't find a way to set a port
> >>> number to child processes when running them remotely. Linux commands
> >>> (top,/proc), showed me that the virtual memory is almost twice as my
> >>> physical which means swapping is happening which is what I'm trying to
> >>> avoid.
> >>>
> >>> So basically, is there a way to assign a port to child processes to
> >> monitor
> >>> them remotely (asked before by Xun) or would you recommend another
> >>> monitoring tool?
> >>>
> >>> Thank you,
> >>> Mark
> >>>
> >>>
> >>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl <
> charles.ce...@gmail.com
> >>> wrote:
> >>>
>  Mark,
>  So if I understand, it is more the memory management that you are
>  interested in, rather than a need to run an existing C or C++
> >> application
>  in MapReduce platform?
>  Have you done profiling of the application?
>  C
>  On Feb 29, 2012, at 2:19 PM, Mark question wrote:
> 
> > Thanks Charles .. I'm running Hadoop for research to perform
> duplicate
> > detection methods. To go deeper, I need to understand what's slowing
> my
> > program, which usually starts with analyzing memory to predict best
> >> input
> > size for map task. So you're saying piping can help me control memory
>  even
> > though it's running on VM eventually?
> >
> > Thanks,
> > Mark
> >
> > On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl <
> >> charles.ce...@gmail.com
> > wrote:
> >
> >> Mark,
> >> Both streaming and pipes allow this, perhaps more so pipes at the
> >> level
>  of
> >> the mapreduce task. Can you provide more details on the application?
> >> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
> >>
> >>> Hi guys, thought I should ask this before I use it ... will using C
>  over
> >>> Hadoop give me the usual C memory management? For example,
> malloc() ,
> >>> sizeof() ? My guess is no since this all will eventually be turned
> >> into
> >>> bytecode, but I need more control on memory which obviously is hard
> >> for
> >> me
> >>> to do with Java.
> >>>
> >>> Let me know of any advantages you know about streaming in C over
>  hadoop.
> >>> Thank you,
> >>> Mark
> >>
> >>
> 
> 
> >>
> >>
>
>


Re: High quality hadoop logo?

2012-03-01 Thread Owen O'Malley
On Thu, Mar 1, 2012 at 2:14 PM, Keith Wiley  wrote:
> Sorry, false alarm.  I was looking at the popup thumbnails in google image 
> search.  If I click all the way through, there are some high quality
> versions available.  Why is the version on the Apache site (and the Wikipedia 
> page) so poor?

The high resolution images are in subversion:

http://svn.apache.org/repos/asf/hadoop/logos/

-- Owen


Re: High quality hadoop logo?

2012-03-01 Thread Keith Wiley
Sorry, false alarm.  I was looking at the popup thumbnails in google image 
search.  If I click all the way through, there are some high quality versions 
available.  Why is the version on the Apache site (and the Wikipedia page) so 
poor?

On Mar 1, 2012, at 14:09 , Keith Wiley wrote:

> Is there a high quality version of the hadoop logo anywhere?  Even the 
> graphic presented on the Apache page itself suffers from dreadful jpeg 
> artifacting.  A google image search didn't inspire much hope on this issue 
> (they all have the same low-quality jpeg appearance).  I'm looking for good 
> graphics for slides, presentations, publications, etc.
> 
> Thanks.
> 
> 
> Keith Wiley kwi...@keithwiley.com keithwiley.com
> music.keithwiley.com
> 
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch 
> can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch 
> can't
> scratch. All together this implies: He scratched the itch from the scratch 
> that
> itched but would never itch the scratch from the itch that scratched."
>   --  Keith Wiley
> 
> 



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"Yet mark his perfect self-contentment, and hence learn his lesson, that to be
self-contented is to be vile and ignorant, and that to aspire is better than to
be blindly and impotently happy."
   --  Edwin A. Abbott, Flatland




High quality hadoop logo?

2012-03-01 Thread Keith Wiley
Is there a high quality version of the hadoop logo anywhere?  Even the graphic 
presented on the Apache page itself suffers from dreadful jpeg artifacting.  A 
google image search didn't inspire much hope on this issue (they all have the 
same low-quality jpeg appearance).  I'm looking for good graphics for slides, 
presentations, publications, etc.

Thanks.


Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
   --  Keith Wiley




kill -QUIT

2012-03-01 Thread Mohit Anchlia
When I try kill -QUIT for a job it doesn't send the stacktrace to the log
files. Does anyone know why or if I am doing something wrong?

I find the job using ps -ef|grep "attempt". I then go to
logs/userLogs/job/attempt/


Re: fairscheduler : group.name doesn't work, please help

2012-03-01 Thread Harsh J
The group.name scheduler support was introduced in
https://issues.apache.org/jira/browse/HADOOP-3892 but may have been
broken by the security changes present in 0.20.205. You'll need the
fix presented in  https://issues.apache.org/jira/browse/MAPREDUCE-2457
to have group.name support.

On Thu, Mar 1, 2012 at 6:42 PM, Austin Chungath  wrote:
>  I am running fair scheduler on hadoop 0.20.205.0
>
> http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
> The above page talks about the following property
>
> *mapred.fairscheduler.poolnameproperty*
> **
> which I can set to *group.name*
> The default is user.name and when a user submits a job the fair scheduler
> assigns each user's job to a pool which has the name of the user.
> I am trying to change it to group.name so that the job is submitted to a
> pool which has the name of the user's linux group. Thus all jobs from any
> user from a specific group go to the same pool instead of an individual
> pool for every user.
> But *group.name* doesn't seem to work, has anyone tried this before?
>
> *user.name* and *mapred.job.queue.name* works. Is group.name supported in
> 0.20.205.0 because I don't see it mentioned in the docs?
>
> Thanks,
> Austin



-- 
Harsh J


Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Merto Mertek
I think that ${user.name} variable is obtained from system proprietes
class,
where I can not find the group.name propriety, so probably it is not
possible to create pools depending on the user group, despite in the
document is mentioned that is possible..

Correct me if I am wrong and let us know if you solve it..



On 1 March 2012 17:30, Austin Chungath  wrote:

> Hi,
> I tried what you had said. I added the following to mapred-site.xml:
>
>
> 
>  mapred.fairscheduler.poolnameproperty
>   pool.name
> 
>
> 
>  pool.name
> ${mapreduce.job.group.name}
> 
>
> Funny enough it created a pool with the name "${mapreduce.job.group.name}"
> so I tried ${mapred.job.group.name} and ${group.name} all to the same
> effect.
>
> But when I did ${user.name} it worked! and created a pool with the user
> name.
>
>
>
> On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek  wrote:
>
> > From the fairscheduler docs I assume the following should work:
> >
> > 
> >  mapred.fairscheduler.poolnameproperty
> >   pool.name
> > 
> >
> > 
> >  pool.name
> >  ${mapreduce.job.group.name}
> > 
> >
> > which means that the default pool will be the group of the user that has
> > submitted the job. In your case I think that allocations.xml is correct.
> If
> > you want to explicitly define a job to specific pool from your
> > allocation.xml file you can define it as follows:
> >
> > Configuration conf3 = conf;
> > conf3.set("pool.name", "pool3"); // conf.set(propriety.name, value)
> >
> > Let me know if it works..
> >
> >
> > On 29 February 2012 14:18, Austin Chungath  wrote:
> >
> > > How can I set the fair scheduler such that all jobs submitted from a
> > > particular user group go to a pool with the group name?
> > >
> > > I have setup fair scheduler and I have two users: A and B (belonging to
> > the
> > > user group hadoop)
> > >
> > > When these users submit hadoop jobs, the jobs from A got to a pool
> named
> > A
> > > and the jobs from B go to a pool named B.
> > >  I want them to go to a pool with their group name, So I tried adding
> the
> > > following to mapred-site.xml:
> > >
> > > 
> > >  mapred.fairscheduler.poolnameproperty
> > > group.name
> > > 
> > >
> > > But instead the jobs now go to the default pool.
> > > I want the jobs submitted by A and B to go to the pool named "hadoop".
> > How
> > > do I do that?
> > > also how can I explicity set a job to any specified pool?
> > >
> > > I have set the allocation file (fair-scheduler.xml) like this:
> > >
> > > 
> > >  
> > >1
> > >1
> > >3
> > >3
> > >  
> > >  5
> > > 
> > >
> > > Any help is greatly appreciated.
> > > Thanks,
> > > Austin
> > >
> >
>


Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath
Hi,
I tried what you had said. I added the following to mapred-site.xml:



 mapred.fairscheduler.poolnameproperty
  pool.name



 pool.name
 ${mapreduce.job.group.name}


Funny enough it created a pool with the name "${mapreduce.job.group.name}"
so I tried ${mapred.job.group.name} and ${group.name} all to the same
effect.

But when I did ${user.name} it worked! and created a pool with the user
name.



On Thu, Mar 1, 2012 at 8:03 PM, Merto Mertek  wrote:

> From the fairscheduler docs I assume the following should work:
>
> 
>  mapred.fairscheduler.poolnameproperty
>   pool.name
> 
>
> 
>  pool.name
>  ${mapreduce.job.group.name}
> 
>
> which means that the default pool will be the group of the user that has
> submitted the job. In your case I think that allocations.xml is correct. If
> you want to explicitly define a job to specific pool from your
> allocation.xml file you can define it as follows:
>
> Configuration conf3 = conf;
> conf3.set("pool.name", "pool3"); // conf.set(propriety.name, value)
>
> Let me know if it works..
>
>
> On 29 February 2012 14:18, Austin Chungath  wrote:
>
> > How can I set the fair scheduler such that all jobs submitted from a
> > particular user group go to a pool with the group name?
> >
> > I have setup fair scheduler and I have two users: A and B (belonging to
> the
> > user group hadoop)
> >
> > When these users submit hadoop jobs, the jobs from A got to a pool named
> A
> > and the jobs from B go to a pool named B.
> >  I want them to go to a pool with their group name, So I tried adding the
> > following to mapred-site.xml:
> >
> > 
> >  mapred.fairscheduler.poolnameproperty
> > group.name
> > 
> >
> > But instead the jobs now go to the default pool.
> > I want the jobs submitted by A and B to go to the pool named "hadoop".
> How
> > do I do that?
> > also how can I explicity set a job to any specified pool?
> >
> > I have set the allocation file (fair-scheduler.xml) like this:
> >
> > 
> >  
> >1
> >1
> >3
> >3
> >  
> >  5
> > 
> >
> > Any help is greatly appreciated.
> > Thanks,
> > Austin
> >
>


Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Austin Chungath
Thanks,
I will be trying the suggestions and will get back to you soon.

On Thu, Mar 1, 2012 at 8:09 PM, Dave Shine <
dave.sh...@channelintelligence.com> wrote:

> I've just started playing with the Fair Scheduler.  To specify the pool at
> job submission time you set the "mapred.fairscheduler.pool" property on the
> Job Conf to the name of the pool you want the job to use.
>
> Dave
>
>
> -Original Message-
> From: Merto Mertek [mailto:masmer...@gmail.com]
> Sent: Thursday, March 01, 2012 9:33 AM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool
>
> From the fairscheduler docs I assume the following should work:
>
> 
>  mapred.fairscheduler.poolnameproperty
>   pool.name
> 
>
> 
>  pool.name
>  ${mapreduce.job.group.name}
> 
>
> which means that the default pool will be the group of the user that has
> submitted the job. In your case I think that allocations.xml is correct. If
> you want to explicitly define a job to specific pool from your
> allocation.xml file you can define it as follows:
>
> Configuration conf3 = conf;
> conf3.set("pool.name", "pool3"); // conf.set(propriety.name, value)
>
> Let me know if it works..
>
>
> On 29 February 2012 14:18, Austin Chungath  wrote:
>
> > How can I set the fair scheduler such that all jobs submitted from a
> > particular user group go to a pool with the group name?
> >
> > I have setup fair scheduler and I have two users: A and B (belonging
> > to the user group hadoop)
> >
> > When these users submit hadoop jobs, the jobs from A got to a pool
> > named A and the jobs from B go to a pool named B.
> >  I want them to go to a pool with their group name, So I tried adding
> > the following to mapred-site.xml:
> >
> > 
> >  mapred.fairscheduler.poolnameproperty
> > group.name
> > 
> >
> > But instead the jobs now go to the default pool.
> > I want the jobs submitted by A and B to go to the pool named "hadoop".
> > How do I do that?
> > also how can I explicity set a job to any specified pool?
> >
> > I have set the allocation file (fair-scheduler.xml) like this:
> >
> > 
> >  
> >1
> >1
> >3
> >3
> >  
> >  5
> > 
> >
> > Any help is greatly appreciated.
> > Thanks,
> > Austin
> >
>
> The information contained in this email message is considered confidential
> and proprietary to the sender and is intended solely for review and use by
> the named recipient. Any unauthorized review, use or distribution is
> strictly prohibited. If you have received this message in error, please
> advise the sender by reply email and delete the message.
>


RE: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Dave Shine
I've just started playing with the Fair Scheduler.  To specify the pool at job 
submission time you set the "mapred.fairscheduler.pool" property on the Job 
Conf to the name of the pool you want the job to use.

Dave


-Original Message-
From: Merto Mertek [mailto:masmer...@gmail.com]
Sent: Thursday, March 01, 2012 9:33 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop fair scheduler doubt: allocate jobs to pool

>From the fairscheduler docs I assume the following should work:


 mapred.fairscheduler.poolnameproperty
   pool.name



  pool.name
  ${mapreduce.job.group.name}


which means that the default pool will be the group of the user that has 
submitted the job. In your case I think that allocations.xml is correct. If you 
want to explicitly define a job to specific pool from your allocation.xml file 
you can define it as follows:

Configuration conf3 = conf;
conf3.set("pool.name", "pool3"); // conf.set(propriety.name, value)

Let me know if it works..


On 29 February 2012 14:18, Austin Chungath  wrote:

> How can I set the fair scheduler such that all jobs submitted from a
> particular user group go to a pool with the group name?
>
> I have setup fair scheduler and I have two users: A and B (belonging
> to the user group hadoop)
>
> When these users submit hadoop jobs, the jobs from A got to a pool
> named A and the jobs from B go to a pool named B.
>  I want them to go to a pool with their group name, So I tried adding
> the following to mapred-site.xml:
>
> 
>  mapred.fairscheduler.poolnameproperty
> group.name
> 
>
> But instead the jobs now go to the default pool.
> I want the jobs submitted by A and B to go to the pool named "hadoop".
> How do I do that?
> also how can I explicity set a job to any specified pool?
>
> I have set the allocation file (fair-scheduler.xml) like this:
>
> 
>  
>1
>1
>3
>3
>  
>  5
> 
>
> Any help is greatly appreciated.
> Thanks,
> Austin
>

The information contained in this email message is considered confidential and 
proprietary to the sender and is intended solely for review and use by the 
named recipient. Any unauthorized review, use or distribution is strictly 
prohibited. If you have received this message in error, please advise the 
sender by reply email and delete the message.


Re: Hadoop fair scheduler doubt: allocate jobs to pool

2012-03-01 Thread Merto Mertek
>From the fairscheduler docs I assume the following should work:


 mapred.fairscheduler.poolnameproperty
   pool.name



  pool.name
  ${mapreduce.job.group.name}


which means that the default pool will be the group of the user that has
submitted the job. In your case I think that allocations.xml is correct. If
you want to explicitly define a job to specific pool from your
allocation.xml file you can define it as follows:

Configuration conf3 = conf;
conf3.set("pool.name", "pool3"); // conf.set(propriety.name, value)

Let me know if it works..


On 29 February 2012 14:18, Austin Chungath  wrote:

> How can I set the fair scheduler such that all jobs submitted from a
> particular user group go to a pool with the group name?
>
> I have setup fair scheduler and I have two users: A and B (belonging to the
> user group hadoop)
>
> When these users submit hadoop jobs, the jobs from A got to a pool named A
> and the jobs from B go to a pool named B.
>  I want them to go to a pool with their group name, So I tried adding the
> following to mapred-site.xml:
>
> 
>  mapred.fairscheduler.poolnameproperty
> group.name
> 
>
> But instead the jobs now go to the default pool.
> I want the jobs submitted by A and B to go to the pool named "hadoop". How
> do I do that?
> also how can I explicity set a job to any specified pool?
>
> I have set the allocation file (fair-scheduler.xml) like this:
>
> 
>  
>1
>1
>3
>3
>  
>  5
> 
>
> Any help is greatly appreciated.
> Thanks,
> Austin
>


fairscheduler : group.name doesn't work, please help

2012-03-01 Thread Austin Chungath
 I am running fair scheduler on hadoop 0.20.205.0

http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
The above page talks about the following property

*mapred.fairscheduler.poolnameproperty*
**
which I can set to *group.name*
The default is user.name and when a user submits a job the fair scheduler
assigns each user's job to a pool which has the name of the user.
I am trying to change it to group.name so that the job is submitted to a
pool which has the name of the user's linux group. Thus all jobs from any
user from a specific group go to the same pool instead of an individual
pool for every user.
But *group.name* doesn't seem to work, has anyone tried this before?

*user.name* and *mapred.job.queue.name* works. Is group.name supported in
0.20.205.0 because I don't see it mentioned in the docs?

Thanks,
Austin


Re: Should splittable Gzip be a "core" hadoop feature?

2012-03-01 Thread Michel Segel

 I do agree that a git hub project is the way to go unless you could convince 
Cloudera, HortonWorks or MapR to pick it up and support it.  They have enough 
committers 

Is this potentially worthwhile? Maybe, it depends on how the cluster is 
integrated in to the overall environment. Companies that have standardized on 
using gzip would find it useful.



Sent from a remote device. Please excuse any typos...

Mike Segel

On Feb 29, 2012, at 3:17 PM, Niels Basjes  wrote:

> Hi,
> 
> On Wed, Feb 29, 2012 at 19:13, Robert Evans  wrote:
> 
> 
>> What I really want to know is how well does this new CompressionCodec
>> perform in comparison to the regular gzip codec in
> 
> various different conditions and what type of impact does it have on
>> network traffic and datanode load.  My gut feeling is that
> 
> the speedup is going to be relatively small except when there is a lot of
>> computation happening in the mapper
> 
> 
> I agree, I made the same assesment.
> In the javadoc I wrote under "When is this useful?"
> *"Assume you have a heavy map phase for which the input is a 1GiB Apache
> httpd logfile. Now assume this map takes 60 minutes of CPU time to run."*
> 
> 
>> and the added load and network traffic outweighs the speedup in most
>> cases,
> 
> 
> No, the trick to solve that one is to upload the gzipped files with a HDFS
> blocksize equal (or 1 byte larger) than the filesize.
> This setting will help in speeding up Gzipped input files in any situation
> (no more network overhead).
> From there the HDFS file replication factor of the file dictates the
> optimal number of splits for this codec.
> 
> 
>> but like all performance on a complex system gut feelings are
> 
> almost worthless and hard numbers are what is needed to make a judgment
>> call.
> 
> 
> Yes
> 
> 
>> Niels, I assume you have tested this on your cluster(s).  Can you share
>> with us some of the numbers?
>> 
> 
> No I haven't tested it beyond a multiple core system.
> The simple reason for that is that when this was under review last summer
> the whole "Yarn" thing happened
> and I was unable to run it at all for a long time.
> I only got it running again last december when the restructuring of the
> source tree was mostly done.
> 
> At this moment I'm building a experimentation setup at work that can be
> used for various things.
> Given the current state of Hadoop 2.0 I think it's time to produce some
> actual results.
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes


Re: Streaming Hadoop using C

2012-03-01 Thread Charles Earl
How was your experience of starfish?
C
On Mar 1, 2012, at 12:35 AM, Mark question wrote:

> Thank you for your time and suggestions, I've already tried starfish, but
> not jmap. I'll check it out.
> Thanks again,
> Mark
> 
> On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl wrote:
> 
>> I assume you have also just tried running locally and using the jdk
>> performance tools (e.g. jmap) to gain insight by configuring hadoop to run
>> absolute minimum number of tasks?
>> Perhaps the discussion
>> 
>> http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task
>> might be relevant?
>> On Feb 29, 2012, at 3:53 PM, Mark question wrote:
>> 
>>> I've used hadoop profiling (.prof) to show the stack trace but it was
>> hard
>>> to follow. jConsole locally since I couldn't find a way to set a port
>>> number to child processes when running them remotely. Linux commands
>>> (top,/proc), showed me that the virtual memory is almost twice as my
>>> physical which means swapping is happening which is what I'm trying to
>>> avoid.
>>> 
>>> So basically, is there a way to assign a port to child processes to
>> monitor
>>> them remotely (asked before by Xun) or would you recommend another
>>> monitoring tool?
>>> 
>>> Thank you,
>>> Mark
>>> 
>>> 
>>> On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl >> wrote:
>>> 
 Mark,
 So if I understand, it is more the memory management that you are
 interested in, rather than a need to run an existing C or C++
>> application
 in MapReduce platform?
 Have you done profiling of the application?
 C
 On Feb 29, 2012, at 2:19 PM, Mark question wrote:
 
> Thanks Charles .. I'm running Hadoop for research to perform duplicate
> detection methods. To go deeper, I need to understand what's slowing my
> program, which usually starts with analyzing memory to predict best
>> input
> size for map task. So you're saying piping can help me control memory
 even
> though it's running on VM eventually?
> 
> Thanks,
> Mark
> 
> On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl <
>> charles.ce...@gmail.com
> wrote:
> 
>> Mark,
>> Both streaming and pipes allow this, perhaps more so pipes at the
>> level
 of
>> the mapreduce task. Can you provide more details on the application?
>> On Feb 29, 2012, at 1:56 PM, Mark question wrote:
>> 
>>> Hi guys, thought I should ask this before I use it ... will using C
 over
>>> Hadoop give me the usual C memory management? For example, malloc() ,
>>> sizeof() ? My guess is no since this all will eventually be turned
>> into
>>> bytecode, but I need more control on memory which obviously is hard
>> for
>> me
>>> to do with Java.
>>> 
>>> Let me know of any advantages you know about streaming in C over
 hadoop.
>>> Thank you,
>>> Mark
>> 
>> 
 
 
>> 
>> 



Distributed Indexing on MapReduce

2012-03-01 Thread Frank Scholten
Hi all,

I am looking into reusing some existing code for distributed indexing
to test a Mahout tool I am working on
https://issues.apache.org/jira/browse/MAHOUT-944

What I want is to index the Apache Public Mail Archives dataset (200G)
via MapReduce on Hadoop.

I have been going through the Nutch and contrib/index code and from my
understanding I have to:

* Create an InputFormat / RecordReader / InputSplit class for
splitting the e-mails across mappers
* Create a Mapper which emits the e-mails as key value pairs
* Create a Reducer which indexes the e-mails on the local filesystem
(or straight to HDFS?)
* Copy these indexes from local filesystem to HDFS. In the same Reducer?

I am unsure about the final steps. How to get to the end result, a
bunch of index shards on HDFS. It seems
that each Reducer needs to be aware of a directory they eventually
write to on HDFS. I don't see how to get each reducer to copy its
shard to HDFS

How do I set this up?

Cheers,

Frank


Re: "Browse the filesystem" weblink broken after upgrade to 1.0.0: HTTP 404 "Problem accessing /browseDirectory.jsp"

2012-03-01 Thread madhu phatak
On Wed, Feb 29, 2012 at 11:34 PM, W.P. McNeill  wrote:

> I can do perform HDFS operations from the command line like "hadoop fs -ls
> /". Doesn't that meant that the datanode is up?
>

  No. It is just meta data lookup which comes from Namenode. Try to cat
some file like "hadoop fs -cat " . Then if you are able get data then
datanode should be up . Also make sure that hdfs is not in safemode . To
turnoff safemode use hdfs command "hadoop dfsadmin -safemode leave" and
then restart the jobtracker and tasktracker.



-- 
Join me at http://hadoopworkshop.eventbrite.com/


How to configure SWIM

2012-03-01 Thread Arvind
Hi all,
Can anybody help me to configure SWIM -- Statistical Workload Injector for
MapReduce on my hadoop cluster