from:"Alejandro Abdelnur"

Re: Reg HttpFS

2014-05-14 Thread Alejandro Abdelnur

Manoj,

Please look at
http://hadoop.apache.org/docs/r2.4.0/hadoop-hdfs-httpfs/httpfs-default.htmllook
at the 'httpfs.authentication.*' properties.

Thanks.


On Sun, May 4, 2014 at 5:27 AM, Manoj Babu manoj...@gmail.com wrote:

 Hi,

 How to accesss files in hdfs using HttpFS that is protected by kerberos?
 Kerberos authentication works only where is is configured ex: edge node.
 If i am triggering request from other system then how do i authenticate?

 Kindly advise.

 Cheers!
 Manoj.




-- 
Alejandro

Re: hadoop 2.0 upgrade to 2.4

2014-04-10 Thread Alejandro Abdelnur

Motty,

https://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/CDH5-Installation-Guide.html


provides instructions to upgrade from CDH4 to CDH5 (which bundles Hadoop
2.3.0).

If you intention is to use CDH5 that should help you. If you have further
questions about it, the right alias to use is cdh-u...@cloudera.org

If your intention is to use Apache Hadoop 2.4.0, some of CDH documentation
above may still be relevant.

Thanks.



On Thu, Apr 10, 2014 at 12:20 PM, motty cruz motty.c...@gmail.com wrote:

 Hi All,

 I currently have a hadoop 2.0 cluster in production, I want to upgrade to
 latest release.

 current version:
 [root@doop1 ~]# hadoop version
 Hadoop 2.0.0-cdh4.6.0

 Cluster has the following services:
 hbase
 hive
 hue
 impala
 mapreduce
 oozie
 sqoop
 zookeeper

 can someone point me to a howto upgrade hadoop from 2.0 to hadoop 2.4.0?

 Thanks in advance,




-- 
Alejandro

Re: Data Locality and WebHDFS

2014-03-17 Thread Alejandro Abdelnur

I may have expressed myself wrong. You don't need to do any test to see how
locality works with files of multiple blocks. If you are accessing a file
of more than one block over webhdfs, you only have assured locality for the
first block of the file.

Thanks.


On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:

 Thank you, Mingjiang and Alejandro.

 This is interesting.  Since we will use the data locality information for
 scheduling, we could hack this to get the data locality information, at
 least for the first block.  As Alejandro says, we'd have to test what
 happens for other data blocks -- e.g., what if, knowing the block sizes, we
 request the second or third block?

 Interesting food for thought!  I see some experiments in my future!

 Thanks!


 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 well, this is for the first block of the file, the rest of the file
 (blocks being local or not) are streamed out by the same datanode. for
 small files (one block) you'll get locality, for large files only the first
 block, and by chance if other blocks are local to that datanode.


 Alejandro
 (phone typing)

 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:

 According to this page:
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/

 *Data Locality*: The file read and file write calls are redirected to
 the corresponding datanodes. It uses the full bandwidth of the Hadoop
 cluster for streaming data.

 *A HDFS Built-in Component*: WebHDFS is a first class built-in
 component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
 can use all HDFS functionalities. It is a part of HDFS - there are no
 additional servers to install


 So it looks like the data locality is built-into webhdfs, client will be
 redirected to the data node automatically.




 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:

 Hi all,

 I'm writing up a Google Summer of Code proposal to add HDFS support to
 Disco, an Erlang MapReduce framework.

 We're interested in using WebHDFS.  I have two questions:

 1) Does WebHDFS allow querying data locality information?

 2) If the data locality information is known, can data on specific data
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
 through a single server?

 Thanks,
 RJ

 --
 em rnowl...@gmail.com
 c 954.496.2314




 --
 Cheers
 -MJ




 --
 em rnowl...@gmail.com
 c 954.496.2314




-- 
Alejandro

Re: Data Locality and WebHDFS

2014-03-17 Thread Alejandro Abdelnur

dont recall how skips are handled in webhdfs, but i would assume that you'll 
get to the first block As usual, and the skip is handled by the DN serving the 
file (as webhdfs doesnot know at open that you'll skip)

Alejandro
(phone typing)

 On Mar 17, 2014, at 9:47, RJ Nowling rnowl...@gmail.com wrote:
 
 Hi Alejandro,
 
 The WebHDFS API allows specifying an offset and length for the request.  If I 
 specify an offset that start in the second block for a file (thus skipping 
 the first block all together), will the namenode still direct me to a 
 datanode with the first block or will it direct me to a namenode with the 
 second block?  I.e., am I assured data locality only on the first block of 
 the file (as you're saying) or on the first block I am accessing?
 
 If it is as you say, then I may want to reach out the WebHDFS developers and 
 see if they would be interested in the additional functionality.
 
 Thank you,
 RJ
 
 
 On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur t...@cloudera.com 
 wrote:
 I may have expressed myself wrong. You don't need to do any test to see how 
 locality works with files of multiple blocks. If you are accessing a file of 
 more than one block over webhdfs, you only have assured locality for the 
 first block of the file.
 
 Thanks.
 
 
 On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling rnowl...@gmail.com wrote:
 Thank you, Mingjiang and Alejandro.
 
 This is interesting.  Since we will use the data locality information for 
 scheduling, we could hack this to get the data locality information, at 
 least for the first block.  As Alejandro says, we'd have to test what 
 happens for other data blocks -- e.g., what if, knowing the block sizes, we 
 request the second or third block?
 
 Interesting food for thought!  I see some experiments in my future!  
 
 Thanks!
 
 
 On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur t...@cloudera.com 
 wrote:
 well, this is for the first block of the file, the rest of the file 
 (blocks being local or not) are streamed out by the same datanode. for 
 small files (one block) you'll get locality, for large files only the 
 first block, and by chance if other blocks are local to that datanode. 
 
 
 Alejandro
 (phone typing)
 
 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:
 
 According to this page: 
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
 Data Locality: The file read and file write calls are redirected to the 
 corresponding datanodes. It uses the full bandwidth of the Hadoop 
 cluster for streaming data.
 
 A HDFS Built-in Component: WebHDFS is a first class built-in component 
 of HDFS. It runs inside Namenodes and Datanodes, therefore, it can use 
 all HDFS functionalities. It is a part of HDFS – there are no additional 
 servers to install
 
 
 So it looks like the data locality is built-into webhdfs, client will be 
 redirected to the data node automatically. 
 
 
 
 
 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi all,
 
 I'm writing up a Google Summer of Code proposal to add HDFS support to 
 Disco, an Erlang MapReduce framework.  
 
 We're interested in using WebHDFS.  I have two questions:
 
 1) Does WebHDFS allow querying data locality information?
 
 2) If the data locality information is known, can data on specific data 
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go 
 through a single server?
 
 Thanks,
 RJ
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 -- 
 Cheers
 -MJ
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 -- 
 Alejandro
 
 
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314

Re: Data Locality and WebHDFS

2014-03-16 Thread Alejandro Abdelnur

well, this is for the first block of the file, the rest of the file (blocks 
being local or not) are streamed out by the same datanode. for small files (one 
block) you'll get locality, for large files only the first block, and by chance 
if other blocks are local to that datanode. 


Alejandro
(phone typing)

 On Mar 16, 2014, at 18:53, Mingjiang Shi m...@gopivotal.com wrote:
 
 According to this page: 
 http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
 Data Locality: The file read and file write calls are redirected to the 
 corresponding datanodes. It uses the full bandwidth of the Hadoop cluster 
 for streaming data.
 
 A HDFS Built-in Component: WebHDFS is a first class built-in component of 
 HDFS. It runs inside Namenodes and Datanodes, therefore, it can use all HDFS 
 functionalities. It is a part of HDFS – there are no additional servers to 
 install
 
 
 So it looks like the data locality is built-into webhdfs, client will be 
 redirected to the data node automatically. 
 
 
 
 
 On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling rnowl...@gmail.com wrote:
 Hi all,
 
 I'm writing up a Google Summer of Code proposal to add HDFS support to 
 Disco, an Erlang MapReduce framework.  
 
 We're interested in using WebHDFS.  I have two questions:
 
 1) Does WebHDFS allow querying data locality information?
 
 2) If the data locality information is known, can data on specific data 
 nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go 
 through a single server?
 
 Thanks,
 RJ
 
 -- 
 em rnowl...@gmail.com
 c 954.496.2314
 
 
 
 -- 
 Cheers
 -MJ

Re: Need YARN - Test job.

2014-01-03 Thread Alejandro Abdelnur

Gaurav,

[BCCing user@h.a.o to move the thread out of it]

It seems you  are missing some step when reconfiguring your cluster in
Cloudera Manager, you should have to modify things by hand in your setup.
Adding the Cloudera Manager user alias.

Thanks.


On Fri, Jan 3, 2014 at 7:11 AM, Gaurav Shankhdhar 
shankhdhar.gau...@gmail.com wrote:

 yes, i am setting it but still still hangs there..
 Also there are no failure logs, it just hangs without erroring out.

 Any log location you want me to look at?


 On Friday, January 3, 2014 8:38:16 PM UTC+5:30, Gunnar Tapper wrote:

 Hi Guarav,

 What are you setting HADOOP_CONF_DIR to? IME, you get the hang if you
 don't set it as:

  export HADOOP_CONF_DIR=/etc/hadoop/conf.cloudera.yarn1

 Gunnar


 On Fri, Jan 3, 2014 at 6:47 AM, Gaurav Shankhdhar 
 shankhdh...@gmail.comwrote:

 Folks,

 I am trying to run teragen program using YARN framework in CDH 4.5 but
 no luck.
 What i have figured out till now:

 1. By default Demo VM run the MR in MRv1 uses JobTracker and
 TaskTracker.
 2. You can either run MRv1 or MRv2 but not both.
 3. I disabled MRv1 and configured YARN by increasing the alternatives
 priority for YARN, deployed the Client Configuration (Using Cloudera
 Manager).
 4. Submitted the teragen program and can see the YARN framework is in
 action from Web UI's.
 5. But it hangs at Map 0% and Reduce %.

 Tried multiple option to no avail.

 Any step by Step guide or doc to execute MR in MRv2 (YARN) mode in
 CDH4.5?

 Attached doc with screen shots.

 Reference:
 http://www.cloudera.com/content/cloudera-content/cloudera-
 docs/CM4Ent/latest/Cloudera-Manager-Managing-Clusters/cmmc_adding_YARN_
 MRv2.html

 Regards
 Gaurav

 On Friday, January 3, 2014 3:33:44 AM UTC+5:30, Dhanasekaran Anbalagan
 wrote:

 Hi Guys,

 we recently installed CDH5 in our test cluster. we need any test job
 for YARN framework.

 we are able to run mapreduce job successfully on YARN framework without
 code change.

 But we need to test yarn job functionality. can you please guide me.

 we tired https://github.com/hortonworks/simple-yarn-app it's not help
 us.
 tech@dvcloudlab231:~$ *hadoop jar simple-yarn-app-1.0-SNAPSHOT.jar
 com.hortonworks.simpleyarnapp.Client /bin/date 2
 /apps/simple/simple-yarn-app-1.0-SNAPSHOT.jar*
 14/01/02 16:49:05 INFO client.RMProxy: Connecting to ResourceManager at
 dvcloudlab231/192.168.70.231:8032
 Submitting application application_1388687890867_0007
 14/01/02 16:49:05 INFO impl.YarnClientImpl: Submitted application
 application_1388687890867_0007 to ResourceManager at dvcloudlab231/
 192.168.70.231:8032
 Application application_1388687890867_0007 finished with state FAILED
 at 1388699348344

 Note:
 In Resource manager node. I don't see any container logs.

 -Dhanasekaran.

 Did I learn something today? If not, I wasted it.

  --

 ---
 You received this message because you are subscribed to the Google
 Groups CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send
 an email to cdh-user+u...@cloudera.org.

 For more options, visit https://groups.google.com/a/
 cloudera.org/groups/opt_out.




 --
 Thanks,

 Gunnar
 *If you think you can you can, if you think you can't you're right.*

  --

 ---
 You received this message because you are subscribed to the Google Groups
 CDH Users group.
 To unsubscribe from this group and stop receiving emails from it, send an
 email to cdh-user+unsubscr...@cloudera.org.
 For more options, visit
 https://groups.google.com/a/cloudera.org/groups/opt_out.




-- 
Alejandro

Re: Time taken for starting AMRMClientAsync

2013-11-25 Thread Alejandro Abdelnur

Hi Krishna,

Are you starting all AMs from the same JVM? Mind sharing the code you are
using for your time testing?

Thx


On Thu, Nov 21, 2013 at 6:11 AM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Alejandro,

  I have modified the code in


 hadoop-2.2.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java

 to submit multiple application masters one after another and still seeing
 800 to 900 ms being taken for the start() call on AMRMClientAsync in all
 of those applications.

 Please suggest if you think I am missing something else

 Thanks,
 Kishore


 On Tue, Nov 19, 2013 at 6:07 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Alejandro,

   I don't know what are managed and unmanaged AMs, can you please explain
 me what are the difference and how are each of them launched?

  I tried to google for these terms and came
 across hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar, is it
 related to that?

 Thanks,
 Kishore


 On Tue, Nov 19, 2013 at 12:15 AM, Alejandro Abdelnur 
 t...@cloudera.comwrote:

 Kishore,

 Also, please specify if you are using managed or unmanaged AMs (the
 numbers I've mentioned before are using unmanaged AMs).

 thx


 On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli 
 vino...@hortonworks.com wrote:

 It is just creating a connection to RM and shouldn't take that long.
 Can you please file a ticket so that we can look at it?

 JVM class loading overhead is one possibility but 1 sec is a bit too
 much.

  Thanks,
 +Vinod

 On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote:

 Hi,
   I am seeing the following call to start() on AMRMClientAsync taking
 from 0.9 to 1 second. Why does it take that long? Is there a way to reduce
 it, I mean does it depend on any of the interval parameters or so in
 configuration files? I have tried reducing the value of the first argument
 below from 1000 to 100 seconds also, but that doesn't help.

 AMRMClientAsync.CallbackHandler allocListener = new
 RMCallbackHandler();
 amRMClient = AMRMClientAsync.createAMRMClientAsync(1000,
 allocListener);
 amRMClient.init(conf);
 amRMClient.start();


 Thanks,
 Kishore



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
 entity to which it is addressed and may contain information that is
 confidential, privileged and exempt from disclosure under applicable law.
 If the reader of this message is not the intended recipient, you are hereby
 notified that any printing, copying, dissemination, distribution,
 disclosure or forwarding of this communication is strictly prohibited. If
 you have received this communication in error, please contact the sender
 immediately and delete it from your system. Thank You.




 --
 Alejandro






-- 
Alejandro

Re: Time taken for starting AMRMClientAsync

2013-11-25 Thread Alejandro Abdelnur

Krishna,

Well, it all depends on your use case. In the case of Llama, Llama is a
server that hosts multiple unmanaged AMs, thus all AMs run in the same
process.

Thanks.

On Mon, Nov 25, 2013 at 6:40 PM, Krishna Kishore Bonagiri
write2kish...@gmail.com wrote:

Hi Alejandro,

I don't start all the AMs from the same JVM. How can I do that? Also,
when I do that, that will save me time taken to get AM started, which is
also good to see an improvement in. Please let me know how can I do that?
And, would this also save me time taken for connecting from AM to the
Resource Manager?

Thanks,
Kishore

On Tue, Nov 26, 2013 at 3:45 AM, Alejandro Abdelnur t...@cloudera.comwrote:

Hi Krishna,

Are you starting all AMs from the same JVM? Mind sharing the code you are
using for your time testing?

Thx

On Thu, Nov 21, 2013 at 6:11 AM, Krishna Kishore Bonagiri
write2kish...@gmail.com wrote:

Hi Alejandro,

I have modified the code in

hadoop-2.2.0-src/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java

to submit multiple application masters one after another and still
seeing 800 to 900 ms being taken for the start() call on
AMRMClientAsync in all of those applications.

Please suggest if you think I am missing something else

Thanks,
Kishore

On Tue, Nov 19, 2013 at 6:07 PM, Krishna Kishore Bonagiri
write2kish...@gmail.com wrote:

Hi Alejandro,

I don't know what are managed and unmanaged AMs, can you please
explain me what are the difference and how are each of them launched?

I tried to google for these terms and came
across hadoop-yarn-applications-unmanaged-am-launcher-2.2.0.jar, is it
related to that?

Thanks,
Kishore

On Tue, Nov 19, 2013 at 12:15 AM, Alejandro Abdelnur t...@cloudera.com
wrote:

Kishore,

Also, please specify if you are using managed or unmanaged AMs (the
numbers I've mentioned before are using unmanaged AMs).

thx

On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli
vino...@hortonworks.com wrote:

It is just creating a connection to RM and shouldn't take that long.
Can you please file a ticket so that we can look at it?

JVM class loading overhead is one possibility but 1 sec is a bit too
much.

Thanks,
+Vinod

On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote:

Hi,
I am seeing the following call to start() on AMRMClientAsync taking
from 0.9 to 1 second. Why does it take that long? Is there a way to
reduce
it, I mean does it depend on any of the interval parameters or so in
configuration files? I have tried reducing the value of the first
argument
below from 1000 to 100 seconds also, but that doesn't help.

AMRMClientAsync.CallbackHandler allocListener = new
RMCallbackHandler();
amRMClient = AMRMClientAsync.createAMRMClientAsync(1000,
allocListener);
amRMClient.init(conf);
amRMClient.start();

Thanks,
Kishore

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or
entity to which it is addressed and may contain information that is
confidential, privileged and exempt from disclosure under applicable law.
If the reader of this message is not the intended recipient, you are
hereby
notified that any printing, copying, dissemination, distribution,
disclosure or forwarding of this communication is strictly prohibited. If
you have received this communication in error, please contact the sender
immediately and delete it from your system. Thank You.

--
Alejandro

Re: Time taken for starting AMRMClientAsync

2013-11-18 Thread Alejandro Abdelnur

Kishore,

Also, please specify if you are using managed or unmanaged AMs (the numbers
I've mentioned before are using unmanaged AMs).

thx


On Sun, Nov 17, 2013 at 11:16 AM, Vinod Kumar Vavilapalli 
vino...@hortonworks.com wrote:

 It is just creating a connection to RM and shouldn't take that long. Can
 you please file a ticket so that we can look at it?

 JVM class loading overhead is one possibility but 1 sec is a bit too much.

  Thanks,
 +Vinod

 On Oct 21, 2013, at 7:16 AM, Krishna Kishore Bonagiri wrote:

 Hi,
   I am seeing the following call to start() on AMRMClientAsync taking from
 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I
 mean does it depend on any of the interval parameters or so in
 configuration files? I have tried reducing the value of the first argument
 below from 1000 to 100 seconds also, but that doesn't help.

 AMRMClientAsync.CallbackHandler allocListener = new
 RMCallbackHandler();
 amRMClient = AMRMClientAsync.createAMRMClientAsync(1000,
 allocListener);
 amRMClient.init(conf);
 amRMClient.start();


 Thanks,
 Kishore



 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or entity
 to which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the reader
 of this message is not the intended recipient, you are hereby notified that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender immediately
 and delete it from your system. Thank You.




-- 
Alejandro

Re: Time taken for starting AMRMClientAsync

2013-10-21 Thread Alejandro Abdelnur

Hi Krishna,

Those 900ms seems consistent with the numbers we  found while doing some
benchmarks in the context of Llama:

http://cloudera.github.io/llama/

We found that the first application master created from a client process
takes around 900 ms to be ready to submit resource requests. Subsequent
application masters created from the same client process take a mean of 20
ms. The application master submission throughput (discarding the first
submission) tops at approximately 100 application masters per second.

I believe there is room for improvement there.

Cheers


On Mon, Oct 21, 2013 at 7:16 AM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi,
   I am seeing the following call to start() on AMRMClientAsync taking from
 0.9 to 1 second. Why does it take that long? Is there a way to reduce it, I
 mean does it depend on any of the interval parameters or so in
 configuration files? I have tried reducing the value of the first argument
 below from 1000 to 100 seconds also, but that doesn't help.

 AMRMClientAsync.CallbackHandler allocListener = new
 RMCallbackHandler();
 amRMClient = AMRMClientAsync.createAMRMClientAsync(1000,
 allocListener);
 amRMClient.init(conf);
 amRMClient.start();


 Thanks,
 Kishore




-- 
Alejandro

Re: Cloudera links and Document

2013-07-11 Thread Alejandro Abdelnur

Satish,

the right alias for Cloudera Manager questions scm-us...@cloudera.org

Thanks


On Thu, Jul 11, 2013 at 9:20 AM, Suresh Srinivas sur...@hortonworks.comwrote:

 Sathish, this mailing list for Apache Hadoop related questions. Please
 post questions related to other distributions to appropriate vendor's
 mailing list.



 On Thu, Jul 11, 2013 at 6:28 AM, Sathish Kumar sa848...@gmail.com wrote:

 Hi All,

 Can anyone help me the link or document that explain the below.

 How Cloudera Manager works and handle the clusters (Agent and Master
 Server)?
 How the Cloudera Manager Process Flow works?
 Where can I locate Cloudera configuration files and explanation in brief?


 Regards
 Sathish




 --
 http://hortonworks.com/download/




-- 
Alejandro

Re: Requesting containers on a specific host

2013-07-04 Thread Alejandro Abdelnur

John,

This feature will available in the upcoming 2.1.0-beta release. The first
release candidate (RC) has been cut, but it seems a new RC will be needed.
The exact release date is still not known but it should be soon.

thanks.


On Thu, Jul 4, 2013 at 2:43 PM, John Lilley john.lil...@redpoint.netwrote:

  Arun,

 I’m don’t know how to interpret the release schedule from the JIRA.  It
 says that the patch targets 2.1.0 and it is checked into the trunk, does
 that mean it is likely to be rolled into the first Hadoop 2 GA or will it
 have to wait for another cycle?

 Thanks,

 John

 ** **

 *From:* Arun C Murthy [mailto:a...@hortonworks.com]
 *Sent:* Thursday, July 04, 2013 6:28 AM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Requesting containers on a specific host

 ** **

 To guarantee nodes on a specific container you need to use the whitelist
 feature we added recently:

 https://issues.apache.org/jira/browse/YARN-398

  Arun

 ** **

 On Jul 4, 2013, at 3:14 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:



 

 I could get containers on specific nodes using addContainerRequest() on
 AMRMClient. But there are issues with it. I have two nodes, node1 and node2
 in my cluster. And, my Application Master is trying to get 3 containers on
 node1, and 3 containers on node2 in that order. 

 ** **

 While trying to request on node1, it sometimes gives me those on node2,
 and vice verse. When I get a container on a different node than the one I
 need, I release it and make a fresh request. I am having to do like that
 forever to get a container on the node I need. 

 ** **

  Though the node I am requesting has enough resources, why does it keep
 giving me containers on the other node? How can I make sure I get a
 container on the node I want? 

 ** **

 Note: I am using the default scheduler, i.e. Capacity Scheduler.

 ** **

 Thanks,

 Kishore

 ** **

 On Fri, Jun 21, 2013 at 7:25 PM, Arun C Murthy a...@hortonworks.com
 wrote:

 Check if the hostname you are setting is the same in the RM logs…

 ** **

 On Jun 21, 2013, at 2:15 AM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:



 

 Hi,

   I am trying to get container on a specific host, using setHostName(0
 call on ResourceRequest, but could not get allocated anything forever,
 which just works fine when I change the node name to *. I am working on a
 single node cluster, and I am giving the name of the single node I have in
 my cluster. 

 ** **

   Is there any specific format that I need to give for setHostName(), why
 is it not working...

 ** **

 Thanks,

 Kishore

 ** **

 --

 Arun C. Murthy

 Hortonworks Inc.
 http://hortonworks.com/

 ** **

 ** **

 ** **

 --

 Arun C. Murthy

 Hortonworks Inc.
 http://hortonworks.com/

 ** **




-- 
Alejandro

Re: WebHDFS - Jetty

2013-07-01 Thread Alejandro Abdelnur

John,

you should look a the AuthenticationHandler interface in the hadoop-auth
module, there are 2 implementations Pseudo and Kerberos.

Hope this helps


On Mon, Jul 1, 2013 at 9:15 PM, John Strange j...@strangeness.org wrote:

  We have an existing java module that integrates our authentication
 system that works with jetty, but I can’t find the configuration files for
 Jetty so I’m assuming it’s being passed and built on the fly by the
 data/name nodes when they start.  The goal is to provide a way to
 authenticate users via the webHDFS interface.

 ** **

 If anyone has any docs or a good place to start it would be of help.

 ** **

 Thanks,

 - John




-- 
Alejandro

Re: Job end notification does not always work (Hadoop 2.x)

2013-06-25 Thread Alejandro Abdelnur

Devaraj,

If you don't run the HS, once your jobs finished you cannot retrieve
status/counters from it, from Java AP or Web UI. So I'd for any practical
usage, you need it.

thx


On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k devara...@huawei.com wrote:

  It is not mandatory to have running HS in the cluster. Still the user
 can submit the job without HS in the cluster, and user may expect the
 Job/App End Notification.

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Alejandro Abdelnur [mailto:t...@cloudera.com]
 *Sent:* 24 June 2013 21:42
 *To:* user@hadoop.apache.org
 *Cc:* user@hadoop.apache.org

 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  ** **

 if we ought to do this in a yarn service it
 should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would
 be a good choice if we are concerned about the extra work this would cause
 in the RM. the problem with the current HS is that it is MR specific, we
 should generalize it for diff AM types. 

 ** **

 thx


 Alejandro

 (phone typing)


 On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote:

  Even if we handle all the failure cases in AM for Job End Notification,
 we may miss cases like abrupt kill of AM when it is in last retry. If we
 choose NM to give the notification, again RM needs to identify which NM
 should give the end-notification as we don't have any direct protocol
 between AM and NM.

  

 I feel it would be better to move End-Notification responsibility to RM as
 Yarn Service because it ensures 100% notification and also useful for other
 types of applications as well. 

  

  

 Thanks

 Devaraj K

  

 *From:* Ravi Prakash [mailto:ravi...@ymail.com ravi...@ymail.com]
 *Sent:* 23 June 2013 19:01
 *To:* user@hadoop.apache.org
 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  

 Hi Alejandro,

 Thanks for your reply! I was thinking more along the lines Prashant
 suggested i.e. a failure during init() should still trigger an attempt to
 notify (by the AM). But now that you mention it, maybe we would be better
 of including this as a YARN feature after all (specially with all the new
 AMs being written). We could let the NM of the AM handle the notification
 burden, so that the RM doesn't get unduly taxed. Thoughts?

 Thanks
 Ravi

  

  
--

 *From:* Alejandro Abdelnur t...@cloudera.com
 *To:* common-u...@hadoop.apache.org user@hadoop.apache.org
 *Sent:* Saturday, June 22, 2013 7:37 PM
 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  

 If the AM fails before doing the job end notification, at any stage of the
 execution for whatever reason, the job end notification will never be
 deliver. There is not way to fix this unless the notification is done by a
 Yarn service. The 2 'candidate' services for doing this would be the RM and
 the HS. The job notification URL is in the job conf. The RM never sees the
 job conf, that rules out the RM out unless we add, at AM registration time
 the possibility to specify a callback URL. The HS has access to the job
 conf, but the HS is currently a 'passive' service.


 thx

  

 On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com
 wrote:

 Prashanth, 

  

  Please file a jira.

  

  One thing to be aware of - AMs get restarted a certain number of times
 for fault-tolerance - which means we can't just assume that failure of a
 single AM is equivalent to failure of the job.

  

  Only the ResourceManager is in the appropriate position to judge failure
 of AM v/s failure-of-job.

  

 hth,

 Arun

  

 On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com
 wrote:




 

 Thanks Ravi.

 Well, in this case its a no-effort :) A failure of AM init should be
 considered as failure of the job? I looked at the code and best-effort
 makes sense with respect to retry logic etc. You make a good point that
 there would be no notification in case AM OOMs, but I do feel AM init
 failure should send a notification by other means.

  

 On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote:**
 **

 Hi Prashant,

 I would tend to agree with you. Although job-end notification is only a
 best-effort mechanism (i.e. we cannot always guarantee notification for
 example when the AM OOMs), I agree with you that we can do more. If you
 feel strongly about this, please create a JIRA and possibly upload a patch.

 Thanks
 Ravi

  

  
--

 *From:* Prashant Kommireddi prash1...@gmail.com
 *To:* user@hadoop.apache.org user@hadoop.apache.org
 *Sent:* Thursday, June 20, 2013 9:45 PM
 *Subject:* Job end notification does not always work (Hadoop 2.x)

  

 Hello,

 I came across an issue that occurs with the job notification callbacks

Re: Job end notification does not always work (Hadoop 2.x)

2013-06-25 Thread Alejandro Abdelnur

Devaraj,

if a job can finish but you cannot determine it status after it ended, then
the system is not usable. Thus, HS is a required component.

thx


On Tue, Jun 25, 2013 at 6:11 AM, Devaraj k devara...@huawei.com wrote:

  I agree, for getting status/counters we need HS. I mean Job can finish
 without HS also.  

 ** **

 Thanks

 Devaraj k

 ** **

 *From:* Alejandro Abdelnur [mailto:t...@cloudera.com]
 *Sent:* 25 June 2013 18:05
 *To:* common-u...@hadoop.apache.org

 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  ** **

 Devaraj,

 ** **

 If you don't run the HS, once your jobs finished you cannot retrieve
 status/counters from it, from Java AP or Web UI. So I'd for any practical
 usage, you need it.

 ** **

 thx

 ** **

 On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k devara...@huawei.com wrote:**
 **

 It is not mandatory to have running HS in the cluster. Still the user can
 submit the job without HS in the cluster, and user may expect the Job/App
 End Notification.

  

 Thanks

 Devaraj k

  

 *From:* Alejandro Abdelnur [mailto:t...@cloudera.com]
 *Sent:* 24 June 2013 21:42
 *To:* user@hadoop.apache.org
 *Cc:* user@hadoop.apache.org


 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  

 if we ought to do this in a yarn service it
 should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would
 be a good choice if we are concerned about the extra work this would cause
 in the RM. the problem with the current HS is that it is MR specific, we
 should generalize it for diff AM types. 

  

 thx


 Alejandro

 (phone typing)


 On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote:

  Even if we handle all the failure cases in AM for Job End Notification,
 we may miss cases like abrupt kill of AM when it is in last retry. If we
 choose NM to give the notification, again RM needs to identify which NM
 should give the end-notification as we don't have any direct protocol
 between AM and NM.

  

 I feel it would be better to move End-Notification responsibility to RM as
 Yarn Service because it ensures 100% notification and also useful for other
 types of applications as well. 

  

  

 Thanks

 Devaraj K

  

 *From:* Ravi Prakash [mailto:ravi...@ymail.com ravi...@ymail.com]
 *Sent:* 23 June 2013 19:01
 *To:* user@hadoop.apache.org
 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  

 Hi Alejandro,

 Thanks for your reply! I was thinking more along the lines Prashant
 suggested i.e. a failure during init() should still trigger an attempt to
 notify (by the AM). But now that you mention it, maybe we would be better
 of including this as a YARN feature after all (specially with all the new
 AMs being written). We could let the NM of the AM handle the notification
 burden, so that the RM doesn't get unduly taxed. Thoughts?

 Thanks
 Ravi

  

  
--

 *From:* Alejandro Abdelnur t...@cloudera.com
 *To:* common-u...@hadoop.apache.org user@hadoop.apache.org
 *Sent:* Saturday, June 22, 2013 7:37 PM
 *Subject:* Re: Job end notification does not always work (Hadoop 2.x)

  

 If the AM fails before doing the job end notification, at any stage of the
 execution for whatever reason, the job end notification will never be
 deliver. There is not way to fix this unless the notification is done by a
 Yarn service. The 2 'candidate' services for doing this would be the RM and
 the HS. The job notification URL is in the job conf. The RM never sees the
 job conf, that rules out the RM out unless we add, at AM registration time
 the possibility to specify a callback URL. The HS has access to the job
 conf, but the HS is currently a 'passive' service.


 thx

  

 On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com
 wrote:

 Prashanth, 

  

  Please file a jira.

  

  One thing to be aware of - AMs get restarted a certain number of times
 for fault-tolerance - which means we can't just assume that failure of a
 single AM is equivalent to failure of the job.

  

  Only the ResourceManager is in the appropriate position to judge failure
 of AM v/s failure-of-job.

  

 hth,

 Arun

  

 On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com
 wrote:



 

 Thanks Ravi.

 Well, in this case its a no-effort :) A failure of AM init should be
 considered as failure of the job? I looked at the code and best-effort
 makes sense with respect to retry logic etc. You make a good point that
 there would be no notification in case AM OOMs, but I do feel AM init
 failure should send a notification by other means.

  

 On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote:**
 **

 Hi Prashant,

 I would tend to agree with you. Although job-end

Re: Job end notification does not always work (Hadoop 2.x)

2013-06-24 Thread Alejandro Abdelnur

if we ought to do this in a yarn service it 
should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would be a 
good choice if we are concerned about the extra work this would cause in the 
RM. the problem with the current HS is that it is MR specific, we should 
generalize it for diff AM types. 

thx

Alejandro
(phone typing)

On Jun 23, 2013, at 23:28, Devaraj k devara...@huawei.com wrote:

 Even if we handle all the failure cases in AM for Job End Notification, we 
 may miss cases like abrupt kill of AM when it is in last retry. If we choose 
 NM to give the notification, again RM needs to identify which NM should give 
 the end-notification as we don't have any direct protocol between AM and NM.
  
 I feel it would be better to move End-Notification responsibility to RM as 
 Yarn Service because it ensures 100% notification and also useful for other 
 types of applications as well.
  
  
 Thanks
 Devaraj K
  
 From: Ravi Prakash [mailto:ravi...@ymail.com] 
 Sent: 23 June 2013 19:01
 To: user@hadoop.apache.org
 Subject: Re: Job end notification does not always work (Hadoop 2.x)
  
 Hi Alejandro,
 
 Thanks for your reply! I was thinking more along the lines Prashant suggested 
 i.e. a failure during init() should still trigger an attempt to notify (by 
 the AM). But now that you mention it, maybe we would be better of including 
 this as a YARN feature after all (specially with all the new AMs being 
 written). We could let the NM of the AM handle the notification burden, so 
 that the RM doesn't get unduly taxed. Thoughts?
 
 Thanks
 Ravi
  
  
 From: Alejandro Abdelnur t...@cloudera.com
 To: common-u...@hadoop.apache.org user@hadoop.apache.org 
 Sent: Saturday, June 22, 2013 7:37 PM
 Subject: Re: Job end notification does not always work (Hadoop 2.x)
  
 If the AM fails before doing the job end notification, at any stage of the 
 execution for whatever reason, the job end notification will never be 
 deliver. There is not way to fix this unless the notification is done by a 
 Yarn service. The 2 'candidate' services for doing this would be the RM and 
 the HS. The job notification URL is in the job conf. The RM never sees the 
 job conf, that rules out the RM out unless we add, at AM registration time 
 the possibility to specify a callback URL. The HS has access to the job conf, 
 but the HS is currently a 'passive' service.
 
 thx
  
 On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy a...@hortonworks.com wrote:
 Prashanth, 
  
  Please file a jira.
  
  One thing to be aware of - AMs get restarted a certain number of times for 
 fault-tolerance - which means we can't just assume that failure of a single 
 AM is equivalent to failure of the job.
  
  Only the ResourceManager is in the appropriate position to judge failure of 
 AM v/s failure-of-job.
  
 hth,
 Arun
  
 On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi prash1...@gmail.com wrote:
 
 
 Thanks Ravi.
 
 Well, in this case its a no-effort :) A failure of AM init should be 
 considered as failure of the job? I looked at the code and best-effort makes 
 sense with respect to retry logic etc. You make a good point that there would 
 be no notification in case AM OOMs, but I do feel AM init failure should send 
 a notification by other means.
 
  
 
 On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash ravi...@ymail.com wrote:
 Hi Prashant,
 
 I would tend to agree with you. Although job-end notification is only a 
 best-effort mechanism (i.e. we cannot always guarantee notification for 
 example when the AM OOMs), I agree with you that we can do more. If you feel 
 strongly about this, please create a JIRA and possibly upload a patch.
 
 Thanks
 Ravi
  
  
 From: Prashant Kommireddi prash1...@gmail.com
 To: user@hadoop.apache.org user@hadoop.apache.org 
 Sent: Thursday, June 20, 2013 9:45 PM
 Subject: Job end notification does not always work (Hadoop 2.x)
  
 Hello,
 
 I came across an issue that occurs with the job notification callbacks in 
 MR2. It works fine if the Application master has started, but does not send a 
 callback if the initializing of AM fails.
 
 Here is the code from MRAppMaster.java
 
 .
 ...
   // set job classloader if configured
   MRApps.setJobClassLoader(conf);
   initAndStartAppMaster(appMaster, conf, jobUserName);
 } catch (Throwable t) {
   LOG.fatal(Error starting MRAppMaster, t);
   System.exit(1);
 }
   }
 
 protected static void initAndStartAppMaster(final MRAppMaster appMaster,
   final YarnConfiguration conf, String jobUserName) throws IOException,
   InterruptedException {
 UserGroupInformation.setConfiguration(conf);
 UserGroupInformation appMasterUgi = UserGroupInformation
 .createRemoteUser(jobUserName);
 appMasterUgi.doAs(new PrivilegedExceptionActionObject() {
   @Override
   public Object run() throws Exception {
 appMaster.init(conf);
 appMaster.start();
 if(appMaster.errorHappenedShutDown

Re: How to use WebHDFS?

2013-05-18 Thread Alejandro Abdelnur

please try adding the user.name parameter to the querystring. use your hdfs 
username as value

thanks

Alejandro
(phone typing)

On May 18, 2013, at 3:42, Mohammad Mustaqeem 3m.mustaq...@gmail.com wrote:

 Can any one help me to explore WebHDFS.
 The link for it is here.
 I am unable to access it.
 
 When I run this command 
 curl -i http://127.0.0.1:50070/webhdfs/v1/input/listOfFiles?op=GETFILESTATUS;
  
 I get following error - 
 
 Sorry, you are not currently allowed to request 
 http://127.0.0.1:50070/webhdfs/v1/input/listOfFiles? from this cache until 
 you have authenticated yourself.
 
 Please help me.
 
 
 -- 
 With regards ---
 Mohammad Mustaqeem,
 M.Tech (CSE)
 MNNIT Allahabad
 9026604270

Re: Put file with WebHDFS, and can i put file by chunk with curl ?

2013-03-28 Thread Alejandro Abdelnur

you could use WebHDFS/HttpFS and the APPEND operation.

thx


On Wed, Mar 27, 2013 at 1:25 AM, 小学园PHP xxy-...@qq.com wrote:

 I want to put file to HDFS with curl, and more i need to put it by chunks.
 So, does somebody know if curl can upload file by chunk?
 Or ,who has work with WebHDFS by httplib?

 TIA
 Levi




-- 
Alejandro

Re: Submit RHadoop job using Ozzie in Cloudera Manager

2013-03-08 Thread Alejandro Abdelnur

[moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o]

Oozie web UI is read only, it does not do job submissions. If you want to
do that you should look at Hue.

Thx

On Fri, Mar 8, 2013 at 2:53 AM, rohit sarewar rohitsare...@gmail.comwrote:

 Hi

 I have R and RHadoop packages installed on all the nodes. I can submit
 RMR  jobs manually from the terminal.
 I just want to know  How to submit RMR jobs from Oozie web interface ?

 -Rohit


 On Fri, Mar 8, 2013 at 4:18 PM, Jagat Singh jagatsi...@gmail.com wrote:

 Hi

 Do you have rmr and rhdfs packages installed on all nodes?

 For hadoop it doesnt matter what type of job is till you have libraries
 it needs to run in the cluster.

 Submitting any job would be fine.

 Thanks


 On Fri, Mar 8, 2013 at 9:46 PM, rohit sarewar rohitsare...@gmail.comwrote:

 Hi All
 I am using Cloudera Manager 4.5 . As of now I can submit MR jobs using
 Oozie.

 Can we submit Rhadoop jobs using Ozzie in Cloudera Manager ?






-- 
Alejandro

Re: Hadoop Distcp [ Incomplete HDFS URI ]

2013-02-15 Thread Alejandro Abdelnur

it seems you are having an extra : before the first / in your uris. 

thx

Alejandro
(phone typing)

On Feb 15, 2013, at 8:22 AM, Dhanasekaran Anbalagan bugcy...@gmail.com wrote:

 HI Guys,
 
 we have two cluster running with CDH4.0.1 I am trying data one cluster to 
 another cluster. 
 It's says Incomplete HDFS URI
 
 tech@dvcliftonhera227:~$ hadoop distcp 
 hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0 
 hdfs://172.16.30.227:/user/tech
 13/02/15 11:13:15 INFO tools.DistCp: 
 srcPaths=[hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0]
 13/02/15 11:13:15 INFO tools.DistCp: destPath=hdfs://172.16.30.227:/user/tech
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: java.io.IOException: Incomplete HDFS URI, no host: 
 hdfs://172.16.30.122:8020:/user/thirumal/test2/part-0
   at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:118)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2150)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:80)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2184)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2166)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:302)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:194)
   at org.apache.hadoop.tools.DistCp.checkSrcPath(DistCp.java:635)
   at org.apache.hadoop.tools.DistCp.copy(DistCp.java:656)
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
 
 
 I try to debug, The port listening 
 How to resolve this. Please guide me.
 
 $telnet 172.16.30.122 8020
 Trying 172.16.30.122...
 Connected to 172.16.30.122.
 Escape character is '^]'.
 ^]
 telnet quit
 
 $ telnet 172.16.30.227 8020
 Trying 172.16.30.227...
 Connected to 172.16.30.227.
 Escape character is '^]'.
 ^]
 telnet quit
 Connection closed.
 
 #core-site.xml
 
   property
 namefs.defaultFS/name
 valuehdfs://dvcliftonhera227:8020/value
   /property
 
 
 -Dhanasekaran
 
 Did I learn something today? If not, I wasted it.
 --

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2013-02-08 Thread Alejandro Abdelnur

Tony, I think the first step would be to verify if the S3 filesystem
implementation rename works as expected.

Thx


On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton tbur...@sportingindex.comwrote:

 ** **

 Thanks for the reply Alejandro. Using a temp output directory was my first
 guess as well. What’s the best way to proceed? I’ve come across
 FileSystem.rename but it’s consistently returning false for whatever Paths
 I provide. Specifically, I need to copy the following:

 ** **

 s3://path to data/tmp folder/object type 1/part-0

 …

 s3://path to data/tmp folder/object type 1/part-n

 s3://path to data/tmp folder/object type 2/part-0

 …

 s3://path to data/tmp folder/object type 2/part-n

 …

 s3://path to data/tmp folder/object type m/part-n

 ** **

 to 

 ** **

 s3://path to data/object type 1/part-0

 …

 s3://path to data/object type 1/part-n

 s3://path to data/object type 2/part-0

 …

 s3://path to data/object type 2/part-n

 …

 s3://path to data/object type m/part-n

 ** **

 without doing a copyToLocal.

 ** **

 Any tips? Are there any better alternatives to FileSystem.rename? Or would
 using the AWS Java SDK be a better solution?

 ** **

 Thanks!

 ** **

 Tony

 ** **

 ** **

 ** **

 ** **

 ** **

 ** **

 *From:* Alejandro Abdelnur [mailto:t...@cloudera.com]
 *Sent:* 31 January 2013 18:45
 *To:* common-u...@hadoop.apache.org

 *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

 ** **

 Hi Tony, from what i understand your prob is not with MTOF but with you
 wanting to run 2 jobs using the same output directory, the second job will
 fail because the output dir already existed. My take would be tweaking your
 jobs to use a temp output dir, and moving them to the required (final)
 location upon completion.

 ** **

 thx

 ** **

 ** **

 On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton tbur...@sportingindex.com
 wrote:

 Hi everyone,

 Some of you might recall this topic, which I worked on with the list's
 help back in August last year - see email trail below. Despite initial
 success of the discovery, I had the shelve the approach as I ended up using
 a different solution (for reasons I forget!) with the implementation that
 was ultimately used for that particular project.

 I'm now in a position to be working on a similar new task, where I've
 successfully implemented the combination of LazyOutputFormat and
 MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
 locations. However, I've hit another snag which I'm hoping you might help
 me work through.

 I'm going to be running daily tasks to extract data from XML files
 (specifically, the data stored in certain nodes of the XML), stored on AWS
 S3 using object names with the following format:

 s3://inputbucket/data/2013/1/13/list of xml data files.bz2

 I want to extract items from the XML and write out as follows:

 s3://outputbucket/path/xml node name/20130113/output from MR job

 For one day of data, this works fine. I pass in s3://inputbucket/data and
 s3://outputbucket/path as input and output arguments, along with my run
 date (20130113) which gets manipulated and appended where appropriate to
 form the precise read and write locations, for example

 FileInputFormat.setInputhPath(job,  s3://inputbucket/data);
 FileOutputFormat.setOutputPath(job, s3://outputbucket/path);

 Then MultipleOutputs adds on my XML node names underneath
 s3://outputbucket/path automatically.

 However, for the next day's run, the job gets to
 FileOutputFormat.setOutputPath and sees that the output path
 (s3://outputbucket/path) already exists, and throws a
 FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
 though my ultimate subdirectory, to be constructed by MultipleOutputs does
 not already exist.

 Is there any way around this? I'm given hope by this, from
 http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
 public class FileAlreadyExistsException extends IOException - Used when
 target file already exists for any operation *and is not configured to be
 overwritten* (my emphasis). Is it possible to deconfigure the overwrite
 protection?

 If not, I suppose one other way ahead is to create my own FileOutputFormat
 where the checkOutputSpecs() is a bit less fussy; another might be to write
 to a temp directory and programmatically move it to the desired output
 when the job completes successfully, although this is getting to feel a bit
 hacky to me.

 Thanks for any feedback!

 Tony







 
 From: Harsh J [ha...@cloudera.com]
 Sent: 31 August 2012 10:47
 To: user@hadoop.apache.org
 Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

 Good finding, that OF slipped my mind. We can mention on the
 MultipleOutputs javadocs for the new API to use

Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2013-01-31 Thread Alejandro Abdelnur

Hi Tony, from what i understand your prob is not with MTOF but with you
wanting to run 2 jobs using the same output directory, the second job will
fail because the output dir already existed. My take would be tweaking your
jobs to use a temp output dir, and moving them to the required (final)
location upon completion.

thx



On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton tbur...@sportingindex.comwrote:

 Hi everyone,

 Some of you might recall this topic, which I worked on with the list's
 help back in August last year - see email trail below. Despite initial
 success of the discovery, I had the shelve the approach as I ended up using
 a different solution (for reasons I forget!) with the implementation that
 was ultimately used for that particular project.

 I'm now in a position to be working on a similar new task, where I've
 successfully implemented the combination of LazyOutputFormat and
 MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output
 locations. However, I've hit another snag which I'm hoping you might help
 me work through.

 I'm going to be running daily tasks to extract data from XML files
 (specifically, the data stored in certain nodes of the XML), stored on AWS
 S3 using object names with the following format:

 s3://inputbucket/data/2013/1/13/list of xml data files.bz2

 I want to extract items from the XML and write out as follows:

 s3://outputbucket/path/xml node name/20130113/output from MR job

 For one day of data, this works fine. I pass in s3://inputbucket/data and
 s3://outputbucket/path as input and output arguments, along with my run
 date (20130113) which gets manipulated and appended where appropriate to
 form the precise read and write locations, for example

 FileInputFormat.setInputhPath(job,  s3://inputbucket/data);
 FileOutputFormat.setOutputPath(job, s3://outputbucket/path);

 Then MultipleOutputs adds on my XML node names underneath
 s3://outputbucket/path automatically.

 However, for the next day's run, the job gets to
 FileOutputFormat.setOutputPath and sees that the output path
 (s3://outputbucket/path) already exists, and throws a
 FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even
 though my ultimate subdirectory, to be constructed by MultipleOutputs does
 not already exist.

 Is there any way around this? I'm given hope by this, from
 http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html:
 public class FileAlreadyExistsException extends IOException - Used when
 target file already exists for any operation *and is not configured to be
 overwritten* (my emphasis). Is it possible to deconfigure the overwrite
 protection?

 If not, I suppose one other way ahead is to create my own FileOutputFormat
 where the checkOutputSpecs() is a bit less fussy; another might be to write
 to a temp directory and programmatically move it to the desired output
 when the job completes successfully, although this is getting to feel a bit
 hacky to me.

 Thanks for any feedback!

 Tony







 
 From: Harsh J [ha...@cloudera.com]
 Sent: 31 August 2012 10:47
 To: user@hadoop.apache.org
 Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

 Good finding, that OF slipped my mind. We can mention on the
 MultipleOutputs javadocs for the new API to use the LazyOutputFormat for
 the job-level config. Please file a JIRA for this under MAPREDUCE project
 on the Apache JIRA?

 On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton tbur...@sportingindex.com
 wrote:
  Hi Harsh,
 
  I tried using NullOutputFormat as you suggested, however simply using
 
  job.setOutputFormatClass(NullOutputFormat.class);
 
  resulted in no output at all. Although I've not tried overriding
 getOutputCommitter in NullOutputFormat as you suggested, I discovered
 LazyOutputFormat which only writes when it has to, the output file is
 created only when the first record is emitted for a given partition (from
 Hadoop: The Definitive Guide).
 
  Instead of
 
  job.setOutputFormatClass(TextOutputFormat.class);
 
  use LazyOutputFormat like this:
 
  LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
 
  So now my unnamed MultipleOutputs are handling to segmented results, and
 LazyOutputFormat is suppressing the default output. Good job!
 
  Tony
 
 
 
 
 
  
  From: Harsh J [ha...@cloudera.com]
  Sent: 29 August 2012 17:05
  To: user@hadoop.apache.org
  Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat
 
  Hi Tony,
 
  On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton tbur...@sportingindex.com
 wrote:
  Success so far!
 
  I followed the example given by Tom on the link to the
 MultipleOutputs.html API you suggested.
 
  I implemented a WordCount MR job using hadoop 1.0.3 and segmented the
 output depending on word length: output to directory sml for less than 10
 characters, med for between 10 and 20 characters, lrg otherwise.

Re: Same Map/Reduce works on command BUT it hanges through Oozie withouth any error msg

2013-01-30 Thread Alejandro Abdelnur

Yaotian,

*Oozie version?
*More details on what exactly is your workflow action (mapred, java, shell,
etc)
*What is is in the task log of the oozie laucher job for that action?

Thx



On Fri, Jan 25, 2013 at 10:43 PM, yaotian yaot...@gmail.com wrote:

 I manually run it in Hadoop. It works.

 But as a job i run it through Oozie. The map/reduce hanged withouth No any
 error msg.

 I checked master and datanode.

 === From the master. i saw:
 2013-01-26 06:32:38,517 INFO org.apache.hadoop.mapred.JobTracker: Adding
 task (JOB_SETUP) 'attempt_201301251528_0014_r_04_0' to tip
 task_201301251528_0014_r_04, for tracker 'tracker_datanode1:localhost/
 127.0.0.1:45695'
 2013-01-26 06:32:44,538 INFO org.apache.hadoop.mapred.JobInProgress: Task
 'attempt_201301251528_0014_r_04_0' has completed
 task_201301251528_0014_r_04 successfully.
 2013-01-26 06:33:40,640 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
 History file job_201301090834_0089.   Cache size is 0
 2013-01-26 06:33:40,640 WARN org.mortbay.log: /jobtaskshistory.jsp:
 java.io.FileNotFoundException: File
 /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0089_1357872144366_hadoop_sorting+locations+per+user
 does not exist.
 2013-01-26 06:35:17,008 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
 History file job_201301090834_0051.   Cache size is 0
 2013-01-26 06:35:17,009 WARN org.mortbay.log: /jobtaskshistory.jsp:
 java.io.FileNotFoundException: File
 /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/11/00/job_201301090834_0051_1357870007893_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dsort%5Fuser%3A
 does not exist.
 2013-01-26 06:40:04,251 INFO org.apache.hadoop.mapred.JSPUtil: Loading Job
 History file job_201301090834_0026.   Cache size is 0
 2013-01-26 06:40:04,251 WARN org.mortbay.log: /taskstatshistory.jsp:
 java.io.FileNotFoundException: File
 /data/hadoop-0.20.205.0/logs/history/done/version-1/master_1357720451772_/2013/01/09/00/job_201301090834_0026_1357722497582_hadoop_oozie%3Alauncher%3AT%3Djava%3AW%3Dmap-reduce-wf%3AA%3Dreport%5Fsta
 does not exist.

 === Form the datanode. Just saw
 2013-01-26 06:39:11,143 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:39:41,256 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:40:11,371 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:40:41,486 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:41:11,604 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:41:41,724 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:42:11,845 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5
 2013-01-26 06:42:41,959 INFO org.apache.hadoop.mapred.TaskTracker:
 attempt_201301251528_0013_m_00_0 0.0%
 hdfs://master:9000/user/hadoop/oozie-hado/007-130125091701134-oozie-hado-W/sort_user--java/input/dummy.txt:0+5







-- 
Alejandro

Re: Filesystem closed exception

2013-01-30 Thread Alejandro Abdelnur

Hemanth,

Is FS caching enabled or not in your cluster?

A simple solution would be to modify your mapper code not to close the FS.
It will go away when the task ends anyway.

Thx


On Thu, Jan 24, 2013 at 5:26 PM, Hemanth Yamijala yhema...@thoughtworks.com
 wrote:

 Hi,

 We are noticing a problem where we get a filesystem closed exception when
 a map task is done and is finishing execution. By map task, I literally
 mean the MapTask class of the map reduce code. Debugging this we found that
 the mapper is getting a handle to the filesystem object and itself calling
 a close on it. Because filesystem objects are cached, I believe the
 behaviour is as expected in terms of the exception.

 I just wanted to confirm that:

 - if we do have a requirement to use a filesystem object in a mapper or
 reducer, we should either not close it ourselves
 - or (seems better to me) ask for a new version of the filesystem instance
 by setting the fs.hdfs.impl.disable.cache property to true in job
 configuration.

 Also, does anyone know if this behaviour was any different in Hadoop 0.20 ?

 For some context, this behaviour is actually seen in Oozie, which runs a
 launcher mapper for a simple java action. Hence, the java action could very
 well interact with a file system. I know this is probably better addressed
 in Oozie context, but wanted to get the map reduce view of things.


 Thanks,
 Hemanth




-- 
Alejandro

Re: Oozie workflow error - renewing token issue

2013-01-30 Thread Alejandro Abdelnur

Cobert,

[Moving thread to user@oozie.a.o, BCCing common-user@hadoop.a.o]

* What version of Oozie are you using?
* Is the cluster a secure setup (Kerberos enabled)?
* Would you mind posting the complete launcher logs?

Thx



On Wed, Jan 30, 2013 at 6:14 AM, Corbett Martin comar...@nhin.com wrote:

 Thanks for the tip.

 The sqoop command listed in the stdout log file is:
 sqoop
 import
 --driver
 org.apache.derby.jdbc.ClientDriver
 --connect
 jdbc:derby://test-server:1527/mondb
 --username
 monuser
 --password
 x
 --table
 MONITOR
 --split-by
 request_id
 --target-dir
 /mon/import
 --append
 --incremental
 append
 --check-column
 request_id
 --last-value
 200

 The following information is from the sterr and stdout log files.

 /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stderr
 No such sqoop tool: sqoop. See 'sqoop help'.
 Intercepting System.exit(1)
 Failing Oozie Launcher, Main class
 [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]


 /var/log/hadoop-0.20-mapreduce/userlogs/job_201301231648_0029/attempt_201301231648_0029_m_00_0
 # more stdout
 ...
 ...
 ...
  Invoking Sqoop command line now 

 1598 [main] WARN  org.apache.sqoop.tool.SqoopTool  - $SQOOP_CONF_DIR has
 not been set in the environment. Cannot check for additional configuration.
 Intercepting System.exit(1)

  Invocation of Main class completed 

 Failing Oozie Launcher, Main class
 [org.apache.oozie.action.hadoop.SqoopMain], exit code [1]

 Oozie Launcher failed, finishing Hadoop job gracefully

 Oozie Launcher ends


 ~Corbett Martin
 Software Architect
 AbsoluteAR Accounts Receivable Services - An NHIN Solution

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Tuesday, January 29, 2013 10:11 PM
 To: user@hadoop.apache.org
 Subject: Re: Oozie workflow error - renewing token issue

 The job that Oozie launches for your action, which you are observing is
 failing, does its own logs (task logs) show any errors?

 On Wed, Jan 30, 2013 at 4:59 AM, Corbett Martin comar...@nhin.com wrote:
  Oozie question
 
 
 
  I'm trying to run an Oozie workflow (sqoop action) from the Hue
  console and it fails every time.  No exception in the oozie log but I
  see this in the Job Tracker log file.
 
 
 
  Two primary issues seem to be
 
  1.   Client mapred tries to renew a token with renewer specified as mr
 token
 
 
 
  And
 
 
 
  2.   Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
 
 
 
  Any ideas how to get past this?
 
 
 
  Full Stacktrace:
 
 
 
  2013-01-29 17:11:28,860 INFO
 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
  Creating password for identifier: owner=hdfs, renewer=mr token,
  realUser=oozie, issueDate=1359501088860, maxDate=136010560,
  sequenceNumber=75, masterKeyId=8
 
  2013-01-29 17:11:28,871 INFO
 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
  Creating password for identifier: owner=hdfs, renewer=mr token,
  realUser=oozie, issueDate=1359501088871, maxDate=136010571,
  sequenceNumber=76, masterKeyId=8
 
  2013-01-29 17:11:29,202 INFO
  org.apache.hadoop.mapreduce.security.token.DelegationTokenRenewal:
  registering token for renewal for service =10.204.12.62:8021 and jobID
  =
  job_201301231648_0029
 
  2013-01-29 17:11:29,211 INFO
 
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
  Token renewal requested for identifier: owner=hdfs, renewer=mr token,
  realUser=oozie, issueDate=1359501088871, maxDate=136010571,
  sequenceNumber=76, masterKeyId=8
 
  2013-01-29 17:11:29,211 ERROR
  org.apache.hadoop.security.UserGroupInformation:
  PriviledgedActionException as:mapred (auth:SIMPLE)
  cause:org.apache.hadoop.security.AccessControlException: Client mapred
  tries to renew a token with renewer specified as mr token
 
  2013-01-29 17:11:29,211 WARN org.apache.hadoop.security.token.Token:
  Cannot find class for token kind MAPREDUCE_DELEGATION_TOKEN
 
  2013-01-29 17:11:29,211 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 9 on 8021, call renewDelegationToken(Kind:
  MAPREDUCE_DELEGATION_TOKEN, Service: 10.204.12.62:8021, Ident: 00 04
  68 64
  66 73 08 6d 72 20 74 6f 6b 65 6e 05 6f 6f 7a 69 65 8a 01 3c 88 94 58
  67 8a
  01 3c ac a0 dc 67 4c 08), rpc version=2, client version=28,
  methodsFingerPrint=1830206421 from 10.204.12.62:9706: error:
  org.apache.hadoop.security.AccessControlException: Client mapred tries
  to

Re: Complex MapReduce applications with the streaming API

2012-11-27 Thread Alejandro Abdelnur

 Using Oozie seems to be an overkilling for this application, besides, it
doesn't support loops
 so the recusrsion can't really be implemented.

Correct, Oozie does not support loops, this is a restriction by design
(early prototypes supported loops). The idea was that you didn't want never
ending workflows. To this end, Coordinator Jobs address the recurrent run
of workflow jobs.

Still, if you want to do recursion in Oozie, you certainly can, a workflow
invoking to itself as a sub-workflow. Just make sure you define properly
your exit condition.

If you have additional questions, please move this thread to the
u...@oozie.apache.org alias.


Thx


On Tue, Nov 27, 2012 at 4:03 AM, Zoltán Tóth-Czifra 
zoltan.tothczi...@softonic.com wrote:

  Hi everyone,

  Thanks in advance for the support. My problem is the following:

  I'm trying to develop a fairly complex MapReduce application using the
 streaming API (for demonstation purposes, so unfortunately the use Java
 answer doesn't work :-( ). I can get one single MapReduce phase running
 from command line with no problem. The problem is when I want to add more
 MapReduce phases which use each others output, and I maybe even want to do
 a recursion (feed the its output to the same phase again) conditioned by a
 counter.

  The solution in Java MapReduce is trivial (i.e. creating multiple Job
 instances and monitoring counters) but with the streaming API not quite.
 What is the correct way to manage my application with its native code?
 (Python, PHP, Perl...) Calling shell commands from a controller script?
 How should I obtain counters?...

  Using Oozie seems to be an overkilling for this application, besides, it
 doesn't support loops so the recusrsion can't really be implemented.

  Thanks a lot!
 Zoltan




-- 
Alejandro

Re: Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError Java Heap Space

2012-11-05 Thread Alejandro Abdelnur

Eduard,

Would you try using the following properties in your job invocation?

-D mapreduce.map.java.opts=-Xmx768m -D
mapreduce.reduce.java.opts=-Xmx768m -D mapreduce.map.memory.mb=2000 -D
mapreduce.reduce.memory.mb=3000

Thx


On Mon, Nov 5, 2012 at 7:43 AM, Kartashov, Andy andy.kartas...@mpac.ca wrote:
 Your error takes place during reduce task, when temporary files are written
 to memory/disk. You are clearly running low on resources. Check your memory
 “$ free –m” and disk space “$ df –H” as well as “$hadoop fs -df”



 I remember it took me a couple of days to figure out why I was getting heap
 size error and nothing wporked!  Becaue, I tried to write 7Gb output file
 onto a disk (in pseudo distr mode) that only had 4Gb of free space.



 p.s. Always test your jobs on small input first (few lines of inputs) .



 p.p.s. follow your job execution through web:
 http://fully-qualified-hostan-name of your job tracker:50030





 From: Eduard Skaley [mailto:e.v.ska...@gmail.com]
 Sent: Monday, November 05, 2012 4:10 AM
 To: user@hadoop.apache.org
 Cc: Nitin Pawar
 Subject: Re: Error:
 org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError Java Heap Space



 By the way it happens on Yarn not on MRv1

 each container gets 1GB at the moment.

 can you try increasing memory per reducer  ?



 On Wed, Oct 31, 2012 at 9:15 PM, Eduard Skaley e.v.ska...@gmail.com wrote:

 Hello,

 I'm getting this Error through job execution:

 16:20:26 INFO  [main] Job -  map 100% reduce 46%
 16:20:27 INFO  [main] Job -  map 100% reduce 51%
 16:20:29 INFO  [main] Job -  map 100% reduce 62%
 16:20:30 INFO  [main] Job -  map 100% reduce 64%
 16:20:32 INFO  [main] Job - Task Id :
 attempt_1351680008718_0018_r_06_0, Status : FAILED
 Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error
 in shuffle in fetcher#2
 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:123)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:371)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:152)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:147)
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:58)
 at
 org.apache.hadoop.io.BoundedByteArrayOutputStream.init(BoundedByteArrayOutputStream.java:45)
 at
 org.apache.hadoop.mapreduce.task.reduce.MapOutput.init(MapOutput.java:97)
 at
 org.apache.hadoop.mapreduce.task.reduce.MergeManager.unconditionalReserve(MergeManager.java:286)
 at
 org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:276)
 at
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:384)
 at
 org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:319)
 at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:179)

 16:20:33 INFO  [main] Job -  map 100% reduce 65%
 16:20:36 INFO  [main] Job -  map 100% reduce 67%
 16:20:39 INFO  [main] Job -  map 100% reduce 69%
 16:20:41 INFO  [main] Job -  map 100% reduce 70%
 16:20:43 INFO  [main] Job -  map 100% reduce 71%

 I have no clue what the issue could be for this. I googled this issue and
 checked several sources of possible solutions but nothing does fit.

 I saw this jira entry which could fit:
 https://issues.apache.org/jira/browse/MAPREDUCE-4655.

 Here somebody recommends to increase the value for the property
 dfs.datanode.max.xcievers / dfs.datanode.max.receiver.threads to 4096, but
 this is the value for our cluster.
 http://yaseminavcular.blogspot.de/2011/04/common-hadoop-hdfs-exceptions-with.html

 The issue with the to small input files doesn't fit I think, because the map
 phase reads 137 files with each 130MB. Block Size is 128MB.

 The cluster uses version 2.0.0-cdh4.1.1,
 581959ba23e4af85afd8db98b7687662fe9c5f20.

 Thx









 --
 Nitin Pawar





 NOTICE: This e-mail message and any attachments are confidential, subject to
 copyright and may be privileged. Any unauthorized use, copying or disclosure
 is prohibited. If you are not the intended recipient, please delete and
 contact the sender immediately. Please consider the environment before
 printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui
 l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent
 être couverts par le secret professionnel. Toute utilisation, copie ou
 divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire
 prévu de ce courriel,

Re: Loading Data to HDFS

2012-10-30 Thread Alejandro Abdelnur

I don't know what you mean by gateway but in order to have a rough idea of
the time needed you need 3 values

I believe Sumit's setup is a cluster within a firewall and hadoop
client machines also within the firewall, the only way to access to
the cluster is to ssh from outside to one of the hadoop client
machines and then submit your jobs. These hadoop client machines are
often referred as gateway machines.

On Tue, Oct 30, 2012 at 4:10 AM, Bertrand Dechoux decho...@gmail.com wrote:
I don't know what you mean by gateway but in order to have a rough idea of
the time needed you need 3 values
* amount of data you want to put on hadoop
* hadoop bandwidth with regards to local storage (read/write)
* bandwidth between where your data are stored and where the hadoop cluster
is

For the latter, for big volumes, physically moving the volumes is a viable
solution.
It will depends on your constraints of course : budget, speed...

Bertrand

On Tue, Oct 30, 2012 at 11:39 AM, sumit ghosh sumi...@yahoo.com wrote:

Hi Bertrand,

By Physically movi ng the data do you mean that the data volume is
connected to the gateway machine and the data is loaded from the local copy
using copyFromLocal?

Thanks,
Sumit

From: Bertrand Dechoux decho...@gmail.com
To: common-user@hadoop.apache.org; sumit ghosh sumi...@yahoo.com
Sent: Tuesday, 30 October 2012 3:46 PM
Subject: Re: Loading Data to HDFS

It might sound like a deprecated way but can't you move the data
physically?
From what I understand, it is one shot and not streaming so it could be a
good method if you the access of course.

Regards

Bertrand

On Tue, Oct 30, 2012 at 11:07 AM, sumit ghosh sumi...@yahoo.com wrote:

Hi,

I have a data on remote machine accessible over ssh. I have Hadoop CDH4
installed on RHEL. I am planning to load quite a few Petabytes of Data
onto
HDFS.

Which will be the fastest method to use and are there any projects around
Hadoop which can be used as well?

I cannot install Hadoop-Client on the remote machine.

Have a great Day Ahead!
Sumit.

---
Here I am attaching my previous discussion on CDH-user to avoid
duplication.
---
On Wed, Oct 24, 2012 at 9:29 PM, Alejandro Abdelnur t...@cloudera.com
wrote:
in addition to jarcec's suggestions, you could use httpfs. then you'd
only
need to poke a single host:port in your firewall as all the traffic goes
thru it.
thx
Alejandro

On Oct 24, 2012, at 8:28 AM, Jarek Jarcec Cecho jar...@cloudera.com
wrote:
Hi Sumit,
there is plenty of ways how to achieve that. Please find my feedback
below:

Does Sqoop support loading flat files to HDFS?

No, sqoop is supporting only data move from external database and
warehouse systems. Copying files is not supported at the moment.

Can use distcp?

No. Distcp can be used only to copy data between HDFS filesystesm.

How do we use the core-site.xml file on the remote machine to use
copyFromLocal?

Yes you can install hadoop binaries on your machine (with no hadoop
running services) and use hadoop binary to upload data. Installation
procedure is described in CDH4 installation guide [1] (follow client
installation).

Another way that I can think of is leveraging WebHDFS [2] or maybe
hdfs-fuse [3]?

Jarcec

Links:
1: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation
2:

https://ccp.cloudera.com/display/CDH4DOC/Deploying+HDFS+on+a+Cluster#DeployingHDFSonaCluster-EnablingWebHDFS
3: https://ccp.cloudera.com/display/CDH4DOC/Mountable+HDFS

On Wed, Oct 24, 2012 at 01:33:29AM -0700, Sumit Ghosh wrote:

Hi,

I have a data on remote machine accessible over ssh. What is the
fastest
way to load data onto HDFS?

Does Sqoop support loading flat files to HDFS?
Can use distcp?
How do we use the core-site.xml file on the remote machine to use
copyFromLocal?

Which will be the best to use and are there any other open source
projects
around Hadoop which can be used as well?
Have a great Day Ahead!
Sumit

--
Bertrand Dechoux

--
Alejandro

Re: Detect when file is not being written by another process

2012-09-27 Thread Alejandro Abdelnur

AFAIK there is not way to determine i a file has been fully written or not.

Oozie uses a feature of Hadoop which writes a _SUCCESS flag file in
the output directory of a job. This _SUCCESS file is written at job
completion time, thus ensuring all the output of the job is ready.
This means that when Oozie is configured to look for a directory FOO/,
in practice it looks for the existence of FOO/_SUCCESS file.

You can configure Oozie to look for existence of FOO/ but this means
you'll have to use a temp dir, i.e. FOO_TMP/, while writing data and
do a rename to FOO/ once you finished writing the data.

Thx

On Wed, Sep 26, 2012 at 1:52 AM, Hemanth Yamijala
yhema...@thoughtworks.com wrote:
 Agree with Bejoy. The problem you've mentioned sounds like building
 something like a workflow, which is what Oozie is supposed to do.

 Thanks
 hemanth


 On Wed, Sep 26, 2012 at 12:22 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Peter

 AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as
 soon as the files are written to a  certain hdfs directory.


 On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan
 psheri...@millennialmedia.com wrote:

 These are log files being deposited by other processes, which we may not
 have control over.

 We don't want multiple processes to write to the same files — we just
 don't want to start our jobs until they have been completely written.

 Sorry for lack of clarity  thanks for the response.


 --Pete

 From: Bertrand Dechoux decho...@gmail.com
 Reply-To: user@hadoop.apache.org user@hadoop.apache.org
 Date: Tuesday, September 25, 2012 12:33 PM
 To: user@hadoop.apache.org user@hadoop.apache.org
 Subject: Re: Detect when file is not being written by another process

 Hi,

 Multiple files and aggregation or something like hbase?

 Could you tell use more about your context? What are the volumes? Why do
 you want multiple processes to write to the same file?

 Regards

 Bertrand

 On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan
 psheri...@millennialmedia.com wrote:

 Hi all.

 We're using Hadoop 1.0.3.  We need to pick up a set of large (4+GB)
 files when they've finished being written to HDFS by a different process.
 There doesn't appear to be an API specifically for this.  We had discovered
 through experimentation that the FileSystem.append() method can be used for
 this purpose — it will fail if another process is writing to the file.

 However: when running this on a multi-node cluster, using that API
 actually corrupts the file.  Perhaps this is a known issue?  Looking at the
 bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a 
 bunch
 of similar-sounding things.

 What's the right way to solve this problem?  Thanks.


 --Pete




 --
 Bertrand Dechoux






-- 
Alejandro

Re: Native not compiling in OS X Mountain Lion

2012-08-13 Thread Alejandro Abdelnur

Mohamed,

Currently Hadoop native code does not compile/run in any flavor of OS X.

Thanks.

Alejandro

On Mon, Aug 13, 2012 at 2:59 AM, J Mohamed Zahoor jmo...@gmail.com wrote:



 Hi

 I have problems compiling native's in OS X 10.8 for trunk. Especially in
 Yarn projects.
 Anyone faced similar problems?

 [exec] /usr/bin/gcc   -g -Wall -O2 -D_GNU_SOURCE -D_REENTRANT
 -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src
 -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/native
 -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor
 -I/Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl
 -o
 CMakeFiles/container-executor.dir/main/native/container-executor/impl/main.c.o
 -c
 /Users/zahoor/Development/OpenSource/reading/hadoop-common/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/main.c
  [exec] Undefined symbols for architecture x86_64:
  [exec]   _fcloseall, referenced from:
  [exec]   _launch_container_as_user in
 libcontainer.a(container-executor.c.o)
  [exec]   _mkdirat, referenced from:
  [exec]   _mkdirs in libcontainer.a(container-executor.c.o)
  [exec]   _openat, referenced from:
  [exec]   _mkdirs in libcontainer.a(container-executor.c.o)
  [exec] ld: symbol(s) not found for architecture x86_64
  [exec] collect2: ld returned 1 exit status
  [exec] make[2]: *** [target/usr/local/bin/container-executor] Error 1
  [exec] make[1]: *** [CMakeFiles/container-executor.dir/all] Error 2
  [exec] make: *** [all] Error 2


 ./Zahoor




-- 
Alejandro

Re: MRv1 with CDH4

2012-07-30 Thread Alejandro Abdelnur

Antonie,

This is the Apache Hadoop alias for mapreduce, your question is specific to
CDH4 distribution of Apache Hadoop. You should use the Cloudera alias for
such questions. I'll follow up with you in the Cloudera alias where you
cross-posted this message.

Thanks.

On Mon, Jul 30, 2012 at 6:20 AM, Antoine Boudot abou...@cyres.fr wrote:

  Hi everyone
 I don’t understand the steps to follow to implement MRv1 with CDH4
 install service?
 run service?
 config service? mapred-site.xml?
 must - have the script retrieve and recreate CDH3 mapred-site.xml and
 master?

 thanks

 Antoine

 ** **




-- 
Alejandro

Re: HBase mulit-user security

2012-07-25 Thread Alejandro Abdelnur

Tony,

Sorry, missed this email earlier. This seems more appropriate for the Hbase
aliases.

Thx.

On Wed, Jul 11, 2012 at 8:41 AM, Tony Dean tony.d...@sas.com wrote:

 Hi,

 Looking at this further, it appears that when HBaseRPC is creating a proxy
 (e.g., SecureRpcEngine), it injects the current user:
 User.getCurrent() which by default is the cached Kerberos TGT (kinit'ed
 user - using the hadoop-user-kerberos JAAS context).

 Since the server proxy always uses User.getCurrent(), how can an
 application inject the user it wants to use for authorization checks on the
 peer (region server)?

 And since SecureHadoopUser is a static class, how can you have more than 1
 active user in the same application?

 What you have works for a single user application like the hbase shell,
 but what about a multi-user application?

 Am I missing something?

 Thanks!

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of
 different users (like Ooize, as you mentioned in your email), then you
 should use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the
 user you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf
 should be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be
 also defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
  Yes, but this will not work in a multi-tenant environment.  I need to be
 able to create a Kerberos TGT per execution thread.
 
  I was hoping through JAAS that I could inject the name of the current
 principal and authenticate against it.  I'm sure there is a best practice
 for hadoop/hbase client API authentication, just not sure what it is.
 
  Thank you for your comment.  The solution may well be associated with
 the UserGroupInformation class.  Hopefully, other ideas will come from this
 thread.
 
  Thanks.
 
  -Tony
 
  -Original Message-
  From: Ivan Frain [mailto:ivan.fr...@gmail.com]
  Sent: Monday, July 02, 2012 8:14 AM
  To: common-user@hadoop.apache.org
  Subject: Re: hadoop security API (repost)
 
  Hi Tony,
 
  I am currently working on this to access HDFS securely and
 programmaticaly.
  What I have found so far may help even if I am not 100% sure this is the
 right way to proceed.
 
  If you have already obtained a TGT from the kinit command, hadoop
 library will locate it automatically if the name of the ticket cache
 corresponds to default location. On Linux it is located
 /tmp/krb5cc_uid-number.
 
  For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
  meaning you can impersonate ivan from hdfs linux user:
  --
  hdfs@mitkdc:~$ klist
  Ticket cache: FILE:/tmp/krb5cc_10003
  Default principal: i...@hadoop.lan
 
  Valid startingExpires   Service principal
  02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
  until 03/07/2012 13:59
  ---
 
  Then, you just have to set the right security options in your hadoop
 client in java and the identity will be i...@hadoop.lan for our example.
 In my tests, I only use HDFS and here a snippet of code to have access to a
 secure hdfs cluster assuming the previous TGT (ivan's impersonation):
 
  
   val conf: HdfsConfiguration = new HdfsConfiguration()
 
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
  kerberos)
 
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
  true)
   conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY,
  serverPrincipal)
 
   UserGroupInformation.setConfiguration(conf)
 
   val fs = FileSystem.get(new URI(hdfsUri), conf)
  
 
  Using this 'fs' is a handler to access hdfs securely as user 'ivan' even
 if ivan does not appear in the hadoop client code.
 
  Anyway, I also see two other options:
* Setting the KRB5CCNAME environment variable to point to the right
 ticketCache file
* Specifying the keytab file you want to use from the
 UserGroupInformation singleton API:
  UserGroupInformation.loginUserFromKeytab(user, keytabFile)
 
  If you want to understand the auth process and the different options to
 login, I guess you need to have a look

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur

Tony,

If you are doing a server app that interacts with the cluster on
behalf of different users (like Ooize, as you mentioned in your
email), then you should use the proxyuser capabilities of Hadoop.

* Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml
(this requires 2 properties settings, HOSTS and GROUPS).
* Run your server app as MYSERVERUSER and have a Kerberos principal
MYSERVERUSER/MYSERVERHOST
* Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
* Use the UGI.doAs() to create JobClient/Filesystem instances using
the user you want to do something on behalf
* Keep in mind that all the users you need to do something on behalf
should be valid Unix users in the cluster
* If those users need direct access to the cluster, they'll have to be
also defined in in the KDC user database.

Hope this helps.

Thx

On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be able 
 to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice for 
 hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds to 
 default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew until 
 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)
  conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY, serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab file you want to use from the UserGroupInformation 
 singleton API:
 UserGroupInformation.loginUserFromKeytab(user, keytabFile)

 If you want to understand the auth process and the different options to 
 login, I guess you need to have a look to the UserGroupInformation.java 
 source code (release 0.23.1 link: http://bit.ly/NVzBKL). The private class 
 HadoopConfiguration line 347 is of major interest in our case.

 Another point is that I did not find any easy way to prompt the user for a 
 password at runtim using the actual hadoop API. It appears to be somehow 
 hardcoded in the UserGroupInformation singleton. I guess it could be nice to 
 have a new function to give to the UserGroupInformation an authenticated 
 'Subject' which could override all default configurations. If someone have 
 better ideas it could be nice to discuss on it as well.


 BR,
 Ivan

 2012/7/1 Tony Dean tony.d...@sas.com

 Hi,

 The security documentation specifies how to test a secure cluster by
 using kinit and thus adding the Kerberos principal TGT to the ticket
 cache in which the hadoop client code uses to acquire service tickets
 for use in the cluster.
 What if I created an application that used the hadoop API to
 communicate with hdfs and/or mapred protocols, is there a programmatic
 way to inform hadoop to use a particular Kerberos principal name with
 a keytab that contains its password

Re: hadoop security API (repost)

2012-07-02 Thread Alejandro Abdelnur

On Mon, Jul 2, 2012 at 9:15 AM, Tony Dean tony.d...@sas.com wrote:
 Alejandro,

 Thanks for the reply.  My intent is to also be able to scan/get/put hbase 
 tables under a specified identity as well.  What options do I have to perform 
 the same multi-tenant  authorization for these operations?  I have posted 
 this to hbase users distribution list as well, but thought you might have 
 insight.  Since hbase security authentication is so dependent upon hadoop, it 
 would be nice if your suggestion worked for hbase as well.

 Getting back to your suggestion... when configuring 
 hadoop.proxyuser.myserveruser.hosts, host1 would be where I'm making the 
 ugi.doAs() privileged call and host2 is the hadoop namenode?


host1 in that case.

 Also, an another option, is there not a way for an application to pass 
 hadoop/hbase authentication the name of a Kerberos principal to use?  In this 
 case, no proxy, just execute as the designated user.

You could do that, but that means your app will have to have keytabs
for all the users want to act as. Proxyuser will be much easier to
manage. Maybe getting proxyuser support in hbase if it is not there
yet


 Thanks.

 -Tony

 -Original Message-
 From: Alejandro Abdelnur [mailto:t...@cloudera.com]
 Sent: Monday, July 02, 2012 11:40 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Tony,

 If you are doing a server app that interacts with the cluster on behalf of 
 different users (like Ooize, as you mentioned in your email), then you should 
 use the proxyuser capabilities of Hadoop.

 * Configure user MYSERVERUSER as proxyuser in Hadoop core-site.xml (this 
 requires 2 properties settings, HOSTS and GROUPS).
 * Run your server app as MYSERVERUSER and have a Kerberos principal 
 MYSERVERUSER/MYSERVERHOST
 * Initialize your server app loading the MYSERVERUSER/MYSERVERHOST keytab
 * Use the UGI.doAs() to create JobClient/Filesystem instances using the user 
 you want to do something on behalf
 * Keep in mind that all the users you need to do something on behalf should 
 be valid Unix users in the cluster
 * If those users need direct access to the cluster, they'll have to be also 
 defined in in the KDC user database.

 Hope this helps.

 Thx

 On Mon, Jul 2, 2012 at 6:22 AM, Tony Dean tony.d...@sas.com wrote:
 Yes, but this will not work in a multi-tenant environment.  I need to be 
 able to create a Kerberos TGT per execution thread.

 I was hoping through JAAS that I could inject the name of the current 
 principal and authenticate against it.  I'm sure there is a best practice 
 for hadoop/hbase client API authentication, just not sure what it is.

 Thank you for your comment.  The solution may well be associated with the 
 UserGroupInformation class.  Hopefully, other ideas will come from this 
 thread.

 Thanks.

 -Tony

 -Original Message-
 From: Ivan Frain [mailto:ivan.fr...@gmail.com]
 Sent: Monday, July 02, 2012 8:14 AM
 To: common-user@hadoop.apache.org
 Subject: Re: hadoop security API (repost)

 Hi Tony,

 I am currently working on this to access HDFS securely and programmaticaly.
 What I have found so far may help even if I am not 100% sure this is the 
 right way to proceed.

 If you have already obtained a TGT from the kinit command, hadoop library 
 will locate it automatically if the name of the ticket cache corresponds 
 to default location. On Linux it is located /tmp/krb5cc_uid-number.

 For example, with my linux user hdfs, I get a TGT for hadoop user 'ivan'
 meaning you can impersonate ivan from hdfs linux user:
 --
 hdfs@mitkdc:~$ klist
 Ticket cache: FILE:/tmp/krb5cc_10003
 Default principal: i...@hadoop.lan

 Valid startingExpires   Service principal
 02/07/2012 13:59  02/07/2012 23:59  krbtgt/hadoop@hadoop.lan renew
 until 03/07/2012 13:59
 ---

 Then, you just have to set the right security options in your hadoop client 
 in java and the identity will be i...@hadoop.lan for our example. In my 
 tests, I only use HDFS and here a snippet of code to have access to a secure 
 hdfs cluster assuming the previous TGT (ivan's impersonation):

 
  val conf: HdfsConfiguration = new HdfsConfiguration()

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHENTICATION,
 kerberos)

 conf.set(CommonConfigurationKeysPublic.HADOOP_SECURITY_AUTHORIZATION,
 true)
  conf.set(DFSConfigKeys.DFS_NAMENODE_USER_NAME_KEY,
 serverPrincipal)

  UserGroupInformation.setConfiguration(conf)

  val fs = FileSystem.get(new URI(hdfsUri), conf)
 

 Using this 'fs' is a handler to access hdfs securely as user 'ivan' even if 
 ivan does not appear in the hadoop client code.

 Anyway, I also see two other options:
   * Setting the KRB5CCNAME environment variable to point to the right 
 ticketCache file
   * Specifying the keytab

Re: kerberos mapreduce question

2012-06-07 Thread Alejandro Abdelnur

If you provision your user/group information via LDAP to all your nodes it
is not a nightmare.

On Thu, Jun 7, 2012 at 7:49 AM, Koert Kuipers ko...@tresata.com wrote:

 thanks for your answer.

 so at a large place like say yahoo, or facebook, assuming they use
 kerberos, every analyst that uses hive has an account on every node of
 their large cluster? sounds like an admin nightmare to me

 On Thu, Jun 7, 2012 at 10:46 AM, Mapred Learn mapred.le...@gmail.com
 wrote:

  Yes, User submitting a job needs to have an account on all the nodes.
 
  Sent from my iPhone
 
  On Jun 7, 2012, at 6:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
   with kerberos enabled a mapreduce job runs as the user that submitted
  it.
   does this mean the user that submitted the job needs to have linux
  accounts
   on all machines on the cluster?
  
   how does mapreduce do this (run jobs as the user)? do the tasktrackers
  use
   secure impersonation to run-as the user?
  
   thanks! koert
 




-- 
Alejandro

Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur

Praveenesh,

If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
(hosts/groups) settings. You have to use explicit hosts/groups.

Thxs.

Alejandro
PS: please follow up this thread in the oozie-us...@incubator.apache.org

On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi all,

 I want to use oozie to submit different workflows from different users.
 These users are able to submit hadoop jobs.
 I am using hadoop 0.20.205 and oozie 3.1.3
 I have a hadoop user as a oozie-user

 I have set the following things :

 conf/oozie-site.xml :

  property 
  name oozie.services.ext /name 
  value org.apache.oozie.service.HadoopAccessorService
  /value 
  description 
 To add/replace services defined in 'oozie.services' with custom
 implementations.Class names must be separated by commas.
  /description 
  /property 

 conf/core-site.xml
  property
  namehadoop.proxyuser.hadoop.hosts /name
  value* / value
  /property
  property
  namehadoop.proxyuser.hadoop.groups /name
  value* /value
  /property

 When I am submitting jobs as a hadoop user, I am able to run it properly.
 But when I am able to submit the same work flow  from a different user, who
 can submit the simple MR jobs to my hadoop cluster, I am getting the
 following error:

 JA009: java.io.IOException: java.io.IOException: The username kumar
 obtained from the conf doesn't match the username hadoop the user
 authenticated asat
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 Caused by: java.io.IOException: The username kumar obtained from the conf
 doesn't match the username hadoop the user authenticated as
 at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
 ... 11 more

Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur

multiple value are comma separated. keep in mind that valid values for
proxyuser groups, as the property name states are GROUPS, not USERS.

thxs.

Alejandro

On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote:

 How can I specify multiple users /groups for proxy user setting ?
 Can I give comma separated values in these settings ?

 Thanks,
 Praveenesh

 On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  Praveenesh,
 
  If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
  (hosts/groups) settings. You have to use explicit hosts/groups.
 
  Thxs.
 
  Alejandro
  PS: please follow up this thread in the oozie-us...@incubator.apache.org
 
  On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com
  wrote:
 
   Hi all,
  
   I want to use oozie to submit different workflows from different users.
   These users are able to submit hadoop jobs.
   I am using hadoop 0.20.205 and oozie 3.1.3
   I have a hadoop user as a oozie-user
  
   I have set the following things :
  
   conf/oozie-site.xml :
  
property 
name oozie.services.ext /name 
value org.apache.oozie.service.HadoopAccessorService
/value 
description 
   To add/replace services defined in 'oozie.services' with custom
   implementations.Class names must be separated by commas.
/description 
/property 
  
   conf/core-site.xml
property
namehadoop.proxyuser.hadoop.hosts /name
value* / value
/property
property
namehadoop.proxyuser.hadoop.groups /name
value* /value
/property
  
   When I am submitting jobs as a hadoop user, I am able to run it
 properly.
   But when I am able to submit the same work flow  from a different user,
  who
   can submit the simple MR jobs to my hadoop cluster, I am getting the
   following error:
  
   JA009: java.io.IOException: java.io.IOException: The username kumar
   obtained from the conf doesn't match the username hadoop the user
   authenticated asat
   org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  
   at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at
  
  
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
  
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
  
   Caused by: java.io.IOException: The username kumar obtained from the
 conf
   doesn't match the username hadoop the user authenticated as
   at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
   at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
   ... 11 more

Re: How do I synchronize Hadoop jobs?

2012-02-15 Thread Alejandro Abdelnur

You can use Oozie for that, you can write a workflow job that forks A
 B and then joins before C.

Thanks.

Alejandro

On Wed, Feb 15, 2012 at 11:23 AM, W.P. McNeill bill...@gmail.com wrote:
 Say I have two Hadoop jobs, A and B, that can be run in parallel. I have
 another job, C, that takes the output of both A and B as input. I want to
 run A and B at the same time, wait until both have finished, and then
 launch C. What is the best way to do this?

 I know the answer if I've got a single Java client program that launches A,
 B, and C. But what if I don't have the option to launch all of them from a
 single Java program? (Say I've got a much more complicated system with many
 steps happening between A-B and C.) How do I synchronize between jobs, make
 sure there's no race conditions etc. Is this what Zookeeper is for?

Re: Hybrid Hadoop with fork/join ?

2012-01-31 Thread Alejandro Abdelnur

Rob,

Hadoop has as a way to run Map tasks in multithreading mode, look for the
MultithreadedMapRunner  MultithreadedMapper.

Thanks.

Alejandro.

On Tue, Jan 31, 2012 at 7:51 AM, Rob Stewart robstewar...@gmail.com wrote:

 Hi,

 I'm investigating the feasibility of a hybrid approach to parallel
 programming, by fusing together the concurrent Java fork/join
 libraries with Hadoop... MapReduce, a paradigm suited for scalable
 execution over distributed memory + fork/join, a paradigm for optimal
 multi-threaded shared memory execution.

 I am aware that, to a degree, Hadoop can take advantage of multiple
 core on a compute node, by setting the
 mapred.tasktracker.map/reduce.tasks.maximum to be more than one. As
 it states in the Hadoop: The definitive guide book, each task runs
 in a separate JVM. So setting maximum map tasks to 3, will allow the
 possibility of 3 JVMs running on the Operating System, right?

 As an alternative to this approach, I am looking to introduce
 fork/join into map tasks, and perhaps maybe too, throttling down the
 maximum number of map tasks per node to 1. I would implement a
 RecordReader that produces map tasks for more coarse granularity, and
 that fork/join would subdivide to map task using multiple threads -
 where the output of the join would be the output of the map task.

 The motivation for this is that threads are a lot more lightweight
 than initializing JVMs, which as the Hadoop book points out, takes a
 second or so each time, unless mapred.job.resuse.jvm.num.tasks is set
 higher than 1 (the default is 1). So for example, opting for small
 granular maps for a particular job for a given input generates 1,000
 map tasks, which will mean that 1,000 JVMs will be created on the
 Hadoop cluster within the execution of the program. If I were to write
 a bespoke RecordReader to increase the granularity of each map, so
 much so that only 100 map tasks are created. The embedded fork/join
 code would further split each map into 10 threads, to evaluate within
 one JVM concurrently. I would expect this latter approach of multiple
 threads to have better performance, than the clunkier multple-JVM
 approach.

 Has such a hybrid approach combining MapReduce with ForkJoin been
 investigated for feasibility, or similar studies published? Are there
 any important technical limitations that I should consider? Any
 thoughts on the proposed multi-threaded distributed-shared memory
 architecture are much appreciated from the Hadoop community!

 --
 Rob Stewart

Re: Any samples of how to write a custom FileSystem

2012-01-31 Thread Alejandro Abdelnur

Steven,

You could also look at HttpFSFilesystem in the hadoop-httpfs module, it is
quite simple and selfcontained.

Cheers.

Alejandro

On Tue, Jan 31, 2012 at 8:37 PM, Harsh J ha...@cloudera.com wrote:

 To write a custom filesystem, extend on the FileSystem class.

 Depending on the scheme it is supposed to serve, creating an entry
 fs.scheme.impl in core-site.xml, and then loading it via the
 FileSystem.get(URI, conf) API will auto load it for you, provided the
 URI you pass has the right scheme.

 So supposing I have a FS scheme foo, I'd register it in core-site.xml as:

 namefs.foo.impl/name
 valuecom.myorg.FooFileSystem/value

 And then with a URI object from a path that goes foo:///mypath/, I'd
 do: FileSystem.get(URI, new Configuration()) to get a FooFileSystem
 instance.

 Similarly, if you want to overload the local filesystem with your
 class, override the fs.file.impl config with your derivative class,
 and that'd be used in your configuration loaded programs in future.

 A good, not-so-complex impl. example to look at generally would be the
 S3FileSystem.

 On Wed, Feb 1, 2012 at 9:41 AM, Steve Lewis lordjoe2...@gmail.com wrote:
  Specifically how do I register a Custom FileSystem - any sample code
 
  --
  Steven M. Lewis PhD
  4221 105th Ave NE
  Kirkland, WA 98033
  206-384-1340 (cell)
  Skype lordjoe_com



 --
 Harsh J
 Customer Ops. Engineer
 Cloudera | http://tiny.cloudera.com/about

Re: Adding a soft-linked archive file to the distributed cache doesn't work as advertised

2012-01-09 Thread Alejandro Abdelnur

Bill,

In addition you must call DistributedCached.createSymlink(configuration),
that should do.

Thxs.

Alejandro

On Mon, Jan 9, 2012 at 10:30 AM, W.P. McNeill bill...@gmail.com wrote:

I am trying to add a zip file to the distributed cache and have it unzipped
on the task nodes with a softlink to the unzipped directory placed in the
working directory of my mapper process. I think I'm doing everything the
way the documentation tells me to, but it's not working.

On the client in the run() function while I'm creating the job I first
call:

fs.copyFromLocalFile(gate-app.zip, /tmp/gate-app.zip);

As expected, this copies the archive file gate-app.zip to the HDFS
directory /tmp.

Then I call

DistributedCache.addCacheArchive(/tmp/gate-app.zip#gate-app,
configuration);

I expect this to add /tmp/gate-app.zip to the distributed cache and put a
softlink to it called gate-app in the working directory of each task.
However, when I call job.waitForCompletion(), I see the following error:

Exception in thread main java.io.FileNotFoundException: File does not
exist: /tmp/gate-app.zip#gate-app.

It appears that the distributed cache mechanism is interpreting the entire
URI as the literal name of the file, instead of treating the fragment as
the name of the softlink.

As far as I can tell, I'm doing this correctly according to the API
documentation:

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
.

The full project in which I'm doing this is up on github:
https://github.com/wpm/Hadoop-GATE.

Can someone tell me what I'm doing wrong?

Re: Stop chained mapreduce.

2011-09-12 Thread Alejandro Abdelnur

Ilyal,

The MR output files names follow the pattern part- and you'll have as
many as reducers your job had.

As you know the output directory, you could do a fs.listStatus() of the
output directory and check all the part-* files.

Hope this helps.

Thanks.

Alejandro

On Sun, Sep 11, 2011 at 4:52 AM, ilyal levin nipponil...@gmail.com wrote:

 Hi
 I created a chained mapreduce program where each job creates a SequenceFile
 output.
 My stopping condition is simply to check if the last output file (Type -
 SequenceFile) is empty.
  In order to do that i need to use the SequenceFile.Reader
 and for him to read the data i need the path of the output file. The
 problem is that i don't know the name of the file,
 it usually depends on the number of the reducer. What can i do to solve
 this?

 Thanks.

Re: Timer jobs

2011-09-01 Thread Alejandro Abdelnur

[moving common-user@ to BCC]

Oozie is not HA yet. But it would be relatively easy to make it. It was
designed with that in mind, we even did a prototype.

Oozie consists of 2 services, a SQL database to store the Oozie jobs state
and a servlet container where Oozie app proper runs.

The solution for HA for the database, well, it is left to the database. This
means, you'll have to get an HA DB.

The solution for HA for the Oozie app is deploying the servlet container
with the Oozie app in more than one box (2 or 3); and front them by a HTTP
load-balancer.

The missing part is that the current Oozie lock-service is currently an
in-memory implementation. This should be replaced with a Zookeeper
implementation. Zookeeper could run externally or internally in all Oozie
servers. This is what was prototyped long ago.

Thanks.

Alejandro


On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote:

 If I get you right you are asking about Installing Oozie as Distributed
 and/or HA cluster?!
 In that case I am not familiar with an out of the box solution by Oozie.
 But, I think you can made up a solution of your own, for example:
 Installing Oozie on two servers on the same partition which will be
 synchronized by DRBD.
 You can trigger a failover using linux Heartbeat and that way maintain a
 virtual IP.





 On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk
 wrote:

  Hi
 
  Thanks a lot for pointing me to Oozie. I have looked a little bit into
  Oozie and it seems like the component triggering jobs is called
  Coordinator Application. But I really see nowhere that this Coordinator
  Application doesnt just run on a single machine, and that it will
 therefore
  not trigger anything if this machine is down. Can you confirm that the
  Coordinator Application-role is distributed in a distribued Oozie
 setup,
  so that jobs gets triggered even if one or two machines are down?
 
  Regards, Per Steffensen
 
  Ronen Itkin skrev:
 
   Hi
 
  Try to use Oozie for job coordination and work flows.
 
 
 
  On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk
  wrote:
 
 
 
  Hi
 
  I use hadoop for a MapReduce job in my system. I would like to have the
  job
  run very 5th minute. Are there any distributed timer job stuff in
  hadoop?
  Of course I could setup a timer in an external timer framework (CRON or
  something like that) that invokes the MapReduce job. But CRON is only
  running on one particular machine, so if that machine goes down my job
  will
  not be triggered. Then I could setup the timer on all or many machines,
  but
  I would not like the job to be run in more than one instance every 5th
  minute, so then the timer jobs would need to coordinate who is actually
  starting the job this time and all the rest would just have to do
  nothing.
  Guess I could come up with a solution to that - e.g. writing some
 lock
  stuff using HDFS files or by using ZooKeeper. But I would really like
 if
  someone had already solved the problem, and provided some kind of a
  distributed timer framework running in a cluster, so that I could
  just
  register a timer job with the cluster, and then be sure that it is
  invoked
  every 5th minute, no matter if one or two particular machines in the
  cluster
  is down.
 
  Any suggestions are very welcome.
 
  Regards, Per Steffensen
 
 
 
 
 
 
 
 
 
 


 --
 *
 Ronen Itkin*
 Taykey | www.taykey.com

Re: Oozie monitoring

2011-08-30 Thread Alejandro Abdelnur

Avi,

For Oozie related questions, please subscribe and use the
oozie-...@incubator.apache.org alias.

Thanks.

Alejandro

On Tue, Aug 30, 2011 at 2:28 AM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi All,

 First, I really enjoy writing you and I'm thankful for your help.

 I have Oozie installed on dedicated server and I want to monitor it using
 my
 Nagios server.

 Can you please suggest some important parameters to monitor or other tips
 regarding this issue.



 Thanks.



 Avi

Re: Oozie on the namenode server

2011-08-29 Thread Alejandro Abdelnur

[Moving thread to Oozie aliases and hadoop's alias to BCC]

Avi,

Currently you can have a cold standby solution.

An Oozie setup consists of 2 systems, a SQL DB (storing all Oozie jobs
state) and a servlet container (running Oozie proper). You need you DB to be
high available. You need to have a second installation of Oozie configured
identically to the running on stand by. If the running Oozie fails, you
start the second one.

The idea (we once did a prototype using Zookeeper) is to eventually support
a multithreaded Oozie server, that would give both high-availability and
horizontal scalability. Still you'll need a High Available DB.

Thanks.

Alejandro


On Mon, Aug 29, 2011 at 4:57 AM, Avi Vaknin avivakni...@gmail.com wrote:

 Hi,
 I have one more question about Oozie,
 Is there any suggested high availability solution for Oozie ?
 Something like active/passive or active/active solution.
 Thanks.

 Avi

 -Original Message-
 From: Harsh J [mailto:ha...@cloudera.com]
 Sent: Monday, August 29, 2011 2:20 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Oozie on the namenode server

 Avi,

 Should be OK to do so at this stage. However, keep monitoring loads on
 the machines to determine when to move things out to their own
 dedicated boxes.

 On Mon, Aug 29, 2011 at 3:20 PM, Avi Vaknin avivakni...@gmail.com wrote:
  Hi All,
 
  I want to install Oozie and I wonder if it is OK to install it on the
 name
  node or maybe I need to install dedicated server to it.
 
  I have a very small Hadoop cluster (4 datanodes + namenode + secondary
  namenode).
 
  Thanks for your help.
 
 
 
  Avi
 
 



 --
 Harsh J
 -
 No virus found in this message.
 Checked by AVG - www.avg.com
 Version: 10.0.1392 / Virus Database: 1520/3864 - Release Date: 08/28/11

Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur

Hadoop developers,

Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a
successful build.

I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part
of 0.23 (Nicholas already looked at the code).

In addition, the Jersey utils in Hoop will be handy for
https://issues.apache.org/jira/browse/MAPREDUCE-2863.

Most if not all of remaining work is not development but Java package
renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and
final packaging.

The current blocker is to deciding on the Maven sub-modules organization,
https://issues.apache.org/jira/browse/HADOOP-7560

I'll drive the discussion for HADOOP-7560 and as soon as we have an
agreement there I'll refactor Hoop accordingly.

Does this sound reasonable ?

Thanks.

Alejandro

Re: Hoop into 0.23 release

2011-08-22 Thread Alejandro Abdelnur

Nicholas,

Thanks. In order to move forward with the integration of Hoop I need
consensus on:

*
https://issues.apache.org/jira/browse/HDFS-2178?focusedCommentId=13089106page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13089106

* https://issues.apache.org/jira/browse/HADOOP-7560

Thanks.

Alejandro

On Mon, Aug 22, 2011 at 3:42 PM, Tsz Wo Sze szets...@yahoo.com wrote:

 +1
 I believe HDFS-2178 is very close to being committed.  Great work
 Alejandro!

 Nicholas



 
 From: Alejandro Abdelnur t...@cloudera.com
 To: common-user@hadoop.apache.org; hdfs-...@hadoop.apache.org
 Sent: Monday, August 22, 2011 2:16 PM
 Subject: Hoop into 0.23 release

 Hadoop developers,

 Arun will be cutting a branch for Hadoop 0.23 as soon the trunk has a
 successful build.

 I'd like Hoop (https://issues.apache.org/jira/browse/HDFS-2178) to be part
 of 0.23 (Nicholas already looked at the code).

 In addition, the Jersey utils in Hoop will be handy for
 https://issues.apache.org/jira/browse/MAPREDUCE-2863.

 Most if not all of remaining work is not development but Java package
 renaming (to be org.apache.hadoop..), Maven integration (sub-modules) and
 final packaging.

 The current blocker is to deciding on the Maven sub-modules organization,
 https://issues.apache.org/jira/browse/HADOOP-7560

 I'll drive the discussion for HADOOP-7560 and as soon as we have an
 agreement there I'll refactor Hoop accordingly.

 Does this sound reasonable ?

 Thanks.

 Alejandro

Re: Can you access Distributed cache in custom output format ?

2011-07-29 Thread Alejandro Abdelnur

Mmmh, I've never used the -files option (I don't know if it will copy the
files to HDFS for your or you have to put them there first).

My usage pattern of the DC is copying the files to HDFS, then use the DC API
to add those files to the jobconf.

Alejandro

On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn mapred.le...@gmail.comwrote:

 i m trying to access file that I sent as -files option in my hadoop jar
 command.

 in my outputformat,
 I am doing something like:

 Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

 String file1=;
 String file2=;
 Path pt=null;

 for (Path p : cacheFiles) {

 if (p != null) {
 if (p.getName().endsWith(.ryp)) {
 file1 = p.getName();
 } else if (p.getName().endsWith(.cpt)) {
 file2 = p.getName();
 pt=p;
 }

 }

 }

 // then read the file, which gives file does not exist exception:

 Path pat = new Path(file2);

 BufferedReader reader = null;
 try {
 FileSystem fs = FileSystem.get(conf);
 reader=new BufferedReader(
 new InputStreamReader(fs.open(pat)));


 String line = null;
 while ((line = reader.readLine()) != null) {
 System.out.println(Now parsing the line:  + line);


 }
 } catch (Exception e) {
 System.out.println(exception + e.getMessage());

 }

 On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur t...@cloudera.comwrote:

 Where are you getting the error, in the client submitting the job or in
 the MR tasks?

 Are you trying to access a file or trying to set a JAR in the
 DistributedCache?
 How/when are you adding the file/JAR to the DC?
 How are you retrieving the file/JAR from your outputformat code?

 Thxs.

 Alejandro


 On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn mapred.le...@gmail.comwrote:

 I am trying to create a custom text outputformat where I want to access a
 distirbuted cache file.



 On Fri, Jul 29, 2011 at 10:42 AM, Harsh J ha...@cloudera.com wrote:

 Mapred,

 By outputformat, do you mean the frontend, submit-time run of
 OutputFormat? Then no, it cannot access the distributed cache cause
 its not really setup at that point, and the front end doesn't need the
 distributed cache really when it can access those files directly.

 Could you describe slightly deeper on what you're attempting to do?

 On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn mapred.le...@gmail.com
 wrote:
  Hi,
  I am trying to access distributed cache in my custom output format but
 it
  does not work and file open in custom output format fails with file
 does not
  exist even though it physically does.
 
  Looks like distributed cache only works for Mappers and Reducers ?
 
  Is there a way I can read Distributed Cache in my custom output format
 ?
 
  Thanks,
  -JJ
 



 --
 Harsh J

Re: Can you access Distributed cache in custom output format ?

2011-07-29 Thread Alejandro Abdelnur

So, the -files uses the '#' symlink feature, correct?

If so, from the MR task JVM doing a new File(FILENAME) would work, where
the FILENAME does not include the path of the file, correct?

Thxs.

Alejandro

On Fri, Jul 29, 2011 at 11:20 AM, Brock Noland br...@cloudera.com wrote:

 With -files the file will be placed in the CWD of the map/reduce tasks.

 You should be able to open the file with FileInputStream.

 Brock

 On Fri, Jul 29, 2011 at 1:18 PM, Mapred Learn mapred.le...@gmail.comwrote:

 when you use -files option, it copies in a .staging directory and all
 mappers can access it but for output format, I see it is not able to access
 it.

 -files copies cache file under:

 /user/id/.staging/job name/files/filename



 On Fri, Jul 29, 2011 at 11:14 AM, Alejandro Abdelnur 
 t...@cloudera.comwrote:

 Mmmh, I've never used the -files option (I don't know if it will copy the
 files to HDFS for your or you have to put them there first).

 My usage pattern of the DC is copying the files to HDFS, then use the DC
 API to add those files to the jobconf.

 Alejandro


 On Fri, Jul 29, 2011 at 10:56 AM, Mapred Learn 
 mapred.le...@gmail.comwrote:

 i m trying to access file that I sent as -files option in my hadoop jar
 command.

 in my outputformat,
 I am doing something like:

 Path[] cacheFiles = DistributedCache.getLocalCacheFiles(conf);

 String file1=;
 String file2=;
 Path pt=null;

 for (Path p : cacheFiles) {

 if (p != null) {
 if (p.getName().endsWith(.ryp)) {
 file1 = p.getName();
 } else if (p.getName().endsWith(.cpt)) {
 file2 = p.getName();
 pt=p;
 }

 }

 }

 // then read the file, which gives file does not exist exception:

 Path pat = new Path(file2);

 BufferedReader reader = null;
 try {
 FileSystem fs = FileSystem.get(conf);
 reader=new BufferedReader(
 new InputStreamReader(fs.open(pat)));


 String line = null;
 while ((line = reader.readLine()) != null) {
 System.out.println(Now parsing the line:  + line);


 }
 } catch (Exception e) {
 System.out.println(exception + e.getMessage());


 }

 On Fri, Jul 29, 2011 at 10:50 AM, Alejandro Abdelnur t...@cloudera.com
  wrote:

 Where are you getting the error, in the client submitting the job or in
 the MR tasks?

 Are you trying to access a file or trying to set a JAR in the
 DistributedCache?
 How/when are you adding the file/JAR to the DC?
 How are you retrieving the file/JAR from your outputformat code?

 Thxs.

 Alejandro


 On Fri, Jul 29, 2011 at 10:43 AM, Mapred Learn mapred.le...@gmail.com
  wrote:

 I am trying to create a custom text outputformat where I want to
 access a distirbuted cache file.



 On Fri, Jul 29, 2011 at 10:42 AM, Harsh J ha...@cloudera.com wrote:

 Mapred,

 By outputformat, do you mean the frontend, submit-time run of
 OutputFormat? Then no, it cannot access the distributed cache cause
 its not really setup at that point, and the front end doesn't need
 the
 distributed cache really when it can access those files directly.

 Could you describe slightly deeper on what you're attempting to do?

 On Fri, Jul 29, 2011 at 10:57 PM, Mapred Learn 
 mapred.le...@gmail.com wrote:
  Hi,
  I am trying to access distributed cache in my custom output format
 but it
  does not work and file open in custom output format fails with file
 does not
  exist even though it physically does.
 
  Looks like distributed cache only works for Mappers and Reducers ?
 
  Is there a way I can read Distributed Cache in my custom output
 format ?
 
  Thanks,
  -JJ
 



 --
 Harsh J

Re: Multiple Output Formats

2011-07-27 Thread Alejandro Abdelnur

Roger,

Or you can take a look at Hadoop's MultipleOutputs class.

Thanks.

Alejandro

On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu pire...@crs4.it wrote:

 On July 26, 2011 06:11:33 PM Roger Chen wrote:
  Hi all,
 
  I am attempting to implement MultipleOutputFormat to write data to
 multiple
  files dependent on the output keys and values. Can somebody provide a
  working example with how to implement this in Hadoop 0.20.2?
 
  Thanks!

 Hello,

 I have a working sample here:

 http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop-
 seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java

 It extends FileOutputFormat.

 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452

Re: Programming Multiple rounds of mapreduce

2011-06-13 Thread Alejandro Abdelnur

Thanks Matt,

Arko, if you plan to use Oozie, you can have a simple coordinator job that
does does, for example (the following schedules a WF every 5 mins that
consumes the output produced by the previous run, you just have to have the
initial data)

Thxs.

Alejandro


coordinator-app name=coord-1 frequency=${coord:minutes(5)}
start=${start} end=${end} timezone=UTC
 xmlns=uri:oozie:coordinator:0.1
  controls
concurrency1/concurrency
  /controls

  datasets
dataset name=data frequency=${coord:minutes(5)}
initial-instance=${start} timezone=UTC

uri-template${nameNode}/user/${coord:user()}/examples/${dataRoot}/${YEAR}-${MONTH}-${DAY}-${HOUR}-${MINUTE}
  /uri-template
/dataset
  /datasets

  input-events
data-in name=input dataset=data
  instance${coord:current(0)}/instance
/data-in
  /input-events

  output-events
data-out name=output dataset=data
  instance${coord:current(1)}/instance
/data-out
  /output-events

  action
workflow

app-path${nameNode}/user/${coord:user()}/examples/apps/subwf-1/app-path
  configuration
property
  namejobTracker/name
  value${jobTracker}/value
/property
property
  namenameNode/name
  value${nameNode}/value
/property
property
  namequeueName/name
  value${queueName}/value
/property
property
  nameexamplesRoot/name
  value${examplesRoot}/value
/property
property
  nameinputDir/name
  value${coord:dataIn('input')}/value
/property
property
  nameoutputDir/name
  value${coord:dataOut('output')}/value
/property
  /configuration
/workflow
  /action
/coordinator-app
--

On Mon, Jun 13, 2011 at 3:01 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 If you know for certain that it needs to be split into multiple work units
 I would suggest looking into Oozie. Easy to install, light weight, low
 learning curve... for my purposes it's been very helpful so far. I am also
 fairly certain you can chain multiple job confs into the same run but I have
 not actually tried that therefore I can't promise it is easy or possible.

 http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-oozie/

 If you are not running CDH3u0 then you can also get the tarball and
 documentation directly here:
 https://ccp.cloudera.com/display/SUPPORT/CDH3+Downloadable+Tarballs

 Matt

 -Original Message-
 From: Marcos Ortiz [mailto:mlor...@uci.cu]
 Sent: Monday, June 13, 2011 4:57 PM
 To: mapreduce-user@hadoop.apache.org
 Cc: Arko Provo Mukherjee
 Subject: Re: Programming Multiple rounds of mapreduce

 Well, you can define a job for each round and then, you can define the
 running workflow based in your implementation and to chain your jobs

 El 6/13/2011 5:46 PM, Arko Provo Mukherjee escribió:
  Hello,
 
  I am trying to write a program where I need to write multiple rounds
  of map and reduce.
 
  The output of the last round of map-reduce must be fed into the input
  of the next round.
 
  Can anyone please guide me to any link / material that can teach me as
  to how I can achieve this.
 
  Thanks a lot in advance!
 
  Thanks  regards
  Arko

 --
 Marcos Luís Ortíz Valmaseda
  Software Engineer (UCI)
  http://marcosluis2186.posterous.com
  http://twitter.com/marcosluis2186


 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-06-01 Thread Alejandro Abdelnur

John,

Do you have all JARs used by your classes in Needed.jar in the DC classpath
as well?

Are you propagating the delegation token?

Thxs.

Alejandro

On Wed, Jun 1, 2011 at 12:38 PM, John Armstrong john.armstr...@ccri.comwrote:

 On Tue, 31 May 2011 15:09:28 -0400, John Armstrong
 john.armstr...@ccri.com wrote:
  On Tue, 31 May 2011 12:02:28 -0700, Alejandro Abdelnur
 t...@cloudera.com
  wrote:
  What is exactly that does not work?

 In the hopes that more information can help, I've dug into the local
 filesystems on each of my four nodes and retrieved the job.xml and the
 locations of the files to show that everything shows up where it should.

 In this example have one regular file
 (hdfs://node1:hdfsport/hdfs/path/to/file1.foo) added with
 DistributedCache.addCacheFile().  I also have a JAR
 (hdfs://node1:hdfsport/hdfs/path/to/needed.jar) added with
 DistributedCache.addFileToClassPath().  The needed JAR is also part of the
 classpath Oozie provides to my Java task.

 As you can see, both files (with correct filesizes and timestamps) are
 listed as cache files in job.xml, and the JAR is listed as a classpath
 file.  Both files show up on each node; the JAR shows up twice on node 1
 since that's where Oozie ran the Java task, and thus where Oozie placed the
 JAR with its own use of the distributed cache.

 And yet, when mapreduce actually tries to run the job my Java task
 launches, it immediately hits a ClassNotFoundException, claiming it can't
 find the class my.class.package.Needed which is contained in needed.jar.

 JOB.XML
 ...
property
!--Loaded from Unknown--
namemapred.job.classpath.files/name
valuehdfs://node1:hdfsport/hdfs/path/to/needed.jar/value
/property
 ...
property
!--Loaded from Unknown--
namemapred.cache.files/name


 valuehdfs://node1:hdfsport/hdfs/path/to/file1.foo,hdfs://node1:hdfsport/hdfs/path/to/needed.jar/value
/property
 ...
property
!--Loaded from Unknown--
namemapred.cache.files.filesizes/name
value61175,2257057/value
/property
 ...
property
!--Loaded from Unknown--
namemapred.cache.files.timestamps/name
value1306949104866,1306949371660/value
/property
 ...

 NODE 1 LOCAL FILESYSTEM

 /data/4/mapred/local/taskTracker/distcache/5181540010607464671_-132008737_1279047490/node1/hdfs/path/to/file1.foo

 /data/1/mapred/local/taskTracker/distcache/6423795395825083633_-1942178119_1279314284/node1/hdfs/path/to/needed.jar

 /data/3/mapred/local/taskTracker/distcache/2424191142954514770_1281905983_1269665052/node1/hdfs/path/to/needed.jar

 NODE 2 LOCAL FILESYSTEM

 /data/1/mapred/local/taskTracker/distcache/-1458632814086969626_-132008737_1279047490/node1/hdfs/path/to/file1.foo

 /data/2/mapred/local/taskTracker/distcache/4434671176913378591_-1942178119_1279314284/node1/hdfs/path/to/needed.jar

 NODE 3 LOCAL FILESYSTEM

 /data/1/mapred/local/taskTracker/distcache/-6763452370915390695_-132008737_1279047490/node1/hdfs/path/to/file1.foo

 /data/2/mapred/local/taskTracker/distcache/683838159704655_-1942178119_1279314284/node1/hdfs/path/to/needed.jar

 NODE 4 LOCAL FILESYSTEM

 /data/1/mapred/local/taskTracker/distcache/-1759547009148985681_-132008737_1279047490/node1/hdfs/path/to/file1.foo

 /data/2/mapred/local/taskTracker/distcache/1998811135309473771_-1942178119_1279314284/node1/hdfs/path/to/needed.jar

 SAMPLE MAPPER ATTEMPT LOG

 2011-06-01 14:21:41,442 INFO org.apache.hadoop.util.NativeCodeLoader:
 Loaded the native-hadoop library
 2011-06-01 14:21:41,557 INFO
 org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating
 symlink:

 /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/job.jar
 -

 /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_09_0/work/./job.jar
 2011-06-01 14:21:41,560 INFO
 org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating
 symlink:

 /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/jars/.job.jar.crc
 -

 /data/2/mapred/local/taskTracker/hdfs/jobcache/job_201106011430_0002/attempt_201106011430_0002_m_09_0/work/./.job.jar.crc
 2011-06-01 14:21:41,563 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=MAP, sessionId=
 2011-06-01 14:21:41,660 WARN org.apache.hadoop.mapred.Child: Error running
 child
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 my.class.package.Needed
 at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:973)
at

 org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:236)
at org.apache.hadoop.mapred.Task.initialize(Task.java:484)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:298)
at org.apache.hadoop.mapred.Child$4.run(Child.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396

Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2

2011-05-30 Thread Alejandro Abdelnur

John,

Now I get what you are trying to do.

My recommendation would be:

* Use a Java action to do all the stuff prior to starting your MR job
* Use a mapreduce action to start your MR job
* If you need to propagate properties from the Java action to the MR action
you can use the capture-output flag.

If you still want to start your MR job from your Java action, then your Java
action should do all the setup the MapReduceMain class does before starting
the MR job (this will ensure delegation tokens and distributed cache is
avail to your MR job).

Thanks.

Alejandro

On Mon, May 30, 2011 at 6:34 AM, John Armstrong john.armstr...@ccri.comwrote:

 On Fri, 27 May 2011 15:47:23 -0700, Alejandro Abdelnur t...@cloudera.com
 wrote:
  John,
 
  If you are using Oozie, dropping all the JARs your MR jobs needs in the
  Oozie WF lib/ directory should suffice. Oozie will make sure all those
 JARs
  are in the distributed cache.

 That doesn't seem to work.  I have this JAR in the WF /lib/ directory
 because the Java job that launches the MR job needs it.  And yes, it's in
 the distributed cache for the wrapper MR job that Oozie uses to remotely
 run the Java job.  The problem is it's not available for the MR job that
 the Java job launches.

 Thanks, though, for the suggestion.

Re: MultipleOutputFormat

2011-03-29 Thread Alejandro Abdelnur

Dmitriy,

Have you check the MultipleOutputs instead? It provides similar
functionality.

Alejandro

On Wed, Mar 30, 2011 at 11:39 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:

 Hi,
 I can't seem to be able to find either jira or implementation of
 MultipleOutputFormat in new api in either 0.21 or 0.22 branches.
 Are there any plans to port that to new api as well?

 thanks in advance.
 -Dmitriy

Re: MultipleOutputFormat

2011-03-29 Thread Alejandro Abdelnur

You should be able to create partitions on the fly.

Check the last example in the javadocs:

http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html

 ...

  mos.write(key, new Text(value), generateFileName(key, new Text(value)));

Hope this helps.

Alejandro

On Wed, Mar 30, 2011 at 12:02 PM, Dmitriy Lyubimov dlie...@gmail.comwrote:

 yes.. but in my old code the file names are created on the fly (it
 basically creates partitions based on a time field). I dont think
 MultipleOutputs is not suitable to create partitions on the fly.

 On Tue, Mar 29, 2011 at 8:56 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:
  Dmitriy,
  Have you check the MultipleOutputs instead? It provides similar
  functionality.
  Alejandro
 
  On Wed, Mar 30, 2011 at 11:39 AM, Dmitriy Lyubimov dlie...@gmail.com
  wrote:
 
  Hi,
  I can't seem to be able to find either jira or implementation of
  MultipleOutputFormat in new api in either 0.21 or 0.22 branches.
  Are there any plans to port that to new api as well?
 
  thanks in advance.
  -Dmitriy

Re: how to use hadoop apis with cloudera distribution ?

2011-03-08 Thread Alejandro Abdelnur

If write your code within a Maven project (which you can open from Eclipse)
then you should the following in your pom.xml:

* Define Cloudera repository:

...
repository
idcdh.repo/id
url
https://repository.cloudera.com/content/groups/cloudera-repos/url
nameCloudera Repositories/name
snapshots
enabledfalse/enabled
/snapshots
/repository
...

* Add Hadoop dependency:

...
dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-core/artifactId
version0.20.2-cdh3B4/version
/dependency
...


You can browse the Cloudera Maven repositories (for avail JARs and versions)
at:

https://repository.cloudera.com/

You'll find Hadoop/Pig/Hive/Sqoop/Zookeeper/Hbase/Flume/Oozie JARs (with all
the *cdh3B4 of them versions working with each other)

Then your app should run with the Cloudera releases of Hadoop/...etc.

Hope this helps.

Thanks.

Alejandro

On Wed, Mar 9, 2011 at 6:03 AM, Marcos Ortiz Valmaseda mlor...@uci.cuwrote:

 You can check the Cloudera Training Videos, where is a screencast
 explaining how to develop Hadoop using Eclipse.
 http://www.cloudera.com/presentations
 http://vimeo.com/cloudera

 Now, For working with Hadoop APIs using Eclipse, for developing
 applications based on Hadoop, you can use the Kamasphere Plugin for Hadoop
 Development, or if you are a NetBeans user, they have a module for that.

 Regards.


 - Mensaje original -
 De: Mapred Learn mapred.le...@gmail.com
 Para: Marcos Ortiz mlor...@uci.cu
 CC: mapreduce-user@hadoop.apache.org
 Enviados: Martes, 8 de Marzo 2011 12:26:00 (GMT-0500) Auto-Detected
 Asunto: Re: how to use hadoop apis with cloudera distribution ?


 Thanks Marco !
 I was trying to use CDH3 with eclipse and not able to know why eclipse
 complains for the import statement for hadoop apis when cloudera already
 includes them.

 I did not understand how CDH3 works with eclipse, does it download hadoop
 apis when we add svn urls ?



 On Tue, Mar 8, 2011 at 7:22 AM, Marcos Ortiz  mlor...@uci.cu  wrote:



 On Tue, 2011-03-08 at 07:16 -080s,



 0, Mapred Learn wrote:
 
   Hi,
   I downloaded CDH3 VM for hadoop but if I want to use something like:
  
   import org.apache.hadoop.conf.Configuration;
  
   in my java code, what else do I need to do ?

 Can you see all tutorial that Cloudera has on its site
 http://www.cloudera.com/presentations
 http://www.cloudera.com/info/training
 http://www.cloudera.com/developers/learn-hadoop/

 Can you check the CDH3 Official Documentation and the last news about
 the new release:

 http://docs.cloudera.com
 http://www.cloudera.com/blog/category/cdh/



   Do i need to download hadoop from apache ?
 No, CDH beta 3 has with all required tools to work with Hadoop, even
 more applications like HUE, Oozie, Zookepper, Pig, Hive, Chukwa, HBase,
 Flume, etc

  
   if yes, then what does cdh3 do ?
 The Cloudera' colleagues has a excelent work packaging the most used
 applications with Hadoop on a single virtual machine for testing and
 they did a better approach to use Hadoop.

 They has Red Hat and Ubuntu/Debian compatible packages to do more easy
 the installation, configuration and use of Hadoop on these operating
 systems.

 Please, read http://docs.cloudera.com


  
   if not, then where can i find hadoop code on cdh VM ?
  
  
   I am using above line in my java code in eclipse and eclipse is not
 able to find it.
 Do you set JAVA_HOME, and HADOOP_HOME on your system?

 If you have any doubt with this, you can check the excellent DZone'
 refcards about Getting Started with Hadood and Deploying Hadoop
 written by Eugene Ciurana( http://eugeneciurana.eu ), VP of Technology at
 Badoo.com

 Regards, and I hope that this information could be useful for you.

 --
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186

 --
 Marcos Luís Ortíz Valmaseda
  Software Engineer
  Universidad de las Ciencias Informáticas
  Linux User # 418229

 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186

Re: EXT :Re: Problem running a Hadoop program with external libraries

2011-03-05 Thread Alejandro Abdelnur

Why don't you put your native library in HDFS and use the DistributedCache
to make them avail to the tasks. For example:

Copy 'foo.so' to 'hdfs://localhost:8020/tmp/foo.so', then added to the job
distributed cache:

  DistributedCache.addCacheFile(hdfs://localhost:8020/tmp/foo.so#foo.so,
jobConf);
  DistributedCache.createSymlink(conf);

Note that the #foo.so will create a soflink in the task running dir. And the
task running dir is in LD_PATH of your task.

Alejandro

On Sat, Mar 5, 2011 at 7:19 AM, Lance Norskog goks...@gmail.com wrote:

 I have never heard of putting a native code shared library in a Java jar. I
 doubt that it works. But it's a cool idea!

 A Unix binary program loads shared libraries from the paths given in the
 environment variable LD_LIBRARY_PATH. This has to be set to the directory
 with the OpenCV .so file when you start Java.

 Lance

 On Mar 4, 2011, at 2:13 PM, Brian Bockelman wrote:

  Hi,
 
  Check your kernel's overcommit settings.  This will prevent the JVM from
 allocating memory even when there's free RAM.
 
  Brian
 
  On Mar 4, 2011, at 3:55 PM, Ratner, Alan S (IS) wrote:
 
  Aaron,
 
   Thanks for the rapid responses.
 
 
  * ulimit -u unlimited is in .bashrc.
 
 
  * HADOOP_HEAPSIZE is set to 4000 MB in hadoop-env.sh
 
 
  * Mapred.child.ulimit is set to 2048000 in mapred-site.xml
 
 
  * Mapred.child.java.opts is set to -Xmx1536m in mapred-site.xml
 
   I take it you are suggesting that I change the java.opts command to:
 
  Mapred.child.java.opts is value -Xmx1536m
 -Djava.library.path=/path/to/native/libs /value
 
 
  Alan Ratner
  Northrop Grumman Information Systems
  Manager of Large-Scale Computing
  9020 Junction Drive
  Annapolis Junction, MD 20701
  (410) 707-8605 (cell)
 
  From: Aaron Kimball [mailto:akimbal...@gmail.com]
  Sent: Friday, March 04, 2011 4:30 PM
  To: common-user@hadoop.apache.org
  Cc: Ratner, Alan S (IS)
  Subject: EXT :Re: Problem running a Hadoop program with external
 libraries
 
  Actually, I just misread your email and missed the difference between
 your 2nd and 3rd attempts.
 
  Are you enforcing min/max JVM heap sizes on your tasks? Are you
 enforcing a ulimit (either through your shell configuration, or through
 Hadoop itself)? I don't know where these cannot allocate memory errors are
 coming from. If they're from the OS, could it be because it needs to fork()
 and momentarily exceed the ulimit before loading the native libs?
 
  - Aaron
 
  On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball akimbal...@gmail.com
 mailto:akimbal...@gmail.com wrote:
  I don't know if putting native-code .so files inside a jar works. A
 native-code .so is not classloaded in the same way .class files are.
 
  So the correct .so files probably need to exist in some physical
 directory on the worker machines. You may want to doublecheck that the
 correct directory on the worker machines is identified in the JVM property
 'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can
 be manipulated in the Hadoop configuration setting mapred.child.java.opts
 (include '-Djava.library.path=/path/to/native/libs' in the string there.)
 
  Also, if you added your .so files to a directory that is already used by
 the tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may
 need to restart the tasktracker instance for it to take effect. (This is
 true of .jar files in the $HADOOP_HOME/lib directory; I don't know if it is
 true for native libs as well.)
 
  - Aaron
 
  On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) 
 alan.rat...@ngc.commailto:alan.rat...@ngc.com wrote:
  We are having difficulties running a Hadoop program making calls to
 external libraries - but this occurs only when we run the program on our
 cluster and not from within Eclipse where we are apparently running in
 Hadoop's standalone mode.  This program invokes the Open Computer Vision
 libraries (OpenCV and JavaCV).  (I don't think there is a problem with our
 cluster - we've run many Hadoop jobs on it without difficulty.)
 
  1.  I normally use Eclipse to create jar files for our Hadoop
 programs but I inadvertently hit the run as Java application button and
 the program ran fine, reading the input file from the eclipse workspace
 rather than HDFS and writing the output file to the same place.  Hadoop's
 output appears below.  (This occurred on the master Hadoop server.)
 
  2.  I then exported from Eclipse a runnable jar which extracted
 required libraries into the generated jar - presumably producing a jar file
 that incorporated all the required library functions. (The plain jar file
 for this program is 17 kB while the runnable jar is 30MB.)  When I try to
 run this on my Hadoop cluster (including my master and slave servers) the
 program reports that it is unable to locate libopencv_highgui.so.2.2:
 cannot open shared object file: No such file or directory.  Now, in
 addition to this library being incorporated inside

Re: use DistributedCache to add many files to class path

2011-02-16 Thread Alejandro Abdelnur

Lei Liu,

You have a cutpaste error the second addition should use 'tairJarPath' but
it is using the 'jeJarPath'

Hope this helps.

Alejandro

On Thu, Feb 17, 2011 at 11:50 AM, lei liu liulei...@gmail.com wrote:

 I use DistributedCache  to add two files to class path,  exampe below code
 :
String jeJarPath = /group/aladdin/lib/je-4.1.7.jar;
 DistributedCache.addFileToClassPath(new Path(jeJarPath), conf);

 String tairJarPath = /group/aladdin/lib/tair-aladdin-2.3.1.jar
 DistributedCache.addFileToClassPath(new Path(jeJarPath), conf);

 when map/reduce is executing,  the
 /group/aladdin/lib/tair-aladdin-2.3.1.jar file is added to class path,
 but  the /group/aladdin/lib/je-4.1.7.jar file is not added to class path.

 How can I add many files to class path?



 Thanks,


 LiuLei

Re: Accessing Hadoop using Kerberos

2011-01-12 Thread Alejandro Abdelnur

Hadoop UserGroupInformation class looks into the OS Kerberos cache for user
to use.

I guess you'd had to do the kerberos authentication making sure the kerberos
cache is used and then do a UserGroupInformation.loginUser()

And after doing that the Hadoop libraries should work.

Alejandro

On Thu, Jan 13, 2011 at 2:19 PM, Muruga Prabu M murugapra...@gmail.comwrote:

 Yeah kinit is working fine. My scenario is as follows. I have a java
 program
 written by me to upload and download files to the HDFS cluster. I want to
 perform Kerberos authentication from the Java code itself. I have acquired
 the TicketGrantingTicket from the Authentication Server and Service Ticket
 from the Ticket Granting Server. Now how to authenticate myself with hadoop
 by sending the service ticket received from the Ticket Granting Server.

 Regards,
 Pikini

 On Wed, Jan 12, 2011 at 10:13 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  If you kinit-ed successfully you are done.
 
  The hadoop libraries will do the trick of authenticating the user against
  Hadoop.
 
  Alejandro
 
 
  On Thu, Jan 13, 2011 at 12:46 PM, Muruga Prabu M murugapra...@gmail.com
  wrote:
 
   Hi,
  
   I have a Java program to upload and download files from the HDFS. I am
   using
   Hadoop with Kerberos. I am able to get a TGT(From the Authentication
  Server
   ) and a service Ticket(From Ticket Granting server) using kerberos
   successfully. But after that I don't know how to authenticate myself
 with
   Hadoop. Can anyone please help me how to do it ?
  
   Regards,
   Pikini

Re: how to run jobs every 30 minutes?

2010-12-14 Thread Alejandro Abdelnur

Ed,

Actually Oozie is quite different from Cascading.

* Cascading allows you to write 'queries' using a Java API and they get
translated into MR jobs.
* Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG
(workflow jobs) and has timer+data dependency triggers (coordinator jobs).

Regards.

Alejandro

On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote:

 Thanks for the tip. I took a look at it.
 Looks similar to Cascading I guess...?
 Anyway thanks for the info!!

 Ed

 2010/12/8 Alejandro Abdelnur t...@cloudera.com

  Or, if you want to do it in a reliable way you could use an Oozie
  coordinator job.
 
  On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
   My mistake. Come to think about it, you are right, I can just make an
   infinite loop inside the Hadoop application.
   Thanks for the reply.
  
   2010/12/7 Harsh J qwertyman...@gmail.com
  
   Hi,
  
   On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote:
Hi,
   
I'm planning to crawl a certain web site every 30 minutes.
How would I get it done in Hadoop?
   
In pure Java, I used Thread.sleep() method, but I guess this won't
  work
   in
Hadoop.
  
   Why wouldn't it? You need to manage your post-job logic mostly, but
   sleep and resubmission should work just fine.
  
Or if it could work, could anyone show me an example?
   
Ed.
   
  
  
  
   --
   Harsh J
   www.harshj.com

Re: how to run jobs every 30 minutes?

2010-12-08 Thread Alejandro Abdelnur

Or, if you want to do it in a reliable way you could use an Oozie
coordinator job.

On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
 My mistake. Come to think about it, you are right, I can just make an
 infinite loop inside the Hadoop application.
 Thanks for the reply.

 2010/12/7 Harsh J qwertyman...@gmail.com

 Hi,

 On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote:
  Hi,
 
  I'm planning to crawl a certain web site every 30 minutes.
  How would I get it done in Hadoop?
 
  In pure Java, I used Thread.sleep() method, but I guess this won't work
 in
  Hadoop.

 Why wouldn't it? You need to manage your post-job logic mostly, but
 sleep and resubmission should work just fine.

  Or if it could work, could anyone show me an example?
 
  Ed.
 



 --
 Harsh J
 www.harshj.com

Re: HDFS Rsync process??

2010-11-30 Thread Alejandro Abdelnur

The other approach, if the DR cluster is idle or has enough excess capacity,
would be running all the jobs on the input data in both clusters and perform
checksums on the outputs to ensure everything is consistent. And you could
take advantage and distribute ad hoc queries between the 2 clusters.

Alejandro

On Tue, Nov 30, 2010 at 6:51 PM, Steve Loughran ste...@apache.org wrote:

 On 30/11/10 03:59, hadoopman wrote:

 We have two Hadoop clusters in two separate buildings.  Both clusters
 are loading the same data from the same sources (the second cluster is
 for DR).

 We're looking at how we can recover the primary cluster and catch it
 back up again as new data will continue to feed into the DR cluster.
 It's been suggested we use rsync across the network however my concern
 is the amount of data we would have to copy over would take several days
 (at a minimum) to sync them even with our dual bonded 1 gig network cards.

 I'm curious if anyone has come up with a solution short of just loading
 the source logs into HDFS. Is there a way to even rsync two clusters and
 get them in sync? Been googling around. Haven't found anything of
 substances yet.



 you don't need all the files in the cluster in sync as a lot of them are
 intermediate and transient files.

 Instead use dfscopy to copy source files to the two clusters, this runs
 across the machines in the cluster and is also designed to work across
 hadoop versions, with some limitations.

Re: Too large class path for map reduce jobs

2010-10-07 Thread Alejandro Abdelnur

well, if the issue is a too long classpath, the softlink thingy will give
some room to breath as the total CP length will be much smaller.

A

On Thu, Oct 7, 2010 at 3:43 PM, Henning Blohm henning.bl...@zfabrik.dewrote:

  So that's actually another issue, right? Besides splitting the classpath
 into those three groups, you want the TT to create soft-links on demand to
 simplify the computation of classpath string. Is that right?

 But it's the TT that actually starts the job VM. Why does it matter what
 the string actually looks like, as long as it has the right content?

 Thanks,
   Henning


 On Thu, 2010-10-07 at 13:22 +0800, Alejandro Abdelnur wrote:

 [sent too soon]



  The first CP shown is how it is today the CP of a task. If we change it
 pick up all the job JARs from the current dir, then the classpath will be
 much shorter (second CP shown). We can easily achieve this by soft-linking
 the job JARs in the work dir of the task.



  Alejandro

  On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

 Fragmentation of Hadoop classpaths is another issue: hadoop should
 differentiate the CP in 3:



   1*client CP: what is needed to submit a job (only the nachos)

   2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole
 enchilada)

   3*job CP: what is needed to run a job (some of the enchilada)


   But i'm not trying to get into that here. What I'm suggesting is:





   -

   # Hadoop JARs:



   /Users/tucu/dev-apps/hadoop/conf

  /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar

   /Users/tucu/dev-apps/hadoop/bin/..

   /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar

   /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar



   . (about 30 jars from hadoop lib/ )



   /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar



   # Job JARs (for a job with only 2 JARs):



   
 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar


  
 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar


  
 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_00_0/work


   -





   What I'm suggesting is that the later group, the job JARs to be
 soft-linked (by the TT) into the working directory, then their classpath is
 just:



   -

   java-launcher.jar

   oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar

   .

   -






   Alejandro



   On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm henning.bl...@zfabrik.de
 wrote:

   Hi Alejandro,

yes, it can of course be done right (sorry if my wording seemed to imply
 otherwise). Just saying that I think that Hadoop M/R should not go into that
 class loader / module separation business. It's one Job, one VM, right? So
 the problem is to assign just the stuff needed to let the Job do its
 business without becoming an obstacle.

   Must admit I didn't understand your proposal 2. How would that remove
 (e.g.) jetty libs from the job's classpath?

 Thanks,
   Henning

 Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur:



  1. Classloader business can be done right. Actually it could be done as
 spec-ed for servlet web-apps.


 2. If the issue is strictly 'too large classpath', then a simpler solution
 would be to sof-link all JARs to the current directory and create the
 classpath with the JAR names only (no path). Note that the soft-linking
 business is already supported by the DistributedCache. So the changes would
 be mostly in the TT to create the JAR names only classpath before starting
 the child.


 Alejandro


 On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.de
 wrote:

 Hi Tom,

   that's exactly it. Thanks! I don't think that I can comment on the issues
 in Jira so I will do it here.

   Tricking with class paths and deviating from the default class loading
 delegation has never been anything but a short term relieve. Fixing things
 by imposing a better order of stuff on the class path will not work when
 people do actually use child loaders (as the parent win) - like we do. Also
 it may easily lead to very confusing situations because the former part of
 the class path is not complete and gets other stuff from a latter part etc.
 etc no good.

   Child loaders are good for module separation but should not be used to
 hide type visibiliy from the parent. Almost certainly leading to Class
 Loader Contraint Violation - once you lose control (which is usually earlier
 than expected).

   The suggestion to reduce the Job class path to the required minimum is
 the most practical approach. There is some gray area there of course

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-07 Thread Alejandro Abdelnur

Never used those myself, always used the global one, but I knew they are
there.

Which Hadoop API are you using, the old one or the new one?

Alejandro

On Fri, Oct 8, 2010 at 3:50 AM, Vitaliy Semochkin vitaliy...@gmail.comwrote:

 Hi,

 I tried using  mapred.map.child.java.opts and mapred.reduce.child.java.opts
 but looks like  hadoop-0.20.2  ingnores it.

 On which version have you seen it working?

 Regards,
 Vitaliy S

 On Tue, Oct 5, 2010 at 5:14 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:
  The following 2 properties should work:
 
  mapred.map.child.java.opts
  mapred.reduce.child.java.opts
 
  Alejandro
 
 
  On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com
 wrote:
 
  Hi,
 
  You don't say which version of Hadoop you are using.
  Going from memory, I believe in the CDH3 release from Cloudera, there
 are some specific OPTs you can set in hadoop-env.sh.
 
  HTH
 
  -Mike
 
 
  Date: Tue, 5 Oct 2010 16:59:35 +0400
  Subject: how to set diffent VM parameters for mappers and reducers?
  From: vitaliy...@gmail.com
  To: common-user@hadoop.apache.org
 
  Hello,
 
 
  I have mappers that do not need much ram but combiners and reducers
 need a lot.
  Is it possible to set different VM parameters for mappers and reducers?
 
 
 
  PS Often I face interesting problem, on same set of data I
  recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
  but it happens not all the time.
  What could be cause of such behavior?
  My personal opinion is that I have
 
  mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
  it should.
 
  Thanks in Advance,
  Vitaliy S

Re: Too large class path for map reduce jobs

2010-10-06 Thread Alejandro Abdelnur

1. Classloader business can be done right. Actually it could be done as
spec-ed for servlet web-apps.

2. If the issue is strictly 'too large classpath', then a simpler solution
would be to sof-link all JARs to the current directory and create the
classpath with the JAR names only (no path). Note that the soft-linking
business is already supported by the DistributedCache. So the changes would
be mostly in the TT to create the JAR names only classpath before starting
the child.

Alejandro

On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.dewrote:

  Hi Tom,

   that's exactly it. Thanks! I don't think that I can comment on the issues
 in Jira so I will do it here.

   Tricking with class paths and deviating from the default class loading
 delegation has never been anything but a short term relieve. Fixing things
 by imposing a better order of stuff on the class path will not work when
 people do actually use child loaders (as the parent win) - like we do. Also
 it may easily lead to very confusing situations because the former part of
 the class path is not complete and gets other stuff from a latter part etc.
 etc no good.

   Child loaders are good for module separation but should not be used to
 hide type visibiliy from the parent. Almost certainly leading to Class
 Loader Contraint Violation - once you lose control (which is usually earlier
 than expected).

   The suggestion to reduce the Job class path to the required minimum is
 the most practical approach. There is some gray area there of course and it
 will not be feasible to reach the absolute minimal set of types there - but
 something reasonable, i.e. the hadoop core that suffices to run the job.
 Certainly jetty  co are not required for job execution (btw. I hacked
 0.20.2 to remove anything in server/ from the classpath before setting the
 job class path).

   I would suggest to

   a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
 additional classpath, added to the core classpath (as described above). If
 not set, for compatibility, preserve today's behavior.
   b) not getting into custom child loaders for jobs as part of hadoop M/R.
 It's non-trivial to get it right and feels to be beyond scope.

   I wouldn't mind helping btw.

 Thanks,
   Henning



 On Tue, 2010-10-05 at 15:59 -0700, Tom White wrote:

 Hi Henning,

 I don't know if you've 
 seenhttps://issues.apache.org/jira/browse/MAPREDUCE-1938 
 andhttps://issues.apache.org/jira/browse/MAPREDUCE-1700 which have
 discussion about this issue.

 Cheers
 Tom

 On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm henning.bl...@zfabrik.de 
 wrote:
  Short update on the issue:
 
  I tried to find a way to separate class path configurations by modifying the
  scripts in HADOOP_HOME/bin but found that TaskRunner actually copies the
  class path setting from the parent process when starting a local task so
  that I do not see a way of having less on a job's classpath without
  modifying Hadoop.
 
  As that will present a real issue when running our jobs on Hadoop I would
  like to propose to change TaskRunner so that it sets a class path
  specifically for M/R tasks. That class path could be defined in the scipts
  (as for the other processes) using a particular environment variable (e.g.
  HADOOP_JOB_CLASSPATH). It could default to the current VM's class path,
  preserving today's behavior.
 
  Is it ok to enter this as an issue?
 
  Thanks,
Henning
 
 
  Am Freitag, den 17.09.2010, 16:01 + schrieb Allen Wittenauer:
 
  On Sep 17, 2010, at 4:56 AM, Henning Blohm wrote:
 
  When running map reduce tasks in Hadoop I run into classpath issues.
  Contrary to previous posts, my problem is not that I am missing classes on
  the Task's class path (we have a perfect solution for that) but rather find
  too many (e.g. ECJ classes or jetty).
 
  The fact that you mention:
 
  The libs in HADOOP_HOME/lib seem to contain everything needed to run
  anything in Hadoop which is, I assume, much more than is needed to run a 
  map
  reduce task.
 
  hints that your perfect solution is to throw all your custom stuff in lib.
  If so, that's a huge mistake.  Use distributed cache instead.

Re: Too large class path for map reduce jobs

2010-10-06 Thread Alejandro Abdelnur

[sent too soon]

The first CP shown is how it is today the CP of a task. If we change it pick
up all the job JARs from the current dir, then the classpath will be much
shorter (second CP shown). We can easily achieve this by soft-linking the
job JARs in the work dir of the task.

Alejandro

On Thu, Oct 7, 2010 at 1:02 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 Fragmentation of Hadoop classpaths is another issue: hadoop should
 differentiate the CP in 3:

 1*client CP: what is needed to submit a job (only the nachos)
 2*server CP (JT/NN/TT/DD): what is need to run the cluster (the whole
 enchilada)
 3*job CP: what is needed to run a job (some of the enchilada)

 But i'm not trying to get into that here. What I'm suggesting is:


 -
 # Hadoop JARs:

 /Users/tucu/dev-apps/hadoop/conf
 /System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/lib/tools.jar
 /Users/tucu/dev-apps/hadoop/bin/..
 /Users/tucu/dev-apps/hadoop/bin/../hadoop-core-0.20.3-CDH3-SNAPSHOT.jar
 /Users/tucu/dev-apps/hadoop/bin/../lib/aspectjrt-1.6.5.jar

 . (about 30 jars from hadoop lib/ )

 /Users/tucu/dev-apps/hadoop/bin/../lib/jsp-2.1/jsp-api-2.1.jar

 # Job JARs (for a job with only 2 JARs):


 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/-2707763075630339038_639898034_1993697040/localhost/user/tucu/oozie-tucu/003-101004184132247-oozie-tucu-W/java-node--java/java-launcher.jar

 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/distcache/3613772770922728555_-588832047_1993624983/localhost/user/tucu/examples/apps/java-main/lib/oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar

 /Users/tucu/dev-apps/hadoop/dirs/mapred/taskTracker/tucu/jobcache/job_201010041326_0058/attempt_201010041326_0058_m_00_0/work
 -


 What I'm suggesting is that the later group, the job JARs to be soft-linked
 (by the TT) into the working directory, then their classpath is just:

 -
 java-launcher.jar
 oozie-examples-2.2.1-CDH3B3-SNAPSHOT.jar
 .
 -


 Alejandro

 On Wed, Oct 6, 2010 at 7:57 PM, Henning Blohm henning.bl...@zfabrik.dewrote:

  Hi Alejandro,

yes, it can of course be done right (sorry if my wording seemed to
 imply otherwise). Just saying that I think that Hadoop M/R should not go
 into that class loader / module separation business. It's one Job, one VM,
 right? So the problem is to assign just the stuff needed to let the Job do
 its business without becoming an obstacle.

   Must admit I didn't understand your proposal 2. How would that remove
 (e.g.) jetty libs from the job's classpath?

 Thanks,
   Henning

 Am Mittwoch, den 06.10.2010, 18:28 +0800 schrieb Alejandro Abdelnur:

  1. Classloader business can be done right. Actually it could be done as
 spec-ed for servlet web-apps.



  2. If the issue is strictly 'too large classpath', then a simpler
 solution would be to sof-link all JARs to the current directory and create
 the classpath with the JAR names only (no path). Note that the soft-linking
 business is already supported by the DistributedCache. So the changes would
 be mostly in the TT to create the JAR names only classpath before starting
 the child.



  Alejandro



  On Wed, Oct 6, 2010 at 5:57 PM, Henning Blohm henning.bl...@zfabrik.de
 wrote:

  Hi Tom,

   that's exactly it. Thanks! I don't think that I can comment on the
 issues in Jira so I will do it here.

   Tricking with class paths and deviating from the default class loading
 delegation has never been anything but a short term relieve. Fixing things
 by imposing a better order of stuff on the class path will not work when
 people do actually use child loaders (as the parent win) - like we do. Also
 it may easily lead to very confusing situations because the former part of
 the class path is not complete and gets other stuff from a latter part etc.
 etc no good.

   Child loaders are good for module separation but should not be used to
 hide type visibiliy from the parent. Almost certainly leading to Class
 Loader Contraint Violation - once you lose control (which is usually earlier
 than expected).

   The suggestion to reduce the Job class path to the required minimum is
 the most practical approach. There is some gray area there of course and it
 will not be feasible to reach the absolute minimal set of types there - but
 something reasonable, i.e. the hadoop core that suffices to run the job.
 Certainly jetty  co are not required for job execution (btw. I hacked
 0.20.2 to remove anything in server/ from the classpath before setting the
 job class path).

   I would suggest to

   a) introduce some HADOOP_JOB_CLASSPATH var that, if set, is the
 additional classpath, added to the core classpath (as described above). If
 not set, for compatibility, preserve today's behavior.
   b) not getting into custom child loaders for jobs as part of hadoop M/R.
 It's non-trivial to get it right and feels to be beyond scope.

   I wouldn't mind helping btw.

 Thanks,
   Henning





 On Tue, 2010-10-05 at 15:59 -0700, Tom White

Re: is there no streaming.jar file in hadoop-0.21.0??

2010-10-05 Thread Alejandro Abdelnur

Edward,

Yep, you should use the one from contrib/

Alejandro

On Tue, Oct 5, 2010 at 1:55 PM, edward choi mp2...@gmail.com wrote:
 Thanks, Tom.
 Didn't expect the author of THE BOOK would answer my question. Very
 surprised and honored :-)
 Just one more question if you don't mind.
 I read it on the Internet that in order to user Hadoop Streaming in
 Hadoop-0.21.0 you should go
 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar args (Of
 course I don't see any hadoop-streaming.jar in $HADOOP_HOME)
 But according to your reply I should go
 $HADOOP_HOME/bin/hadoop jar
 $HADOOP_HOME/mapred/contrib/streaming/hadoop-*-streaming.jar args
 I suppose the latter one is the way to go?

 Ed.

 2010/10/5 Tom White t...@cloudera.com

 Hi Ed,

 The directory structure moved around as a result of the project
 splitting into three subprojects (Common, HDFS, MapReduce). The
 streaming jar is in mapred/contrib/streaming in the distribution.

 Cheers,
 Tom

 On Mon, Oct 4, 2010 at 8:03 PM, edward choi mp2...@gmail.com wrote:
  Hi,
  I've recently downloaded Hadoop-0.21.0.
  After the installation, I've noticed that there is no contrib directory
  that used to exist in Hadoop-0.20.2.
  So I was wondering if there is no hadoop-0.21.0-streaming.jar file in
  Hadoop-0.21.0.
  Anyone had any luck finding it?
  If the way to use streaming has changed in Hadoop-0.21.0, then please
 tell
  me how.
  Appreciate the help, thx.
 
  Ed.

Re: Re: Help!!The problem about Hadoop

2010-10-05 Thread Alejandro Abdelnur

Or you could try using MultiFileInputFormat for your MR job.

http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/MultiFileInputFormat.html

Alejandro

On Tue, Oct 5, 2010 at 4:55 PM, Harsh J qwertyman...@gmail.com wrote:
500 small files comprising one gigabyte? Perhaps you should try
concatenating them all into one big file and try; as a mapper is
supposed to run at least for a minute optimally. And small files don't
make good use of the HDFS block feature.

Have a read: http://www.cloudera.com/blog/2009/02/the-small-files-problem/

2010/10/5 Jander 442950...@163.com:
Hi Jeff,

Thank you very much for your reply sincerely.

I exactly know hadoop has overhead, but is it too large in my problem?

The 1GB text input has about 500 map tasks because the input is composed of
little text file. And the time each map taken is from 8 seconds to 20
seconds. I use compression like conf.setCompressMapOutput(true).

Thanks,
Jander

At 2010-10-05 16:28:55，Jeff Zhang zjf...@gmail.com wrote:

Hi Jander,

Hadoop has overhead compared to single-machine solution. How many task
have you get when you run your hadoop job ? And what is time consuming
for each map and reduce task ?

There's lots of tips for performance tuning of hadoop. Such as
compression and jvm reuse.

2010/10/5 Jander 442950...@163.com:
Hi, all
I do an application using hadoop.
I take 1GB text data as input the result as follows:
(1) the cluster of 3 PCs: the time consumed is 1020 seconds.
(2) the cluster of 4 PCs: the time is about 680 seconds.
But the application before I use Hadoop takes about 280 seconds, so as the
speed above, I must use 8 PCs in order to have the same speed as before.
Now the problem: whether it is correct?

Jander,
Thanks.

--
Best Regards

Jeff Zhang

--
Harsh J
www.harshj.com

Re: how to set diffent VM parameters for mappers and reducers?

2010-10-05 Thread Alejandro Abdelnur

The following 2 properties should work:

mapred.map.child.java.opts
mapred.reduce.child.java.opts

Alejandro


On Tue, Oct 5, 2010 at 9:02 PM, Michael Segel michael_se...@hotmail.com wrote:

 Hi,

 You don't say which version of Hadoop you are using.
 Going from memory, I believe in the CDH3 release from Cloudera, there are 
 some specific OPTs you can set in hadoop-env.sh.

 HTH

 -Mike


 Date: Tue, 5 Oct 2010 16:59:35 +0400
 Subject: how to set diffent VM parameters for mappers and reducers?
 From: vitaliy...@gmail.com
 To: common-user@hadoop.apache.org

 Hello,


 I have mappers that do not need much ram but combiners and reducers need a 
 lot.
 Is it possible to set different VM parameters for mappers and reducers?



 PS Often I face interesting problem, on same set of data I
 recieve I have java.lang.OutOfMemoryError: Java heap space in combiner
 but it happens not all the time.
 What could be cause of such behavior?
 My personal opinion is that I have

 mapred.job.reuse.jvm.num.tasks=-1 and jvm GC doesn't always start when
 it should.

 Thanks in Advance,
 Vitaliy S

Re: Classpath

2010-08-28 Thread Alejandro Abdelnur

Yes, you can do #1, but I wouldn't say it is practical. You can do #2
as well, as you suggest.

But, IMO, the best way is copying the JARs in HDFS and using DistributedCache.

A

On Sun, Aug 29, 2010 at 1:29 PM, Mark static.void@gmail.com wrote:
  How can I add jars to Hadoops classpath when running MapReduce jobs for the
 following situations?

 1) Assuming that the jars are local the nodes that running the job.
 2) The jobs are only local to the client submitting the job.

 I'm assuming I can just jar up all required jobs into the main job jar being
 submitted, but I was wondering if there was some other way. Thanks

Re: REST web service on top of Hadoop

2010-07-29 Thread Alejandro Abdelnur

In Oozie are working on MR/Pig jobs submission over HTTP.

On Thu, Jul 29, 2010 at 5:09 PM, Steve Loughran ste...@apache.org wrote:
 S. Venkatesh wrote:

 HDFS Proxy in contrib provides HTTP interface over HDFS. Its not very
 RESTful but we are working on a new version which will have a REST
 API.

 AFAIK, Oozie will provide REST API for launching MR jobs.



 I've done a webapp to do this (and some trickier things like creating the
 cluster when needed), some slides on it
 http://www.slideshare.net/steve_l/long-haul-hadoop

 there are links on there to the various JIRA issues which have discussed the
 topic

Re: Hadoop multiple output files

2010-06-28 Thread Alejandro Abdelnur

Adam,

Yes, you have do some configuration in the jobconf before submitting
the job. if the javadocs are not clear enough, check the testcase

A

On Mon, Jun 28, 2010 at 8:42 PM, Adam Silberstein
silbe...@yahoo-inc.com wrote:
 Hi Alejandro,
 Thanks for the tip.  I tried the class, but got the following error:

 java.lang.IllegalArgumentException: Undefined named output 'text'
  at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja
 va:496)
  at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.ja
 va:476)
  at
 com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu
 ce(Unknown Source)
  at
 com.yahoo.shadoop.applications.SyntheticToHdfsTableAndIndex$SortReducer.redu
 ce(Unknown Source)
  at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
  at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)


 I am trying to write to an output file named 'text.'  Do I need to
 initialize that file in some way?  I tried making a directory with the name,
 but that didn't do anything.

 Thanks,
 Adam


 On 6/28/10 6:17 PM, Alejandro Abdelnur tuc...@gmail.com wrote:

 Check the MultipleOutputs class

 http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/
 MultipleOutputs.html

 On Mon, Jun 28, 2010 at 5:31 PM, Adam Silberstein
 silbe...@yahoo-inc.com wrote:

 Hi,
 I would like to run a hadoop job that write to multiple output files.  I see
 a class called MultipleOutputFormat that looks like what I want, but I have
 not been able to find any sample code showing how to use it.  I see
 discussion of it in a JIRA, where the idea is to choose output file based on
 key.  If someone could me to a sample, that would be great.

 Thanks,
 Adam

Re: preserve JobTracker information

2010-05-19 Thread Alejandro Abdelnur

Also you can configure the job tracker to keep the RunningJob
information for completed jobs (avail via the Hadoop Java API). There
is a config property that enables this, another that specifies the
location (it can be HDFS or local), and another that specifies for how
many hours you want to keep that information.

HTH

A

On Wed, May 19, 2010 at 1:36 AM, Harsh J qwertyman...@gmail.com wrote:
 Preserved JobTracker history is already available at /jobhistory.jsp

 There is a link at the end of the /jobtracker.jsp page that leads to
 this. There's also free analysis to go with that! :)

 On Tue, May 18, 2010 at 11:00 PM, Alan Miller someb...@squareplanet.de 
 wrote:
 Hi,

 Is there a way to preserve previous job information (Completed Jobs, Failed
 Jobs)
 when the hadoop cluster is restarted?

 Everytime I start up my cluster (start-dfs.sh,start-mapred.sh) the
 JobTracker interface
 at http://myhost:50020/jobtracker.jsp is always empty.

 Thanks,
 Alan






 --
 Harsh J
 www.harshj.com

Re: MultipleTextOutputFormat splitting output into different directories.

2009-09-15 Thread Alejandro Abdelnur

Using the MultipleOutputs (
http://hadoop.apache.org/common/docs/r0.19.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
) you can split data in different files in the outputdir.

After your job finishes you can move the files to different directories.

The benefit of this doing this is that task failures and speculative
execution will also take into account all these files and your data
will be consistent. If you are writing to different directories then
you have to handle this by hand.

A

On Tue, Sep 15, 2009 at 4:22 PM, Aviad sela sela.s...@gmail.com wrote:
 Is any body interested ,addressed such probelm.
 Or does it seem to be esoteric usage ?




 On Wed, Sep 9, 2009 at 7:06 PM, Aviad sela sela.s...@gmail.com wrote:

  I am using Hadoop 0.19.1

 I attempt to split an input into multiple directories.
 I don't know in advance how many directories exists.
 I don't know in advance what is the directory depth.
 I expect that under each such directory a file exists with all availble
 records having the same key permutation
 found in the job.

 If currently each reducer produce a single output i.e. PART-0001
 I would like to create as many directory necessary taking the following
 pattern:

                key1 / key2/ .../ keyN/ PART-0001

 where the  key?  may have different values for each input record.
 different record may results with a different path requested:
               key1a/key2b/PART-0001
               key1c/key2d/key3e/PART-0001
 to keep it simple, during each job we may expect the same depth from each
 record.

 I assume that the input records imply that each reduce will produce several
 hundreds of such directories.
 (Indeed this strongly depends on the input record semantic).


 The MAP part reads a record,following some logic, assign a key like :
 KEY_A, KEY_B
 The MAP Value is the original input line.


 For The reducer part I assign the IdentityReducer.
 However have set :

     jobConf.setReducerClass(IdentityReducer.
 *class*);

     jobConf.setOutputFormat(MyTextOutputFormat.*class*);



 Where the MyTextOutputFormat  extends MultipleTextOutputFormat, and
 implements:

     protected String generateFileNameForKeyValue(K key, V value, String
 name)
     {
         String keyParts[] = key.toString().split(,);
         Path finalPath = null;
         // Build the directory structure comprised of the Key parts
        for (int i=0; i  keyParts.length; i++)
        {
             String part = keyParts[i].trim();
            if  (false == .equals(part))
            {
                if (null == finalPath)
                           finalPath = new Path(part);
                 else
                 {
                         finalPath = new Path(finalPath, part);
                 }
            }
         } //end of for

        String fileName = generateLeafFileName(name);
        finalPath = new Path(finalPath, fileName);

        return finalPath.toString();
  } //generatedFileNameKeyValue
 During execution I have seen the reduce attempts does create the following
 path under the output path:

 /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_00_0/KEY_A/KEY_B/part-0
    However, the file was empty.



 The job fails at the end with the the following exceptions found in the
 task log:

 2009-09-09 11:19:49,653 INFO org.apache.hadoop.hdfs.DFSClient: Exception in
 createBlockOutputStream java.io.IOException: Bad connect ack with
 firstBadLink 9.148.30.71:50010
 2009-09-09 11:19:49,654 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning
 block blk_-6138647338595590910_39383
 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
 Exception: java.io.IOException: Unable to create new block.
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2722)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

 2009-09-09 11:19:55,659 WARN org.apache.hadoop.hdfs.DFSClient: Error
 Recovery for block blk_-6138647338595590910_39383 bad datanode[1] nodes ==
 null
 2009-09-09 11:19:55,660 WARN org.apache.hadoop.hdfs.DFSClient: Could not
 get block locations. Source file
 /user/hadoop/test_rep01/_temporary/_attempt_200909080349_0013_r_02_0/KEY_A/KEY_B/part-2
 - Aborting...
 2009-09-09 11:19:55,686 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.io.IOException: Bad connect ack with firstBadLink 9.148.30.71:50010
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2780)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2703)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
 2009-09-09 11:19:55,688 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
 cleanup for

Re: Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem

2009-02-03 Thread Alejandro Abdelnur

Mikhail,

You are right, please open a Jira on this.

Alejandro


On Wed, Jan 28, 2009 at 9:23 PM, Mikhail Yakshin
greycat.na@gmail.comwrote:

 Hi,

 We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm
 trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious
 problem I've got into that we're extensively using MultipleOutputs in
 our jobs dealing with sequence files that store Cascading's Tuples.

 Since Cascading 0.9, Tuples stopped being WritableComparable and
 implemented generic Hadoop serialization interface and framework.
 However, in Hadoop 0.19, MultipleOutputs require use of older
 WritableComparable interface. Thus, trying to do something like:

 MultipleOutputs.addNamedOutput(conf, output-name,
 MySpecialMultiSplitOutputFormat.class, Tuple.class, Tuple.class);
 mos = new MultipleOutputs(conf);
 ...
 mos.getCollector(output-name, reporter).collect(tuple1, tuple2);

 yields an error:

 java.lang.RuntimeException: java.lang.RuntimeException: class
 cascading.tuple.Tuple not org.apache.hadoop.io.WritableComparable
at
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getNamedOutputKeyClass(MultipleOutputs.java:252)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs$InternalFileOutputFormat.getRecordWriter(MultipleOutputs.java:556)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getRecordWriter(MultipleOutputs.java:425)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:511)
at
 org.apache.hadoop.mapred.lib.MultipleOutputs.getCollector(MultipleOutputs.java:476)
at my.namespace.MyReducer.reduce(MyReducer.java:xxx)

 Is there any known workaround for that? Any progress going on to make
 MultipleOutputs use generic Hadoop serialization?

 --
 WBR, Mikhail Yakshin

easy way to create IntelliJ project for Hadoop trunk?

2009-02-03 Thread Alejandro Abdelnur

With ivy fetching JARs as part of the build and several src dirs creating an
IDE (IntelliJ in my case) project becomes a pain.

Any easy way of doing it?

Thxs.

Alejandro

Re: MultipleOutputFormat versus MultipleOutputs

2008-08-28 Thread Alejandro Abdelnur

Besides the usage pattern the key differences are:

Different MultipleOutputs outputs can have different outputformats and
key/value classes. With MultipleOutputFormat you can't.

(and if I'm not mistaken) If using MultipleOutputFormat in a map you
can't have a reduce phase. With MultipleOutputs you can.

A

On Thu, Aug 28, 2008 at 3:36 AM, Shirley Cohen [EMAIL PROTECTED] wrote:
 Hi,

 I would like the reducer to output to different files based upon the value
 of the key. I understand that both MultipleOutputs and MultipleOutputFormat
 can do this. Is that correct? However, I don't understand the differences
 between these two classes. Can someone explain the differences and provide
 an example to illustrate these differences? I found a snippet of code on how
 to use MultipleOutputs in the documentation, but could not find an example
 for using MultipleOutputFormat.

 Thanks in advance,

 Shirley

Re: MultipleOutputFormat versus MultipleOutputs

2008-08-28 Thread Alejandro Abdelnur

Shirley,

Check the javadocs or the corresponding testcases.

A

On Thu, Aug 28, 2008 at 6:02 PM, Shirley Cohen [EMAIL PROTECTED] wrote:
 Thanks, Alejandro!

 Do you have an example that shows how to use MultipleOutputFormat?

 Shirley

 On Aug 28, 2008, at 2:41 AM, Alejandro Abdelnur wrote:

 Besides the usage pattern the key differences are:

 Different MultipleOutputs outputs can have different outputformats and
 key/value classes. With MultipleOutputFormat you can't.

 (and if I'm not mistaken) If using MultipleOutputFormat in a map you
 can't have a reduce phase. With MultipleOutputs you can.

 A

 On Thu, Aug 28, 2008 at 3:36 AM, Shirley Cohen [EMAIL PROTECTED]
 wrote:

 Hi,

 I would like the reducer to output to different files based upon the
 value
 of the key. I understand that both MultipleOutputs and
 MultipleOutputFormat
 can do this. Is that correct? However, I don't understand the differences
 between these two classes. Can someone explain the differences and
 provide
 an example to illustrate these differences? I found a snippet of code on
 how
 to use MultipleOutputs in the documentation, but could not find an
 example
 for using MultipleOutputFormat.

 Thanks in advance,

 Shirley

Re: How can I control Number of Mappers of a job?

2008-08-01 Thread Alejandro Abdelnur

when done, HADOOP-3387 would allow you to do that. In our
implementation we can tell Hadoop the exact # maps and it will group
splits if necessary.

On Fri, Aug 1, 2008 at 5:25 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
 Well, the only way to reliably fix the number of maptasks that I've found is
 by using compressed input files, that forces hadoop to assign one and only
 one file to a map task ;)

 Andreas

 On Thursday 31 July 2008 21:30:33 Gopal Gandhi wrote:
 Thank you, finally someone has interests in my questions =)
 My cluster contains more than one machine. Please don't get me wrong :-). I
 don't want to limit the total mappers in one node (by mapred.map.tasks).
 What I want is to limit the total mappers for one job. The motivation is
 that I have 2 jobs to run at the same time. they have the same input data
 in Hadoop. I found that one job has to wait until the other finishes its
 mapping. Because the 2 jobs are submitted by 2 different people, I don't
 want one job to be starving. So I want to limit the first job's total
 mappers so that the 2 jobs will be launched simultaneously.



 - Original Message 
 From: Goel, Ankur [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Cc: [EMAIL PROTECTED]
 Sent: Wednesday, July 30, 2008 10:17:53 PM
 Subject: RE: How can I control Number of Mappers of a job?

 How big is your cluster? Assuming you are running a single node cluster,

 Hadoop-default.xml has a parameter 'mapred.map.tasks' that is set to 2.
 So
 By default, no matter how many map tasks are calculated by framework,
 only  2 map task will execute on a single node cluster.

 -Original Message-
 From: Gopal Gandhi [mailto:[EMAIL PROTECTED]
 Sent: Thursday, July 31, 2008 4:38 AM
 To: core-user@hadoop.apache.org
 Cc: [EMAIL PROTECTED]
 Subject: How can I control Number of Mappers of a job?

 The motivation is to control the max # of mappers of a job. For example,
 the input data is 246MB, divided by 64M is 4. If by default there will
 be 4 mappers launched on the 4 blocks.
 What I want is to set its max # of mappers as 2, so that 2 mappers are
 launched first and when they completes on the first 2 blocks, another 2
 mappers start on the rest 2 blocks. Does Hadoop provide a way?

Re: multiple Output Collectors ?

2008-07-30 Thread Alejandro Abdelnur

Are you seeing in local FS or in HDFS?

In local FS you'll see them. In HDFS you should not see any (via
hadoop dfs -ls).

As far as I understand, check-sum files are empty the corresponding
files are empty.

A

On Mon, Jul 21, 2008 at 3:11 AM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Hello,

 Is there any reason that the check-sum file for  a multipleoutput's
 collectors is empty?

 -k

 On Wed, Jul 16, 2008 at 4:26 AM, Alejandro Abdelnur [EMAIL PROTECTED] wrote:
 multiple mappers mean multiple jobs, which means you'll have to run 2
 jobs on the same data, with the MultipleOutputs and
 MultipleOutputFormat you can do that in one pass form a single Mapper.

 On Wed, Jul 16, 2008 at 3:26 AM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Thank you very much. Someone suggested I could just use multiple
 mapper. Would that work better (easier?) ?

 -k

 On Mon, Jul 14, 2008 at 11:59 PM, Alejandro Abdelnur [EMAIL PROTECTED] 
 wrote:
 check MultipleOutputFormat and MultipleOutputs (this has been
 committed to the trunk last week)


 On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Hello,

 Is it possible to have more than one output collector for one map?

 My input are records of html pages. I am mapping each url to its
 html-content and want to have two output collectors. One that maps
 each url, html-content -- url, outlinks and another one that map
 url, html-content to something else (difficult to explain).

 Please help. Thanks

 -k

Re: How to write one file per key as mapreduce output

2008-07-29 Thread Alejandro Abdelnur

On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter
[EMAIL PROTECTED] wrote:

 Alejandro said:
 Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip)

 I'm muddling through both
 http://issues.apache.org/jira/browse/HADOOP-2906 and
 http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
 of these.  I'm a little confused by the way this works but it looks
 like I can define a number of named outputs which looks like it
 enables different output formats and I can also define some of these
 as multi, meaning that I can write to different targets (like
 files).  Is this correct?

Exactly.



 A couple of questions:

  - I needed to pass 'null' to the collect method so as to not write
 the key to the file.  These files are meant to be consumable chunks of
 content so I want to control exactly what goes into them.  Does this
 seem normal or am i missing something?  Is there a downside to passing
 null here?

Not sure what happens if you write NULL as key or value.

  - What is the 'part-0' file for?  I have seen this in other
 places in the dfs. But it seems extraneous here.  It's not super
 critical but if I can make it go away that would be great.

This is the standard output of the M/R job whatever is written the
OutputCollector you get in the reduce() call (or in the map() call
when reduce=0)

  - What is the purpose of the '-r-0' suffix?  Perhaps it is to
 help with collisions?

Yes, files written from a map have '-m-', files written from a reduce have '-r-'

 I guess it seems strange that I can't just say
 the output file should be called X and have an output file called X
 appear.

Well, you need the map, reduce mask and the task number mask to avoid
collisions.

Re: How to write one file per key as mapreduce output

2008-07-29 Thread Alejandro Abdelnur

Then you may want to look at the MultipleOutputFile, it can do what you need.

On Tue, Jul 29, 2008 at 10:11 PM, Lincoln Ritter
[EMAIL PROTECTED] wrote:
 Thanks for the info!

 Not sure what happens if you write NULL as key or value.

 Looking at the code, it doesn't seem to really make a difference, and
 the function in question (basically 'collect') looks to be robust to
 null - but I may be missing something!

 In my case, I basically want the key to be the output filename, and
 the data in the files to be directly consumable by my app.  Having the
 key show up in the file complicates things on the app side so I'm
 trying to avoid this.  Passing null seems to work for now.


 -lincoln

 --
 lincolnritter.com




 On Tue, Jul 29, 2008 at 9:27 AM, Alejandro Abdelnur [EMAIL PROTECTED] wrote:
 On Thu, Jul 24, 2008 at 12:32 AM, Lincoln Ritter
 [EMAIL PROTECTED] wrote:

 Alejandro said:
 Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN 
 tip)

 I'm muddling through both
 http://issues.apache.org/jira/browse/HADOOP-2906 and
 http://issues.apache.org/jira/browse/HADOOP-3149 trying to make sense
 of these.  I'm a little confused by the way this works but it looks
 like I can define a number of named outputs which looks like it
 enables different output formats and I can also define some of these
 as multi, meaning that I can write to different targets (like
 files).  Is this correct?

 Exactly.

 

 A couple of questions:

  - I needed to pass 'null' to the collect method so as to not write
 the key to the file.  These files are meant to be consumable chunks of
 content so I want to control exactly what goes into them.  Does this
 seem normal or am i missing something?  Is there a downside to passing
 null here?

 Not sure what happens if you write NULL as key or value.

  - What is the 'part-0' file for?  I have seen this in other
 places in the dfs. But it seems extraneous here.  It's not super
 critical but if I can make it go away that would be great.

 This is the standard output of the M/R job whatever is written the
 OutputCollector you get in the reduce() call (or in the map() call
 when reduce=0)

  - What is the purpose of the '-r-0' suffix?  Perhaps it is to
 help with collisions?

 Yes, files written from a map have '-m-', files written from a reduce have 
 '-r-'

 I guess it seems strange that I can't just say
 the output file should be called X and have an output file called X
 appear.

 Well, you need the map, reduce mask and the task number mask to avoid
 collisions.

Re: How to write one file per key as mapreduce output

2008-07-22 Thread Alejandro Abdelnur

Lincoln,

Take a look at the MultipleOutputFormat class or MultipleOutputs (in SVN tip)

A

On Wed, Jul 23, 2008 at 5:34 AM, Lincoln Ritter
[EMAIL PROTECTED] wrote:
 Greetings,

 I have what I think is a pretty straight-forward, noobie question.  I
 would like to write one file per key in the reduce (or map) phase of a
 mapreduce job.  I have looked at the documentation for
 FileOutputFormat and MultipleTextOutputFormat but am a bit unclear on
 how to use it/them.  Can anybody give me a quick pointer?

 Thanks very much!

 -lincoln

 --
 lincolnritter.com

[email recall] ROME 1.0 RC1 disted

2008-07-17 Thread Alejandro Abdelnur

Apologies for the my previous email, somehow my email aliases got
screwed up, this should have gone to the ROME alias.

Alejandro

On Wed, Jul 16, 2008 at 10:43 PM, Alejandro Abdelnur [EMAIL PROTECTED] wrote:
 I've updated ROME wiki, uploaded the javadocs and dists, tagging CVS
 at the moment.

 Would somebody double check everything is OK, links and downloads?

 Also we should get the JAR in a maven repo. Who had the necessary
 permissions to do?

 We can do a final 1.0 release in a couple of weeks, if no issues arise
 retagging RC1 as 1.0. Also it would give time for any subproject
 (Fetcher ?) that wants to go 1.0 in tandem.

 Cheers.

 A

Re: multiple Output Collectors ?

2008-07-16 Thread Alejandro Abdelnur

multiple mappers mean multiple jobs, which means you'll have to run 2
jobs on the same data, with the MultipleOutputs and
MultipleOutputFormat you can do that in one pass form a single Mapper.

On Wed, Jul 16, 2008 at 3:26 AM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Thank you very much. Someone suggested I could just use multiple
 mapper. Would that work better (easier?) ?

 -k

 On Mon, Jul 14, 2008 at 11:59 PM, Alejandro Abdelnur [EMAIL PROTECTED] 
 wrote:
 check MultipleOutputFormat and MultipleOutputs (this has been
 committed to the trunk last week)


 On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Hello,

 Is it possible to have more than one output collector for one map?

 My input are records of html pages. I am mapping each url to its
 html-content and want to have two output collectors. One that maps
 each url, html-content -- url, outlinks and another one that map
 url, html-content to something else (difficult to explain).

 Please help. Thanks

 -k

ROME 1.0 RC1 disted

2008-07-16 Thread Alejandro Abdelnur

I've updated ROME wiki, uploaded the javadocs and dists, tagging CVS
at the moment.

Would somebody double check everything is OK, links and downloads?

Also we should get the JAR in a maven repo. Who had the necessary
permissions to do?

We can do a final 1.0 release in a couple of weeks, if no issues arise
retagging RC1 as 1.0. Also it would give time for any subproject
(Fetcher ?) that wants to go 1.0 in tandem.

Cheers.

A

Re: Outputting to different paths from the same input file

2008-07-14 Thread Alejandro Abdelnur

You can use MultipleOutputFormat or MultipleOutputs (it has been
committed to SVN a few days ago) for this.

Then you can use a filter on your input dir for the next jobs so only
files matching a given name/pattern are used.

A

On Fri, Jul 11, 2008 at 8:54 PM, Jason Venner [EMAIL PROTECTED] wrote:
 We open side effect files in our map and reduce jobs to 'tee' off additional
 data streams.
 We open them in the /configure/ method and close them in the /close/ method
 The /configure/ method provides access to the /JobConf.

 /We create our files relative to value of conf.get(mapred.output.dir), in
 the map/reduce object instances.

 The files end up in the conf.getOutputPath() directory, and we move them out
 based on knowing the shape of the file names, after the job finishes.


 Then after the job is finished move all of the files to another location
 using a file name based filter to select the files to move (from the job

 schnitzi wrote:

 Okay, I've found some similar discussions in the archive, but I'm still
 not
 clear on this.  I'm new to Hadoop, so 'scuse my ignorance...

 I'm writing a Hadoop tool to read in an event log, and I want to produce
 two
 separate outputs as a result -- one for statistics, and one for budgeting.
 Because the event log I'm reading in can be massive, I would like to only
 process it once.  But the outputs will each be read by further M/R
 processes, and will be significantly different from each other.

 I've looked at MultipleOutputFormat, but it seems to just want to
 partition
 data that looks basically the same into this file or that.

 What's the proper way to do this?  Ideally, whatever solution I implement
 should be atomic, in that if any one of the writes fails, neither output
 will be produced.


 AdTHANKSvance,
 Mark


 --
 Jason Venner
 Attributor - Program the Web http://www.attributor.com/
 Attributor is hiring Hadoop Wranglers and coding wizards, contact if
 interested

Re: Parameterized InputFormats

2008-07-14 Thread Alejandro Abdelnur

If your InputFormat implements Configurable you'll get access to the
JobConf via the setConf(Configuration) method when Hadoop creates an
instance of your class.

On Mon, Jun 30, 2008 at 11:20 PM, Nathan Marz [EMAIL PROTECTED] wrote:
 Hello,

 Are there any plans to change the JobConf API so that it takes an instance
 of an InputFormat rather than the InputFormat class? I am finding the
 inability to properly parameterize my InputFormats to be very restricting.
 What's the reasoning behind having the class as a parameter rather than an
 instance?

 -Nathan Marz

Re: multiple Output Collectors ?

2008-07-14 Thread Alejandro Abdelnur

check MultipleOutputFormat and MultipleOutputs (this has been
committed to the trunk last week)


On Mon, Jul 14, 2008 at 11:49 PM, Khanh Nguyen [EMAIL PROTECTED] wrote:
 Hello,

 Is it possible to have more than one output collector for one map?

 My input are records of html pages. I am mapping each url to its
 html-content and want to have two output collectors. One that maps
 each url, html-content -- url, outlinks and another one that map
 url, html-content to something else (difficult to explain).

 Please help. Thanks

 -k

Re: Realtime Map Reduce = Supercomputing for the Masses?

2008-06-02 Thread Alejandro Abdelnur

Yes you would have to do it with classloaders (not 'hello world' but not
'rocket science' either).

You'll be limited on using native libraries, even if you use classloaders
properly as native libs can be loaded only once.

You will have to ensure you get rid of the task classloader once the task is
over (thus removing all singleton stuff that may be in it).

You will have to put in place a security manager for the code running out
the task classloader.

You'll end up doing somemthing similar to servlet containers webapp
classloading model with the extra burden of hot-loading for each task run.
Which in the end may have a similar overhead of bootstrapping a JVM for the
task, this should be measured to see what is the time delta to see if it is
worth the effort.

A


On Mon, Jun 2, 2008 at 3:53 PM, Steve Loughran [EMAIL PROTECTED] wrote:

 Christophe Taton wrote:

 Actually Hadoop could be made more friendly to such realtime Map/Reduce
 jobs.
 For instance, we could consider running all tasks inside the task tracker
 jvm as separate threads, which could be implemented as another personality
 of the TaskRunner.
 I have been looking into this a couple of weeks ago...
 Would you be interested in such a feature?


 Why does that have benefits? So that you can share stuff via local data
 structures? Because you'd better be sharing classloaders if you are going to
 play that game. And that is very hard to get right (to the extent that I
 dont think any apache project other than Felix does it well)

Re: How do people keep their client configurations in sync with the remote cluster(s)

2008-05-19 Thread Alejandro Abdelnur

 I'll keep an eye on that issue. I think a key problem right now is that
 clients take their config from the configuration file in the core jar, and
 from their own settings, You need to keep the settings in sync somehow, and
 have to take what the core jar provides.

Yes, exactly that is the problem, there is no way to have default
values set by the cluster unless you redistribute a hadoop-site.xml to
all your clients.

Regarding Hadoop-3287 there is some rejection for such fix, so if you
feel it make sense please comment on it.

Alejandro

Re: How do people keep their client configurations in sync with the remote cluster(s)

2008-05-19 Thread Alejandro Abdelnur

That would be an option too.

On Mon, May 19, 2008 at 10:26 PM, Ted Dunning [EMAIL PROTECTED] wrote:

 I think it would be better to have the client retrieve the default
 configuration.  Not all configuration settings are simple overrides.   Some
 are read-modify-write operations.

 This also fits the current code better.


 On 5/19/08 6:38 AM, Steve Loughran [EMAIL PROTECTED] wrote:

 Alejandro Abdelnur wrote:
 A while ago I've opened an issue related to this topic

   https://issues.apache.org/jira/browse/HADOOP-3287

 My take is a little different, when submitting a job, the clients
 should only send to the jobtracker the configuration they explicitly
 set, then the job tracker would apply the defaults for all the other
 configuration.

 By doing this the cluster admin can modify things at any time and
 changes on default values take effect for all clients without having
 to distribute a new configuration to all clients.

 IMO, this approach was the intended behavior at some point, according
 to the Configuration.write(OutputStream) javadocs ' Writes non-default
 properties in this configuration.'. But as the write method is writing
 default properties this is not happening.

 I'll keep an eye on that issue. I think a key problem right now is that
 clients take their config from the configuration file in the core jar,
 and from their own settings, You need to keep the settings in sync
 somehow, and have to take what the core jar provides.


 This approach would also get rid of the separate mechanism (zookeeper,
 svn, etc) to keep clients synchronized as there would be no need to do
 so.

 zookeeper and similar are to keep the cluster alive; they shouldnt be
 needed for clients, which should only need some URL of a job tracker to
 talk to.

Re: Can reducer output multiple files?

2008-05-08 Thread Alejandro Abdelnur

Lookt at the MultipleOutputFormat class under mapred/lib.

Or you can look at the Hadoop-3149 patch that introduces a
MultipleOutputs class. You could get this work easily on released
Hadoop versions.

HTH

Alejandro

On Thu, May 8, 2008 at 2:05 PM, Jeremy Chow [EMAIL PROTECTED] wrote:
 Hi list,
 I want to output my reduced results into several files according to some
 types the results blongs to. How can I implement this?

 Thx,
 Jeremy

 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 http://coderplay.javaeye.com

Re: small sized files - how to use MultiInputFileFormat

2008-04-01 Thread Alejandro Abdelnur

In principle I agree with you Ted.

However, in many cases we have multiple large jobs generating outputs
that are not that big and as result the number of small size files
(significantly smaller than a Hadoop block) is large, using the
default splitting logic there triggers jobs with a large amount of
tasks that inefficiently clogs the cluster.

The MultipleFileInputFormat helps to avoid that, but it has a problem,
if the file set is a mix of small and large files the splits are
uneven and it does not do split on single large files.

To address this we've written our own InputFormat (for Text and
SequenceFiles) that collapses small files into a splits up to the
block size and splits big files into the block size.

It has a  twist that you can you specify the max number of MAPs that
you want or the BLOCK size you want to use for the splits.

When a particular split contains multiple small files, the suggested
host for the splits is order based on the host that has most of the
data for those files.

We'll still have to do some clean up on the code and then we'll submit
it to Hadoop.

A

On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning [EMAIL PROTECTED] wrote:

  Small files are a bad idea for high throughput no matter what technology you
  use.  The issue is that you need a larger file in order to avoid disk seeks.




  On 3/28/08 7:34 PM, Jason Curtes [EMAIL PROTECTED] wrote:

   Hello,
  
   I have been trying to run Hadoop on a set of small text files, not larger
   than 10k each. The total input size is 15MB. If I try to run the example
   word count application, it takes about 2000 seconds, more than half an hour
   to complete. However, if I merge all the files into one large file, it 
 takes
   much less than a minute. I think using MultiInputFileFormat can be helpful
   at this point. However, the API documentation is not really helpful. I
   wonder if MultiInputFileFormat can really solve my problem, and if so, can
   you suggest me a reference on how to use it, or a few lines to be added to
   the word count example to make things more clear?
  
   Thanks in advance.
  
   Regards,
  
   Jason Curtes

94 matches

Mail list logo