Re: Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X
OK I will continue on hdp list: I am already using the hdfs command for all of those individual commands but they are *not* a replacement for the single start-dfs.sh 2015-10-24 9:48 GMT-07:00 Ted Yu <yuzhih...@gmail.com>: > See /usr/hdp/current/hadoop-hdfs-client/bin/hdfs which calls hdfs.distro > > At the top of hdfs.distro, you would see the usage: > > function print_usage(){ > echo "Usage: hdfs [--config confdir] COMMAND" > echo " where COMMAND is one of:" > echo " dfs run a filesystem command on the file > systems supported in Hadoop." > echo " namenode -format format the DFS filesystem" > echo " secondarynamenoderun the DFS secondary namenode" > echo " namenode run the DFS namenode" > echo " journalnode run the DFS journalnode" > > BTW since this question is vendor specific, I suggest continuing on > vendor's forum. > > Cheers > > On Fri, Oct 23, 2015 at 7:06 AM, Stephen Boesch <java...@gmail.com> wrote: > >> >> We are setting up automated deployments on a headless system: so using >> the GUI is not an option here. When we search for those scripts under >> HDP they are not found: >> >> $ pwd >> /usr/hdp/current >> >> Which scripts exist in HDP ? >> >> [stack@s1-639016 current]$ find -L . -name \*.sh >> ... >> >> There are ZERO start/stop sh scripts.. >> >> In particular I am interested in the *start-dfs.sh* script that starts >> the namenode(s) , journalnode, and datanodes. >> >> >
Re: Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X
Artem, the first thing I did was a find /usr/hdp -name \*.sh . I have done numerous variations on that. 2015-10-24 8:35 GMT-07:00 Artem Ervits <artemerv...@gmail.com>: > Look in /usr/hdp/2.3 > On Oct 23, 2015 10:07 AM, "Stephen Boesch" <java...@gmail.com> wrote: > >> >> We are setting up automated deployments on a headless system: so using >> the GUI is not an option here. When we search for those scripts under >> HDP they are not found: >> >> $ pwd >> /usr/hdp/current >> >> Which scripts exist in HDP ? >> >> [stack@s1-639016 current]$ find -L . -name \*.sh >> ... >> >> There are ZERO start/stop sh scripts.. >> >> In particular I am interested in the *start-dfs.sh* script that starts >> the namenode(s) , journalnode, and datanodes. >> >>
Start/stop scripts - particularly start-dfs.sh - in Hortonworks Data Platform 2.3.X
We are setting up automated deployments on a headless system: so using the GUI is not an option here. When we search for those scripts under HDP they are not found: $ pwd /usr/hdp/current Which scripts exist in HDP ? [stack@s1-639016 current]$ find -L . -name \*.sh ... There are ZERO start/stop sh scripts.. In particular I am interested in the *start-dfs.sh* script that starts the namenode(s) , journalnode, and datanodes.
Re: Jr. to Mid Level Big Data jobs in Bay Area
Hi, This is not a job board. Thanks. 2015-05-17 16:00 GMT-07:00 Adam Pritchard apritchard...@gmail.com: Hi everyone, I was wondering if any of you know any openings looking to hire a big data dev in the Palo Alto area. Main thing I am looking for is to be on a team that will embrace having a Jr to Mid level big data developer, where I can grow my skill set and contribute. My skills are: 3 years Java 1.5 years Hadoop 1.5 years Hbase 1 year map reduce 1 year Apache Storm 1 year Apache Spark (did a Spark Streaming project in Scala) 5 years PHP 3 years iOS development 4 years Amazon ec2 experience Currently I am working in San Francisco as a big data developer, but the team I'm on is content leaving me work that I already knew how to do when I came to the team (web services) and I want to work with big data technologies at least 70% of the time. I am not a senior big data dev, but I am motivated to be and am just looking for an opportunity where I can work all day or most of the day with big data technologies, and contribute and learn from the project at hand. Thanks if anyone can share any information, Adam
Re: Jobs for Hadoop Admin in SFO Bay Area
Hi, There are plenty of job boards: please use them : this one is for technical items. 2015-02-17 17:50 GMT-08:00 Krish Donald gotomyp...@gmail.com: Hi, Does any of you aware of the job opening for entry level Hadoop Admin in SFO Bay Area ? If yes, please let me know. Thanks Krish
Re: Sr.Technical Architect/Technical Architect/Sr. Hadoop /Big Data Developer for CA, GA, NJ, NY, AZ Locations_(FTE)Full Time Employment
Please refrain from job postings on this mailing list. 2014-11-12 20:53 GMT-08:00 Larry McCay lmc...@hortonworks.com: Everyone should be aware that replying to this mail results in sending your papers to everyone on the list On Wed, Nov 12, 2014 at 8:17 PM, mark charts mcha...@yahoo.com wrote: Hello. I am interested. Attached are my cover letter and my resume. Mark Charts On Wednesday, November 12, 2014 2:45 PM, Amendra Singh Gangwar amen...@exlarate.com wrote: Hi, Please let me know if you are available for this FTE position for CA, GA, NJ, NY with good travel. Please forward latest resume along with Salary Expectations, Work Authorization Minimum joining time required. Job Descriptions: Positions: Technical Architect/Sr. Technical Architect/ Sr. Hadoop /Big Data Developer Location : CA, GA, NJ, NY Job Type : Full Time Employment Domain: BigData Requirement 1: Sr. Technical Architect: 12+ years of experience in the implementation role of high end software products in telecom/ financials/ healthcare/ hospitality domain. Requirement 2: Technical Architect: 9+ years of experience in the implementation role of high end software products in telecom/ financials/ healthcare/ hospitality domain. Requirement 3: Sr. Hadoop /Big Data Developer 7+ years of experience in the implementation role of high end software products in telecom/ financials/ healthcare/ hospitality domain. Education: Engineering Graduate, .MCA, Masters/Post Graduates (preferably IT/ CS) Primary Skills: 1. Expertise on Java/ J2EE and should still be hands on. 2. Implemented and in-depth knowledge of various java/ J2EE/ EAI patterns by using Open Source products. 3. Design/ Architected and implemented complex projects dealing with the considerable data size (GB/ PB) and with high complexity. 4. Sound knowledge of various Architectural concepts (Multi-tenancy, SOA, SCA etc) and capable of identifying and incorporating various NFR’s (performance, scalability, monitoring etc) 5. Good in database principles, SQL, and experience working with large databases (Oracle/ MySQL/ DB2). 6. Sound knowledge about the clustered deployment Architecture and should be capable of providing deployment solutions based on customer needs. 7. Sound knowledge about the Hardware (CPU, memory, disk, network, Firewalls etc) 8. Should have worked on open source products and also contributed towards it. 9. Capable of working as an individual contributor and within team too. 10. Experience in working in ODC model and capable of presenting the Design and Architecture to CTO’s, CEO’s at onsite 11. Should have experience/ knowledge on working with batch processing/ Real time systems using various Open source technologies lik Solr, hadoop, NoSQL DB’s, Storm, kafka etc. Role Responsibilities (Technical Architect/Sr. Technical Architect) •Anticipate on technological evolutions. •Coach the technical team in the development of the technical architecture. •Ensure the technical directions and choices. •Design/ Architect/ Implement various solutions arising out of the large data processing (GB’s/ PB’s) over various NoSQL, Hadoop and MPP based products. •Driving various Architecture and design calls with bigdata customers. •Working with offshore team and providing guidance on implementation details. •Conducting sessions/ writing whitepapers/ Case Studies pertaining to BigData •Responsible for Timely and quality deliveries. •Fulfill organization responsibilities – Sharing knowledge and experience within the other groups in the org., conducting various technical sessions and trainings. Role Responsibilities (Sr. Hadoop /Big Data Developer) •Implementation of various solutions arising out of the large data processing (GB’s/ PB’s) over various NoSQL, Hadoop and MPP based products •Active participation in the various Architecture and design calls with bigdata customers. •Working with Sr. Architects and providing implementation details to offshore. •Conducting sessions/ writing whitepapers/ Case Studies pertaining to BigData •Responsible for Timely and quality deliveries. •Fulfill organization responsibilities – Sharing knowledge and experience within the other groups in the org., conducting various technical sessions and trainings -- Thanks Regards, Amendra Singh Exlarate LLC Cell : 323-250-0583 E-mail :amen...@exlarate.com www.exlarate.com Under Bill s.1618 Title III passed by the 105th U.S. Congress this mail cannot be considered Spam as long as we include contact information and a remove link for removal from our mailing list. To be removed from our mailing list reply with Remove and include your original email address/addresses in the subject heading. Include complete address/addresses and/or domain to be removed. We
Re: Spark vs. Storm
Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities. 2014-07-02 12:59 GMT-07:00 Shahab Yunus shahab.yu...@gmail.com: Not exactly. There are of course major implementation differences and then some subtle and high level ones too. My 2-cents: Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.) Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records. Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.) So given this, you can pick the framework which is more attuned to your needs. On Wed, Jul 2, 2014 at 3:31 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Do these two projects do essentially the same thing? Is one better than the other?
Re: Big Data tech stack (was Spark vs. Storm)
You will not be arriving at a generic stack without oversimplifying to the point of serious deficiencies. There are as you say a multitude of options. You are attempting to boil them down to A vs B as opposed to A may work better under the following conditions .. 2014-07-02 13:25 GMT-07:00 Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com: You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got: Ubuntu Hadoop Cassandra (Seems to be the highest performing NoSQL database out there.) Storm (maybe?) Python (Easier than Java. Maybe that shouldn’t be a concern.) Hive (For people to leverage their existing SQL skillset.) That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions? B. *From:* Stephen Boesch java...@gmail.com *Sent:* Wednesday, July 02, 2014 3:07 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs. Storm Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities. 2014-07-02 12:59 GMT-07:00 Shahab Yunus shahab.yu...@gmail.com: Not exactly. There are of course major implementation differences and then some subtle and high level ones too. My 2-cents: Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.) Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records. Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.) So given this, you can pick the framework which is more attuned to your needs. On Wed, Jul 2, 2014 at 3:31 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Do these two projects do essentially the same thing? Is one better than the other?
Re: Cloudera VM
Hi Shashi, apologies, I was working on CDH5 enterprise (real deal) today and had misremembered the settings for the VM the VM uses */opt/cloudera/parcels/CDH/lib/*[hive|hadoop|hbase] 2014-06-13 22:51 GMT-07:00 Maisnam Ns maisnam...@gmail.com: Hi Stephen, Your last command ls -lrta /etc/hadoop worked , but when I tried this command I could not find hadoop|hive|hbase etc, where are they located, the actual hadoop directory [cloudera@localhost /]$ ls /usr/lib anaconda-runtime hue java-1.5.0 jvm-exports rpm bonobojavajava-1.6.0 jvm-private ruby ConsoleKitjava-1.3.1 java-1.7.0 locale sendmail cups java-1.4.0 java-ext lsb sendmail.postfix games java-1.4.1 jvm mozilla vmware-tools gcc java-1.4.2 jvm-commmon python2.6yum-plugins Thanks shashi On Sat, Jun 14, 2014 at 10:30 AM, Stephen Boesch java...@gmail.com wrote: Hi Shashidhar, They are under /etc/[hadoop|hbase|hive|etc]/conf as symlinks to /usr/lib/[hadoop|hbase|hive|etc] [cloudera@localhost CDH]$ ll -Ld /etc/h* drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hadoop-httpfs drwxr-xr-x. 3 root root 4096 Apr 4 10:36 /etc/hal drwxr-xr-x. 3 root root 4096 Jun 8 16:37 /etc/hbase drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hbase-solr drwxr-xr-x. 3 root root 4096 Jun 8 16:37 /etc/hive drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hive-hcatalog drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hive-webhcat -rw-r--r--. 1 root root9 Sep 23 2011 /etc/host.conf -rw-r--r--. 1 root root 46 Apr 4 10:36 /etc/hosts -rw-r--r--. 1 root root 370 Jan 12 2010 /etc/hosts.allow -rw-r--r--. 1 root root 460 Jan 12 2010 /etc/hosts.deny drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hue [cloudera@localhost CDH]$ *ls -lrta /etc/hadoop* total 16 drwxr-xr-x. 2 root root 4096 Apr 4 11:21 conf.cloudera.hdfs lrwxrwxrwx1 root root 29 Jun 8 16:37 *conf - /etc/alternatives/hadoop-conf* drwxr-xr-x. 4 root root 4096 Jun 8 16:37 . drwxr-xr-x. 2 cloudera cloudera 4096 Jun 8 19:14 conf.cloudera.yarn drwxr-xr-x. 113 root root 4096 Jun 13 21:57 .. 2014-06-13 21:42 GMT-07:00 Shashidhar Rao raoshashidhar...@gmail.com: Hi, I just installed cloudera vm 5 .x on vmplayer. Can somebody having experience with cloudera help me in finding where are the *-site.xml files so that I can configure the various settings. Thanks Shashi
Re: Cloudera VM
Hi Shashidhar, They are under /etc/[hadoop|hbase|hive|etc]/conf as symlinks to /usr/lib/[hadoop|hbase|hive|etc] [cloudera@localhost CDH]$ ll -Ld /etc/h* drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hadoop-httpfs drwxr-xr-x. 3 root root 4096 Apr 4 10:36 /etc/hal drwxr-xr-x. 3 root root 4096 Jun 8 16:37 /etc/hbase drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hbase-solr drwxr-xr-x. 3 root root 4096 Jun 8 16:37 /etc/hive drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hive-hcatalog drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hive-webhcat -rw-r--r--. 1 root root9 Sep 23 2011 /etc/host.conf -rw-r--r--. 1 root root 46 Apr 4 10:36 /etc/hosts -rw-r--r--. 1 root root 370 Jan 12 2010 /etc/hosts.allow -rw-r--r--. 1 root root 460 Jan 12 2010 /etc/hosts.deny drwxr-xr-x 2 root root 4096 Jun 8 16:37 /etc/hue [cloudera@localhost CDH]$ *ls -lrta /etc/hadoop* total 16 drwxr-xr-x. 2 root root 4096 Apr 4 11:21 conf.cloudera.hdfs lrwxrwxrwx1 root root 29 Jun 8 16:37 *conf - /etc/alternatives/hadoop-conf* drwxr-xr-x. 4 root root 4096 Jun 8 16:37 . drwxr-xr-x. 2 cloudera cloudera 4096 Jun 8 19:14 conf.cloudera.yarn drwxr-xr-x. 113 root root 4096 Jun 13 21:57 .. 2014-06-13 21:42 GMT-07:00 Shashidhar Rao raoshashidhar...@gmail.com: Hi, I just installed cloudera vm 5 .x on vmplayer. Can somebody having experience with cloudera help me in finding where are the *-site.xml files so that I can configure the various settings. Thanks Shashi
Re: FW: Lambdoop - We are hiring! CTO-Founder2Be based in Madrid, Spain
Please refrain from making this a job posting site. thanks. 2014-01-29 Gursev Bajwa gba...@blackberry.com Gursev Bajwa BlackBerry Technical Support BlackBerry Care Office: (519) 888-7465 x39495 Mobile: (226) 339-2981 PIN: 2FFFC504 Always On, Always Connected Sent from my BlackBerry 10 Smartphone -Original Message- From: Info [mailto:i...@lambdoop.com] Sent: Wednesday, January 29, 2014 8:02 AM To: gene...@hadoop.apache.org Subject: Lambdoop - We are hiring! CTO-Founder2Be based in Madrid, Spain Dear all, At Lambdoop we are building a BigData middleware with a simple Java API for building Big Data applications using any processing paradigm: long-running (batch) tasks, streaming (real-time) processing and hybrid computation models (Lambda Architecture). No MapReduce coding, streaming topology processing or complex NoSQL management. No synchronization or aggregation issues. Just Lambdoop! More info to come at lambdoop.com Job description: Due to our next official launch, we are looking to hire a talented Chief Technology Officer with experience working in a start-up environment. Our engineering team has been hardly working on implementing our initial product and now we are ready to launch and grow a new innovative BigData company that will make BigData application development easier and faster. With our ultimate goal to make our customers have better use of data, simplifying the way they create valuable insights from they sparse data sources. You will be responsible for building and leading our engineering team, setting the product strategy and roadmap. While this will be the main activity, our next CTO should be willing to get his hands on the products, writing code, solving problems and prototyping new functionalities. You will be providing technical guidance and direction to our entire engineering team. Ideally this individual will propose innovative solutions and explore new ways of increase the value we provide to our customers. We are looking for a CTO- founder to be - based in Madrid, Spain. W e offer competitive salary and strong benefits, (equity included). We offer a very innovative work culture, within a passionate team willing to make the difference. If you are intersted, please drop us an email at in f...@lambdoop.com and tell us about you. Required Skills *PhD/MS Computer Science - Software Engineering *Strong OOP skills, ability to analyze requirements and prepare design *Strong background in Architecture Design, especially in parallel and distributed processing systems *Demonstrated track building SW solutions and products, preferably in the data analytics space, in another successful start-up or well-respected technology company *Hands on experience with common open-source and JEE technologies *Knowledge on cloud environments (AWS, Google, ...) *Team-oriented individual with excellent interpersonal, planning, coordination, and problem-finding skills. Ability to prioritize and assign tasks throughout the team. *High degree of initiative and the ability to work independently and follow-through on assignments. Big Data experience *Hadoop (HDFS, MapReduce, YARN) *Batch technologies (Hive, Pig, Cascading) *NoSQL (HBase, Cassandra, Redis) *Real-time processing (Storm, Trident) *Related tools (Avro, Sqoop, Flume) *Machine Learning (Mahout, SAMOA) Personal skills *Strong communication capabilities in Spanish and English (both written and verbal) *Result oriented capable to work under tight deadlines *Motivated *Open minded *A passion to learn and to make a difference Thanks . - The Lambdoop team la mbdoop.com @infoLambdoop
Re: how to benchmark the whole hadoop cluster performance?
You are on the right track. TestDFSIO and TeraGen/Sort provide good characterization of IO and shuffle /sort performance. You would likely want to run/save dstat (/vmstat/iostat/..) info on the individual nodes as well. HiBench does provide additional useful characterizations such as mixed workloads using typical hadoop ecosystem tools. 2013/9/2 Ravi Kiran ravikiranmag...@gmail.com You can also look at a ) https://github.com/intel-hadoop/HiBench Regards Ravi Magham On Mon, Sep 2, 2013 at 12:26 PM, ch huang justlo...@gmail.com wrote: hi ,all: i want to evaluate my hadoop cluster performance ,what tool can i use? (TestDFSIO,nnbench?)
Re: Collect, Spill and Merge phases insight
great questions, i am also looking forward to answers from expert(s) here. 2013/7/16 Felix.徐 ygnhz...@gmail.com Hi all, I am trying to understand the process of Collect, Spill and Merge in Map, I've referred to a few documentations but still have a few questions. Here is my understanding about the spill phase in map: 1.Collect function add a record into the buffer. 2.If the buffer exceeds a threshold (determined by parameters like io.sort.mb), spill phase begins. 3.Spill phase includes 3 actions : sort , combine and compression. 4.Spill may be performed multiple times thus a few spilled files will be generated. 5.If there are more than 1 spilled files, Merge phase begins and merge these files into a big one. If there is any miss understanding about these phases, please correct me ,thanks! And my questions are: 1.Where is the partition being calculated (in Collect or Spill) ? Does Collect simply append a record into the buffer and check whether we should spill the buffer? 2.At Merge phase, since the spilled files are compressed, does it need to uncompressed these files and compress them again? Since Merge may be performed more than 1 round, does it compress intermediate files? 3.Does the Merge phase at Map and Reduce side almost the same (External merge-sort combined with Min-Heap) ?
Hint on EOFException's on datanodes
On a smallish (10 node) cluster with only 2 mappers per node after a few minutes EOFExceptions are cropping up on the datanodes: an example is shown below. Any hint on what to tweak/change in hadoop / cluster settings to make this more happy? 2013-05-24 05:03:57,460 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc): writeBlock blk_7760450154173670997_48372 received exception java.io.EOFException: while trying to read 65557 bytes 2013-05-24 05:03:57,262 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 0 for Block blk_-3990749197748165818_48331): PacketResponder 0 for block blk_-3990749197748165818_48331 terminating 2013-05-24 05:03:57,460 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode (org.apache.hadoop.hdfs.server.datanode.DataXceiver@1b1accfc): DatanodeRegistration(10.254.40.79:9200, storageID=DS-1106090267-10.254.40.79-9200-1369343833886, infoPort=9102, ipcPort=9201):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:268) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:312) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:376) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:406) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:112) at java.lang.Thread.run(Thread.java:662) 2013-05-24 05:03:57,261 INFO org.apache.hadoop.hdfs.server.datanode.Dat
Re: Reduce starts before map completes (at 23%)
Hi Sai, The first phase of reducer is to copy/fetch the files from Mapper machine(s) to the reducer. This can be started when some of the mappers have completed. You will notice that the reducer will not surpass 33% though: since the next phase - sort - requries that all of the mappers be completed. Summary: that is fine. 2013/4/11 Sai Sai saigr...@yahoo.in I am running the wordcount from hadoop-examples, i am giving as input a bunch of test files, i have noticed in the output given below reduce starts when the map is at 23%, i was wondering if it is not right that reducers will start only after the complete mapping is done which mean when map is 100% then i thought the reducers will start. Why r the reducers starting when map is still at 23%. 13/04/11 21:10:32 INFO mapred.JobClient: map 0% reduce 0% 13/04/11 21:10:56 INFO mapred.JobClient: map 1% reduce 0% 13/04/11 21:10:59 INFO mapred.JobClient: map 2% reduce 0% 13/04/11 21:11:02 INFO mapred.JobClient: map 3% reduce 0% 13/04/11 21:11:05 INFO mapred.JobClient: map 4% reduce 0% 13/04/11 21:11:08 INFO mapred.JobClient: map 6% reduce 0% 13/04/11 21:11:11 INFO mapred.JobClient: map 7% reduce 0% 13/04/11 21:11:17 INFO mapred.JobClient: map 8% reduce 0% 13/04/11 21:11:23 INFO mapred.JobClient: map 10% reduce 0% 13/04/11 21:11:26 INFO mapred.JobClient: map 12% reduce 0% 13/04/11 21:11:32 INFO mapred.JobClient: map 14% reduce 0% 13/04/11 21:11:44 INFO mapred.JobClient: map 23% reduce 0% 13/04/11 21:11:50 INFO mapred.JobClient: map 23% reduce 1% 13/04/11 21:11:53 INFO mapred.JobClient: map 33% reduce 7% 13/04/11 21:12:02 INFO mapred.JobClient: map 42% reduce 7% Please pour some light. Thanks Sai
Re: Job: Bigdata Architect
Please let's refrain from recruiting on this DL, which is my understanding focused on hadoop questions/issues not jobs. Thanks, 2013/4/1 jessica k jessica.kudukisgr...@gmail.com We are a recruiting firm focused on The IT Service and Solutions industry. We were contracted by a top tier $7 Billion + consulting firm. I thought you may be interested , or know someone who may be, in the following position. ** ** *1. **Job Role:* Sr. Architect (C2) / Architect (C1) / Tech Lead (B3) – Big Data and Online Analytics *2. **Locations*: Mountain View, California or New York City, NY* 3. **Job Description: * We are looking for Architects for developing Big Data solutions. The candidates would be responsible for the following activities: - Meet clients, understand their needs and craft solutions meeting client needs Technology evaluation, architecture design, implementation, testing and technical reviews of highly scalable platforms, solutions and technology building blocks to address customer’s requirement. - Lead the overall technical solution implementation as part of customer’s project delivery team - Mentor and groom project team members in the core technology areas, usage of SDLC tools and Agile - Engage in external and internal forums to demonstrate thought leadership *4. Skills Required:* - 7+ years of overall work experience. - Strong hands-on work experience in architecting, designing and implementing big data solutions including one or more of the following skills: Programming Languages Core Java, J2EE, JDBC, Spring, Struts, Hibernate Scalable databases or object stores Cassandra, MongoDB, AWS DynamoDB, Hbase, OpenStack Swift, etc Caching memcached, BerkelyDB, Gigaspaces, Infinispan Distributed File Systems HDFS, Gluster, Lustre Distributed Data Processing Hadoop Map Reduce, Parallel Processing, Stream Processing, Flume, Splunk Other skills Machine Learning, Mathematical background NLP If you are interested , please respond with a current resume to * jess...@kudukisgroup.com*. I will give you a call to speak in further detail. We keep all information confidential. Feel free to reply with any questions. Thanks, ~ Jessica
Re: Job driver and 3rd party jars
*-Dmapreduce.task.classpath.user.precedence=true* * * I have also experienced these issues with -libjars not working and am following this thread with interest. Where is this particular option - as opposed to mapreduce.user.classpath.first which in version 1.0.3 is in the TaskRunner.java ? Any documentation / hints on this? 2013/3/7 刘晓文 lxw1...@qq.com try: hadoop jar *-Dmapreduce.task.classpath.user.precedence=true *-libjars your_jar -- Original -- *From: * Barak Yaishbarak.ya...@gmail.com; *Date: * Fri, Mar 8, 2013 03:06 PM *To: * useruser@hadoop.apache.org; ** *Subject: * Re: Job driver and 3rd party jars Yep, my typo, I'm using the later. I was also trying export HADOOP_CLASSPATH_USER_FIRST =true and export HADOOP_CLASSPATH=myjar before launching the hadoop jar, but I still getting the same exception. I'm running hadoop 1.0.4. On Mar 8, 2013 2:27 AM, Harsh J ha...@cloudera.com wrote: To be precise, did you use -libjar or -libjars? The latter is the right option. On Fri, Mar 8, 2013 at 12:18 AM, Barak Yaish barak.ya...@gmail.com wrote: Hi, I'm able to run M/R jobs where the mapper and reducer required to use 3rd party jars. I'm registering those jars in -libjar while invoking the hadoop jar command. I'm facing a strange problem, though, when the job driver itself ( extends Configured implements Tool ) required to run such code ( for example notify some remote service upon start and end). Is there a way to configure classpath when submitting jobs using hadoop jar? Seems like -libjar doesn't work for this case... Exception in thread main java.lang.NoClassDefFoundError: com/me/context/DefaultContext at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClassCond(ClassLoader.java:632) at java.lang.ClassLoader.defineClass(ClassLoader.java:616) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141) at java.net.URLClassLoader.defineClass(URLClassLoader.java:283) at java.net.URLClassLoader.access$000(URLClassLoader.java:58) at java.net.URLClassLoader$1.run(URLClassLoader.java:197) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) at com.peer39.bigdata.mr.pnm.PnmDataCruncher.run(PnmDataCruncher.java:50) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at com.me.mr.pnm.PnmMR.main(PnmDataCruncher.java:261) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: com.me.context.DefaultContext at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) -- Harsh J
Re: Tests to be done to check a library if it works on Hadoop MapReduce framework
check out mrunit http://mrunit.apache.org/ 2013/2/16 Varsha Raveendran varsha.raveend...@gmail.com Hello! As part of my graduate project I am trying to create a library to support Genetic Algorithms to work on Hadoop MapReduce. A couple of things I have not understood - How to test the library in Hadoop? I mean how to check if it is doing what it is supposed to do. What is the user expected to give as input? I am a newbie to both GA and Hadoop. I have been studying both for the past couple of weeks. So now have a fair idea about GAs but not able to understand how to implement a library. Never done this before. Any suggestions/help would be appreciated. Regards, Varsha
Re: Help with DataDrivenDBInputFormat: splits are created properly but zero records are sent to the mappers
It turns out to be an apparent problem in one of the two methods of DataDrivenDBAPi.setInput(). The version I used does not work as shown: it needs to have a primary key column set somehow. But no information / documentation on how to set the pkcol that I could find. So I converted to using the other setIput() method as follow: DataDrivenDBInputFormat.setInput(job, DBTextWritable.class, APP_DETAILS_CRAWL_QUEUE_V, null, id, id); Now this is working . 2013/1/24 Stephen Boesch java...@gmail.com I have made an attempt to implement a job using DataDrivenDBInputFormat. The result is that the input splits are created successfully with 45K records apeice, but zero records are then actually sent to the mappers. If anyone can point to working example(s) of using DataDrivenDBInputFormat it would be much appreciated. Here are further details of my attempt: DBConfiguration.configureDB(job.getConfiguration(), props.getDriver(), props.getUrl(), props.getUser(), props.getPassword()); // Note: i also include code here to verify able to get java.sql.Connection using the above props.. DataDrivenDBInputFormat.setInput(job, DBLongWritable.class, select id,status from app_detail_active_crawl_queue_v where + DataDrivenDBInputFormat.SUBSTITUTE_TOKEN, SELECT MIN(id),MAX(id) FROM app_detail_active_crawl_queue_v ); // I verified by stepping with debugger that the input query were successfully applied by DataDrivenDBInputFormat to create two splits of 40K records each ); .. snip .. // Register a custom DBLongWritable class static { WritableComparator.define(DBLongWritable.class, new DBLongWritable.DBLongKeyComparator()); int x = 1; } Here is the job output. No rows were processed (even though 90K rows were identified in the INputSplits phase and divided into two 45K splits..So why were the input splits not processed? [Thu Jan 24 12:19:59] Successfully connected to driver=com.mysql.jdbc.Driver url=jdbc:mysql://localhost:3306/classint user=stephenb [Thu Jan 24 12:19:59] select id,status from app_detail_active_crawl_queue_v where $CONDITIONS 13/01/24 12:20:03 INFO mapred.JobClient: Running job: job_201301102125_0069 13/01/24 12:20:05 INFO mapred.JobClient: map 0% reduce 0% 13/01/24 12:20:22 INFO mapred.JobClient: map 50% reduce 0% 13/01/24 12:20:25 INFO mapred.JobClient: map 100% reduce 0% 13/01/24 12:20:30 INFO mapred.JobClient: Job complete: job_201301102125_0069 13/01/24 12:20:30 INFO mapred.JobClient: Counters: 17 13/01/24 12:20:30 INFO mapred.JobClient: Job Counters 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=21181 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/01/24 12:20:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/01/24 12:20:30 INFO mapred.JobClient: Launched map tasks=2 13/01/24 12:20:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 13/01/24 12:20:30 INFO mapred.JobClient: File Output Format Counters 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Written=0 13/01/24 12:20:30 INFO mapred.JobClient: FileSystemCounters 13/01/24 12:20:30 INFO mapred.JobClient: HDFS_BYTES_READ=215 13/01/24 12:20:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44010 13/01/24 12:20:30 INFO mapred.JobClient: File Input Format Counters 13/01/24 12:20:30 INFO mapred.JobClient: Bytes Read=0 13/01/24 12:20:30 INFO mapred.JobClient: Map-Reduce Framework 13/01/24 12:20:30 INFO mapred.JobClient: Map input records=0 13/01/24 12:20:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=200056832 13/01/24 12:20:30 INFO mapred.JobClient: Spilled Records=0 13/01/24 12:20:30 INFO mapred.JobClient: CPU time spent (ms)=2960 13/01/24 12:20:30 INFO mapred.JobClient: Total committed heap usage (bytes)=247201792 13/01/24 12:20:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4457689088 13/01/24 12:20:30 INFO mapred.JobClient: Map output records=0 13/01/24 12:20:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=215
Getting hostname (or any environment variable) into *-site.xml files
Hi, We are using a library with hdfs that uses custom properties inside the *-site.xml files. Instead of (a) hard-coding or (b) writing a sed script to update to the local hostnames on each deployed node, is there a mechanism to use environment variables? property namecustom.property/name value${HOSTNAME}/value /property I have tried this and out of the box the string ${HOSTNAME} is used. Anyone have recommendation/solution on this? thanks, stephen b
Re: Anyone successfully run Hadoop in pseudo or cluster model under Cygwin?
Hi Tim, Running in cygwin one encounters more bugs and often will get less support (since less people running cygwin) than running on native linux distros. So if you have a choice I'd not recommend it. If you do not have a choice then of course please wait for other responses on this ML. stephenb 2012/4/12 Tim.Wu china.tim...@gmail.com If yes. Could u send me an email? Your prompt reply will be appreciated, because I asked two questions in this mailing list in March, but no one reply me. Questions are listed in http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201203.mbox/%3CCA+2n-oGLocSo3YMUiYzhbtOPzO=11g1rl_b45+y-tggyvzk...@mail.gmail.com%3E and http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201203.mbox/%3CCA+2n-oERYM15BKb4dc9KETpjNfucqzd-=coecf2sgclj5u7...@mail.gmail.com%3E -- Best, WU Pengcheng ( Tim )
Re: UDF compiling
HI try adding the directory in which WordCount.class was placed to the -classpath 4/12 Barry, Sean F sean.f.ba...@intel.com I am trying to compiling a customized WordCount UDF but I get this cannot find symbol error when I compile. And I'm not sure how to resolve this issue. hduser@master:~ javac -classpath /usr/lib/hadoop/hadoop-core-0.20.2-cdh3u3.jar WordCount.java WordCount.java:24: error: cannot find symbol conf.setMapperClass(WordMapper.class); ^ symbol: class WordMapper location: class WordCount WordCount.java:25: error: cannot find symbol conf.setReducerClass(SumReducer.class); ^ symbol: class SumReducer location: class WordCount 2 errors hduser@master:~ ls SumReducer.classWordMapper.class SumReducer.java WordCount.java WordMapper.java hduser@master:~
Re: Hadoop Oppurtunity
yes please, let's focus on the technical issues 2012/2/18 real great.. greatness.hardn...@gmail.com Could we actually create a separate mailing list for Hadoop related jobs? On Sun, Feb 19, 2012 at 11:40 AM, larry la...@pssclabs.com wrote: Hi: We are looking for someone to help install and support hadoop clusters. We are in Southern California. Thanks, Larry Lesser PSSC Labs (949) 380-7288 Tel. la...@pssclabs.com 20432 North Sea Circle Lake Forest, CA 92630 -- Regards, R.V.
Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second
Update on this: I've shut down all the servers multiple times. Also cleared the data directories and reformatted the namenode. Restarted it and the same results: 100% cpu and millions of these calls to isBPServiceAlive. 2011/11/29 Stephen Boesch java...@gmail.com I am just trying to get off the ground with MRv2. The first node (in pseudo distributed mode) is working fine - ran a couple of TeraSort's on it. The second node has a serious issue with its single DataNode: it consumes 100% of one of the CPU's. Looking at it through JVisualVM, there are over 8 million invocations of isBPServiceAlive in a matter of a minute or so and continually incrementing at a steady clip. A screenshot of the JvisualVM cpu profile - showing just shy of 8M invocations is attached. What kind of configuration error could lead to this? The conf/masters and conf/slaves simply say localhost. If need be I'll copy the *-site.xml's. They are boilerplate from the Cloudera page by Ahmed Radwan.
Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second
Hi Uma, I mentioned that I have restarted the datanode *many *times, and in fact the entire cluster more than ten times. 2011/11/29 Uma Maheswara Rao G mahesw...@huawei.com Looks you are getting HDFS-2553. The cause might be that, you cleared the datadirectories directly without DN restart. Workaround would be to restart DNs. Regards, Uma -- *From:* Stephen Boesch [java...@gmail.com] *Sent:* Tuesday, November 29, 2011 8:53 PM *To:* mapreduce-user@hadoop.apache.org *Subject:* Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second Update on this: I've shut down all the servers multiple times. Also cleared the data directories and reformatted the namenode. Restarted it and the same results: 100% cpu and millions of these calls to isBPServiceAlive. 2011/11/29 Stephen Boesch java...@gmail.com I am just trying to get off the ground with MRv2. The first node (in pseudo distributed mode) is working fine - ran a couple of TeraSort's on it. The second node has a serious issue with its single DataNode: it consumes 100% of one of the CPU's. Looking at it through JVisualVM, there are over 8 million invocations of isBPServiceAlive in a matter of a minute or so and continually incrementing at a steady clip. A screenshot of the JvisualVM cpu profile - showing just shy of 8M invocations is attached. What kind of configuration error could lead to this? The conf/masters and conf/slaves simply say localhost. If need be I'll copy the *-site.xml's. They are boilerplate from the Cloudera page by Ahmed Radwan.
Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second
I verified the DN was down via both jps and java. Anyways, it was enough to see via top since as mentioned DN was consuming 100% of one cpu when running. 2011/11/29 Stephen Boesch java...@gmail.com Hi Uma, I mentioned that I have restarted the datanode *many *times, and in fact the entire cluster more than ten times. 2011/11/29 Uma Maheswara Rao G mahesw...@huawei.com Looks you are getting HDFS-2553. The cause might be that, you cleared the datadirectories directly without DN restart. Workaround would be to restart DNs. Regards, Uma -- *From:* Stephen Boesch [java...@gmail.com] *Sent:* Tuesday, November 29, 2011 8:53 PM *To:* mapreduce-user@hadoop.apache.org *Subject:* Re: MRv2 DataNode problem: isBPServiceAlive invoked order of 200K times per second Update on this: I've shut down all the servers multiple times. Also cleared the data directories and reformatted the namenode. Restarted it and the same results: 100% cpu and millions of these calls to isBPServiceAlive. 2011/11/29 Stephen Boesch java...@gmail.com I am just trying to get off the ground with MRv2. The first node (in pseudo distributed mode) is working fine - ran a couple of TeraSort's on it. The second node has a serious issue with its single DataNode: it consumes 100% of one of the CPU's. Looking at it through JVisualVM, there are over 8 million invocations of isBPServiceAlive in a matter of a minute or so and continually incrementing at a steady clip. A screenshot of the JvisualVM cpu profile - showing just shy of 8M invocations is attached. What kind of configuration error could lead to this? The conf/masters and conf/slaves simply say localhost. If need be I'll copy the *-site.xml's. They are boilerplate from the Cloudera page by Ahmed Radwan.
ProtocolProvider errors On MRv2 Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider
Hi I set up a pseudo cluster according to the instructions here http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/. Initially the randomwriter example worked. But after a crash on the machine and restarting the services I am getting the errors shown below. Jps seems to think the processes are running properly: had@mithril:/shared/hadoop$ jps 7980 JobHistoryServer 7668 NameNode 7821 ResourceManager 7748 DataNode 8021 Jps 7902 NodeManager $ hadoop jar hadoop-mapreduce-examples-0.23.0.jar randomwriter - Dmapreduce.job.user.name=$USER -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -Dmapreduce.randomwriter.bytespermap=1 -Ddfs.blocksize=64m -Ddfs.block.size=64m -libjars $YARN_HOME/modules/hadoop-mapreduce-client-jobclient-0.23.0.jar output 2011-11-28 10:23:56,102 WARN conf.Configuration (Configuration.java:set(629)) - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 2011-11-28 10:23:56,158 INFO ipc.YarnRPC (YarnRPC.java:create(47)) - Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC 2011-11-28 10:23:56,162 INFO mapred.ResourceMgrDelegate (ResourceMgrDelegate.java:init(95)) - Connecting to ResourceManager at / 0.0.0.0:8040 2011-11-28 10:23:56,163 INFO ipc.HadoopYarnRPC (HadoopYarnProtoRPC.java:getProxy(48)) - Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.yarn.api.ClientRMProtocol 2011-11-28 10:23:56,203 INFO mapred.ResourceMgrDelegate (ResourceMgrDelegate.java:init(99)) - Connected to ResourceManager at / 0.0.0.0:8040 2011-11-28 10:23:56,248 INFO mapreduce.Cluster (Cluster.java:initialize(116)) - Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: java.lang.reflect.InvocationTargetException 2011-11-28 10:23:56,250 INFO mapreduce.Cluster (Cluster.java:initialize(111)) - Cannot pick org.apache.hadoop.mapred.LocalClientProtocolProvider as the ClientProtocolProvider - returned null protocol 2011-11-28 10:23:56,251 INFO mapreduce.Cluster (Cluster.java:initialize(111)) - Cannot pick org.apache.hadoop.mapred.JobTrackerClientProtocolProvider as the ClientProtocolProvider - returned null protocol java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. My *-site.xml files are precisely as shown on the instructions page. In any case copying here the one that is most germane - mapred-site.xml ?xml version=1.0? ?xml-stylesheet href=configuration.xsl? configuration property name mapreduce.framework.name/name valueyarn/value /property /configuration
Re: ProtocolProvider errors On MRv2 Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider
Here is complete output. 2011-11-28 16:34:27,606 WARN conf.Configuration (Configuration.java:set(629)) - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used 2011-11-28 16:34:27,660 INFO ipc.YarnRPC (YarnRPC.java:create(47)) - Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC 2011-11-28 16:34:27,663 INFO mapred.ResourceMgrDelegate (ResourceMgrDelegate.java:init(95)) - Connecting to ResourceManager at / 0.0.0.0:8040 2011-11-28 16:34:27,664 INFO ipc.HadoopYarnRPC (HadoopYarnProtoRPC.java:getProxy(48)) - Creating a HadoopYarnProtoRpc proxy for protocol interface org.apache.hadoop.yarn.api.ClientRMProtocol 2011-11-28 16:34:27,700 INFO mapred.ResourceMgrDelegate (ResourceMgrDelegate.java:init(99)) - Connected to ResourceManager at / 0.0.0.0:8040 2011-11-28 16:34:27,734 INFO mapreduce.Cluster (Cluster.java:initialize(116)) - Failed to use org.apache.hadoop.mapred.YarnClientProtocolProvider due to error: java.lang.reflect.InvocationTargetException 2011-11-28 16:34:27,735 INFO mapreduce.Cluster (Cluster.java:initialize(111)) - Cannot pick org.apache.hadoop.mapred.LocalClientProtocolProvider as the ClientProtocolProvider - returned null protocol 2011-11-28 16:34:27,736 INFO mapreduce.Cluster (Cluster.java:initialize(111)) - Cannot pick org.apache.hadoop.mapred.JobTrackerClientProtocolProvider as the ClientProtocolProvider - returned null protocol java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses. at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:123) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:85) at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:78) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:460) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:450) at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:246) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:294) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:189) 2011/11/28 Stephen Boesch java...@gmail.com I had mentioned in the original post that the configuration files were set up exactly as in the cloudera post. That includes the yarn-site.xml. but since there are questions about it, i'll go ahead and include those below. This setup DID work one time, just does not restart properly.. yarn-site.xml ?xml version=1.0? configuration property nameyarn.nodemanager.aux-services/name valuemapreduce.shuffle/value /property property nameyarn.nodemanager.aux-services.mapreduce.shuffle.class/name valueorg.apache.hadoop.mapred.ShuffleHandler/value /property /configuration core-site.xml ?xml version=1.0? ?xml-stylesheet href=configuration.xsl? configuration property namefs.default.name/name valuehdfs://localhost:9000/value /property property nameyarn.user/name valuehad/value /property /configuration hdfs-site.xml ?xml version=1.0? ?xml-stylesheet href=configuration.xsl? configuration property namedfs.replication/name value1/value /property property namedfs.permissions/name valuefalse/value /property /configuration mapred-site.xml ?xml version=1.0? ?xml-stylesheet href=configuration.xsl? configuration property name mapreduce.framework.name/name valueyarn/value /property /configuration thx 2011/11/28 Stephen Boesch java...@gmail.com Yes I did both of those already. 2011/11/28 Marcos Luis Ortiz Valmaseda marcosluis2...@googlemail.com 2011/11/28 Stephen Boesch java...@gmail.com: Hi I set up a pseudo cluster according to the instructions here http://www.cloudera.com/blog/2011/11/building-and-deploying-mr2/. Initially the randomwriter example worked. But after a crash on the machine and restarting the services I am getting the errors shown below. Jps seems to think the processes are running properly: had@mithril:/shared/hadoop$ jps 7980 JobHistoryServer 7668 NameNode 7821
Re: MRv2 with other hadoop projects
thanks, I looked into it more. As you mentioned there are patches for PIG available. Hive is a WIP: Tom White and a couple of other core committers are getting close on the 0.23.0 branch. Hbase is farther out, with more work to do. stephenb 2011/11/27 Mahadev Konar maha...@hortonworks.com Stephen, Coming up PIG 0.9.2 and 0.10 are supposed to work with 0.23 (mrv2). Hive - I am not too sure of, you might want to check there mailing lists. As for HBase, there is some work needed (see: HBASE-4813 and MR-3169) to make it work with MRv2. Hope that helps. thanks mahadev On Nov 26, 2011, at 5:11 PM, Stephen Boesch wrote: Hi, Any work out there on using hbase, hive, pig with MRv2? thx! stephenb
MRv2 with other hadoop projects
Hi, Any work out there on using hbase, hive, pig with MRv2? thx! stephenb
Re: Gratuitous use of CHMOD in windows
Hi, you'll probably need to be running this under cygwin since windows native is not supported. 2011/11/21 Steve Lewis lordjoe2...@gmail.com I am running a job on a cluster launching from a windows box and fs.default.name to point the job to the cluster. Everything works until the last step where I say FileSystem fileSystem = FileSystem.get(config); // this is hdfs fileSystem.copyToLocalFile(src, dst); // local system at this point RawLocalFileSystem tries to exec chmod which is not on windows. This looks like a bug - it seems that there should be a fallback system or better yet a call to the Java standard File.canWrite command before attempting to alter permissions -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: Matrix multiplication in Hadoop
Hi, there are two solutions suggested that take advantage of either (a) a vector x matrix (your CF / Mahout example ) or (b) a small matrix x large matrix (an earlier suggestion of putting the small matrix into the Distributed Cache). Not clear yet on good approaches of (c) large matrix x large matrix. 2011/11/19 bejoy.had...@gmail.com Hey Mike In mahout one place where matrix multiplication is used is in Collaborative Filtering distributed implementation. The recommendations here are generated by the multiplication of a cooccurence matrix with a user vector. This user vector is treated as a single column matrix and then the matrix multiplication takes place in there. Regards Bejoy K S -Original Message- From: Mike Spreitzer mspre...@us.ibm.com Date: Fri, 18 Nov 2011 14:52:05 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: RE: Matrix multiplication in Hadoop Well, this mismatch may tell me something interesting about Hadoop. Matrix multiplication has a lot of inherent parallelism, so from very crude considerations it is not obvious that there should be a mismatch. Why is matrix multiplication ill-suited for Hadoop? BTW, I looked into the Mahout documentation some, and did not find matrix multiplication there. It might be hidden inside one of the advertised algorithms; I looked at the documentation for a few, but did not notice mention of MM. Thanks, Mike From: Michael Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org Date: 11/18/2011 01:49 PM Subject:RE: Matrix multiplication in Hadoop Ok Mike, First I admire that you are studying Hadoop. To answer your question... not well. Might I suggest that if you want to learn Hadoop, you try and find a problem which can easily be broken in to a series of parallel tasks where there is minimal communication requirements between each task? No offense, but if I could make a parallel... what you're asking is akin to taking a normalized relational model and trying to run it as is in HBase. Yes it can be done. But not the best use of resources. To: common-user@hadoop.apache.org CC: common-user@hadoop.apache.org Subject: Re: Matrix multiplication in Hadoop From: mspre...@us.ibm.com Date: Fri, 18 Nov 2011 12:39:00 -0500 That's also an interesting question, but right now I am studying Hadoop and want to know how well dense MM can be done in Hadoop. Thanks, Mike From: Michel Segel michael_se...@hotmail.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Date: 11/18/2011 12:34 PM Subject:Re: Matrix multiplication in Hadoop Is Hadoop the best tool for doing large matrix math. Sure you can do it, but, aren't there better tools for these types of problems? Sent from a remote device. Please excuse any typos... Mike Segel
Namenode in inconsistent state: how to reinitialize the storage directory
I am relatively new here and starting the CDH3u1 (on vmware). The nameserver is not coming up due to the following error: 2011-10-25 22:47:00,547 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name 2011-10-25 22:47:00,549 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:305) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:327) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1224) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1233) Now, I first noticed there was a lock file. So I sudo rm'ed it and retried. But same error. Then, not knowing what files are required (if any) to restart, I moved the entire dir and created a new empty one. Here are both the new and the 'sav dirs cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll /var/lib/hadoop-0.20/cache/hadoop/dfs/name total 8 drwxr-xr-x 2 hdfs hdfs 4096 2011-10-25 23:11 . drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 .. cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll /var/lib/hadoop-0.20/cache/hadoop/dfs/name.sav total 20 drwxr-xr-x 2 hdfs hdfs 4096 2011-01-24 15:24 image drwxr-xr-x 2 hdfs hdfs 4096 2011-09-25 11:49 previous.checkpoint drwxr-xr-x 2 hdfs hdfs 4096 2011-10-25 21:01 current drwxr-xr-x 5 hdfs hdfs 4096 2011-10-25 23:02 . drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 .. So then, any recommendations on how to proceed? thanks
Re: Namenode in inconsistent state: how to reinitialize the storage directory
I found a suggestion to reformat the namenode. In order to do so, I found it necessary to set the dir to 777. AFter $ sudo chmod 777 /var/lib/hadoop-0.20/cache/hadoop/dfs/name $ ./hadoop namenode -format (successful) $ ./hadoop-daemon.sh --config $HADOOP/conf start namenode (success!) So.. this leads to a related question: *What gives with these permissions? ** *Maybe this is *cloudera *specific. I am logged in to cloudera user,. but these directories have owners/groups with a mix of hadoop, mapred, hbase, hdfs, etc.When i look in /etc/passwd and /etc/group there is no clear indication that cloudera should be able to access files owned by members of those groups. Where is there more info about making the file permissions happy when running the various hadoop services from cloudera user ? i am on CDH3u1 thx 2011/10/25 Stephen Boesch java...@gmail.com I am relatively new here and starting the CDH3u1 (on vmware). The nameserver is not coming up due to the following error: 2011-10-25 22:47:00,547 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name 2011-10-25 22:47:00,549 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name is in an inconsistent state: storage directory does not exist or is not accessible. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:305) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:327) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:465) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1224) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1233) Now, I first noticed there was a lock file. So I sudo rm'ed it and retried. But same error. Then, not knowing what files are required (if any) to restart, I moved the entire dir and created a new empty one. Here are both the new and the 'sav dirs cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll /var/lib/hadoop-0.20/cache/hadoop/dfs/name total 8 drwxr-xr-x 2 hdfs hdfs 4096 2011-10-25 23:11 . drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 .. cloudera@cloudera-demo:/usr/lib/hadoop/logs$ ll /var/lib/hadoop-0.20/cache/hadoop/dfs/name.sav total 20 drwxr-xr-x 2 hdfs hdfs 4096 2011-01-24 15:24 image drwxr-xr-x 2 hdfs hdfs 4096 2011-09-25 11:49 previous.checkpoint drwxr-xr-x 2 hdfs hdfs 4096 2011-10-25 21:01 current drwxr-xr-x 5 hdfs hdfs 4096 2011-10-25 23:02 . drwxr-xr-x 4 hdfs hadoop 4096 2011-10-25 23:11 .. So then, any recommendations on how to proceed? thanks
Suggestions on hadoop using virtualization on local hardware but can't use Hypervisor
Hi I have been struggling a long time with how to set up a hadoop test environment with 6 to 10 nodes. I have some concerns about using S3/EC2 but have not ruled it out. Another direction is that I have set up a higher end box with the intent of using virtualization. The box is a core i7 2600-K with an MSI Motherboard. p67a-gd65 b3. This one apparently does *not *support either Citrix XenServer or Vmware hypervisor's. (rats!!) That puts me back at the 'drawing board'. But i already have this really fast box just sitting there.. Any suggestions for other ways to employ this machine to get the job done? thanks stephen
Re: Estimating Time required to compute M/Rjob
You could consider two scenarios / set of requirements for your estimator: 1. Allow it to 'learn' from certain input data and then project running times of similar (or moderately dissimilar) workloads. So the first steps could be to define a couple of relatively small control M/R jobs on a small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R cluster. Try to design the control M/R job in a way that it will be able to completely load down all of the available DataNodes in the cluster-under-test for at least a brief period of time. Then you wlil have obtained a decent signal on the capabilities of the cluster under test and may allow a relatively high degree of predictive accuracy for even much larger jobs 2. If instead it were your goal to drive the predictions off of a purely mathematical model - in your terms the application and base file system - and without any empirical data - then here is an alternative approach. - Follow step (1) above against a variety of applications and base file systems - especially in configurations for which you wish your model to provide high quality predictions. - Save the results in structured data - Derive formulas for characterizing the curves of performance via those variables that you defined (application / base file system) Now you have a trained model. When it is applied to a new set of applications / base file systems it can use the curves you have already determined to provide the result without any runtime requirements. Obviously the value of this second approach is limited by the degree of similarity of the training data to the applications you attempt to model. If all of your training data is on a 50 node cluster against machines with IDE drives don't expect good results when asked to model a 1000 node cluster using SAN's / RAID's / SCSI's. 2011/4/16 Sonal Goyal sonalgoy...@gmail.com What is your MR job doing? What is the amount of data it is processing? What kind of a cluster do you have? Would you be able to share some details about what you are trying to do? If you are looking for metrics, you could look at the Terasort run .. Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoHadoop ETL and Data Integrationhttps://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Apr 16, 2011 at 3:31 PM, real great.. greatness.hardn...@gmail.comwrote: Hi, As a part of my final year BE final project I want to estimate the time required by a M/R job given an application and a base file system. Can you folks please help me by posting some thoughts on this issue or posting some links here. -- Regards, R.V.
Re: Estimating Time required to compute M/Rjob
some additional thoughts about the the 'variables' involved in characterizing the M/R application itself. - the configuration of the cluster for numbers of mappers vs reducers compared to the characteristics (amount of work/procesing) required in each of the map/shuffle/reduce stages - is the application using multiple chained M/R stages? Multi stage M/R's are more difficult to tune properly in terms of keeping all workers busy . That may be challenging to model. 2011/4/16 Stephen Boesch java...@gmail.com You could consider two scenarios / set of requirements for your estimator: 1. Allow it to 'learn' from certain input data and then project running times of similar (or moderately dissimilar) workloads. So the first steps could be to define a couple of relatively small control M/R jobs on a small-ish dataset and throw it at the unknown (cluster-under-test) hdfs/ M/R cluster. Try to design the control M/R job in a way that it will be able to completely load down all of the available DataNodes in the cluster-under-test for at least a brief period of time. Then you wlil have obtained a decent signal on the capabilities of the cluster under test and may allow a relatively high degree of predictive accuracy for even much larger jobs 2. If instead it were your goal to drive the predictions off of a purely mathematical model - in your terms the application and base file system - and without any empirical data - then here is an alternative approach. - Follow step (1) above against a variety of applications and base file systems - especially in configurations for which you wish your model to provide high quality predictions. - Save the results in structured data - Derive formulas for characterizing the curves of performance via those variables that you defined (application / base file system) Now you have a trained model. When it is applied to a new set of applications / base file systems it can use the curves you have already determined to provide the result without any runtime requirements. Obviously the value of this second approach is limited by the degree of similarity of the training data to the applications you attempt to model. If all of your training data is on a 50 node cluster against machines with IDE drives don't expect good results when asked to model a 1000 node cluster using SAN's / RAID's / SCSI's. 2011/4/16 Sonal Goyal sonalgoy...@gmail.com What is your MR job doing? What is the amount of data it is processing? What kind of a cluster do you have? Would you be able to share some details about what you are trying to do? If you are looking for metrics, you could look at the Terasort run .. Thanks and Regards, Sonal https://github.com/sonalgoyal/hihoHadoop ETL and Data Integrationhttps://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Apr 16, 2011 at 3:31 PM, real great.. greatness.hardn...@gmail.comwrote: Hi, As a part of my final year BE final project I want to estimate the time required by a M/R job given an application and a base file system. Can you folks please help me by posting some thoughts on this issue or posting some links here. -- Regards, R.V.
Re: Birthday Calendar
Forum moderator: pls mark emails from this user as spam. 2011/4/10 Tiru Murugan veera.tirumurugan...@gmail.com Hi I am creating a birthday calendar of all my friends and family. Can you please click on the link below to enter your birthday for me? http://www.birthdayalarm.com/bd2/86124361a478169686b1536202358c505931142d1386 Thanks, Tiru
Re: running local hadoop job in windows
another approach you may well have already considered.. but may reconsider.. use (free version of ..) vmware player running on your winXXX env (host) and install linux distro of your choice as the guest o/s. you can spin up essentially any number of instances that way. .. and not be concerned about the configuration/behavorial discrepancies between cygwin and native linux. If you wish there are pre-packaged distros for cloudera, apache, yahoo! for vmware. 2011/4/5 Tish Heyssel tish...@gmail.com Mark, Make sure you add the cygwin/bin to your global PATH variable in Windows too... and echo the PATH, if you're running in a command window to make sure it shows up there... When running through eclipse, it should pick up the PATH variable. Good luck. its worth the trouble. this does work. tish On Mon, Apr 4, 2011 at 10:13 PM, Mark Kerzner markkerz...@gmail.com wrote: I understand now, I need to install cygwin correctly, asking it for all the right options. Thank you, Mark On Mon, Apr 4, 2011 at 9:06 PM, Lance Norskog goks...@gmail.com wrote: You're stuck with cygwin! Hadoop insists on running the 'chmod' program. You have to have a binary in your search path. Lance On Sat, Mar 19, 2011 at 9:15 PM, Mark Kerzner markkerz...@gmail.com wrote: Now I AM running under cygwin, and I get the same error, as you can see from the attached screenshot. Thank you, Mark On Sat, Mar 19, 2011 at 9:16 PM, Simon gsmst...@gmail.com wrote: As far as I know, currently hadoop can only run under *nix like systems. Correct me if I am wrong. And if you want to run it under windows, you can try cygwin as the environment. Thanks Simon On Fri, Mar 18, 2011 at 7:11 PM, Mark Kerzner markkerz...@gmail.com wrote: No, I hoped that it is not absolutely necessary for that kind of use. I am not even issuing the hadoop -jar command, but it is pure java -jar. It is true though that my Ubuntu has a Hadoop set up, so maybe it is doing a lot of magic behind my back. I did not want to have my inexperienced Windows users to have to install cygwin for just trying the package. Thank you, Mark On Fri, Mar 18, 2011 at 6:06 PM, Stephen Boesch java...@gmail.com wrote: presumably you ran this under cygwin? 2011/3/18 Mark Kerzner markkerz...@gmail.com Hi, guys, I want to give my users a sense of what my hadoop application can do, and I am trying to make it run in Windows, with this command java -jar dist\FreeEed.jar This command runs my hadoop job locally, and it works in Linux. However, in Windows I get the error listed below. Since I am running completely locally, I don't see why it is trying to do what it does. Is there a workaround? Thank you, Mark Error: 11/03/18 17:57:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName =JobTracker, sessionId= java.io.IOException: Failed to set permissions of path: file:/tmp/hadoop-Mark/ma pred/staging/Mark-1397630897/.staging to 0700 at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFile System.java:526) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys tem.java:500) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav a:310) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi ssionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1063) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 93) at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) at org.frd.main.FreeEedProcess.run(FreeEedProcess.java:66) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.frd.main.FreeEedProcess.main
location for mapreduce next generation branch
Looking under http://svn.apache.org/repos/asf/hadoop/mapreduce/branches/ it does not seem to be present. pointers to correct location appreciated.
Re: How to insert some print codes into Hadoop?
keep in mind for going that direction that aspectJ has limited support in some IDE's (e.g. intellij). can complicate the development. 2011/3/23 Konstantin Boudnik c...@apache.org [Moving to common-user@, Bcc'ing general@] If you know where you need to have your print statements you can use AspectJ to do runtime injection of needed java code into desirable spots. You don't need to even touch the source code for that - just instrument (weave) the jar file -- Take care, Konstantin (Cos) Boudnik On Tue, Mar 22, 2011 at 09:19, Bo Sang sampl...@gmail.com wrote: Hi, guys: I would like to do some minor modification of Hadoop (just to insert some print codes into particular places). And I have the following questions: 1. It seems there are three parts of Hadoop: common, hdfs, mapred. And they are packed as three independent jar packages. Could I only modify one part (eg, common) and pack a new jar package without modifying the rest two. 2. I have try to import the folder hadoop-0.21.0/common into eclipse as a project. But eclipse fails to recognize it as an existing project. But if I import folder hadoop-0.21.0 as a existing project, it works. However, I only want to modify common part. How could I only modify common part and export a new common jar package without modifying the rest two parts. -- Best Regards! Sincerely Bo Sang
Re: Hadoop Jar
Hola Miguel, Wondering what Main method are you referring to that is defined to run. Can you be specific? 2011/3/22 Miguel Costa miguel-co...@telecom.pt I already fixed, I had the Main class defined on my project . Even if I execute hadoop –jar myyjar.jar mymainClass mymainClass is never executed. The project can not have a Main method defined to run if has more than one and we want to run different Main methods. Miguel *From:* Miguel Costa [mailto:miguel-co...@telecom.pt] *Sent:* terça-feira, 22 de Março de 2011 16:18 *To:* common-user@hadoop.apache.org *Subject:* Hadoop Jar Hi, I’m trying to execute a jar file from ./bin/hadoop jar /mypath/myjar mymainclass But the jar that is execute is an old jar. If I Execute the jar from java –cp myjar maymainclass it runs fine. What am I doing wrong? Thanks, Miguel
Re: running local hadoop job in windows
presumably you ran this under cygwin? 2011/3/18 Mark Kerzner markkerz...@gmail.com Hi, guys, I want to give my users a sense of what my hadoop application can do, and I am trying to make it run in Windows, with this command java -jar dist\FreeEed.jar This command runs my hadoop job locally, and it works in Linux. However, in Windows I get the error listed below. Since I am running completely locally, I don't see why it is trying to do what it does. Is there a workaround? Thank you, Mark Error: 11/03/18 17:57:43 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName =JobTracker, sessionId= java.io.IOException: Failed to set permissions of path: file:/tmp/hadoop-Mark/ma pred/staging/Mark-1397630897/.staging to 0700 at org.apache.hadoop.fs.RawLocalFileSystem.checkReturnValue(RawLocalFile System.java:526) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys tem.java:500) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav a:310) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi ssionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:799) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1063) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:7 93) at org.apache.hadoop.mapreduce.Job.submit(Job.java:465) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:495) at org.frd.main.FreeEedProcess.run(FreeEedProcess.java:66) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.frd.main.FreeEedProcess.main(FreeEedProcess.java:71) at org.frd.main.FreeEedMain.runProcessing(FreeEedMain.java:88) at org.frd.main.FreeEedMain.processOptions(FreeEedMain.java:65) at org.frd.main.FreeEedMain.main(FreeEedMain.java:31)