Re: Hive/Hdfs Connector
Sandeep, You can Use HiveDriver class for achieving this... http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html Regards, Srinivas On Thu, Jul 5, 2012 at 11:02 AM, Sandeep Reddy P sandeepreddy.3...@gmail.com wrote: Hi, We have some application which generates SQL queries and connects to RDBMS using connectors like JDBC for mysql. Now if we generate HQL using our application is there any way to connect to Hive/Hdfs using connectors?? I need help on what connectors i have to use? We dont want to pull data from Hive/Hdfs to RDBMS instead we need our application to connect to Hive/Hdfs. -- Thanks, sandeep -- -- Srinivas srini...@cloudwick.com
Re: Hadoop with Sharded MySql
All, I'm trying to get data into HDFS directly from sharded database and expose to existing hive infrastructure. ( we are currently doing this way,, mysql-staging server-hdfs put commands-hdfs, which is taking lot of time ). If we have way of running single sqoop job across all shardes for single table, I believe it makes life easier in terms of monotoring and exception handlings.. Thanks, Srinivas On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote: Hi Sujith, Srinivas is asking how to import data into HDFS using sqoop? I believe he must have thought out well before designing the entire architecture/solution. He has not specified whether he would like to modify the data or not. Whether to use HIve or HBase is a different question altogether and depends on his use-case. Thanks, Anil On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote: Hi , instead of pulling 70K tables from mysql into hdfs. take dump of all 30 table and put in to hBase data base . if you pulled 70K tables from mysql into hdfs , you need to use Hive , but modification will not possible in Hive :( *@ common-user :* please correct me , if i am wrong . Kind Regards Sujit Dhamale (+91 9970086652) On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com wrote: Maybe you can do some VIEWs or unions or merge tables on the mysql side to overcome the aspect of launching so many sqoop jobs. On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani hivehadooplearn...@gmail.com wrote: All, We are trying to implement sqoop in our environment which has 30 mysql sharded databases and all the databases have around 30 databases with 150 tables in each of the database which are all sharded (horizontally sharded that means the data is divided into all the tables in mysql). The problem is that we have a total of around 70K tables which needed to be pulled from mysql into hdfs. So, my question is that generating 70K sqoop commands and running them parallel is feasible or not? Also, doing incremental updates is going to be like invoking 70K another sqoop jobs which intern kick of map-reduce jobs. The main problem is monitoring and managing this huge number of jobs? Can anyone suggest me the best way of doing it or is sqoop a good candidate for this type of scenario? Currently the same process is done by generating tsv files mysql server and dumped into staging server and from there we'll generate hdfs put statements.. Appreciate your suggestions !!! Thanks, Srinivas Surasani -- Thanks Regards, Anil Gupta -- Regards, -- Srinivas srini...@cloudwick.com
Re: BZip2 Splittable?
@Daniel, If you want to process bz2 files in parallel( more than one mapper/reducer ), you can go for Pig. See below. Pig has inbuilt support for processing .bz2 files in parallel (.gz support is coming soon). If the input file name extension is .bz2, Pig decompresses the file on the fly and passes the decompressed input stream to your load function. Regards, On Fri, Feb 24, 2012 at 2:59 PM, Rohit ro...@hortonworks.com wrote: Hi Daniel, Because your MapReduce jobs will not split bzip2 files, each entire bzip2 file will be processed by one Map task. Thus, if your job takes multiple bzip2 text files as the input, then you'll have as many Map tasks as you have files running in parallel. The Map tasks will be run by your TaskTrackers. Usually the cluster setup has the DataNode and the TaskTracker processing running on the same machines - so with 6 data nodes, you have 6 tasktrackers. Hope that answers your question. Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote: Hi Rohit, thanks for the response, this is pretty much as I expected and hopefully adds weight to my other thoughts... Could this mean that all my datanodes are being sent all of the data or that only one datanode is executing the job. Thanks again , Dan. -Original Message- From: Rohit Bakhshi [mailto:ro...@hortonworks.com] Sent: 24 February 2012 15:54 To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org) Subject: Re: BZip2 Splittable? Daniel, I just noticed your Hadoop version - 0.20.2. The JIRA fix below is for Hadoop 0.21.0, which is a different version. So it may not be supported on your version of Hadoop. -- Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote: Hi Daniel, Bzip2 compression codec allows for splittable files. According to this Hadoop JIRA improvement, splitting of bzip2 compressed files in Hadoop jobs is supported: https://issues.apache.org/jira/browse/HADOOP-4012 -- Rohit Bakhshi www.hortonworks.com (http://www.hortonworks.com/) On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote: Hi All, I have a cluster of 6 datanodes, all running hadoop version 0.20.2, r911707 that take a series of bzip2 compressed text files as input. I have read conflicting articles regarding whether or not hadoop can split these bzip2 files, can anyone give me a definite answer? Thanks is advance, Dan. CONFIDENTIALITY - This email and any files transmitted with it, are confidential, may be legally privileged and are intended solely for the use of the individual or entity to whom they are addressed. If this has come to you in error, you must not copy, distribute, disclose or use any of the information it contains. Please notify the sender immediately and delete them from your system. SECURITY - Please be aware that communication by email, by its very nature, is not 100% secure and by communicating with Perform Group by email you consent to us monitoring and reading any such correspondence. VIRUSES - Although this email message has been scanned for the presence of computer viruses, the sender accepts no liability for any damage sustained as a result of a computer virus and it is the recipient’s responsibility to ensure that email is virus free. AUTHORITY - Any views or opinions expressed in this email are solely those of the sender and do not necessarily represent those of Perform Group. COPYRIGHT - Copyright of this email and any attachments belongs to Perform Group, Companies House Registration number 6324278. -- Regards, -- Srinivas srini...@cloudwick.com
Re: Processing small xml files
Hi Mohit, You can use Pig for processing XML files. PiggyBank has build in load function to load the XML files. Also you can specify pig.maxCombinedSplitSize and pig.splitCombination for efficient processing. On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote: I'm not sure what you mean by flat format here. In my scenario, I have an file input.xml that looks like this. myfile section value1/value /section section value2/value /section /myfile input.xml is a plain text file. Not a sequence file. If I read it with the XMLInputFormat my mapper gets called with (key, value) pairs that look like this: (, sectionvalue1/value/section) (, sectionvalue2/value/section) Where the keys are numerical offsets into the file. I then use this information to write a sequence file with these (key, value) pairs. So my Hadoop job that uses XMLInputFormat takes a text file as input and produces a sequence file as output. I don't know a rule of thumb for how many small files is too many. Maybe someone else on the list can chime in. I just know that when your throughput gets slow that's one possible cause to investigate. I need to install hadoop. Does this xmlinput format comes as part of the install? Can you please give me some pointers that would help me install hadoop and xmlinputformat if necessary? -- -- Srinivas srini...@cloudwick.com
Re: memory of mappers and reducers
1) Yes option 2 is enough. 2) Configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child (map/reduce) processes. ** value of mapred.child.ulimit value of mapred.child.java.opts On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com wrote: Thanks for the reply Srinivas, so option 2 will be enough, however, when I tried setting it to 512MB, I see through the system monitor that the map task is given 275MB of real memory!! Is that normal in hadoop to go over the upper bound of memory given by the property mapred.child.java.opts. Mark On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote: Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com -- -- Srinivas srini...@cloudwick.com
Re: ENOENT: No such file or directory
Sumanth, what is the value set to mapred.job.reuse.jvm.num.tasks property ? On Thu, Feb 16, 2012 at 9:25 PM, Sumanth V vsumant...@gmail.com wrote: Hi, We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs are failing with the following errors. We noticed that we are consistently hitting this error condition when the total number of map tasks in a particular job exceeds the total map task capacity of the cluster. Other jobs where the number of map tasks are lower than the total map task capacity fares well. Here are the lines from Job Tracker log file - 2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java: 172) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288) at org.apache.hadoop.mapred.Child.main(Child.java:245) Here is the task tracker log - 2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of tasks it ran: 1 2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater: Exception in truncateLogs while getting allLogsFileDetails(). Ignoring the truncation of logs of this process. java.io.FileNotFoundException: /usr/lib/hadoop-0.20/logs/userlogs/ job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:120) at java.io.FileReader.init(FileReader.java:55) at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java: 110) at org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr uncater.java: 353) at org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc ater.java: 98) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv mFinishedAction(UserLogManager.java: 163) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc essEvent(UserLogManager.java: 137) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni tor(UserLogManager.java: 132) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager $1.run(UserLogManager.java:66) 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_06_0 0.0% 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_53_0 0.0% 2012-02-16 15:05:22,329 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_57_0 0.0% Any help in resolving this issue would be highly appreciated! Let me know if any other config info is needed. Thanks, Sumanth -- -- Srinivas srini...@cloudwick.com
Re: ENOENT: No such file or directory
Sumanth, I think Sreedhar is pointing to dfs.datanode.max.xceivers property in hdfs-site.xml. Try setting this property to higher value. On Thu, Feb 16, 2012 at 9:51 PM, Sumanth V vsumant...@gmail.com wrote: ulimit values are set to much higher values than the default values Here is the /etc/security/limits.conf contents - * - nofile 64000 hdfs - nproc 32768 hdfs - stack 10240 hbase - nproc 32768 hbase - stack 10240 mapred - nproc 32768 mapred - stack 10240 Sumanth On Thu, Feb 16, 2012 at 6:48 PM, Sree K quikre...@yahoo.com wrote: Sumanth, You may want to check ulimit setting for open files. Set it to a higher value if it is at default value of 1024. Regards, Sreedhar From: Sumanth V vsumant...@gmail.com To: common-user@hadoop.apache.org Sent: Thursday, February 16, 2012 6:25 PM Subject: ENOENT: No such file or directory Hi, We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs are failing with the following errors. We noticed that we are consistently hitting this error condition when the total number of map tasks in a particular job exceeds the total map task capacity of the cluster. Other jobs where the number of map tasks are lower than the total map task capacity fares well. Here are the lines from Job Tracker log file - 2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java: 172) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288) at org.apache.hadoop.mapred.Child.main(Child.java:245) Here is the task tracker log - 2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of tasks it ran: 1 2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater: Exception in truncateLogs while getting allLogsFileDetails(). Ignoring the truncation of logs of this process. java.io.FileNotFoundException: /usr/lib/hadoop-0.20/logs/userlogs/ job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:120) at java.io.FileReader.init(FileReader.java:55) at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java: 110) at org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr uncater.java: 353) at org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc ater.java: 98) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv mFinishedAction(UserLogManager.java: 163) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc essEvent(UserLogManager.java: 137) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni tor(UserLogManager.java: 132) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager $1.run(UserLogManager.java:66) 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_06_0 0.0% 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_53_0 0.0% 2012-02-16 15:05:22,329 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_57_0 0.0% Any help in resolving this issue would be highly appreciated! Let me know if any other config info is needed. Thanks, Sumanth -- -- Srinivas srini...@cloudwick.com
Re: ENOENT: No such file or directory
Sumanth, For quick check, try setting this to much bigger value( 1M ), though this is not good practice( data-node may run into out of memory). On Thu, Feb 16, 2012 at 10:21 PM, Sumanth V vsumant...@gmail.com wrote: Hi Srinivas, The *dfs.datanode.max.xcievers* value is set to 4096 in hdfs-site.xml. Sumanth On Thu, Feb 16, 2012 at 7:11 PM, Srinivas Surasani vas...@gmail.com wrote: Sumanth, I think Sreedhar is pointing to dfs.datanode.max.xceivers property in hdfs-site.xml. Try setting this property to higher value. On Thu, Feb 16, 2012 at 9:51 PM, Sumanth V vsumant...@gmail.com wrote: ulimit values are set to much higher values than the default values Here is the /etc/security/limits.conf contents - * - nofile 64000 hdfs - nproc 32768 hdfs - stack 10240 hbase - nproc 32768 hbase - stack 10240 mapred - nproc 32768 mapred - stack 10240 Sumanth On Thu, Feb 16, 2012 at 6:48 PM, Sree K quikre...@yahoo.com wrote: Sumanth, You may want to check ulimit setting for open files. Set it to a higher value if it is at default value of 1024. Regards, Sreedhar From: Sumanth V vsumant...@gmail.com To: common-user@hadoop.apache.org Sent: Thursday, February 16, 2012 6:25 PM Subject: ENOENT: No such file or directory Hi, We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs are failing with the following errors. We noticed that we are consistently hitting this error condition when the total number of map tasks in a particular job exceeds the total map task capacity of the cluster. Other jobs where the number of map tasks are lower than the total map task capacity fares well. Here are the lines from Job Tracker log file - 2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or directory at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java: 172) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288) at org.apache.hadoop.mapred.Child.main(Child.java:245) Here is the task tracker log - 2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of tasks it ran: 1 2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater: Exception in truncateLogs while getting allLogsFileDetails(). Ignoring the truncation of logs of this process. java.io.FileNotFoundException: /usr/lib/hadoop-0.20/logs/userlogs/ job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.init(FileInputStream.java:120) at java.io.FileReader.init(FileReader.java:55) at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java: 110) at org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr uncater.java: 353) at org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc ater.java: 98) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv mFinishedAction(UserLogManager.java: 163) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc essEvent(UserLogManager.java: 137) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni tor(UserLogManager.java: 132) at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager $1.run(UserLogManager.java:66) 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_06_0 0.0% 2012-02-16 15:05:22,228 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_53_0 0.0% 2012-02-16 15:05:22,329 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_57_0 0.0% Any help in resolving this issue would be highly appreciated! Let me know if any other config info is needed. Thanks, Sumanth -- -- Srinivas srini...@cloudwick.com -- -- Srinivas srini...@cloudwick.com
Re: memory of mappers and reducers
Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com
Re: How to estimate hadoop?
Hey, It completely depends on your data sizes and processing. You can have one node cluster to thousands (many more) depending on your needs. Following link may help you. http://wiki.apache.org/hadoop/HardwareBenchmarks Regards, On Wed, Feb 15, 2012 at 10:17 PM, Jinyan Xu xxjjyy2...@gmail.com wrote: Hi all, I want to used hadoop system, but I need a overall system info about hadoop, for example, system performance, mem used, cpu utilization and so on. So do anyone have a system estimate about hadoop? which tool can do this? yours rock -- -- Srinivas srini...@cloudwick.com
Re: Understanding fair schedulers
Praveenesh, You can try specifying mapred.fairscheduler.pool to your pool name while running the job. By default, mapred.faircheduler.poolnameproperty set to user.name ( each job run by user is allocated to his named pool ) and you can also change this property to group.name. Srinivas -- Also, you can set On Wed, Jan 25, 2012 at 6:24 AM, praveenesh kumar praveen...@gmail.comwrote: Understanding Fair Schedulers better. Can we create mulitple pools in Fair Schedulers. I guess Yes. Please correct me. Suppose I have 2 pools in my fair-scheduler.xml 1. Hadoop-users : Min map : 10, Max map : 50, Min Reduce : 10, Max Reduce : 50 2. Admin-users: Min map : 20, Max map : 80, Min Reduce : 20, Max Reduce : 80 I have 5 users, who will be using these pools. How will I allocate specific pools to specific users ? Suppose I want user1,user2 to use Hadoop-users pool and user3,user4,user5 to use Admin users In http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html they have mentioned allocations something like this. ?xml version=1.0? allocations pool name=sample_pool minMaps5/minMaps minReduces5/minReduces maxMaps25/maxMaps maxReduces25/maxReduces minSharePreemptionTimeout300/minSharePreemptionTimeout /pool user name=sample_user maxRunningJobs6/maxRunningJobs /user userMaxJobsDefault3/userMaxJobsDefault fairSharePreemptionTimeout600/fairSharePreemptionTimeout /allocations I tried creating more pools, its happening, but how to allocate users to use specific pools ? Thanks, Praveenesh
Re: Understanding fair schedulers
Praveenesh, You can try specifying mapred.fairscheduler.pool to your pool name while running the job. By default, mapred.faircheduler.poolnameproperty set to user.name ( each job run by user is allocated to his named pool ) and you can also change this property to group.name. Srinivas -- Also, you can set On Wed, Jan 25, 2012 at 6:24 AM, praveenesh kumar praveen...@gmail.comwrote: Understanding Fair Schedulers better. Can we create mulitple pools in Fair Schedulers. I guess Yes. Please correct me. Suppose I have 2 pools in my fair-scheduler.xml 1. Hadoop-users : Min map : 10, Max map : 50, Min Reduce : 10, Max Reduce : 50 2. Admin-users: Min map : 20, Max map : 80, Min Reduce : 20, Max Reduce : 80 I have 5 users, who will be using these pools. How will I allocate specific pools to specific users ? Suppose I want user1,user2 to use Hadoop-users pool and user3,user4,user5 to use Admin users In http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html they have mentioned allocations something like this. ?xml version=1.0? allocations pool name=sample_pool minMaps5/minMaps minReduces5/minReduces maxMaps25/maxMaps maxReduces25/maxReduces minSharePreemptionTimeout300/minSharePreemptionTimeout /pool user name=sample_user maxRunningJobs6/maxRunningJobs /user userMaxJobsDefault3/userMaxJobsDefault fairSharePreemptionTimeout600/fairSharePreemptionTimeout /allocations I tried creating more pools, its happening, but how to allocate users to use specific pools ? Thanks, Praveenesh
Re: Parallel CSV loader
Edmen, Parallel Databases ( Teradata, Netezza..)?? I believe if you use Sqoop (with JBDC) for loading you cannot achieve parallelism since table gets dead locks by specifying more mappers. But you can use Sqoop + Parallel Database Connector ( you find them on Cloudera site ) to achieve the native loading utilities. for example you can achieve Teradata fast loader utility with Sqoop and Teradata Connector. Srinivas -- On Tue, Jan 24, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote: Agree. Apache Sqoop is what you're looking for: http://incubator.apache.org/sqoop/ On Tue, Jan 24, 2012 at 10:51 PM, Prashant Kommireddi prash1...@gmail.com wrote: I am assuming you want to move data between Hadoop and database. Please take a look at Sqoop. Thanks, Prashant Sent from my iPhone On Jan 24, 2012, at 9:19 AM, Edmon Begoli ebeg...@gmail.com wrote: I am looking to use Hadoop for parallel loading of CSV file into a non-Hadoop, parallel database. Is there an existing utility that allows one to pick entries, row-by-row, synchronized and in parallel and load into a database? Thank you in advance, Edmon -- Harsh J Customer Ops. Engineer, Cloudera
Re: Does tunning requires re-formatting Namenode ?
Hi Praveenesh, You just need to restart the hadoop services. Reformatting the namenode causes the lose of metadata and there by loss of data residing in datanodes. Regards, Srinivas, Cloudwick Technologies. On Mon, Jan 9, 2012 at 12:26 AM, praveenesh kumar praveen...@gmail.comwrote: Hey Guys, Do I need to format the namenode again if I am changing some HDFS configurations like blocksize, checksum, compression codec etc or is there any other way to enforce these new changes in the present cluster setup ? Thanks, Praveenesh
Lucene Example error
When I am trying to run lucene example from command line, I get the following exception. [root@localhost lucene-3.4.0]# echo $CLASSPATH /root/lucene-3.4.0/lucene-core-3.4.0.jar:/root/lucene-3.4.0/contrib/demo/lucene-demo-3.4.0.jar:/usr/java/jdk1.6.0_26/lib [root@localhost lucene-3.4.0]# export JAVA_HOME=/usr/java/jdk1.6.0_26/ [root@localhost lucene-3.4.0]# java org.apache.lucene.demo.IndexFiles -docs /root/lucene-3.4.0/contrib/demo/src/test/org/apache/lucene/demo/test-files/docs/ Indexing to directory 'index'... Exception in thread main java.lang.NoSuchMethodError: method java.lang.Class.isAnonymousClass with signature ()Z was not found. at org.apache.lucene.analysis.Analyzer.assertFinal(Analyzer.java:57) at org.apache.lucene.analysis.Analyzer.init(Analyzer.java:45) at org.apache.lucene.analysis.ReusableAnalyzerBase.init(ReusableAnalyzerBase.java:39) at org.apache.lucene.analysis.StopwordAnalyzerBase.init(StopwordAnalyzerBase.java:60) at org.apache.lucene.analysis.standard.StandardAnalyzer.init(StandardAnalyzer.java:72) at org.apache.lucene.analysis.standard.StandardAnalyzer.init(StandardAnalyzer.java:82) at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:88) I appreciate your help.
Re: I need help reducing reducer memory
Steve, I had similar problem while loading data from HDFS to Teradata with reducer. Adding the following switches may help . hadoop jar *.jar -Dmapred.child.java.opts=-Xmx1200m/2400m -xx:-UseGCOverheadLimit i/p o/p and also you may try -Dmapred.job.reuse.jvm.num.tasks=1. Regards, Srinivas On Tue, Oct 25, 2011 at 4:38 PM, Steve Lewis lordjoe2...@gmail.com wrote: I have problems with reduce tasks failing with GC overhead limit exceeded My reduce job retains a small amount of data in memory while processing each key discarding it after the key is handled My *mapred.child.java.opts is *-Xmx1200m I tried mapred.job.shuffle.input.buffer.percent = 0.20 mapred.job.reduce.input.buffer.percent=0.30 I really don't know what parameters I can set to lower the memory footprint of my reducer and could use help I am only passing tens of thousands of keys with thousands of values - each value will be maybe 10KB -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Question on mianframe files
Hi, I'm downloading mainframe files using FTP in binary mode on to local file system. These files are now seen as EBCDIC. The information about these files are (a) fixed in length ( each field in record has fixed length). (b)each record is of some KB ( This KB is fixed for each record). Now here I'm able to convert this EBCDIC files to ASCII files with in unix file system using the following command. dd if INPUTFILENAME of OUTPUTFILENAME conv=ascii,unblock cbs=150 150 being the record size. So, here I want this conversion to be done in Hadoop to leverage the use of parallel processing. I was wondering is there any record reader available for this kind of files and also about how to convert Packed Decimals COMP(3) files to ASCII. Any suggestions on how this can be done. Srinivas
Re: Question on mianframe files
Yes, we have files which are only EBCDIC and files containing EBCDIC + PACKED DECIMALS. As a first step started working on only EBCDIC files. dd command working fine but the intention is to do this conversion with in HDFS to leverage parallel processing. On Sat, Oct 15, 2011 at 10:58 AM, Michel Segel michael_se...@hotmail.comwrote: You may not want to do this... Does the data contain and packed or zoned decimals? If so, the dd conversion will corrupt your data. Sent from a remote device. Please excuse any typos... Mike Segel On Oct 15, 2011, at 3:51 AM, SRINIVAS SURASANI vas...@gmail.com wrote: Hi, I'm downloading mainframe files using FTP in binary mode on to local file system. These files are now seen as EBCDIC. The information about these files are (a) fixed in length ( each field in record has fixed length). (b)each record is of some KB ( This KB is fixed for each record). Now here I'm able to convert this EBCDIC files to ASCII files with in unix file system using the following command. dd if INPUTFILENAME of OUTPUTFILENAME conv=ascii,unblock cbs=150 150 being the record size. So, here I want this conversion to be done in Hadoop to leverage the use of parallel processing. I was wondering is there any record reader available for this kind of files and also about how to convert Packed Decimals COMP(3) files to ASCII. Any suggestions on how this can be done. Srinivas