Re: Hive/Hdfs Connector

2012-07-05 Thread Srinivas Surasani
Sandeep,

You can Use HiveDriver class for achieving this...

http://hive.apache.org/docs/r0.7.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html

Regards,
Srinivas

On Thu, Jul 5, 2012 at 11:02 AM, Sandeep Reddy P
sandeepreddy.3...@gmail.com wrote:
 Hi,
 We have some application which generates SQL queries and connects to RDBMS
 using connectors like JDBC for mysql. Now if we generate HQL using our
 application is there any way to connect to Hive/Hdfs using connectors?? I
 need help on what connectors i have to use?
 We dont want to pull data from Hive/Hdfs to RDBMS instead we need our
 application to connect to Hive/Hdfs.

 --
 Thanks,
 sandeep




-- 
-- Srinivas
srini...@cloudwick.com


Re: Hadoop with Sharded MySql

2012-06-01 Thread Srinivas Surasani
All,

I'm trying to get data into HDFS directly from sharded database and expose
to existing hive infrastructure.

( we are currently doing this way,, mysql-staging server-hdfs put
commands-hdfs, which is taking lot of time ).

If we have way of running single sqoop job across all shardes for single
table, I believe it makes life easier in terms of monotoring and exception
handlings..

Thanks,
Srinivas

On Fri, Jun 1, 2012 at 1:27 AM, anil gupta anilgupt...@gmail.com wrote:

 Hi Sujith,

 Srinivas is asking how to import data into HDFS using sqoop?  I believe he
 must have thought out well before designing the entire
 architecture/solution. He has not specified whether he would like to modify
 the data or not. Whether to use HIve or HBase is a different question
 altogether and depends on his use-case.

 Thanks,
 Anil


 On Thu, May 31, 2012 at 9:52 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:

  Hi ,
  instead of pulling 70K tables from mysql into hdfs.
  take dump of all 30 table and put in to hBase data base .
 
  if you pulled 70K tables from mysql into hdfs , you need to use Hive ,
 but
  modification will not possible in Hive :(
 
  *@ common-user :* please correct me , if i am wrong .
 
  Kind Regards
  Sujit Dhamale
  (+91 9970086652)
  On Fri, Jun 1, 2012 at 5:42 AM, Edward Capriolo edlinuxg...@gmail.com
  wrote:
 
   Maybe you can do some VIEWs or unions or merge tables on the mysql
   side to overcome the aspect of launching so many sqoop jobs.
  
   On Thu, May 31, 2012 at 6:02 PM, Srinivas Surasani
   hivehadooplearn...@gmail.com wrote:
All,
   
We are trying to implement sqoop in our environment which has 30
 mysql
sharded databases and all the databases have around 30 databases with
150 tables in each of the database which are all sharded
 (horizontally
sharded that means the data is divided into all the tables in mysql).
   
The problem is that we have a total of around 70K tables which needed
to be pulled from mysql into hdfs.
   
So, my question is that generating 70K sqoop commands and running
 them
parallel is feasible or not?
   
Also, doing incremental updates is going to be like invoking 70K
another sqoop jobs which intern kick of map-reduce jobs.
   
The main problem is monitoring and managing this huge number of jobs?
   
Can anyone suggest me the best way of doing it or is sqoop a good
candidate for this type of scenario?
   
Currently the same process is done by generating tsv files  mysql
server and dumped into staging server and  from there we'll generate
hdfs put statements..
   
Appreciate your suggestions !!!
   
   
Thanks,
Srinivas Surasani
  
 



 --
 Thanks  Regards,
 Anil Gupta




-- 
Regards,
-- Srinivas
srini...@cloudwick.com


Re: BZip2 Splittable?

2012-02-24 Thread Srinivas Surasani
@Daniel,

If you want to process bz2 files in parallel( more than one mapper/reducer
), you can go for Pig.

See below.

Pig has inbuilt support for processing .bz2 files in parallel (.gz support
is coming soon). If the input file name extension is .bz2, Pig decompresses
the file on the fly and passes the decompressed input stream to your load
function.

Regards,


On Fri, Feb 24, 2012 at 2:59 PM, Rohit ro...@hortonworks.com wrote:

 Hi Daniel,

 Because your MapReduce jobs will not split bzip2 files, each entire bzip2
 file will be processed by one Map task. Thus, if your job takes multiple
 bzip2 text files as the input, then you'll have as many Map tasks as you
 have files running in parallel.

 The Map tasks will be run by your TaskTrackers. Usually the cluster setup
 has the DataNode and the TaskTracker processing running on the same
 machines - so with 6 data nodes, you have 6 tasktrackers.

 Hope that answers your question.


 Rohit Bakhshi



 www.hortonworks.com (http://www.hortonworks.com/)



 On Friday, February 24, 2012 at 7:59 AM, Daniel Baptista wrote:
  Hi Rohit, thanks for the response, this is pretty much as I expected and
 hopefully adds weight to my other thoughts...
 
  Could this mean that all my datanodes are being sent all of the data or
 that only one datanode is executing the job.
 
  Thanks again , Dan.
 
  -Original Message-
  From: Rohit Bakhshi [mailto:ro...@hortonworks.com]
  Sent: 24 February 2012 15:54
  To: common-user@hadoop.apache.org (mailto:common-user@hadoop.apache.org)
  Subject: Re: BZip2 Splittable?
 
  Daniel,
 
  I just noticed your Hadoop version - 0.20.2.
 
  The JIRA fix below is for Hadoop 0.21.0, which is a different version.
 So it may not be supported on your version of Hadoop.
 
  --
  Rohit Bakhshi
  www.hortonworks.com (http://www.hortonworks.com/)
 
 
 
 
  On Friday, February 24, 2012 at 7:49 AM, Rohit Bakhshi wrote:
 
   Hi Daniel,
  
   Bzip2 compression codec allows for splittable files.
  
   According to this Hadoop JIRA improvement, splitting of bzip2
 compressed files in Hadoop jobs is supported:
   https://issues.apache.org/jira/browse/HADOOP-4012
  
   --
   Rohit Bakhshi
   www.hortonworks.com (http://www.hortonworks.com/)
  
  
  
  
   On Friday, February 24, 2012 at 7:43 AM, Daniel Baptista wrote:
  
Hi All,
   
I have a cluster of 6 datanodes, all running hadoop version 0.20.2,
 r911707 that take a series of bzip2 compressed text files as input.
   
I have read conflicting articles regarding whether or not hadoop can
 split these bzip2 files, can anyone give me a definite answer?
   
Thanks is advance, Dan.
 
 
  
 
  CONFIDENTIALITY - This email and any files transmitted with it, are
 confidential, may be legally privileged and are intended solely for the use
 of the individual or entity to whom they are addressed. If this has come to
 you in error, you must not copy, distribute, disclose or use any of the
 information it contains. Please notify the sender immediately and delete
 them from your system.
 
  SECURITY - Please be aware that communication by email, by its very
 nature, is not 100% secure and by communicating with Perform Group by email
 you consent to us monitoring and reading any such correspondence.
 
  VIRUSES - Although this email message has been scanned for the presence
 of computer viruses, the sender accepts no liability for any damage
 sustained as a result of a computer virus and it is the recipient’s
 responsibility to ensure that email is virus free.
 
  AUTHORITY - Any views or opinions expressed in this email are solely
 those of the sender and do not necessarily represent those of Perform Group.
 
  COPYRIGHT - Copyright of this email and any attachments belongs to
 Perform Group, Companies House Registration number 6324278.




-- 
Regards,
-- Srinivas
srini...@cloudwick.com


Re: Processing small xml files

2012-02-17 Thread Srinivas Surasani
Hi Mohit,

You can use Pig for processing XML files. PiggyBank has build in load
function to load the XML files.
 Also you can specify pig.maxCombinedSplitSize  and
pig.splitCombination for efficient processing.

On Sat, Feb 18, 2012 at 1:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote:
 On Tue, Feb 14, 2012 at 10:56 AM, W.P. McNeill bill...@gmail.com wrote:

 I'm not sure what you mean by flat format here.

 In my scenario, I have an file input.xml that looks like this.

 myfile
   section
      value1/value
   /section
   section
      value2/value
   /section
 /myfile

 input.xml is a plain text file. Not a sequence file. If I read it with the
 XMLInputFormat my mapper gets called with (key, value) pairs that look like
 this:

 (, sectionvalue1/value/section)
 (, sectionvalue2/value/section)

 Where the keys are numerical offsets into the file. I then use this
 information to write a sequence file with these (key, value) pairs. So my
 Hadoop job that uses XMLInputFormat takes a text file as input and produces
 a sequence file as output.

 I don't know a rule of thumb for how many small files is too many. Maybe
 someone else on the list can chime in. I just know that when your
 throughput gets slow that's one possible cause to investigate.


 I need to install hadoop. Does this xmlinput format comes as part of the
 install? Can you please give me some pointers that would help me install
 hadoop and xmlinputformat if necessary?



-- 
-- Srinivas
srini...@cloudwick.com


Re: memory of mappers and reducers

2012-02-16 Thread Srinivas Surasani
1) Yes option 2 is enough.
2) Configuration variable mapred.child.ulimit can be used to control
the maximum virtual memory of the child (map/reduce) processes.

** value of mapred.child.ulimit  value of mapred.child.java.opts

On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com wrote:
 Thanks for the reply Srinivas, so option 2 will be enough, however, when I
 tried setting it to 512MB, I see through the system monitor that the map
 task is given 275MB of real memory!!
 Is that normal in hadoop to go over the upper bound of memory given by the
 property mapred.child.java.opts.

 Mark

 On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote:

 Hey Mark,

 Yes, you can limit the memory for each task with
 mapred.child.java.opts property. Set this to final if no developer
 has to change it .

 Little intro to mapred.task.default.maxvmem

 This property has to be set on both the JobTracker  for making
 scheduling decisions and on the TaskTracker nodes for the sake of
 memory management. If a job doesn't specify its virtual memory
 requirement by setting mapred.task.maxvmem to -1, tasks are assured a
 memory limit set to this property. This property is set to -1 by
 default. This value should in general be less than the cluster-wide
 configuration mapred.task.limit.maxvmem. If not or if it is not set,
 TaskTracker's memory management will be disabled and a scheduler's
 memory based scheduling decisions may be affected.

 On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com
 wrote:
  Hi,
 
   My question is what's the difference between the following two settings:
 
  1. mapred.task.default.maxvmem
  2. mapred.child.java.opts
 
  The first one is used by the TT to monitor the memory usage of tasks,
 while
  the second one is the maximum heap space assigned for each task. I want
 to
  limit each task to use upto say 100MB of memory. Can I use only #2 ??
 
  Thank you,
  Mark



 --
 -- Srinivas
 srini...@cloudwick.com




-- 
-- Srinivas
srini...@cloudwick.com


Re: ENOENT: No such file or directory

2012-02-16 Thread Srinivas Surasani
Sumanth,

what is the value set to mapred.job.reuse.jvm.num.tasks property ?

On Thu, Feb 16, 2012 at 9:25 PM, Sumanth V vsumant...@gmail.com wrote:
 Hi,

 We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs
 are failing with the following errors. We noticed that we are
 consistently hitting this error condition when the total number of map
 tasks in a particular job exceeds the total map task capacity of the
 cluster.
 Other jobs where the number of map tasks are lower than the total map task
 capacity fares well.

 Here are the lines from Job Tracker log file -

 2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress:
 Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or
 directory
        at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)

 at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java: 172)

 at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
        at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
        at org.apache.hadoop.mapred.Child.main(Child.java:245)

 Here is the task tracker log -

 2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of tasks
 it ran: 1
 2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater:
 Exception in truncateLogs while getting allLogsFileDetails(). Ignoring the
 truncation of logs of this process.
 java.io.FileNotFoundException:
 /usr/lib/hadoop-0.20/logs/userlogs/
 job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index
 (No
 such file or directory)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.init(FileInputStream.java:120)
        at java.io.FileReader.init(FileReader.java:55)

 at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java: 110)

 at org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr
 uncater.java: 353)

 at org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc
 ater.java: 98)

 at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv
 mFinishedAction(UserLogManager.java: 163)

 at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc
 essEvent(UserLogManager.java: 137)

 at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni
 tor(UserLogManager.java: 132)

 at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager
 $1.run(UserLogManager.java:66)
 2012-02-16 15:05:22,228 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_06_0
 0.0%
 2012-02-16 15:05:22,228 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_53_0
 0.0%
 2012-02-16 15:05:22,329 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_57_0
 0.0%

 Any help in resolving this issue would be highly appreciated! Let me
 know if any other config info is needed.

 Thanks,
 Sumanth



-- 
-- Srinivas
srini...@cloudwick.com


Re: ENOENT: No such file or directory

2012-02-16 Thread Srinivas Surasani
Sumanth, I think Sreedhar is pointing to dfs.datanode.max.xceivers
property in hdfs-site.xml. Try setting this property to higher value.



On Thu, Feb 16, 2012 at 9:51 PM, Sumanth V vsumant...@gmail.com wrote:
 ulimit values are set to much higher values than the default values
 Here is the /etc/security/limits.conf contents -
 *       -       nofile  64000
 hdfs    -       nproc   32768
 hdfs    -       stack   10240
 hbase   -       nproc   32768
 hbase   -       stack   10240
 mapred  -       nproc   32768
 mapred  -       stack   10240


 Sumanth



 On Thu, Feb 16, 2012 at 6:48 PM, Sree K quikre...@yahoo.com wrote:

 Sumanth,

 You may want to check ulimit setting for open files.


 Set it to a higher value if it is at default value of 1024.

 Regards,
 Sreedhar




 
  From: Sumanth V vsumant...@gmail.com
 To: common-user@hadoop.apache.org
 Sent: Thursday, February 16, 2012 6:25 PM
 Subject: ENOENT: No such file or directory

 Hi,

 We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs
 are failing with the following errors. We noticed that we are
 consistently hitting this error condition when the total number of map
 tasks in a particular job exceeds the total map task capacity of the
 cluster.
 Other jobs where the number of map tasks are lower than the total map task
 capacity fares well.

 Here are the lines from Job Tracker log file -

 2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress:
 Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or
 directory
         at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)

 at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:
 172)

 at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
         at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
         at org.apache.hadoop.mapred.Child.main(Child.java:245)

 Here is the task tracker log -

 2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM :
 jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of tasks
 it ran: 1
 2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater:
 Exception in truncateLogs while getting allLogsFileDetails(). Ignoring the
 truncation of logs of this process.
 java.io.FileNotFoundException:
 /usr/lib/hadoop-0.20/logs/userlogs/
 job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index
 (No
 such file or directory)
         at java.io.FileInputStream.open(Native Method)
         at java.io.FileInputStream.init(FileInputStream.java:120)
         at java.io.FileReader.init(FileReader.java:55)

 at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java:
 110)

 at
 org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr
 uncater.java: 353)

 at
 org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc
 ater.java: 98)

 at
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv
 mFinishedAction(UserLogManager.java: 163)

 at
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc
 essEvent(UserLogManager.java: 137)

 at
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni
 tor(UserLogManager.java: 132)

 at org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager
 $1.run(UserLogManager.java:66)
 2012-02-16 15:05:22,228 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_06_0
 0.0%
 2012-02-16 15:05:22,228 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_53_0
 0.0%
 2012-02-16 15:05:22,329 INFO
 org.apache.hadoop.mapred.TaskTracker: attempt_201202161408_0004_m_57_0
 0.0%

 Any help in resolving this issue would be highly appreciated! Let me
 know if any other config info is needed.

 Thanks,
 Sumanth




-- 
-- Srinivas
srini...@cloudwick.com


Re: ENOENT: No such file or directory

2012-02-16 Thread Srinivas Surasani
Sumanth,

For quick check, try setting this to much bigger value( 1M ), though
this is not good practice( data-node may run into out of memory).

On Thu, Feb 16, 2012 at 10:21 PM, Sumanth V vsumant...@gmail.com wrote:
 Hi Srinivas,

 The *dfs.datanode.max.xcievers* value is set to 4096 in hdfs-site.xml.


 Sumanth



 On Thu, Feb 16, 2012 at 7:11 PM, Srinivas Surasani vas...@gmail.com wrote:

 Sumanth, I think Sreedhar is pointing to dfs.datanode.max.xceivers
 property in hdfs-site.xml. Try setting this property to higher value.



 On Thu, Feb 16, 2012 at 9:51 PM, Sumanth V vsumant...@gmail.com wrote:
  ulimit values are set to much higher values than the default values
  Here is the /etc/security/limits.conf contents -
  *       -       nofile  64000
  hdfs    -       nproc   32768
  hdfs    -       stack   10240
  hbase   -       nproc   32768
  hbase   -       stack   10240
  mapred  -       nproc   32768
  mapred  -       stack   10240
 
 
  Sumanth
 
 
 
  On Thu, Feb 16, 2012 at 6:48 PM, Sree K quikre...@yahoo.com wrote:
 
  Sumanth,
 
  You may want to check ulimit setting for open files.
 
 
  Set it to a higher value if it is at default value of 1024.
 
  Regards,
  Sreedhar
 
 
 
 
  
   From: Sumanth V vsumant...@gmail.com
  To: common-user@hadoop.apache.org
  Sent: Thursday, February 16, 2012 6:25 PM
  Subject: ENOENT: No such file or directory
 
  Hi,
 
  We have a 20 node hadoop cluster running CDH3 U2. Some of our jobs
  are failing with the following errors. We noticed that we are
  consistently hitting this error condition when the total number of map
  tasks in a particular job exceeds the total map task capacity of the
  cluster.
  Other jobs where the number of map tasks are lower than the total map
 task
  capacity fares well.
 
  Here are the lines from Job Tracker log file -
 
  2012-02-16 15:05:28,695 INFO org.apache.hadoop.mapred.TaskInProgress:
  Error from attempt_201202161408_0004_m_000169_0: ENOENT: No such file or
  directory
          at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method)
 
  at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:
  172)
 
  at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:215)
          at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:288)
          at org.apache.hadoop.mapred.Child.main(Child.java:245)
 
  Here is the task tracker log -
 
  2012-02-16 15:05:22,126 INFO org.apache.hadoop.mapred.JvmManager: JVM :
  jvm_201202161408_0004_m_1467721896 exited with exit code 0. Number of
 tasks
  it ran: 1
  2012-02-16 15:05:22,127 WARN org.apache.hadoop.mapred.TaskLogsTruncater:
  Exception in truncateLogs while getting allLogsFileDetails(). Ignoring
 the
  truncation of logs of this process.
  java.io.FileNotFoundException:
  /usr/lib/hadoop-0.20/logs/userlogs/
  job_201202161408_0004/attempt_201202161408_0004_m_000112_1/log.index
  (No
  such file or directory)
          at java.io.FileInputStream.open(Native Method)
          at java.io.FileInputStream.init(FileInputStream.java:120)
          at java.io.FileReader.init(FileReader.java:55)
 
  at org.apache.hadoop.mapred.TaskLog.getAllLogsFileDetails(TaskLog.java:
  110)
 
  at
 
 org.apache.hadoop.mapred.TaskLogsTruncater.getAllLogsFileDetails(TaskLogsTr
  uncater.java: 353)
 
  at
 
 org.apache.hadoop.mapred.TaskLogsTruncater.shouldTruncateLogs(TaskLogsTrunc
  ater.java: 98)
 
  at
 
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.doJv
  mFinishedAction(UserLogManager.java: 163)
 
  at
 
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.proc
  essEvent(UserLogManager.java: 137)
 
  at
 
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager.moni
  tor(UserLogManager.java: 132)
 
  at
 org.apache.hadoop.mapreduce.server.tasktracker.userlogs.UserLogManager
  $1.run(UserLogManager.java:66)
  2012-02-16 15:05:22,228 INFO
  org.apache.hadoop.mapred.TaskTracker:
 attempt_201202161408_0004_m_06_0
  0.0%
  2012-02-16 15:05:22,228 INFO
  org.apache.hadoop.mapred.TaskTracker:
 attempt_201202161408_0004_m_53_0
  0.0%
  2012-02-16 15:05:22,329 INFO
  org.apache.hadoop.mapred.TaskTracker:
 attempt_201202161408_0004_m_57_0
  0.0%
 
  Any help in resolving this issue would be highly appreciated! Let me
  know if any other config info is needed.
 
  Thanks,
  Sumanth
 



 --
 -- Srinivas
 srini...@cloudwick.com




-- 
-- Srinivas
srini...@cloudwick.com


Re: memory of mappers and reducers

2012-02-15 Thread Srinivas Surasani
Hey Mark,

Yes, you can limit the memory for each task with
mapred.child.java.opts property. Set this to final if no developer
has to change it .

Little intro to mapred.task.default.maxvmem

This property has to be set on both the JobTracker  for making
scheduling decisions and on the TaskTracker nodes for the sake of
memory management. If a job doesn't specify its virtual memory
requirement by setting mapred.task.maxvmem to -1, tasks are assured a
memory limit set to this property. This property is set to -1 by
default. This value should in general be less than the cluster-wide
configuration mapred.task.limit.maxvmem. If not or if it is not set,
TaskTracker's memory management will be disabled and a scheduler's
memory based scheduling decisions may be affected.

On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote:
 Hi,

  My question is what's the difference between the following two settings:

 1. mapred.task.default.maxvmem
 2. mapred.child.java.opts

 The first one is used by the TT to monitor the memory usage of tasks, while
 the second one is the maximum heap space assigned for each task. I want to
 limit each task to use upto say 100MB of memory. Can I use only #2 ??

 Thank you,
 Mark



-- 
-- Srinivas
srini...@cloudwick.com


Re: How to estimate hadoop?

2012-02-15 Thread Srinivas Surasani
Hey,

It completely depends on your data sizes and processing. You can have
one node cluster to thousands (many more) depending on your needs.
Following link may help you.

http://wiki.apache.org/hadoop/HardwareBenchmarks

Regards,


On Wed, Feb 15, 2012 at 10:17 PM, Jinyan Xu xxjjyy2...@gmail.com wrote:
 Hi all,

 I want to used hadoop system, but I need a overall system info about
 hadoop, for example,
 system performance, mem used, cpu utilization and so on.  So do anyone have
 a system estimate  about hadoop? which tool can do this?

 yours
 rock



-- 
-- Srinivas
srini...@cloudwick.com


Re: Understanding fair schedulers

2012-01-25 Thread Srinivas Surasani
Praveenesh,

You can try specifying mapred.fairscheduler.pool to your pool name while
running the job. By default, mapred.faircheduler.poolnameproperty set to
user.name ( each job run by user is allocated to his named pool ) and you
can also change this property to group.name.

Srinivas --

Also, you can set

On Wed, Jan 25, 2012 at 6:24 AM, praveenesh kumar praveen...@gmail.comwrote:

 Understanding Fair Schedulers better.

 Can we create mulitple pools in Fair Schedulers. I guess Yes. Please
 correct me.

 Suppose I have 2 pools in my fair-scheduler.xml

 1. Hadoop-users : Min map : 10, Max map : 50, Min Reduce : 10, Max Reduce :
 50
 2. Admin-users: Min map : 20, Max map : 80, Min Reduce : 20, Max Reduce :
 80

 I have 5 users, who will be using these pools. How will I allocate specific
 pools to specific users ?

 Suppose I want user1,user2 to use Hadoop-users pool and user3,user4,user5
 to use Admin users

 In http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
 they have mentioned allocations something like this.

 ?xml version=1.0?
 allocations
  pool name=sample_pool
minMaps5/minMaps
minReduces5/minReduces
maxMaps25/maxMaps
maxReduces25/maxReduces
minSharePreemptionTimeout300/minSharePreemptionTimeout
  /pool
  user name=sample_user
maxRunningJobs6/maxRunningJobs
  /user
  userMaxJobsDefault3/userMaxJobsDefault
  fairSharePreemptionTimeout600/fairSharePreemptionTimeout
 /allocations

 I tried creating more pools, its happening, but how to allocate users to
 use specific pools ?

 Thanks,
 Praveenesh



Re: Understanding fair schedulers

2012-01-25 Thread Srinivas Surasani
Praveenesh,

You can try specifying mapred.fairscheduler.pool to your pool name while
running the job. By default, mapred.faircheduler.poolnameproperty set to
user.name ( each job run by user is allocated to his named pool ) and you
can also change this property to group.name.

Srinivas --

Also, you can set

On Wed, Jan 25, 2012 at 6:24 AM, praveenesh kumar praveen...@gmail.comwrote:

 Understanding Fair Schedulers better.

 Can we create mulitple pools in Fair Schedulers. I guess Yes. Please
 correct me.

 Suppose I have 2 pools in my fair-scheduler.xml

 1. Hadoop-users : Min map : 10, Max map : 50, Min Reduce : 10, Max Reduce :
 50
 2. Admin-users: Min map : 20, Max map : 80, Min Reduce : 20, Max Reduce :
 80

 I have 5 users, who will be using these pools. How will I allocate specific
 pools to specific users ?

 Suppose I want user1,user2 to use Hadoop-users pool and user3,user4,user5
 to use Admin users

 In http://hadoop.apache.org/common/docs/r0.20.205.0/fair_scheduler.html
 they have mentioned allocations something like this.

 ?xml version=1.0?
 allocations
  pool name=sample_pool
minMaps5/minMaps
minReduces5/minReduces
maxMaps25/maxMaps
maxReduces25/maxReduces
minSharePreemptionTimeout300/minSharePreemptionTimeout
  /pool
  user name=sample_user
maxRunningJobs6/maxRunningJobs
  /user
  userMaxJobsDefault3/userMaxJobsDefault
  fairSharePreemptionTimeout600/fairSharePreemptionTimeout
 /allocations

 I tried creating more pools, its happening, but how to allocate users to
 use specific pools ?

 Thanks,
 Praveenesh



Re: Parallel CSV loader

2012-01-24 Thread Srinivas Surasani
Edmen,

Parallel Databases ( Teradata, Netezza..)??  I believe if you use Sqoop
(with JBDC) for loading you cannot achieve parallelism since table gets
dead locks by specifying more mappers. But you can use Sqoop + Parallel
Database Connector ( you find them on Cloudera site ) to achieve the native
loading utilities. for example you can achieve Teradata fast loader utility
with Sqoop and Teradata Connector.

Srinivas --

On Tue, Jan 24, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote:

 Agree. Apache Sqoop is what you're looking for:
 http://incubator.apache.org/sqoop/

 On Tue, Jan 24, 2012 at 10:51 PM, Prashant Kommireddi
 prash1...@gmail.com wrote:
  I am assuming you want to move data between Hadoop and database.
  Please take a look at Sqoop.
 
  Thanks,
  Prashant
 
  Sent from my iPhone
 
  On Jan 24, 2012, at 9:19 AM, Edmon Begoli ebeg...@gmail.com wrote:
 
  I am looking to use Hadoop for parallel loading of CSV file into a
  non-Hadoop, parallel database.
 
  Is there an existing utility that allows one to pick entries,
  row-by-row, synchronized and in parallel and load into a database?
 
  Thank you in advance,
  Edmon



 --
 Harsh J
 Customer Ops. Engineer, Cloudera



Re: Does tunning requires re-formatting Namenode ?

2012-01-08 Thread Srinivas Surasani
Hi Praveenesh,

You just need to restart the hadoop services. Reformatting the namenode
causes the lose of metadata and there by loss of data residing in datanodes.

Regards,
Srinivas,
Cloudwick Technologies.

On Mon, Jan 9, 2012 at 12:26 AM, praveenesh kumar praveen...@gmail.comwrote:

 Hey Guys,

 Do I need to format the namenode again if I am changing some HDFS
 configurations like blocksize, checksum, compression codec etc or is there
 any other way to enforce these new changes in the present cluster setup ?

 Thanks,
 Praveenesh



Lucene Example error

2011-12-22 Thread Srinivas Surasani
When I am trying to run lucene example from command line, I get the
following exception.

[root@localhost lucene-3.4.0]# echo $CLASSPATH
/root/lucene-3.4.0/lucene-core-3.4.0.jar:/root/lucene-3.4.0/contrib/demo/lucene-demo-3.4.0.jar:/usr/java/jdk1.6.0_26/lib
[root@localhost lucene-3.4.0]# export JAVA_HOME=/usr/java/jdk1.6.0_26/
[root@localhost lucene-3.4.0]# java org.apache.lucene.demo.IndexFiles -docs
/root/lucene-3.4.0/contrib/demo/src/test/org/apache/lucene/demo/test-files/docs/
Indexing to directory 'index'...
Exception in thread main java.lang.NoSuchMethodError: method
java.lang.Class.isAnonymousClass with signature ()Z was not found.
at org.apache.lucene.analysis.Analyzer.assertFinal(Analyzer.java:57)
at org.apache.lucene.analysis.Analyzer.init(Analyzer.java:45)
at
org.apache.lucene.analysis.ReusableAnalyzerBase.init(ReusableAnalyzerBase.java:39)
at
org.apache.lucene.analysis.StopwordAnalyzerBase.init(StopwordAnalyzerBase.java:60)
at
org.apache.lucene.analysis.standard.StandardAnalyzer.init(StandardAnalyzer.java:72)
at
org.apache.lucene.analysis.standard.StandardAnalyzer.init(StandardAnalyzer.java:82)
at org.apache.lucene.demo.IndexFiles.main(IndexFiles.java:88)


I appreciate your help.


Re: I need help reducing reducer memory

2011-10-25 Thread SRINIVAS SURASANI
Steve,

I had similar problem while loading data from HDFS to Teradata with reducer.
Adding the following switches may help .

hadoop jar *.jar -Dmapred.child.java.opts=-Xmx1200m/2400m
 -xx:-UseGCOverheadLimit  i/p o/p
and also you may try -Dmapred.job.reuse.jvm.num.tasks=1.

Regards,
Srinivas

On Tue, Oct 25, 2011 at 4:38 PM, Steve Lewis lordjoe2...@gmail.com wrote:

 I have problems with reduce tasks failing with GC overhead limit exceeded
 My reduce job retains a small amount of data in memory while processing
 each
 key discarding it after the
 key is handled
 My *mapred.child.java.opts is *-Xmx1200m
 I tried
  mapred.job.shuffle.input.buffer.percent = 0.20
  mapred.job.reduce.input.buffer.percent=0.30

 I really don't know what parameters I can set to lower the memory footprint
 of my reducer and could use help

 I am only passing tens of thousands of keys with thousands of values - each
 value will be maybe 10KB


 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com



Question on mianframe files

2011-10-15 Thread SRINIVAS SURASANI
Hi,

I'm downloading mainframe files using FTP in binary mode on to local file
system. These files are now seen as EBCDIC. The information about these
files are
(a) fixed in length ( each
field in record has fixed length).
(b)each record is of some KB
( This KB is fixed for each record).

Now here I'm able to convert this EBCDIC files to ASCII files with in unix
file system using the following command.
dd if INPUTFILENAME  of OUTPUTFILENAME conv=ascii,unblock cbs=150
150 being  the record size.

So, here I want this conversion to be done in Hadoop to leverage the use of
 parallel processing. I was wondering is there any record reader available
for this kind of files and also about how to convert Packed Decimals COMP(3)
files to ASCII.

Any suggestions on how this can be done.


Srinivas


Re: Question on mianframe files

2011-10-15 Thread SRINIVAS SURASANI
Yes, we have files which are only EBCDIC and files containing  EBCDIC +
PACKED DECIMALS. As a first step started working on only EBCDIC files. dd
command working fine but the intention is to do this conversion with in HDFS
to leverage parallel processing.

On Sat, Oct 15, 2011 at 10:58 AM, Michel Segel michael_se...@hotmail.comwrote:

 You may not want to do this...
 Does the data contain and packed or zoned decimals?
 If so, the dd conversion will corrupt your data.


 Sent from a remote device. Please excuse any typos...

 Mike Segel

 On Oct 15, 2011, at 3:51 AM, SRINIVAS SURASANI vas...@gmail.com wrote:

  Hi,
 
  I'm downloading mainframe files using FTP in binary mode on to local file
  system. These files are now seen as EBCDIC. The information about these
  files are
 (a) fixed in length ( each
  field in record has fixed length).
 (b)each record is of some
 KB
  ( This KB is fixed for each record).
 
  Now here I'm able to convert this EBCDIC files to ASCII files with in
 unix
  file system using the following command.
  dd if INPUTFILENAME  of OUTPUTFILENAME conv=ascii,unblock cbs=150
  150 being  the record size.
 
  So, here I want this conversion to be done in Hadoop to leverage the use
 of
  parallel processing. I was wondering is there any record reader available
  for this kind of files and also about how to convert Packed Decimals
 COMP(3)
  files to ASCII.
 
  Any suggestions on how this can be done.
 
 
  Srinivas