Questions on hadoop configuration for heterogeneous cluster
Hi all, I'm a little confused about how to configure hadoop in a heterogeneous cluster. For example, if I have one machine(m1) with a two-core processor, another(m2) with a four-core processor, and I'd like to use them as tasktracker nodes in a hadoop cluster, how could I configure the mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters to 2 on m1, and set to 4 on m2? Or I have to set both to 2 on JT node? In another word, among those massive of parameters in *-site.xml and environtment variables in hadoop-env.sh, which ones could be set on each DN/TT with different values and still take effect? Thanks in advance and look forward to your reply. -- Best Regards, Li Yu
Please let me know in which scenarios the followign exception will come in hadoop-datanode side
java.io.IOException Blockblk_129_3380 is not valid at org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java :962). *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: Questions on hadoop configuration for heterogeneous cluster
Hi, On Thu, Dec 16, 2010 at 2:48 PM, Yu Li car...@gmail.com wrote: Hi all, I'm a little confused about how to configure hadoop in a heterogeneous cluster. For example, if I have one machine(m1) with a two-core processor, another(m2) with a four-core processor, and I'd like to use them as tasktracker nodes in a hadoop cluster, how could I configure the mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters to 2 on m1, and set to 4 on m2? Yes you can do this just fine. Note that the configuration property name says tasktracker meaning it is a tasktracker specific setting and can vary for each. Has nothing to do with the JobTracker. another word, among those massive of parameters in *-site.xml and environtment variables in hadoop-env.sh, which ones could be set on each DN/TT with different values and still take effect? *.tasktracker.* and *.datanode.* properties are TT and DN specific and can be set individually for each of them. This is due to a naming convention followed by Hadoop. -- Harsh J www.harshj.com
Thrift Error
Hi all, I am googled a lot about the below error but can't able to find the root cause. I am selecting data from Hive table website_master but it results in below error : Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1 website_ma0_ org.apache.thrift.TApplicationException: Invalid method name: 'getThriftSchema' at org.apache.thrift.TApplicationException.read(TApplicationException.java:107) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247) at org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151) at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107) at org.hibernate.loader.Loader.getResultSet(Loader.java:1183) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing maxRows/queryTimeout java.sql.SQLException: Method not supported at org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123) at org.hibernate.loader.Loader.getResultSet(Loader.java:1191) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, SQLState: null 10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create ResultSet: Invalid method name: 'getThriftSchema' could not execute query using iterate Can someone Please tell me why this occurs and how to resolve it. Thanks Regards Adarsh Sharma
Re: test
tested :) -- Edson Ramiro Lucas Filho {skype, twitter, gtalk}: erlfilho http://www.inf.ufpr.br/erlf07/ On Thu, Dec 16, 2010 at 5:03 AM, sravankumar sravanku...@huawei.com wrote:
Read specific block of a file
hi there, i want to aski if hdfs api supports reading just a specific block of a file (of course if file exceeds the default block size). for example is it possible to read/fetch just the first of the third block of a specific file in hdfs? does the api supports that?
Question on utf-8 chars
This must be a simple question . But somehow I am not able to get it to work. I have a text file which has ISO Latin characters like Cancún. The mapper is taking Text as the input value. public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException But the Latin charcters are not recognized correctly and it throws a MalInputException when I try Text.validateUTF8(value.getBytes()); Any idea how to resolve this. Appreciate any help. -- Sheeba Ann George
NameNode question about lots of small files
Hi, During our research into the 'small files' issues we are having I didn't find anything to explain what I see 'after' a change. Before: all files were stored in a structure like /source/year/month/day/ where we had dozens of files in each day's direcotory (and 500+ sources). We were using a lot more memory than we expected in the NameNode so we redesigned the directory structure. Here is the 'before' summary: *1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is 1.94 GB / 1.94 GB (100%)* ** The Heap Size relative to the # of files was higher than we expected (Using 150 byte/file rule of thumb from Cloudera) so we redesigned our approach. After: simplified into /source/year_month/ and while there are a lot of files in the directory, the memory usage dropped significantly: * * *1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is 1.18 GB / 1.74 GB (67%)* ** This was a suprise, since we hadn't done the file compaction step and already saw a huge drop in memory usage. What I don't understand is why the change in memory usage? The old structure is still there (/source/year/month/day) but with no files in the tips. The reorg process only moved the files to the new structure, a separate step is going to remove the empty directories. The 'before' was after the cluster was at idle for 4+ hours so I don't think it was GC timing issue. I'm looking to understand what happened so I can make sure our capacity calculations based on # of files and # of directories is correct. We're using: 0.20.2, r911707 Thanks, Chris
Needs a simple answer
Hi all, Why the following lines would work in the main class (WordCount) and not in Mapper ? even though myconf is set in WordCount to point to the getConf() returned object. try{ FileSystem hdfs = FileSystem.get(wc.WordCount.myconf); hdfs.copyFromLocalFile(new Path(/Users/file), new Path(/tmp/file)); }catch(Exception e) { System.err.print(\nError);} Also, the print statement will never print on console unless it's in my run function.. Appreciate it :) Maha
Re: Needs a simple answer
Maha, Remember that the mapper is not running on the same machine as the main class. Thus local files aren't where you think. On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote: Hi all, Why the following lines would work in the main class (WordCount) and not in Mapper ? even though myconf is set in WordCount to point to the getConf() returned object. try{ FileSystem hdfs = FileSystem.get(wc.WordCount.myconf); hdfs.copyFromLocalFile(new Path(/Users/file), new Path(/tmp/file)); }catch(Exception e) { System.err.print(\nError);} Also, the print statement will never print on console unless it's in my run function.. Appreciate it :) Maha
Re: NameNode question about lots of small files
Hi Chris, To have a reasonable understanding of used heap, you need to trigger a full GC. Otherwise, the heap number on the web UI doesn't actually tell you live heap. With the default (non-CMS) collector, the collector will not run until it is manually triggered or the heap becomes full. You can use JConsole to connect and force a GC to get a good measurement of heap used. Keep in mind also that the total heap is more than just the inodes and blocks. Other things like RPC buffers account for some usage as well. -Todd On Thu, Dec 16, 2010 at 11:25 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, During our research into the 'small files' issues we are having I didn't find anything to explain what I see 'after' a change. Before: all files were stored in a structure like /source/year/month/day/ where we had dozens of files in each day's direcotory (and 500+ sources). We were using a lot more memory than we expected in the NameNode so we redesigned the directory structure. Here is the 'before' summary: *1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is 1.94 GB / 1.94 GB (100%)* ** The Heap Size relative to the # of files was higher than we expected (Using 150 byte/file rule of thumb from Cloudera) so we redesigned our approach. After: simplified into /source/year_month/ and while there are a lot of files in the directory, the memory usage dropped significantly: * * *1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is 1.18 GB / 1.74 GB (67%)* ** This was a suprise, since we hadn't done the file compaction step and already saw a huge drop in memory usage. What I don't understand is why the change in memory usage? The old structure is still there (/source/year/month/day) but with no files in the tips. The reorg process only moved the files to the new structure, a separate step is going to remove the empty directories. The 'before' was after the cluster was at idle for 4+ hours so I don't think it was GC timing issue. I'm looking to understand what happened so I can make sure our capacity calculations based on # of files and # of directories is correct. We're using: 0.20.2, r911707 Thanks, Chris -- Todd Lipcon Software Engineer, Cloudera
Question on UTF-8
This must be a simple question . But somehow I am not able to get it to work. I have a text file which has ISO Latin characters like Cancún. The mapper is taking Text as the input value. public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException But the Latin characters are not recognized correctly and it throws a MalInputException when I try Text.validateUTF8(value.getBytes()); Any idea how to resolve this. Appreciate any help. Thanks Sheeba
Re: Questions on hadoop configuration for heterogeneous cluster
Hi Harsh, Thanks a lot for your reply, this really helps! On 16 December 2010 18:08, Harsh J qwertyman...@gmail.com wrote: Hi, On Thu, Dec 16, 2010 at 2:48 PM, Yu Li car...@gmail.com wrote: Hi all, I'm a little confused about how to configure hadoop in a heterogeneous cluster. For example, if I have one machine(m1) with a two-core processor, another(m2) with a four-core processor, and I'd like to use them as tasktracker nodes in a hadoop cluster, how could I configure the mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters to 2 on m1, and set to 4 on m2? Yes you can do this just fine. Note that the configuration property name says tasktracker meaning it is a tasktracker specific setting and can vary for each. Has nothing to do with the JobTracker. another word, among those massive of parameters in *-site.xml and environtment variables in hadoop-env.sh, which ones could be set on each DN/TT with different values and still take effect? *.tasktracker.* and *.datanode.* properties are TT and DN specific and can be set individually for each of them. This is due to a naming convention followed by Hadoop. -- Harsh J www.harshj.com -- Best Regards, Li Yu
Re: Thrift Error
Adarsh, hive and hadoop both ship with the libthrift.jar and libfb303.jar, you should locate the 1's shipped with hadoop and move them to some other folder or rename them. for me the location for this libraries were as follows libthrift.jar : /usr/lib/hadoop/lib/ libfb303.jar : /usr/lib/hive/lib/ See if this issue solves the problem. I have faced this issue earlier when accessing hive over a thrift server. Thanks, Viral On Thu, Dec 16, 2010 at 2:12 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote: Hi all, I am googled a lot about the below error but can't able to find the root cause. I am selecting data from Hive table website_master but it results in below error : Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1 website_ma0_ org.apache.thrift.TApplicationException: Invalid method name: 'getThriftSchema' at org.apache.thrift.TApplicationException.read(TApplicationException.java:107) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247) at org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151) at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107) at org.hibernate.loader.Loader.getResultSet(Loader.java:1183) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing maxRows/queryTimeout java.sql.SQLException: Method not supported at org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123) at org.hibernate.loader.Loader.getResultSet(Loader.java:1191) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, SQLState: null 10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create ResultSet: Invalid method name: 'getThriftSchema' could not execute query using iterate Can someone Please tell me why this occurs and how to resolve it. Thanks Regards Adarsh Sharma
Re: Thrift Error
Viral Bajaria wrote: Adarsh, hive and hadoop both ship with the libthrift.jar and libfb303.jar, you should locate the 1's shipped with hadoop and move them to some other folder or rename them. for me the location for this libraries were as follows libthrift.jar : /usr/lib/hadoop/lib/ libfb303.jar : /usr/lib/hive/lib/ See if this issue solves the problem. I have faced this issue earlier when accessing hive over a thrift server. Thanks, Viral On Thu, Dec 16, 2010 at 2:12 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote: Hi all, I am googled a lot about the below error but can't able to find the root cause. I am selecting data from Hive table website_master but it results in below error : Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1 website_ma0_ org.apache.thrift.TApplicationException: Invalid method name: 'getThriftSchema' at org.apache.thrift.TApplicationException.read(TApplicationException.java:107) at org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247) at org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57) at org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194) at org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151) at org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107) at org.hibernate.loader.Loader.getResultSet(Loader.java:1183) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing maxRows/queryTimeout java.sql.SQLException: Method not supported at org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185) at org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123) at org.hibernate.loader.Loader.getResultSet(Loader.java:1191) at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381) at org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278) at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865) at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41) at SelectClauseExample.main(SelectClauseExample.java:25) 10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, SQLState: null 10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create ResultSet: Invalid method name: 'getThriftSchema' could not execute query using iterate Can someone Please tell me why this occurs and how to resolve it. Thanks Regards Adarsh Sharma Thanks a Lot Viral ! -Adarsh
Regarding decommission progress status for datanode
Hi All, Is there anyway to know while decommission in progress that a given datanode is decommissioned or not using java API .I need it because i want to automate this instead of manual intervention Right now we are checking manually in Namenode-UI /LiveNodesLink and using hadoop dfsadmin -report Please let me know Thanks sandeep *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Please help with hadoop configuration parameter set and get
Hi, I am a newbie of hadoop. Today I was struggling with a hadoop problem for several hours. I initialize a parameter by setting job configuration in main. E.g. Configuration con = new Configuration(); con.set(test, 1); Job job = new Job(con); Then in the mapper class, I want to set test to 2. I did it by context.getConfiguration().set(test,2); Finally in the main method, after the job is finished, I check the test again by job.getConfiguration().get(test); However, the value of test is still 1. The reason why I want to change the parameter inside Mapper class is that I want to determine when to stop an iteration in the main method. For example, for doing breadth-first search, when there is no new nodes are added for further expansion, the searching iteration should stop. Your help will be deeply appreciated. Thank you Wei