Questions on hadoop configuration for heterogeneous cluster

2010-12-16 Thread Yu Li
Hi all,

I'm a little confused about how to configure hadoop in a heterogeneous
cluster. For example, if I have one machine(m1) with a two-core
processor, another(m2) with a four-core processor, and I'd like to use them
as tasktracker nodes in a hadoop cluster, how could I configure the
mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters to
2 on m1, and set to 4 on m2? Or I have to set both to 2 on JT node? In
another word, among those massive of parameters in *-site.xml and
environtment variables in hadoop-env.sh, which ones could be set on each
DN/TT with different values and still take effect?

Thanks in advance and look forward to your reply.

-- 
Best Regards,
Li Yu


Please let me know in which scenarios the followign exception will come in hadoop-datanode side

2010-12-16 Thread sandeep
java.io.IOException Blockblk_129_3380 is not valid 

at
org.apache.hadoop.hdfs.server.datanode.FSDataset.getBlockFile(FSDataset.java
:962).

 

 


***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 



Re: Questions on hadoop configuration for heterogeneous cluster

2010-12-16 Thread Harsh J
Hi,

On Thu, Dec 16, 2010 at 2:48 PM, Yu Li car...@gmail.com wrote:
 Hi all,

 I'm a little confused about how to configure hadoop in a heterogeneous
 cluster. For example, if I have one machine(m1) with a two-core
 processor, another(m2) with a four-core processor, and I'd like to use them
 as tasktracker nodes in a hadoop cluster, how could I configure the
 mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters to
 2 on m1, and set to 4 on m2?

Yes you can do this just fine. Note that the configuration property
name says tasktracker meaning it is a tasktracker specific setting
and can vary for each. Has nothing to do with the JobTracker.

 another word, among those massive of parameters in *-site.xml and
 environtment variables in hadoop-env.sh, which ones could be set on each
 DN/TT with different values and still take effect?

*.tasktracker.* and *.datanode.* properties are TT and DN specific and
can be set individually for each of them. This is due to a naming
convention followed by Hadoop.

-- 
Harsh J
www.harshj.com


Thrift Error

2010-12-16 Thread Adarsh Sharma

Hi all,

I am googled a lot about the below error but can't able to find the root 
cause.


I am selecting data from Hive table website_master but it results in 
below error :


Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1 
website_ma0_
org.apache.thrift.TApplicationException: Invalid method name: 
'getThriftSchema'
   at 
org.apache.thrift.TApplicationException.read(TApplicationException.java:107)
   at 
org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247)
   at 
org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231)
   at 
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76)
   at 
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57)
   at 
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48)
   at 
org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194)
   at 
org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151)
   at 
org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107)

   at org.hibernate.loader.Loader.getResultSet(Loader.java:1183)
   at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
   at 
org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)

   at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
   at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
   at SelectClauseExample.main(SelectClauseExample.java:25)
10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing 
maxRows/queryTimeout

java.sql.SQLException: Method not supported
   at 
org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926)
   at 
org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185)
   at 
org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123)

   at org.hibernate.loader.Loader.getResultSet(Loader.java:1191)
   at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
   at 
org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)

   at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
   at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
   at SelectClauseExample.main(SelectClauseExample.java:25)
10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, 
SQLState: null
10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create 
ResultSet: Invalid method name: 'getThriftSchema'

could not execute query using iterate

Can someone Please tell me why this occurs and how to resolve it.


Thanks  Regards

Adarsh Sharma


Re: test

2010-12-16 Thread Edson Ramiro
tested :)

--
Edson Ramiro Lucas Filho
{skype, twitter, gtalk}: erlfilho
http://www.inf.ufpr.br/erlf07/


On Thu, Dec 16, 2010 at 5:03 AM, sravankumar sravanku...@huawei.com wrote:






Read specific block of a file

2010-12-16 Thread Petrucci Andreas

hi there, i want to aski if hdfs api supports reading just a specific block of 
a file (of course if file exceeds the default block size). for example is it 
possible to read/fetch just the first of the third block of a specific file in 
hdfs? does the api supports that?
  

Question on utf-8 chars

2010-12-16 Thread Sheeba
This must be a simple question . But somehow I am not able to get it to work.
I have a text file which has ISO Latin characters like Cancún.
The mapper is taking Text as the input value.
public

void map(LongWritable key, Text value, OutputCollectorText, IntWritable 
output, Reporter reporter) throws IOException
But the Latin charcters are not recognized correctly and it throws a 
MalInputException when I try

Text.validateUTF8(value.getBytes());

Any idea how to resolve this. Appreciate any help.



-- 
Sheeba Ann George




NameNode question about lots of small files

2010-12-16 Thread Chris Curtin
Hi,

During our research into the 'small files' issues we are having I didn't
find anything to explain what I see 'after' a change.

Before: all files were stored in a structure like /source/year/month/day/
where we had dozens of files in each day's direcotory (and 500+ sources). We
were using a lot more memory than we expected in the NameNode so we
redesigned the directory structure. Here is the 'before' summary:


*1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is
1.94 GB / 1.94 GB (100%)*

**

The Heap Size relative to the # of files was higher than we expected (Using
150 byte/file rule of thumb from Cloudera)  so we redesigned our approach.



After: simplified into /source/year_month/ and while there are a lot of
files in the directory, the memory usage dropped significantly:

* *

*1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is
1.18 GB / 1.74 GB (67%)*

**

This was a suprise, since we hadn't done the file compaction step and
already saw a huge drop in memory usage.



What I don't understand is why the change in memory usage? The old structure
is still there (/source/year/month/day) but with no files in the tips. The
reorg process only moved the files to the new structure, a separate step is
going to remove the empty directories. The 'before' was after the cluster
was at idle for 4+ hours so I don't think it was GC timing issue.



I'm looking to understand what happened so I can make sure our capacity
calculations based on # of files and # of directories is correct. We're
using: 0.20.2, r911707



Thanks,



Chris


Needs a simple answer

2010-12-16 Thread maha
Hi all,

   Why the following lines would work in the main class (WordCount) and not in 
Mapper ? even though  myconf  is set in WordCount to point to the getConf() 
returned object.

 try{ 
FileSystem hdfs = FileSystem.get(wc.WordCount.myconf);
hdfs.copyFromLocalFile(new Path(/Users/file), new 
Path(/tmp/file));
   }catch(Exception e) { System.err.print(\nError);}

  
  Also, the print statement will never print on console unless it's in my run 
function..

  Appreciate it :)

Maha



Re: Needs a simple answer

2010-12-16 Thread Ted Dunning
Maha,

Remember that the mapper is not running on the same machine as the main
class.  Thus local files aren't where you think.

On Thu, Dec 16, 2010 at 1:06 PM, maha m...@umail.ucsb.edu wrote:

 Hi all,

   Why the following lines would work in the main class (WordCount) and not
 in Mapper ? even though  myconf  is set in WordCount to point to the
 getConf() returned object.

 try{
FileSystem hdfs = FileSystem.get(wc.WordCount.myconf);
hdfs.copyFromLocalFile(new Path(/Users/file), new
 Path(/tmp/file));
   }catch(Exception e) { System.err.print(\nError);}


  Also, the print statement will never print on console unless it's in my
 run function..

  Appreciate it :)

Maha




Re: NameNode question about lots of small files

2010-12-16 Thread Todd Lipcon
Hi Chris,

To have a reasonable understanding of used heap, you need to trigger a
full GC. Otherwise, the heap number on the web UI doesn't actually
tell you live heap.

With the default (non-CMS) collector, the collector will not run until
it is manually triggered or the heap becomes full.

You can use JConsole to connect and force a GC to get a good
measurement of heap used.

Keep in mind also that the total heap is more than just the inodes and
blocks. Other things like RPC buffers account for some usage as well.

-Todd

On Thu, Dec 16, 2010 at 11:25 AM, Chris Curtin curtin.ch...@gmail.com wrote:
 Hi,

 During our research into the 'small files' issues we are having I didn't
 find anything to explain what I see 'after' a change.

 Before: all files were stored in a structure like /source/year/month/day/
 where we had dozens of files in each day's direcotory (and 500+ sources). We
 were using a lot more memory than we expected in the NameNode so we
 redesigned the directory structure. Here is the 'before' summary:


 *1823121 files and directories, 1754612 blocks = 3577733 total. Heap Size is
 1.94 GB / 1.94 GB (100%)*

 **

 The Heap Size relative to the # of files was higher than we expected (Using
 150 byte/file rule of thumb from Cloudera)  so we redesigned our approach.



 After: simplified into /source/year_month/ and while there are a lot of
 files in the directory, the memory usage dropped significantly:

 * *

 *1824616 files and directories, 1754612 blocks = 3579228 total. Heap Size is
 1.18 GB / 1.74 GB (67%)*

 **

 This was a suprise, since we hadn't done the file compaction step and
 already saw a huge drop in memory usage.



 What I don't understand is why the change in memory usage? The old structure
 is still there (/source/year/month/day) but with no files in the tips. The
 reorg process only moved the files to the new structure, a separate step is
 going to remove the empty directories. The 'before' was after the cluster
 was at idle for 4+ hours so I don't think it was GC timing issue.



 I'm looking to understand what happened so I can make sure our capacity
 calculations based on # of files and # of directories is correct. We're
 using: 0.20.2, r911707



 Thanks,



 Chris




-- 
Todd Lipcon
Software Engineer, Cloudera


Question on UTF-8

2010-12-16 Thread Sheeba George
This must be a simple question . But somehow I am not able to get it to
work.
I have a text file which has ISO Latin characters like Cancún.
The mapper is taking Text as the input value.

public
void map(LongWritable key, Text value, OutputCollectorText, IntWritable
output, Reporter reporter) throws IOException

But the Latin characters are not recognized correctly and it throws a
MalInputException when I try
Text.validateUTF8(value.getBytes());

Any idea how to resolve this.
 Appreciate any help.

Thanks
Sheeba


Re: Questions on hadoop configuration for heterogeneous cluster

2010-12-16 Thread Yu Li
Hi Harsh,

Thanks a lot for your reply, this really helps!

On 16 December 2010 18:08, Harsh J qwertyman...@gmail.com wrote:

 Hi,

 On Thu, Dec 16, 2010 at 2:48 PM, Yu Li car...@gmail.com wrote:
  Hi all,
 
  I'm a little confused about how to configure hadoop in a heterogeneous
  cluster. For example, if I have one machine(m1) with a two-core
  processor, another(m2) with a four-core processor, and I'd like to use
 them
  as tasktracker nodes in a hadoop cluster, how could I configure the
  mapred.tasktracker.map/reduce.tasks.maximum? Could I set both parameters
 to
  2 on m1, and set to 4 on m2?

 Yes you can do this just fine. Note that the configuration property
 name says tasktracker meaning it is a tasktracker specific setting
 and can vary for each. Has nothing to do with the JobTracker.

  another word, among those massive of parameters in *-site.xml and
  environtment variables in hadoop-env.sh, which ones could be set on each
  DN/TT with different values and still take effect?

 *.tasktracker.* and *.datanode.* properties are TT and DN specific and
 can be set individually for each of them. This is due to a naming
 convention followed by Hadoop.

 --
 Harsh J
 www.harshj.com




-- 
Best Regards,
Li Yu


Re: Thrift Error

2010-12-16 Thread Viral Bajaria
Adarsh,

hive and hadoop both ship with the libthrift.jar and libfb303.jar, you
should locate the 1's shipped with hadoop and move them to some other folder
or rename them.

for me the location for this libraries were as follows
libthrift.jar : /usr/lib/hadoop/lib/
libfb303.jar : /usr/lib/hive/lib/

See if this issue solves the problem. I have faced this issue earlier when
accessing hive over a thrift server.

Thanks,
Viral

On Thu, Dec 16, 2010 at 2:12 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

 Hi all,

 I am googled a lot about the below error but can't able to find the root
 cause.

 I am selecting data from Hive table website_master but it results in below
 error :

 Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1
 website_ma0_
 org.apache.thrift.TApplicationException: Invalid method name:
 'getThriftSchema'
   at
 org.apache.thrift.TApplicationException.read(TApplicationException.java:107)
   at
 org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247)
   at
 org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231)
   at
 org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76)
   at
 org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57)
   at
 org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48)
   at
 org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194)
   at
 org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151)
   at
 org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107)
   at org.hibernate.loader.Loader.getResultSet(Loader.java:1183)
   at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
   at
 org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)
   at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
   at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
   at SelectClauseExample.main(SelectClauseExample.java:25)
 10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing
 maxRows/queryTimeout
 java.sql.SQLException: Method not supported
   at
 org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926)
   at
 org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185)
   at
 org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123)
   at org.hibernate.loader.Loader.getResultSet(Loader.java:1191)
   at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
   at
 org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)
   at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
   at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
   at SelectClauseExample.main(SelectClauseExample.java:25)
 10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, SQLState:
 null
 10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create
 ResultSet: Invalid method name: 'getThriftSchema'
 could not execute query using iterate

 Can someone Please tell me why this occurs and how to resolve it.


 Thanks  Regards

 Adarsh Sharma



Re: Thrift Error

2010-12-16 Thread Adarsh Sharma

Viral Bajaria wrote:

Adarsh,

hive and hadoop both ship with the libthrift.jar and libfb303.jar, you
should locate the 1's shipped with hadoop and move them to some other folder
or rename them.

for me the location for this libraries were as follows
libthrift.jar : /usr/lib/hadoop/lib/
libfb303.jar : /usr/lib/hive/lib/

See if this issue solves the problem. I have faced this issue earlier when
accessing hive over a thrift server.

Thanks,
Viral

On Thu, Dec 16, 2010 at 2:12 AM, Adarsh Sharma adarsh.sha...@orkash.comwrote:

  

Hi all,

I am googled a lot about the below error but can't able to find the root
cause.

I am selecting data from Hive table website_master but it results in below
error :

Hibernate: select website_ma0_.s_no as col_0_0_ from website_master1
website_ma0_
org.apache.thrift.TApplicationException: Invalid method name:
'getThriftSchema'
  at
org.apache.thrift.TApplicationException.read(TApplicationException.java:107)
  at
org.apache.hadoop.hive.service.ThriftHive$Client.recv_getThriftSchema(ThriftHive.java:247)
  at
org.apache.hadoop.hive.service.ThriftHive$Client.getThriftSchema(ThriftHive.java:231)
  at
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.initDynamicSerde(HiveQueryResultSet.java:76)
  at
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:57)
  at
org.apache.hadoop.hive.jdbc.HiveQueryResultSet.init(HiveQueryResultSet.java:48)
  at
org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeImmediate(HivePreparedStatement.java:194)
  at
org.apache.hadoop.hive.jdbc.HivePreparedStatement.executeQuery(HivePreparedStatement.java:151)
  at
org.hibernate.jdbc.AbstractBatcher.getResultSet(AbstractBatcher.java:107)
  at org.hibernate.loader.Loader.getResultSet(Loader.java:1183)
  at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
  at
org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)
  at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
  at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
  at SelectClauseExample.main(SelectClauseExample.java:25)
10/12/16 14:06:55 WARN jdbc.AbstractBatcher: exception clearing
maxRows/queryTimeout
java.sql.SQLException: Method not supported
  at
org.apache.hadoop.hive.jdbc.HivePreparedStatement.getQueryTimeout(HivePreparedStatement.java:926)
  at
org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:185)
  at
org.hibernate.jdbc.AbstractBatcher.closeQueryStatement(AbstractBatcher.java:123)
  at org.hibernate.loader.Loader.getResultSet(Loader.java:1191)
  at org.hibernate.loader.hql.QueryLoader.iterate(QueryLoader.java:381)
  at
org.hibernate.hql.ast.QueryTranslatorImpl.iterate(QueryTranslatorImpl.java:278)
  at org.hibernate.impl.SessionImpl.iterate(SessionImpl.java:865)
  at org.hibernate.impl.QueryImpl.iterate(QueryImpl.java:41)
  at SelectClauseExample.main(SelectClauseExample.java:25)
10/12/16 14:06:55 WARN util.JDBCExceptionReporter: SQL Error: 0, SQLState:
null
10/12/16 14:06:55 ERROR util.JDBCExceptionReporter: Could not create
ResultSet: Invalid method name: 'getThriftSchema'
could not execute query using iterate

Can someone Please tell me why this occurs and how to resolve it.


Thanks  Regards

Adarsh Sharma




  

Thanks a Lot Viral !

-Adarsh


Regarding decommission progress status for datanode

2010-12-16 Thread sandeep
Hi All,

 

Is there anyway to know while decommission in progress  that a given
datanode is decommissioned or not using java API .I need it because i want
to automate this instead of manual intervention

 

Right now we are checking  manually in Namenode-UI  /LiveNodesLink  and
using hadoop dfsadmin -report

 

Please let me know

 

Thanks

sandeep

 


***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 



Please help with hadoop configuration parameter set and get

2010-12-16 Thread Peng, Wei
Hi,

 

I am a newbie of hadoop.

Today I was struggling with a hadoop problem for several hours.

 

I initialize a parameter by setting job configuration in main.

E.g. Configuration con = new Configuration();

con.set(test, 1);

Job job = new Job(con);

 

Then in the mapper class, I want to set test to 2. I did it by 

context.getConfiguration().set(test,2);

 

Finally in the main method, after the job is finished, I check the
test again by

job.getConfiguration().get(test);

 

However, the value of test is still 1.

 

The reason why I want to change the parameter inside Mapper class is
that I want to determine when to stop an iteration in the main method.
For example, for doing breadth-first search, when there is no new nodes
are added for further expansion, the searching iteration should stop. 

 

Your help will be deeply appreciated. Thank you

 

Wei