Input Sampler with Custom Key Type

2011-04-07 Thread Meena_86

Hi, 

I am a beginner in Hadoop Map Reduce. Please redirect me if I am not posting
in the correct forum. 
I have created my own Key Type which implements from WritableComparable. I
would like to use TotalOrderPartitioner with this Key and Text as Value. But
I keep encountering errors when the TotalOrderPartitioner reads from the
partition file.

java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at
org.apache.hadoop.mapred.MapTask$OldOutputCollector.(MapTask.java:448)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 6 more
Caused by: java.lang.IllegalArgumentException: Can't read partitions file
at
org.apache.hadoop.mapred.lib.TotalOrderPartitioner.configure(TotalOrderPartitioner.java:91)
... 11 more
Caused by: java.io.IOException: Split points are out of order
at
org.apache.hadoop.mapred.lib.TotalOrderPartitioner.configure(TotalOrderPartitioner.java:78)
... 11 more

Please do let me know if anybody would like to look at my code. I have also
implemented by own RawComparator with a custom compare() method.

Thanks,
Meena


-- 
View this message in context: 
http://old.nabble.com/Input-Sampler-with-Custom-Key-Type-tp31349008p31349008.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Fw: start-up with safe mode?

2011-04-07 Thread springring


> 
>> Hi,
>> 
>>   When I start up hadoop, the namenode log show "STATE* Safe mode ON" like 
>> that , how to set it off?
> I can set it off with command "hadoop fs -dfsadmin leave" after start up, 
> but how can I just start HDFS
> out of Safe mode? 
>>   Thanks.
>> 
>> Ring
>> 
>> the startup 
>> log
>> 
>> 2011-04-08 11:58:20,655 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
>> Initializing JVM Metrics with processName=NameNode, sessionId=null
>> 2011-04-08 11:58:20,657 INFO 
>> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing 
>> NameNodeMeterics using context 
>> object:org.apache.hadoop.metrics.spi.NullContext
>> 2011-04-08 11:58:20,678 INFO org.apache.hadoop.hdfs.util.GSet: VM type   
>> = 32-bit
>> 2011-04-08 11:58:20,678 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory 
>> = 17.77875 MB
>> 2011-04-08 11:58:20,678 INFO org.apache.hadoop.hdfs.util.GSet: capacity  
>> = 2^22 = 4194304 entries
>> 2011-04-08 11:58:20,678 INFO org.apache.hadoop.hdfs.util.GSet: 
>> recommended=4194304, actual=4194304
>> 2011-04-08 11:58:20,697 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hdfs
>> 2011-04-08 11:58:20,697 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
>> 2011-04-08 11:58:20,697 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
>> 2011-04-08 11:58:20,701 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>> dfs.block.invalidate.limit=1000
>> 2011-04-08 11:58:20,701 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), 
>> accessTokenLifetime=0 min(s)
>> 2011-04-08 11:58:20,976 INFO 
>> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: 
>> Initializing FSNamesystemMetrics using context 
>> object:org.apache.hadoop.metrics.spi.NullContext
>> 2011-04-08 11:58:21,001 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Number of files = 17
>> 2011-04-08 11:58:21,007 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Number of files under construction = 0
>> 2011-04-08 11:58:21,007 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Image file of size 1529 loaded in 0 seconds.
>> 2011-04-08 11:58:21,007 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Edits file /tmp/hadoop-hdfs/dfs/name/current/edits of size 4 edits # 0 
>> loaded in 0 seconds.
>> 2011-04-08 11:58:21,009 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Image file of size 1529 saved in 0 seconds.
>> 2011-04-08 11:58:21,022 INFO org.apache.hadoop.hdfs.server.common.Storage: 
>> Image file of size 1529 saved in 0 seconds.
>> 2011-04-08 11:58:21,032 INFO 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading 
>> FSImage in 339 msecs
>> 2011-04-08 11:58:21,036 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe 
>> mode ON.
>> The reported blocks 0 needs additional 2 blocks to reach the threshold 
>> 0.9990 of total blocks 3. Safe mode will be turned off automatically.
>>

RE: Configuring Hadoop With Eclipse Environment for C++ CDT Code

2011-04-07 Thread Sagar Kohli
Hi Adarsh,

Try this link
http://shuyo.wordpress.com/2011/03/08/hadoop-development-environment-with-eclipse/

regards
Sagar

From: Adarsh Sharma [adarsh.sha...@orkash.com]
Sent: Friday, April 08, 2011 9:45 AM
To: common-user@hadoop.apache.org
Subject: Configuring Hadoop With Eclipse Environment for C++ CDT Code

Dear all,

I am following the below links to configure Eclipse with hadoop
Environment But don't able to find the Map-Reduce Perspective in Open
Perspective > Other Option.

http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse

http://wiki.apache.org/hadoop/EclipseEnvironment

I copied hadoop-eclipse plugin.jar in plug-ins sub-directory of eclipse.
But don't know why there is no Map-reduce Option in New> Project Option.

Please let me know if there is any other useful link of doing this.



Thanks & best regards,
Adarsh Sharma



Are you exploring a Big Data Strategy ? Listen to this recorded webinar on 
Planning your Hadoop/ NoSQL projects for 2011 at 
www.impetus.com/featured_webinar?eventid=37

Follow us on www.twitter.com/impetuscalling or visit www.impetus.com to know 
more.


NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


How is hadoop going to handle the next generation disks?

2011-04-07 Thread Edward Capriolo
I have a 0.20.2 cluster. I notice that our nodes with 2 TB disks waste
tons of disk io doing a 'du -sk' of each data directory. Instead of
'du -sk' why not just do this with java.io.file? How is this going to
work with 4TB 8TB disks and up ? It seems like calculating used and
free disk space could be done a better way.

Edward


Configuring Hadoop With Eclipse Environment for C++ CDT Code

2011-04-07 Thread Adarsh Sharma

Dear all,

I am following the below links to configure Eclipse with hadoop 
Environment But don't able to find the Map-Reduce Perspective in Open 
Perspective > Other Option.


http://developer.yahoo.com/hadoop/tutorial/module3.html#eclipse

http://wiki.apache.org/hadoop/EclipseEnvironment

I copied hadoop-eclipse plugin.jar in plug-ins sub-directory of eclipse. 
But don't know why there is no Map-reduce Option in New> Project Option.


Please let me know if there is any other useful link of doing this.



Thanks & best regards,
Adarsh Sharma


0.21.0 - Java Class Error

2011-04-07 Thread Witold Januszewski
To Whom It May Concern,

When trying to run Hadoop 0.21 with JDK 1.6_23 I get an error:
java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName.
The full error log is in the attached .png
Can you help me? I'd be grateful.


Yours faithfully,
Witold Januszewski


HADOOP-189 and the context class loader

2011-04-07 Thread Benson Margulies
We have fairly good evidence that, as of 0.20.2, hadoop does not set
the thread context class loader to the class loader than includes all
the .jar files from the lib subdirectory of a job jar.

Code we wrote (which is sitting in the 'main' part of the job jar)
calls a class in Mahout (which is sitting in a jar in the 'lib'
directory). That code calls loadClass on
Thread.currentThread().getContextClassLoader() looking for the lucence
StandardAnalyzer. *That* class is sitting in the lucene-core jar,
which is also in the job jar's lib dir.

The result of a class not found exception.

I've checked that we didn't somehow mistakenly unpack lucene in to the
main area (thus ending up with two copies, but I'll check again).

So, the naive interpretation of this is that nothing calls
setContextClassLoader, but I haven't gone reading hadoop source to
check that; I'm asking here instead.


Research on Hadoop Usage & Deployments

2011-04-07 Thread David Menninger
We are in the late stages of a research project on the usage of Apache Hadoop 
for managing and analyzing large scale data. The purpose is to assess the state 
and maturity of the management of large-scale data to determine best practices 
and trends which would lead to further adoption of Hadoop and more effective 
deployments. The survey takes approximately 20-30 minutes and does not have to 
be completed all at once. A report of the findings is made available to all 
participants.  The survey will be closing soon and I wanted to make sure that 
members of Apache Hadoop community are represented in the survey findings.

Here is the survey link: http://www.ventanaresearch.com/him.

NOTE: most of the Hadoop specific questions are in the latter part of the 
survey.

Dave

David Menninger
Vice President & Research Director
Ventana Research



Re: Developing, Testing, Distributing

2011-04-07 Thread Chris K Wensel
> But when I tried to implement a real life project, things has become too 
> complicated to me, things didn't go the way I expected them to go, 
> I  had to implement it using plain map/redapi


as with any framework (in Java), it takes time to find the best practices. many 
of which are documented in the user guide.

that said, a number of projects on top of Cascading simplify development, 
namely the JRuby and Clojure integrations and query languages.
http://www.cascading.org/modules.html

and don't forget you can still run raw MR jobs in tandem with Cascading flows, 
so no need to rewrite working apps.

don't hesitate to ask questions on the list or IRC channel (#cascading)

chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading



RE: Developing, Testing, Distributing

2011-04-07 Thread Guy Doulberg
Thanks for your answers, 

I checked cascading for a while, 
It was easy to get started and to do the tutorial, I really liked the modeling 
of pipes, cogroups and so on...

But when I tried to implement a real life project, things has become too 
complicated to me, things didn't go the way I expected them to go, 
I  had to implement it using plain map/redapi

I think I should give it a try again. 



-Original Message-
From: Guy Doulberg [mailto:guy.doulb...@conduit.com] 
Sent: Thursday, April 07, 2011 10:40 AM
To: common-user@hadoop.apache.org
Subject: Developing, Testing, Distributing 

Hey,

I have been developing Map/Red jars for a while now, and I am still not 
comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,
What plugins to the IDE you are using
How do you test your code, which Unit test libraries your using, how do you run 
your automatic tests after you have finished the development?
Do you have test/qa/staging environments beside the dev and the production? How 
do you keep it similar to the production
Code reuse - how do you build components that can be used in other jobs, do you 
build generic map or reduce class?

I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could 
be helpful for newbie and experienced developers all together

Thanks Guy


Question about merge multiple files within Hadoop

2011-04-07 Thread Xiaobo Gu
Hi,

Does the copyMerge method of class FileUtil can only merge files with
the exact size in bytes into one file, or the size of source files do
not matter?

Regards,

Xiaobo Gu


Re: Developing, Testing, Distributing

2011-04-07 Thread Chris K Wensel
> How do you test your code, which Unit test libraries your using, how do you 
> run your automatic tests after you have finished the development?
> Do you have test/qa/staging environments beside the dev and the production? 
> How do you keep it similar to the production
> Code reuse - how do you build components that can be used in other jobs, do 
> you build generic map or reduce class?


In all honesty you should take a look at Cascading. It was designed to simplify 
this, but keep in mind i'm the project lead so biased.

In Cascading, there are three distinct elements that can be tested 
independently.

- operations, things like functions and filters. that can typically be re-used 
in any cascading app.
- assemblies of operations that constitute a unit of work or some algorithmic 
process (this will become 1 or more MR jobs during runtime)
- taps, the things that talk to HDFS or external systems like HBase, CouchBase, 
MySQL, ElasticSearch, etc.

each of these can be unit tested individually or as a whole. and you can make 
libraries or frameworks usable by other developers on your teams.

the real value is that you no longer need to think in MapReduce when 
developing, just the problem domain. 

and you can test your processing app independently of making it work in staging 
or production just by swapping out taps.

http://www.cascading.org/

btw, I use IntelliJ for all my development. 

cheers,
chris

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading



Re: Developing, Testing, Distributing

2011-04-07 Thread David Rosenstrauch

On 04/07/2011 03:39 AM, Guy Doulberg wrote:

Hey,

I have been developing Map/Red jars for a while now, and I am still not 
comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,


Eclipse


What plugins to the IDE you are using


Um  subclipse.  (And findbugs sometimes.)


How do you test your code, which Unit test libraries your using, how do you run 
your automatic tests after you have finished the development?


JUnit.  Run the tests right inside eclipse using the IDE's built-in 
junit capabilities.



Do you have test/qa/staging environments beside the dev and the production? How 
do you keep it similar to the production


We have small dev and qa Hadoop clusters, in addition to the large 
production cluster.  We don't do anything particular to keep them 
similar.  If you want to run a test job, and require some data that's on 
the prod cluster, you have to port it yourself.



Code reuse - how do you build components that can be used in other jobs, do you 
build generic map or reduce class?


If you do Test Driven Development when you write your code, you wind up 
with components that you can test independently, and then plug into your 
M/R classes.



I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could 
be helpful for newbie and experienced developers all together

Thanks Guy





Re: How to abort a job in a map task

2011-04-07 Thread David Rosenstrauch

On 04/06/2011 08:40 PM, Haruyasu Ueda wrote:

Hi all,

I'm writing M/R java program.

I want to abort a job itself in a map task, when the map task found
irregular data.

I have two idea to do so.
  1. execulte "bin/hadoop -kill jobID" in map task, from slave machine.
  2. raise an IOException to abort.

I want to know which is better way.
Or, whether there is  better/recommended programming idiom.

If you have any experience about this, please share your case.

  --HAL


I'd go with throwing the exception.  That way the cause of the job crash 
will get displayed right in the Hadoop GUI.


DR


Writing to Mapper Context from RecordReader

2011-04-07 Thread Adi
using 0.21.0. I have implemented a custom InputFormat. The RecordReader
extends org.apache.hadoop.mapreduce.RecordReader

The sample I looked at threw an IOException when there was incompatible
input line. But I am not sure who is supposed to catch and handle this
exception. The task just failed when this exception was thrown.
I changed the implementation to log an error instead of throwing an
IOException but the best thing would be to write to the output via context
and report this error.
But the RecordReader does not have a handle to the Mapper context.
Is there a way to get a handle to the current Mapper context and write a
message via the Mapper context from the RecordReader?
Any other suggestions on handling bad input data when implementing Custom
InputFormat?

Thanks.

-Adi


problem to creating file in HDFS

2011-04-07 Thread rahul_trigune

fs -copyFromLocal creating a empty file in the localhast ,when i was trying
to create my localhost as psuedo mode in hadoop
-- 
View this message in context: 
http://old.nabble.com/problem-to-creating-file-in-HDFS-tp31342064p31342064.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Developing, Testing, Distributing

2011-04-07 Thread Guy Doulberg
Hey,

I have been developing Map/Red jars for a while now, and I am still not 
comfortable with the developing environment I gathered for myself (and the team)

I am curious how other Hadoop developers out-there, are developing their jobs...

What IDE you are using,
What plugins to the IDE you are using
How do you test your code, which Unit test libraries your using, how do you run 
your automatic tests after you have finished the development?
Do you have test/qa/staging environments beside the dev and the production? How 
do you keep it similar to the production
Code reuse - how do you build components that can be used in other jobs, do you 
build generic map or reduce class?

I can tell you that I have no answer to the questions above,

I hope that this post is not too general, but I think the discussion here could 
be helpful for newbie and experienced developers all together

Thanks Guy


RE: Including Additional Jars

2011-04-07 Thread Guy Doulberg
Or to set the Main class in the manifest of the Jar, 



-Original Message-
From: Bill Graham [mailto:billgra...@gmail.com] 
Sent: Wednesday, April 06, 2011 11:17 PM
To: Shuja Rehman
Cc: common-user@hadoop.apache.org
Subject: Re: Including Additional Jars

You need to pass the mainClass after the jar:

http://hadoop.apache.org/common/docs/r0.21.0/commands_manual.html#jar

On Wed, Apr 6, 2011 at 11:31 AM, Shuja Rehman  wrote:
> i am using the following command
>
> hadoop jar myjar.jar -libjars /home/shuja/lib/mylib.jar  param1 param2
> param3
>
> but the program still giving the error and does not find the mylib.jar. can
> u confirm the syntax of command?
> thnx
>
>
>
> On Wed, Apr 6, 2011 at 8:29 PM, Bill Graham  wrote:
>>
>> If you could share more specifics regarding just how it's not working
>> (i.e., job specifics, stack traces, how you're invoking it, etc), you
>> might get more assistance in troubleshooting.
>>
>>
>> On Wed, Apr 6, 2011 at 1:44 AM, Shuja Rehman 
>> wrote:
>> > -libjars is not working nor distributed cache, any other
>> > solution??
>> >
>> > On Mon, Apr 4, 2011 at 11:40 PM, James Seigel  wrote:
>> >
>> >> James’ quick and dirty, get your job running guideline:
>> >>
>> >> -libjars <-- for jars you want accessible by the mappers and reducers
>> >> classpath or bundled in the main jar <-- for jars you want accessible
>> >> to
>> >> the runner
>> >>
>> >> Cheers
>> >> James.
>> >>
>> >>
>> >>
>> >> On 2011-04-04, at 12:31 PM, Shuja Rehman wrote:
>> >>
>> >> > well...i think to put in distributed cache is good idea. do u have
>> >> > any
>> >> > working example how to put extra jars in distributed cache and how to
>> >> make
>> >> > available these jars for job?
>> >> > Thanks
>> >> >
>> >> > On Mon, Apr 4, 2011 at 10:20 PM, Mark Kerzner 
>> >> wrote:
>> >> >
>> >> >> I think you can put them either in your jar or in distributed cache.
>> >> >>
>> >> >> As Allen pointed out, my idea of putting them into hadoop lib jar
>> >> >> was
>> >> >> wrong.
>> >> >>
>> >> >> Mark
>> >> >>
>> >> >> On Mon, Apr 4, 2011 at 12:16 PM, Marco Didonna
>> >> >> > >> >>> wrote:
>> >> >>
>> >> >>> On 04/04/2011 07:06 PM, Allen Wittenauer wrote:
>> >> >>>
>> >> 
>> >>  On Apr 4, 2011, at 8:06 AM, Shuja Rehman wrote:
>> >> 
>> >>  Hi All
>> >> >
>> >> > I have created a map reduce job and to run on it on the cluster,
>> >> > i
>> >> have
>> >> > bundled all jars(hadoop, hbase etc) into single jar which
>> >> > increases
>> >> the
>> >> > size
>> >> > of overall file. During the development process, i need to copy
>> >> > again
>> >> >> and
>> >> > again this complete file which is very time consuming so is there
>> >> > any
>> >> >> way
>> >> > that i just copy the program jar only and do not need to copy the
>> >> > lib
>> >> > files
>> >> > again and again. i am using net beans to develop the program.
>> >> >
>> >> > kindly let me know how to solve this issue?
>> >> >
>> >> 
>> >>        This was in the FAQ, but in a non-obvious place.  I've
>> >>  updated
>> >> it
>> >>  to be more visible (hopefully):
>> >> 
>> >> 
>> >> 
>> >> >>
>> >>
>> >> http://wiki.apache.org/hadoop/FAQ#How_do_I_submit_extra_content_.28jars.2C_static_files.2C_etc.29_for_my_job_to_use_during_runtime.3F
>> >> 
>> >> >>>
>> >> >>> Does the same apply to jar containing libraries? Let's suppose I
>> >> >>> need
>> >> >>> lucene-core.jar to run my project. Can I put my this jar into my
>> >> >>> job
>> >> jar
>> >> >> and
>> >> >>> have hadoop "see" lucene's classes? Or should I use distributed
>> >> >>> cache??
>> >> >>>
>> >> >>> MD
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Regards
>> >> > Shuja-ur-Rehman Baig
>> >> > 
>> >>
>> >>
>> >
>> >
>> > --
>> > Regards
>> > Shuja-ur-Rehman Baig
>> > 
>> >
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
>
>
>