December 2011 SF Hadoop User Group

2011-11-16 Thread Aaron Kimball
After a month's hiatus for Hadoop World, we're back! The December Hadoop
meetup will be held Wednesday, December 14, from 6pm to 8pm. This meetup
will be hosted by Splunk at their office on Brannan St.

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All
Hadoop-related topics are encouraged, and all members of the Hadoop
community are welcome.

Event schedule:

   - 6pm - Welcome
   - 6:30pm - Introductions; start creating agenda
   - Breakout sessions begin as soon as we're ready
   - 8pm - Conclusion


Food and refreshments will be provided, courtesy of Splunk.
Please RSVP at http://www.meetup.com/hadoopsf/events/41427512/

Regards,
- Aaron Kimball


October SF Hadoop Meetup

2011-09-30 Thread Aaron Kimball
The October SF Hadoop users meetup will be held Wednesday, October 12, from
7pm to 9pm. This meetup will be hosted by Twitter at their office on Folsom
St. *Please note that due to scheduling constraints, we will begin an hour
later than usual this month.*

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All Hadoop-related
topics are encouraged, and all members of the Hadoop community are welcome.

Event schedule:

   - *7pm* - Welcome
   - 7:30pm - Introductions; start creating agenda
   - Breakout sessions begin as soon as we're ready
   - 9pm - Conclusion

Food and refreshments will be provided, courtesy of Twitter.

Please RSVP at http://www.meetup.com/hadoopsf/events/35650052/

Regards,

- Aaron Kimball


Meetup Announcement: July 2011 SF HUG (7/13/2011)

2011-06-15 Thread Aaron Kimball
Hadoop fans,

Last week we held another successful San Francisco Hadoop "unconference"
meetup at RichRelevance's office.

Discussion topics included:
* Working with multiple Hadoop clusters
* HBase
* Hive SerDes and evolving data formats
* Zookeeper and Leader Election
* Timeseries databases / statistics
* Hadoop internals
* Hadoop Pipes
* Hive performance with Joins
* Pig + Java
* Avro

We will hold next month's meetup on Wednesday July 13, from 6--8pm.
This meetup will be hosted once again by our friends at CBSi. Their office
is at 235 Second Street.

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All Hadoop-related
topics are encouraged, and all members of the Hadoop community are welcome.
It's never too early to start brainstorming topics if you'd like to lead a
session -- send me an email off-list if you want to discuss an idea.

Event schedule:

* 6pm - Welcome
* 6:30pm - Introductions; start creating agenda
* Breakout sessions begin as soon as we're ready
* 8pm - Conclusion

Food and refreshments will be provided, courtesy of CBSi.

I hope to see you there! Please RSVP at http://bit.ly/kLpLQR so we can get
an accurate count for food and beverages.
Cheers,
- Aaron Kimball


Next SF HUG: June 8, at RichRelevance

2011-05-19 Thread Aaron Kimball
We had a great time after the Cloudera hackathon last week with our monthly
SF Hadoop User Group meetup. I'd like to thank Cloudera again for hosting
such a successful event.

Our next meetup will be held Wednesday, June 8, from 6pm to 8pm.

This meetup will be hosted by our friends at RichRelevance. Their office is
at 275 Battery St. #1150, though we'll be using their 9th floor conference
space.

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All Hadoop-related
topics are encouraged, and all members of the Hadoop community are welcome.

Event schedule:

   - 6pm - Welcome
   - 6:30pm - Introductions; start creating agenda
   - Breakout sessions begin as soon as we're ready
   - 8pm - Conclusion


Food and refreshments will be provided, courtesy of RichRelevance.

If you're going to attend, please RSVP at http://bit.ly/kxaJqa.

Hope to see you all there!
- Aaron Kimball


April SFHUG recap, May SFHUG meetup announcement

2011-04-18 Thread Aaron Kimball
Hello Hadoop fans,

This last week we had a very successful meetup of the SF Hadoop User Group,
hosted by Twitter.
Breakout topics included:
* Log analysis
* Cluster resource management
* FlumeBase
* Reusable MapReduce scripts
* Languages for MapReduce programming
* Hadoop Roadmap
* Oozie Best Practices
* HBase how-to
* NameNode SPOF best practices
* Dirty Data

Notes for some of these breakout sessions are posted on the meetup group's
message board at: http://bit.ly/hIpU7G

I am also pleased to announce that the May Hadoop meetup will be held on
Wednesday, May 11, from 6pm to 8pm.

This meetup will be hosted by our friends at Cloudera, in their new SF
office.

As usual, we will use the discussion-based "unconference" format. At the
beginning of the meetup we will collaboratively construct an agenda
consisting of several discussion breakout groups. All participants may
propose a topic and volunteer to facilitate a discussion. All Hadoop-related
topics are encouraged, and all members of the Hadoop community are welcome.

Event schedule:

6pm - Welcome
6:30pm - Introductions; start creating agenda
Breakout sessions begin as soon as we're ready
8pm - Conclusion

Food and refreshments will be provided, courtesy of Cloudera. Please RSVP
at http://bit.ly/hwMCI2

Looking forward to seeing you there!
Regards,
- Aaron Kimball


Re: How does sqoop distribute it's data evenly across HDFS?

2011-03-17 Thread Aaron Kimball
Sqoop operates in parallel by spawning multiple map tasks that each read
different rows of the database, based on the split key. Each map task
submits a different SQL query which uses the split key column to get a
subset of the rows you want to read.  Each map task then writes an output
file to HDFS, covering its subset of the total rows. Hopefully, Sqoop's
partitioning heuristic has resulted in each of these output files being of
roughly the same size. If your row range is "lumpy" (e.g., you have a whole
lot of rows with a column ID=0...1, then a blank space, then a whole lot
more rows where 500 < ID < 600), you'll see a bunch of output files
where some may be empty, and one or two contain all the rows. If your row
range is more uniform (e.g., the range of the ID column is more-or-less
fully-occupied between its maximum and minimum values), you'll get much more
even file-sizes.

But assuming the number of rows read in by each map task are more or less
the same, then the files will be distributed across the cluster using the
underlying platform. In practice, Sqoop relies on MapReduce and HDFS to make
things "just work out." For instance, by spawning 4 map tasks (Sqoop's
default), it is likely that your cluster will have four separate nodes where
there is a free task slot, and that these tasks will be allocated across
those four different nodes. HDFS' replica placement algorithm is "one
replica on the same node as the writer, and the other two replicas
elsewhere" (assuming a single rack -- if you've configured a multiple rack
topology, HDFS goes further and ensures allocation on at least two racks).
So the map tasks will "probably" be spread onto four different nodes, and
HDFS will "probably" put 3 replicas * 4 tasks' output on a reasonably
diverse set of machines in the cluster. Note that HDFS block placement is
actually on a per-block, not a per-file basis, so if each task is writing
multiple blocks of output, the number of datanodes which are candidates for
replica targets goes up substantially.

In theory, you are right: pathological cases can occur, where all Sqoop
tasks run serially on a single node, making that node hold replica #1 of
each task's output. The HDFS namenode could then pick the same two foreign
replica nodes for each of these four tasks' output and have only three nodes
in the cluster hold all the data from an import job. But this is a very
unlikely case, and not one worth worrying about in practice. If this
happens, it's likely because the majority of your nodes are already in a
disk-full condition, or are otherwise unavailable, and only a specific
subset of the nodes are capable of actually receiving the writes. But in
this regard, Sqoop is no different than any other custom MapReduce program
you might write; it's not particularly more or less resilient to any
pathological conditions of the underlying system which might arise.

Hope that helps.

Cheers,
- Aaron

On Thu, Mar 17, 2011 at 2:41 AM, Andy Doddington wrote:

> Ok, I understand about the balancer process which can be run manually, but
> the sqoop documentation seems to imply that it does balancing for you, based
> on the split key, as you note.
>
> But what causes the various sqoop data import map jobs to write to
> different data nodes? I.e. What stops them all writing to the same node, in
> the ultimate pathological case?
>
> Thanks,
>
>  Andy D
>
> On 17 Mar 2011, at 00:28, Harsh J  wrote:
>
> > There's a balancer available to re-balance DNs across the HDFS cluster
> > in general. It is available in the $HADOOP_HOME/bin/ directory as
> > start-balancer.sh
> >
> > But what I think sqoop implies is that your data is balanced due to
> > the map jobs it runs for imports (using a provided split factor
> > between maps), which should make it write chunks of data out to
> > different DataNodes.
> >
> > I guess you could get more information on the Sqoop mailing list
> > sqoop-u...@cloudera.org,
> > https://groups.google.com/a/cloudera.org/group/sqoop-user/topics
> >
> > On Thu, Mar 17, 2011 at 5:04 AM, BeThere  wrote:
> >> The sqoop documentation seems to imply that it uses the key information
> provided to it on the command line to ensure that the SQL data is
> distributed evenly across the DFS. However I cannot see any mechanism for
> achieving this explicitly other than relying on the implicit distribution
> provided by default by HDFS. Is this correct or are there methods on some
> API that allow me to manage the distribution to ensure that it is balanced
> across all nodes in my cluster?
> >>
> >> Thanks,
> >>
> >> Andy D
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
> > http://harshj.com
>


SF Hadoop Meetup - March review and April announcement (April 13)

2011-03-11 Thread Aaron Kimball
Hadoop fans,

This week we had a great SF Hadoop meetup hosted by Yelp. As usual, we used
an unconference format to construct agenda topics, and broke out into
separate discussion groups. About 75 people attended, and we had a dozen
great discussions; some were business-focused, others very technical, and
some highly theoretical:

* Learning thrift
* Cluster capacity planning
* Selling Hadoop to your organization
* Alternate JVM languages and Hadoop
* Schema / metadata management
* Integration and Migration: Data warehousing, BI, and Hadoop
* Abstract game theory problems and massive data sets
* Hadoop monitoring / management best practices
* "Oinkers anonymous"; for Pig users
* Yahoo S4 and Cloudera Flume
* Configuration tuning and performance
* Production-level workflow tools

Notes for some topics will be available at our meetup site at
http://meetup.com/hadoopsf

If you haven't yet come to one of our meetups, please consider trying it out
next month! We need your voice to help contribute to the discussions. All
members of the Hadoop community, novice or experienced, are welcome.

Next month's meetup will be held Wednesday, April 13, from 6pm to 8pm. The
meetup will be hosted by our friends at Twitter. Their office is at 795
Folsom St. (4th and Folsom)

We will use the discussion-based "unconference" format. At the beginning of
the meetup we will collaboratively construct an agenda consisting of several
discussion breakout groups. All participants may propose a topic and
volunteer to facilitate a discussion. All Hadoop-related topics are
encouraged, and all members of the Hadoop community are welcome.

Please RSVP at http://bit.ly/gkIvqH

Event schedule:
* 6pm - Welcome
* 6:30pm - Introductions; start creating agenda
* Breakout sessions begin as soon as we're ready
* 8pm - Conclusion

Regards,
- Aaron Kimball


Re: Problem running a Hadoop program with external libraries

2011-03-04 Thread Aaron Kimball
Actually, I just misread your email and missed the difference between your
2nd and 3rd attempts.

Are you enforcing min/max JVM heap sizes on your tasks? Are you enforcing a
ulimit (either through your shell configuration, or through Hadoop itself)?
I don't know where these "cannot allocate memory" errors are coming from. If
they're from the OS, could it be because it needs to fork() and momentarily
exceed the ulimit before loading the native libs?

- Aaron

On Fri, Mar 4, 2011 at 1:26 PM, Aaron Kimball  wrote:

> I don't know if putting native-code .so files inside a jar works. A
> native-code .so is not "classloaded" in the same way .class files are.
>
> So the correct .so files probably need to exist in some physical directory
> on the worker machines. You may want to doublecheck that the correct
> directory on the worker machines is identified in the JVM property
> 'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can
> be manipulated in the Hadoop configuration setting mapred.child.java.opts
> (include '-Djava.library.path=/path/to/native/libs' in the string there.)
>
> Also, if you added your .so files to a directory that is already used by
> the tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may
> need to restart the tasktracker instance for it to take effect. (This is
> true of .jar files in the $HADOOP_HOME/lib directory; I don't know if it is
> true for native libs as well.)
>
> - Aaron
>
>
> On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) 
> wrote:
>
>> We are having difficulties running a Hadoop program making calls to
>> external libraries - but this occurs only when we run the program on our
>> cluster and not from within Eclipse where we are apparently running in
>> Hadoop's standalone mode.  This program invokes the Open Computer Vision
>> libraries (OpenCV and JavaCV).  (I don't think there is a problem with our
>> cluster - we've run many Hadoop jobs on it without difficulty.)
>>
>> 1.  I normally use Eclipse to create jar files for our Hadoop programs
>> but I inadvertently hit the "run as Java application" button and the program
>> ran fine, reading the input file from the eclipse workspace rather than HDFS
>> and writing the output file to the same place.  Hadoop's output appears
>> below.  (This occurred on the master Hadoop server.)
>>
>> 2.  I then "exported" from Eclipse a "runnable jar" which "extracted
>> required libraries" into the generated jar - presumably producing a jar file
>> that incorporated all the required library functions. (The plain jar file
>> for this program is 17 kB while the runnable jar is 30MB.)  When I try to
>> run this on my Hadoop cluster (including my master and slave servers) the
>> program reports that it is unable to locate "libopencv_highgui.so.2.2:
>> cannot open shared object file: No such file or directory".  Now, in
>> addition to this library being incorporated inside the runnable jar file it
>> is also present on each of my servers at
>> hadoop-0.21.0/lib/native/Linux-amd64-64/ where we have loaded the same
>> libraries (to give Hadoop 2 shots at finding them).  These include:
>>  ...
>>  libopencv_highgui_pch_dephelp.a
>>  libopencv_highgui.so
>>  libopencv_highgui.so.2.2
>>  libopencv_highgui.so.2.2.0
>>  ...
>>
>>  When I poke around inside the runnable jar I find
>> javacv_linux-x86_64.jar which contains:
>>  com/googlecode/javacv/cpp/linux-x86_64/libjniopencv_highgui.so
>>
>> 3.  I then tried adding the following to mapred-site.xml as suggested
>> in Patch 2838 that's supposed to be included in hadoop 0.21
>> https://issues.apache.org/jira/browse/HADOOP-2838
>>  
>>mapred.child.env
>>
>>  
>> LD_LIBRARY_PATH=/home/ngc/hadoop-0.21.0/lib/native/Linux-amd64-64
>>  
>>  The log is included at the bottom of this email with Hadoop now
>> complaining about a different missing library with an out-of-memory error.
>>
>> Does anyone have any ideas as to what is going wrong here?  Any help would
>> be appreciated.  Thanks.
>>
>> Alan
>>
>>
>> BTW: Each of our servers has 4 hard drives and many of the errors below
>> refer to the 3 drives (/media/hd2 or hd3 or hd4) reserved exclusively for
>> HDFS and thus perhaps not a good place for Hadoop to be looking for a
>> library file.  My slaves have 24 GB RAM, the jar file is 30 MB, and the
>> sequence file being read is 400 KB - so I hope I am not running ou

Re: Problem running a Hadoop program with external libraries

2011-03-04 Thread Aaron Kimball
I don't know if putting native-code .so files inside a jar works. A
native-code .so is not "classloaded" in the same way .class files are.

So the correct .so files probably need to exist in some physical directory
on the worker machines. You may want to doublecheck that the correct
directory on the worker machines is identified in the JVM property
'java.library.path' (instead of / in addition to $LD_LIBRARY_PATH). This can
be manipulated in the Hadoop configuration setting mapred.child.java.opts
(include '-Djava.library.path=/path/to/native/libs' in the string there.)

Also, if you added your .so files to a directory that is already used by the
tasktracker (like hadoop-0.21.0/lib/native/Linux-amd64-64/), you may need to
restart the tasktracker instance for it to take effect. (This is true of
.jar files in the $HADOOP_HOME/lib directory; I don't know if it is true for
native libs as well.)

- Aaron


On Fri, Mar 4, 2011 at 12:53 PM, Ratner, Alan S (IS) wrote:

> We are having difficulties running a Hadoop program making calls to
> external libraries - but this occurs only when we run the program on our
> cluster and not from within Eclipse where we are apparently running in
> Hadoop's standalone mode.  This program invokes the Open Computer Vision
> libraries (OpenCV and JavaCV).  (I don't think there is a problem with our
> cluster - we've run many Hadoop jobs on it without difficulty.)
>
> 1.  I normally use Eclipse to create jar files for our Hadoop programs
> but I inadvertently hit the "run as Java application" button and the program
> ran fine, reading the input file from the eclipse workspace rather than HDFS
> and writing the output file to the same place.  Hadoop's output appears
> below.  (This occurred on the master Hadoop server.)
>
> 2.  I then "exported" from Eclipse a "runnable jar" which "extracted
> required libraries" into the generated jar - presumably producing a jar file
> that incorporated all the required library functions. (The plain jar file
> for this program is 17 kB while the runnable jar is 30MB.)  When I try to
> run this on my Hadoop cluster (including my master and slave servers) the
> program reports that it is unable to locate "libopencv_highgui.so.2.2:
> cannot open shared object file: No such file or directory".  Now, in
> addition to this library being incorporated inside the runnable jar file it
> is also present on each of my servers at
> hadoop-0.21.0/lib/native/Linux-amd64-64/ where we have loaded the same
> libraries (to give Hadoop 2 shots at finding them).  These include:
>  ...
>  libopencv_highgui_pch_dephelp.a
>  libopencv_highgui.so
>  libopencv_highgui.so.2.2
>  libopencv_highgui.so.2.2.0
>  ...
>
>  When I poke around inside the runnable jar I find
> javacv_linux-x86_64.jar which contains:
>  com/googlecode/javacv/cpp/linux-x86_64/libjniopencv_highgui.so
>
> 3.  I then tried adding the following to mapred-site.xml as suggested
> in Patch 2838 that's supposed to be included in hadoop 0.21
> https://issues.apache.org/jira/browse/HADOOP-2838
>  
>mapred.child.env
>
>  
> LD_LIBRARY_PATH=/home/ngc/hadoop-0.21.0/lib/native/Linux-amd64-64
>  
>  The log is included at the bottom of this email with Hadoop now
> complaining about a different missing library with an out-of-memory error.
>
> Does anyone have any ideas as to what is going wrong here?  Any help would
> be appreciated.  Thanks.
>
> Alan
>
>
> BTW: Each of our servers has 4 hard drives and many of the errors below
> refer to the 3 drives (/media/hd2 or hd3 or hd4) reserved exclusively for
> HDFS and thus perhaps not a good place for Hadoop to be looking for a
> library file.  My slaves have 24 GB RAM, the jar file is 30 MB, and the
> sequence file being read is 400 KB - so I hope I am not running out of
> memory.
>
>
> 1.  RUNNING DIRECTLY FROM ECLIPSE IN HADOOP'S STANDALONE MODE - SUCCESS
>
>  Running Face Program
> 11/03/04 12:44:10 INFO security.Groups: Group mapping
> impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
> cacheTimeout=30
> 11/03/04 12:44:10 INFO jvm.JvmMetrics: Initializing JVM Metrics with
> processName=JobTracker, sessionId=
> 11/03/04 12:44:10 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the same.
> 11/03/04 12:44:10 WARN mapreduce.JobSubmitter: No job jar file set.  User
> classes may not be found. See Job or Job#setJar(String).
> 11/03/04 12:44:10 INFO mapred.FileInputFormat: Total input paths to process
> : 1
> 11/03/04 12:44:10 WARN conf.Configuration: mapred.map.tasks is deprecated.
> Instead, use mapreduce.job.maps
> 11/03/04 12:44:10 INFO mapreduce.JobSubmitter: number of splits:1
> 11/03/04 12:44:10 INFO mapreduce.JobSubmitter: adding the following
> namenodes' delegation tokens:null
> 11/03/04 12:44:10 WARN security.TokenCache: Overwriting existing token
> storage with # keys=0
> 11/03/04 12:44:10 INFO mapreduce.J

Reminder: SF Hadoop meetup in 1 week

2011-03-02 Thread Aaron Kimball
Hadoop fans,

As a reminder -- the third SF Hadoop meetup is one week away! We will meet
on March 9th, from 6pm to 8pm. (We will hopefully continue using the 2nd
Wednesday of the month for successive meetups).

This meetup will be hosted by Yelp, at their office at 706 Mission St, San
Francisco (3rd and Mission). Yelp has graciously offered to sponsor food &
drinks for the event as well.

Space is still available for more attendees! Yelp has asked that all
attendees RSVP in advance, to comply with their security policy. Please join
the meetup group and RSVP at http://www.meetup.com/hadoopsf/events/16678757/

As per our emerging tradition, we will continue to use the discussion-based
"unconference" format. At the beginning of the meetup we will
collaboratively construct an agenda consisting of several discussion
breakout groups. All participants may propose a topic and volunteer to
facilitate a discussion. All members of the Hadoop community are welcome to
attend. While all Hadoop-related subjects are on topic, this month's
discussion theme is "integration."

Regards,
- Aaron Kimball


March 2011 San Francisco Hadoop User Meetup ("integration")

2011-02-23 Thread Aaron Kimball
Hadoop fans,

I'm pleased to announce that the third SF Hadoop meetup will be held
Wednesday, March 9th, from 6pm to 8pm. (We will hopefully continue using the
2nd Wednesday of the month for successive meetups).

This meetup will be hosted by the good folks at Yelp. Their office is at 706
Mission St, San Francisco (3rd and Mission). We have ample space available
for a large gathering of Hadoop enthusiasts.

As per our emerging tradition, we will continue to use the discussion-based
"unconference" format. At the beginning of the meetup we will
collaboratively construct an agenda consisting of several discussion
breakout groups. All participants may propose a topic and volunteer to
facilitate a discussion. All members of the Hadoop community are welcome to
attend. While all Hadoop-related subjects are on topic, I'd like to seed
this month's discussion with the theme of "integration."

Yelp has asked that all attendees RSVP in advance, to comply with their
security policy. Please join the meetup group and RSVP at
http://www.meetup.com/hadoopsf/events/16678757/

Refreshments will be provided.

Regards,
- Aaron Kimball


SF Hadoop meetup report

2011-02-11 Thread Aaron Kimball
Hadoop fans,

This week we held the second SF Hadoop meetup. About 70 individuals
attended, and we again had several great discussions in parallel, using the
same unconference format as the first meetup in January.

The following break-out sessions were held:
* Hadoop and Workflow Systems
* Feeding Hadoop
* Contributing to Hadoop and Related Projects
* Hive Q&A
* Hadoop + RDBMS
* Hadoop and Predictive Analytics
* Hadoop in Amazon
* HBase: What is the sweet spot?
* Hardware for Hadoop
* Hadoop Security

For some discussions, summary notes are available; these will be posted at
http://www.meetup.com/hadoopsf/messages/boards/

Thanks again to our hosts at CBS Interactive for providing the space for
this meetup. I would also like to thank the discussion facilitators, and all
the attendees for contributing to the success of the discussions.

Stay tuned for the next SF Hadoop meetup announcement! Sign up at
http://www.meetup.com/hadoopsf/

Regards,
- Aaron Kimball


Re: Click Stream Data

2011-01-30 Thread Aaron Kimball
Start with the student's CS department's web server?

I believe the wikimedia foundation also makes the access logs to wikipedia
et al. available publicly. That is quite a lot of data though.
- Aaron

On Sun, Jan 30, 2011 at 10:54 AM, Bruce Williams
wrote:

> Does anyone know of a source of click stream data for a student research
> project?
>
> Bruce Williams
>
> Concepts, like individuals, have their histories and are just as incapable
> of withstanding the ravages of time as are individuals. But in and through
> all this they retain a kind of homesickness for the scenes of their
> childhood.
> Soren Kierkegaard
>


Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
How is the central log4j file made available to the tasks? After you make
your changes to the configuration file, does it help if you restart the task
trackers?

You could also try setting the log level programmatically in your "void
setup(Context)" method:

@Override
protected void setup(Context context) {
  logger.setLevel(Level.DEBUG);
}

- Aaron

On Wed, Dec 15, 2010 at 2:23 PM, W.P. McNeill  wrote:

> I'm running on a cluster.  I'm trying to write to the log files on the
> cluster machines, the ones that are visible through the jobtracker web
> interface.
>
> The log4j file I gave excerpts from is a central one for the cluster.
>
> On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball 
> wrote:
>
> > W. P.,
> >
> > How are you running your Reducer? Is everything running in standalone
> mode
> > (all mappers/reducers in the same process as the launching application)?
> Or
> > are you running this in pseudo-distributed mode or on a remote cluster?
> >
> > Depending on the application's configuration, log4j configuration could
> be
> > read from one of many different places.
> >
> > Furthermore, where are you expecting your output? If you're running in
> > pseudo-distributed (or fully distributed) mode, mapper / reducer tasks
> will
> > not emit output back to the console of the launching application.  That
> > only
> > happens in local mode. In the distributed flavors, you'll see a different
> > file for each task attempt containing its log output, on the machine
> where
> > the task executed. These files can be accessed through the web UI at
> > http://jobtracker:50030/ -- click on the job, then the task, then the
> task
> > attempt, then "syslog" in the right-most column.
> >
> > - Aaron
> >
> > On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill 
> wrote:
> >
> > > I would like to use Hadoop's Log4j infrastructure to do logging from my
> > > map/reduce application.  I think I've got everything set up correctly,
> > but
> > > I
> > > am still unable to specify the logging level I want.
> > >
> > > By default Hadoop is set up to log at level INFO.  The first line of
> its
> > > log4j.properties file looks like this:
> > >
> > > hadoop.root.logger=INFO,console
> > >
> > >
> > > I have an application whose reducer looks like this:
> > >
> > > package com.me;
> > >
> > > public class MyReducer<...> extends Reducer<...> {
> > >   private static Logger logger =
> > > Logger.getLogger(MyReducer.class.getName());
> > >
> > >   ...
> > >   protected void reduce(...) {
> > >   logger.debug("My message");
> > >   ...
> > >   }
> > > }
> > >
> > >
> > > I've added the following line to the Hadoop log4j.properties file:
> > >
> > > log4j.logger.com.me.MyReducer=DEBUG
> > >
> > >
> > > I expect the Hadoop system to log at level INFO, but my application to
> > log
> > > at level DEBUG, so that I see "My message" in the logs for the reducer
> > > task.
> > >  However, my application does not produce any log4j output.  If I
> change
> > > the
> > > line in my reducer to read logger.info("My message") the message does
> > get
> > > logged, so somehow I'm failing to specify that log level for this
> class.
> > >
> > > I've also tried changing the log4j line for my app to
> > > read log4j.logger.com.me.MyReducer=DEBUG,console and get the same
> result.
> > >
> > > I've been through the Hadoop and log4j documentation and I can't figure
> > out
> > > what I'm doing wrong.  Any suggestions?
> > >
> > > Thanks.
> > >
> >
>


Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
W. P.,

How are you running your Reducer? Is everything running in standalone mode
(all mappers/reducers in the same process as the launching application)? Or
are you running this in pseudo-distributed mode or on a remote cluster?

Depending on the application's configuration, log4j configuration could be
read from one of many different places.

Furthermore, where are you expecting your output? If you're running in
pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will
not emit output back to the console of the launching application.  That only
happens in local mode. In the distributed flavors, you'll see a different
file for each task attempt containing its log output, on the machine where
the task executed. These files can be accessed through the web UI at
http://jobtracker:50030/ -- click on the job, then the task, then the task
attempt, then "syslog" in the right-most column.

- Aaron

On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill  wrote:

> I would like to use Hadoop's Log4j infrastructure to do logging from my
> map/reduce application.  I think I've got everything set up correctly, but
> I
> am still unable to specify the logging level I want.
>
> By default Hadoop is set up to log at level INFO.  The first line of its
> log4j.properties file looks like this:
>
> hadoop.root.logger=INFO,console
>
>
> I have an application whose reducer looks like this:
>
> package com.me;
>
> public class MyReducer<...> extends Reducer<...> {
>   private static Logger logger =
> Logger.getLogger(MyReducer.class.getName());
>
>   ...
>   protected void reduce(...) {
>   logger.debug("My message");
>   ...
>   }
> }
>
>
> I've added the following line to the Hadoop log4j.properties file:
>
> log4j.logger.com.me.MyReducer=DEBUG
>
>
> I expect the Hadoop system to log at level INFO, but my application to log
> at level DEBUG, so that I see "My message" in the logs for the reducer
> task.
>  However, my application does not produce any log4j output.  If I change
> the
> line in my reducer to read logger.info("My message") the message does get
> logged, so somehow I'm failing to specify that log level for this class.
>
> I've also tried changing the log4j line for my app to
> read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result.
>
> I've been through the Hadoop and log4j documentation and I can't figure out
> what I'm doing wrong.  Any suggestions?
>
> Thanks.
>


San Francisco Hadoop meetup

2010-11-04 Thread Aaron Kimball
Hello Hadoop fans,

The Bay Area Hadoop User Group meetups are a long hike for those of us who
come from San Francisco. I'd like to gauge interest in SF-centric Hadoop
gatherings.

In contrast to the presentation-based format of the usual HUG meetings, I'm
interested in holding events that are more discussion-based: an environment
facilitating people who are working on similar problems to share tips for
handling common scenarios and identify common goals for development. Of
course, if you have a relevant presentation that would benefit the group as
a whole, such offers are gladly accepted.

All industries welcome. If you'd like to come, I'd encourage you to think of
a two minute summary about what you're working on with Hadoop, areas you're
interested in learning more about, and/or challenges that you're facing that
might be addressed by a larger Hadoop community. We can build the discussion
from there, and may break into smaller groups. All are welcome (developers,
users, or managers; experienced or newbies). Though I'd like to hold this in
San Francisco, folks from outside the city are of course welcome too.

If you're interested in joining us, please fill out the following:
* I've created a short survey to help understand days / times that would
work for the most people:  http://bit.ly/ajK26U
* Please also join the meetup group at http://meetup.com/hadoopsf -- We'll
use this to plan the event, RSVP information, etc.

I'm looking forward to meeting more of you!
- Aaron Kimball


Re: Partitioned Datasets Map/Reduce

2010-07-05 Thread Aaron Kimball
One possibility: write out all the partition numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
mapper.

If you wanted to be more clever, it might be possible to subclass
MultiFileInputFormat to group together both datasets "file-number-wise" when
generating splits, but I don't have specific guidance here.

- Aaron

On Sat, Jul 3, 2010 at 9:35 AM, abc xyz  wrote:

>
>
> Hello everyone,
>
>
> I have written my custom partitioner for partitioning datasets. I want  to
> partition two datasets using the same partitioner and then in the  next
> mapreduce job, I want each mapper to handle the same partition from  the
> two
> sources and perform some function such as joining etc. How I  can I ensure
> that
> one mapper gets the split that corresponds to same  partition from both the
> sources?
>
>
> Any help would be highly appreciated.
>
>
>
>


Re: create error

2010-07-05 Thread Aaron Kimball
Is there a reason you're using that particular interface? That's very
low-level.

See http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample for the proper
API to use.

- Aaron

On Sat, Jul 3, 2010 at 1:36 AM, Vidur Goyal wrote:

> Hi,
>
> I am trying to create a file in hdfs . I am calling create from an
> instance of DFSClient. This is a part of code that i am using
>
> byte[] buf = new byte[65536];
>int len;
>while ((len = dis.available()) != 0) {
>if (len < buf.length) {
>break;
>} else {
>dis.read(buf, 0, buf.length);
>ds.write(buf, 0, buf.length);
>}
>}
>
> dis is DataInputStream for the local file system from which i am copying
> the file and ds is the DataOutputStream to hdfs.
>
> and i get these errors.
>
> 2010-07-03 13:45:07,480 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(127.0.0.1:50010,
> storageID=DS-455297472-127.0.0.1-50010-1278144155322, infoPort=50075,
> ipcPort=50020):DataXceiver
> java.io.EOFException: while trying to read 65557 bytes
>at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:265)
>at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309)
>at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373)
>at
>
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525)
>at
>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
>at
>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>at java.lang.Thread.run(Thread.java:636)
>
>
> When i run the loop for number of times that is a multiple of block size ,
> the operation runs just fine. As soon as i change the buffer array size to
> a non block size , it starts giving errors.
> I am in middle of a project . Any help will be appreciated.
>
> thanks
> vidur
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>


Re: Text files vs. SequenceFiles

2010-07-05 Thread Aaron Kimball
David,

I think you've more-or-less outlined the pros and cons of each format
(though do see Alex's important point regarding SequenceFiles and
compression). If everyone who worked with Hadoop clearly favored one or the
other, we probably wouldn't include support for both formats by default. :)
Neither format is "right" or "wrong" in the general case. The decision will
be application-specific.

I would point out, though, that you may be underestimating the processing
cost of parsing records. If you've got a really dead-simple problem like
"each record is just a set of integers", you could probably split a line of
text on commas/tabs/etc. into fields and then convert those to proper
integer values in a relatively efficient fashion. But if you may have
delimiters embedded in free-form strings, you'll need to build up a much
more complex DFA to process the data, and it's not too hard to find yourself
CPU-bound. (Java regular expressions can be very slow.) Yes, you can always
throw more nodes at the problem, but you may find that your manager is
unwilling to sign off on purchasing more nodes at some point :) Also,
writing/maintaining parser code is its own challenge.

If your data is essentially text in nature, you might just store it in text
files and be done with it for all the reasons you've stated.

But for complex record types, SequenceFiles will be faster. Especially if
you have to work with raw byte arrays at any point, escaping that (e.g.,
BASE64 encoding) into text and then back is hardly worth the trouble. Just
store it in a binary format and be done with it. Intermediate job data
should probably live as SequenceFiles all the time. They're only ever going
to be read by more MapReduce jobs, right? For data at either "edge" of your
problem--either input or final output data--you might want the greater
ubiquity of text-based files.

- Aaron

On Fri, Jul 2, 2010 at 3:35 PM, Joe Stein wrote:

> David,
>
> You can also set compression to occur of your data between your map &
> reduce
> tasks (this data can be large and often is quicker to compress and transfer
> than just transfer when the copy gets going).
>
> *mapred.compress.map.output*
>
> Setting this value to *true* should speed up the reducers copy greatly
> especially when working with large data sets.
>
>
> http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/
>
> When we load in our data we use the HDFS API and get the data in to begin
> with as SequenceFiles (compressed by block) and never look back from there.
>
> We have a custom SequenceFileLoader so we can still use Pig also against
> our
> SequenceFiles.  It is worth the little bit of engineering effort to save
> space.
>
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> Twitter: @allthingshadoop
> */
>
> On Fri, Jul 2, 2010 at 6:14 PM, Alex Loddengaard 
> wrote:
>
> > Hi David,
> >
> > On Fri, Jul 2, 2010 at 2:54 PM, David Rosenstrauch  > >wrote:
> > >
> > > * We should use a SequenceFile (binary) format as it's faster for the
> > > machine to read than parsing text, and the files are smaller.
> > >
> > > * We should use a text file format as it's easier for humans to read,
> > > easier to change, text files can be compressed quite small, and a) if
> the
> > > text format is designed well and b) given the context of a distributed
> > > system like Hadoop where you can throw more nodes at a problem, the
> text
> > > parsing time will wind up being negligible/irrelevant in the overall
> > > processing time.
> > >
> >
> > SequenceFiles can also be compressed, either per record or per block.
>  This
> > is advantageous if you want to use gzip, because gzip isn't splittable.
>  A
> > SF compressed by blocks is therefor splittable, because each block is
> > gzipped vs. the entire file being gzipped.
> >
> > As for readability, "hadoop fs -text" is the same as "hadoop fs -cat" for
> > SequenceFiles.
> >
> > Lastly, I promise that eventually you'll run out of space in your cluster
> > and wish you did better compression.  Plus compression makes jobs faster.
> >
> > The general recommendation is to use SequenceFiles as early in your ETL
> as
> > possible.  Usually people get their data in as text, and after the first
> MR
> > pass they work with SequenceFiles from there on out.
> >
> > Alex
> >
>


Re: Is it possible ....!!!

2010-06-10 Thread Aaron Kimball
Hadoop has some classes for controlling how sockets are used. See
org.apache.hadoop.net.StandardSocketFactory, SocksSocketFactory.

The socket factory implementation chosen is controlled by the
hadoop.rpc.socket.factory.class.default configuration parameter. You could
probably write your own SocketFactory that gives back socket implementations
that tee the conversation to another port, or to a file, etc.

So, "it's possible," but I don't know that anyone's implemented this. I
think others may have examined Hadoop's protocols via wireshark or other
external tools, but those don't have much insight into Hadoop's internals.
(Neither, for that matter, would the socket factory. You'd probably need to
be pretty clever to introspect as to exactly what type of message is being
sent and actually do semantic analysis, etc.)

Allen's suggestion is probably more "correct," but might incur additional
work on your part.

Cheers,
- Aaron

On Thu, Jun 10, 2010 at 3:54 PM, Allen Wittenauer
wrote:

>
> On Jun 10, 2010, at 3:25 AM, Ahmad Shahzad wrote:
> > Reason for doing that is that i want all the communication to happen
> through
> > a communication library that resolves every communication problem that we
> > can have e.g firewalls, NAT, non routed paths, multi homing etc etc. By
> > using that library all the headache of communication will be gone. So, we
> > will be able to use hadoop quite easily and there will be no
> communication
> > problems.
>
> I know Owen pointed you towards using proxies, but anything remotely
> complex would probably be better in an interposer library, as then it is
> application agnostic.


Re: help on CombineFileInputFormat

2010-05-10 Thread Aaron Kimball
Zhenyu,

It's a bit complicated and involves some layers of
indirection. CombineFileRecordReader is a sort of shell RecordReader that
passes the actual work of reading records to another child record reader.
That's the class name provided in the third parameter. Instructing it to use
CombineFileRecordReader again as its child RR doesn't tell it to do anything
useful. You must give it the name of another RecordReader class that
actually understands how to parse your particular records.

Unfortunately, TextInputFormat's LineRecordReader and
SequenceFileInputFormat's SequenceFileRecordReader both require the
InputSplit to be a FileSplit. So you can't use them directly.
(CombineFileInputFormat will pass a CombineFileSplit to the
CombineFileRecordReader which is then passed along to the child RR that you
specify.)

In Sqoop I got around this by creating (another!) indirection class called
CombineShimRecordReader.

The export functionality of Sqoop uses CombineFileInputFormat to allow the
user to specify the number of map tasks; it then organizes a set of input
files into that many tasks. This instantiates a CombineFileRecordReader
configured to forward its InputSplit to CombineShimRecordReader.
CombineShimRecordReader then translates the CombineFileSplit into a regular
FileSplit and forward thats to LineRecordReader (for text) or
SequenceFileRecordReader (for SequenceFiles). The grandchild (LineRR or
SequenceFileRR) is determined on a file-by-file basis by
CombineShimRecordReader, by calling a static method of Sqoop's
ExportJobBase.

You can take a look at the source of theseclasses here:
*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/ExportInputFormat.java

*
http://github.com/cloudera/sqoop/blob/master/src/shims/common/org/apache/hadoop/sqoop/mapreduce/CombineShimRecordReader.java
*
http://github.com/cloudera/sqoop/blob/master/src/java/org/apache/hadoop/sqoop/mapreduce/ExportJobBase.java

(apologies for the lengthy URLs; you could also just download the whole
project's source at http://github.com/cloudera/sqoop) :)

Cheers,
- Aaron


On Thu, May 6, 2010 at 7:32 AM, Zhenyu Zhong wrote:

> Hi,
>
> I tried to use CombineFileInputFormat in 0.20.2. It seems I need to extend
> it because it is an abstract class.
> However, I need to implement getRecordReader method in the extended class.
>
> May I ask how to implement this getRecordReader method?
>
> I tried to do something like this:
>
> public RecordReader getRecordReader(InputSplit genericSplit, JobConf job,
>
> Reporter reporter) throws IOException {
>
> // TODO Auto-generated method stub
>
> reporter.setStatus(genericSplit.toString());
>
> return new CombineFileRecordReader(job, (CombineFileSplit) genericSplit,
> reporter, CombineFileRecordReader.class);
>
> }
>
> It doesn't seem to be working. I would be very appreciated if someone can
> shed a light on this.
>
> thanks
> zhenyu
>


Sqoop is moving to github!

2010-03-29 Thread Aaron Kimball
Hi Hadoop, Hive, and Sqoop users,

For the past year, the Apache Hadoop MapReduce project has played host to
Sqoop, a command-line tool that performs parallel imports and exports
between relational databases and HDFS. We've developed a lot of features and
gotten a lot of great feedback from users. While Sqoop was a contrib project
in Hadoop, it has been steadily improved and grown.

But the contrib directory is a home for new or small projects incubating
underneath Hadoop's umbrella. Sqoop is starting to look less like a small
project these days. In particular, a feature that has been growing in
importance for Sqoop is its ability to integrate with Hive. In order to
facilitate this integration from a compilation and testing standpoint, we've
pulled Sqoop out of contrib and into its own repository hosted on github.

You can download all the relevant bits here:
http://www.github.com/cloudera/sqoop

The code there will run in conjunction with the Apache Hadoop trunk source.
(Compatibility with other distributions/versions is forthcoming.)

While we've changed hosts, Sqoop will keep the same license -- future
improvements will continue to remain Apache 2.0-licensed. We welcome the
contributions of all in the open source community; there's a lot of exciting
work still to be done! If you'd like to help out but aren't sure where to
start, send me an email and I can recommend a few areas where improvements
would be appreciated.

Want some more information about Sqoop? An introduction is available here:
http://www.cloudera.com/sqoop
A ready-to-run release of Sqoop is included with Cloudera's Distribution for
Hadoop: http://archive.cloudera.com
And its reference manual is available for browsing at
http://archive.cloudera.com/docs/sqoop

If you have any questions about this move process, please ask me.

Regards,
- Aaron Kimball
Cloudera, Inc.


Re: Sqoop Installation on Apache Hadop 0.20.2

2010-03-17 Thread Aaron Kimball
Hi Utku,

Apache Hadoop 0.20 cannot support Sqoop as-is. Sqoop makes use of the
DataDrivenDBInputFormat (among other APIs) which are not shipped with
Apache's 0.20 release. In order to get Sqoop working on 20, you'd need to
apply a lengthy list of patches from the project source repository to your
copy of Hadoop and recompile. Or you could just download it all from
Cloudera, where we've done that work for you :)

So as it stands, Sqoop won't be able to run on 0.20 unless you choose to use
Cloudera's distribution.  Do note that your use of the term "fork" is a bit
strong here; with the exception of (minor) modifications to make it interact
in a more compatible manner with the external Linux environment, our
distribution only includes code that's available to the project at large.
But some of that code has not been rolled into a binary release from Apache
yet. If you choose to go with Cloudera's distribution, it just means that
you get publicly-available features (like Sqoop, MRUnit, etc.) a year or so
ahead of what Apache has formally released, but our codebase isn't radically
diverging; CDH is just somewhere ahead of the Apache 0.20 release, but
behind Apache's svn trunk. (All of Sqoop, MRUnit, etc. are available in the
Hadoop source repository on the trunk branch.)

If you install our distribution, then Sqoop will be installed in
/usr/lib/hadoop-0.20/contrib/sqoop and /usr/bin/sqoop for you. There isn't a
separate package to install Sqoop independent of the rest of CDH; thus no
extra download link on our site.

I hope this helps!

Good luck,
- Aaron


On Wed, Mar 17, 2010 at 4:30 AM, Reik Schatz  wrote:

> At least for MRUnit, I was not able to find it outside of the Cloudera
> distribution (CDH). What I did: installing CDH locally using apt (Ubuntu),
> searched for and copied the mrunit library into my local Maven repository,
> and removed CDH after. I guess the same is somehow possible for Sqoop.
>
> /Reik
>
>
> Utku Can Topçu wrote:
>
>> Dear All,
>>
>> I'm trying to run tests using MySQL as some kind of a datasource, so I
>> thought cloudera's sqoop would be a nice project to have in the
>> production.
>> However, I'm not using the cloudera's hadoop distribution right now, and
>> actually I'm not thinking of switching from a main project to a fork.
>>
>> I read the documentation on sqoop at
>> http://www.cloudera.com/developers/downloads/sqoop/ but there are
>> actually
>> no links for downloading the sqoop itself.
>>
>> Has anyone here know, and tried to use sqoop with the latest apache
>> hadoop?
>> If so can you give me some tips and tricks on it?
>>
>> Best Regards,
>> Utku
>>
>>
>
> --
>
> *Reik Schatz*
> Technical Lead, Platform
> P: +46 8 562 470 00
> M: +46 76 25 29 872
> F: +46 8 562 470 01
> E: reik.sch...@bwin.org 
> */bwin/* Games AB
> Klarabergsviadukten 82,
> 111 64 Stockholm, Sweden
>
> [This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy this e-mail. Any
> unauthorised copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden.]
>
>


Re: CombineFileInputFormat in 0.20.2 version

2010-03-16 Thread Aaron Kimball
The most obvious workaround is to use the old API (continue to use Mapper,
Reducer, etc. from org.apache.hadoop.mapred, not .mapreduce).

If you really want to use the new API, though, I unfortunately don't see a
super-easy path. You could try to apply the patch from MAPREDUCE-364 to your
version of Hadoop and recompile, but that might be tricky since the
filenames will most likely not line up (due to the project split).

- Aaron

On Tue, Mar 16, 2010 at 8:11 AM, Aleksandar Stupar <
stupar.aleksan...@yahoo.com> wrote:

> Hi all,
>
> I want to use CombineFileInputFormat in 0.20.2 version but it can't be used
> with Job class.
>
> Description:
> org.apache.hadoop.mapred.lib.CombineFileInputFormat can not be used with
> org.apache.hadoop.mapreduce.Job
> because Job.setInputFormat requires subclass of
>  org.apache.hadoop.mapreduce.InputFormat and CombineFileInputFormat
> extends org.apache.hadoop.mapred.FileInputFormat.
>
> Also CombineFileInputFormat uses deprecated classes.
>
>
> Are there any workarounds?
>
> Thanks,
> Aleksandar Stupar.
>
>
>
>


Re: Unexpected termination of a job

2010-03-03 Thread Aaron Kimball
If it's terminating before you even run a job, then you're in luck -- it's
all still running on the local machine. Try running it in Eclipse and use
the debugger to trace its execution.

- Aaron

On Wed, Mar 3, 2010 at 4:13 AM, Rakhi Khatwani  wrote:

> Hi,
>I am running a job which has lotta preprocessing involved. so whn i
> run my class from a jarfile, somehow it terminates after sometime without
> giving any exception,
> i have tried running the same program several times, and everytime it
> terminates at different locations in the code(during the preprocessing...
> haven't configured a job as yet). Probably it terminaits after a fixed
> interval).
> No idea why this is happeneing, Any Pointers??
> Regards,
> Raakhi Khatwani
>


Re: Separate mail list for streaming?

2010-03-03 Thread Aaron Kimball
We've already got a lot of mailing lists :) If you send questions to
mapreduce-user, are you not getting enough feedback?

- Aaron

On Wed, Mar 3, 2010 at 12:09 PM, Michael Kintzer
wrote:

> Hi,
>
> Was curious if anyone else thought it would be useful to have a separate
> mail list for discussion/issues specific to Hadoop Streaming?
>
> Thanks,
>
> Michael
>


Re: dataset

2010-03-03 Thread Aaron Kimball
Look at implementing your own Partitioner implementation to control which
records are sent to which reduce shards.

- Aaron

On Wed, Mar 3, 2010 at 12:15 PM, Gang Luo  wrote:

> Hi all,
> I want to generate some datasets with data skew to test my mapreduce jobs.
> I am using TPC-DS but it seems I cannot control the data skew level. There
> is a suite from Microsoft that could generate skewed datasets based on
> TPC-D, but only workable in windows. I haven't succeed make it compilable in
> linux yet. Please tell me how can I get some skewed dataset.
>
> Thanks.
> -Gang
>
>
>
>
>


Re: Why is $JAVA_HOME/lib/tools.jar in the classpath?

2010-02-17 Thread Aaron Kimball
Thomas,

What version of Hadoop are you building Debian packages for? If you're
taking Cloudera's existing debs and modifying them, these include a backport
of Sqoop (from Apache's trunk) which uses the rt tools.jar to compile
auto-generated code at runtime. Later versions of Sqoop (including the one
in the most recently-released CDH2: 0.20.1+169.56-1) include MAPREDUCE-1146
which eliminates that dependency.

- Aaron

On Tue, Feb 16, 2010 at 3:19 AM, Steve Loughran  wrote:

> Thomas Koch wrote:
>
>> Hi,
>>
>> I'm working on the Debian package for hadoop (the first version is already
>> in the new queue for Debian unstable).
>> Now I stumpled about $JAVA_HOME/lib/tools.jar in the classpath. Since
>> Debian supports different JAVA runtimes, it's not that easy to know, which
>> one the user currently uses and therefor I'd would make things easier if
>> this jar would not be necessary.
>> From searching and inspecting the SVN history I got the impression, that
>> this is an ancient legacy that's not necessary (anymore)?
>>
>>
> I don't think hadoop core/hdf/maperd needs it. The only place where it
> would be needed is JSP->java->binary work, but as the JSPs are precompiled
> you can probably get away without it. Just add tests for all the JSPs to
> make sure they work.
>
>
> -steve
>


Re: mapred.system.dir

2010-02-12 Thread Aaron Kimball
To expand on Eric's comment: dfs.data.dir is the local filesystem directory
(or directories) that a particular datanode uses to store its slice of the
HDFS data blocks.

so dfs.data.dir might be "/home/hadoop/data/" on some machine; a bunch of
files with inscrutable names like blk_4546857325993894516 will be stored
there. These "blk" files represent chunks of "real" complete user-accessible
files in HDFS-proper.

mapred.system.dir is a filesystem path like "/system/mapred" which is served
by the HDFS, where files used by MapReduce appear. The purpose of the config
file comment is to let you know that you're free to pick a path name like
"/system/mapred" here even though your local Linux machine doesn't have a
path named "/system"; this HDFS path is in a separate (HDFS-specific)
namespace from "/home", "/etc", "/var" and the other various denizens of
your local machine.

- Aaron

On Fri, Feb 12, 2010 at 6:23 AM, Eric Sammer  wrote:

> On 2/12/10 8:40 AM, Edson Ramiro wrote:
> > Hi all,
> >
> > I'm setting up a Hadoop Cluster and some doubts have
> >  arisen about hadoop configuration.
> >
> > The Hadoop Cluster Setup [1] says that the mapred.system.dir must
> > be in the HDFS and be accessible from both the server and clients.
> >
> > Where is the HDFS directory? is the dfs.data.dir?
> >
> > should I export by NFS or other protocol the mapred.system.dir to
> > leave it accessible from server and clients?
> >
> > Thanks in advance
> >
> > [1] http://hadoop.apache.org/common/docs/current/cluster_setup.html
> >
> > Edson Ramiro
> >
>
> Edson:
>
> An HDFS file system is a distributed global view controlled by the
> namenode. If a file is "in HDFS" all clients and servers that are
> pointed at the namenode will be able to see everything. This means that
> you don't need to do anything special to export or reveal the
> mapred.system.dir; that's what HDFS does. It's worth reading the HDFS
> Architecture paper on the Hadoop site or the Google GFS paper for
> details on how this all works and how it relates to map reduce.
>
> HTH.
> --
> Eric Sammer
> e...@lifeless.net
> http://esammer.blogspot.com
>


Re: Ubuntu Single Node Tutorial failure. No live or dead nodes.

2010-02-12 Thread Aaron Kimball
Sonal,

Can I ask why you're sleeping between starting hdfs and mapreduce? I've
never needed this in my own code. In general, Hadoop is pretty tolerant
about starting daemons "out of order."

If you need to wait for HDFS to be ready and come out of safe mode before
launching a job, that's another story, but you can accomplish that with:

$HADOOP_HOME/hadoop dfsadmin -safemode wait

... which will block until HDFS is ready for user commands in read/write
mode.
- Aaron


On Fri, Feb 12, 2010 at 8:44 AM, Sonal Goyal  wrote:

> Hi
>
> I had faced a similar issue on Ubuntu and Hadoop 0.20 and modified the
> start-all script to introduce a sleep time :
>
> bin=`dirname "$0"`
> bin=`cd "$bin"; pwd`
>
> . "$bin"/hadoop-config.sh
>
> # start dfs daemons
> "$bin"/start-dfs.sh --config $HADOOP_CONF_DIR
> *echo 'sleeping'
> sleep 60
> echo 'awake'*
> # start mapred daemons
> "$bin"/start-mapred.sh --config $HADOOP_CONF_DIR
>
>
> This seems to work. Please see if this works for you.
> Thanks and Regards,
> Sonal
>
>
> On Thu, Feb 11, 2010 at 3:56 AM, E. Sammer  wrote:
>
> > On 2/10/10 5:19 PM, Nick Klosterman wrote:
> >
> >> @E.Sammer, no I don't *think* that it is part of another cluster. The
> >> tutorial is for a single node cluster just as a initial set up to see if
> >> you can get things up and running. I have reformatted the namenode
> >> several times in my effort to get hadoop to work.
> >>
> >
> > What I mean is that the data node, at some point, connected to your name
> > node. If you reformat the name node, the data node must be wiped clean;
> it's
> > effectively trying to join a name node that no longer exists.
> >
> >
> > --
> > Eric Sammer
> > e...@lifeless.net
> > http://esammer.blogspot.com
> >
>


Re: Identity Reducer

2010-02-11 Thread Aaron Kimball
Can you post the entire exception with its accompanying stack trace?
- Aaron

On Thu, Feb 11, 2010 at 5:26 PM, Prabhu Hari Dhanapal <
dragonzsn...@gmail.com> wrote:

> @ Jeff
> I seem to have used the Mapper you are pointing to ...
>
>
> import org.apache.hadoop.mapred.MapReduceBase;
> import org.apache.hadoop.mapred.Mapper;
>
> Will that affect the Reducer in any sense?
>
>
> On Thu, Feb 11, 2010 at 8:19 PM, Jeff Zhang  wrote:
>
> > I guess you are using org.apache.hadoop.mapreduce.Mapper which is a class
> > for hadoop new API. you can use the org.apache.hadoop.mapred.Mapper which
> > is
> > for old API
> >
> >
> >
> > On Thu, Feb 11, 2010 at 5:10 PM, Prabhu Hari Dhanapal <
> > dragonzsn...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > I m trying to Write a program that performs some simple Datamining on a
> > > certain DataSet. I was told that an Identity Reducer should be written.
> > >
> > > public class Reduce extends MapReduceBase implements Reducer {
> > >
> > > public void reduce(Text key, Iterator
> > > values,OutputCollector output,
> > >  Reporter reporter) throws IOException {
> > >
> > > ===>
> > > It shows me the following exception  ..
> > >
> > > "The type Reducer cannot be a superinterface of Reduce; a
> superinterface
> > > must be an interface "
> > > Can I have some pointers on solving this?
> > >
> > >
> > > --
> > > Hari
> > >
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
>
>
>
> --
> Hari
>


Re: Any alternative for MultipleOutputs class in hadoop 0.18.3

2010-02-11 Thread Aaron Kimball
There's an older mechanism called MultipleOutputFormat which may do what you
need.
- Aaron

On Fri, Feb 5, 2010 at 10:13 AM, Udaya Lakshmi  wrote:

> Hi,
>  MultipleOutput class is not available in hadoop 0.18.3. Is there any
> alternative for this class? Please point me useful link.
>
> Thanks,
> Udaya.
>


Re: map side only behavior

2010-01-31 Thread Aaron Kimball
In a map-only job, map tasks will be connected directly to the OutputFormat.
So calling output.collect() / context.write() in the mapper will emit data
straight to files in HDFS without sorting the data. There is no sort buffer
involved. If you want exactly one output file, follow Nick's advice.

- Aaron

On Fri, Jan 29, 2010 at 8:32 AM, Jones, Nick  wrote:

> A single unity reducer should enforce a merge and sort to generate one
> file.
>
> Nick Jones
>
> -Original Message-
> From: Jeff Zhang [mailto:zjf...@gmail.com]
> Sent: Friday, January 29, 2010 10:06 AM
> To: common-user@hadoop.apache.org
> Subject: Re: map side only behavior
>
> No, the merge and sort will not happen in mapper task. And each mapper task
> will generate one output file.
>
>
>
> 2010/1/29 Gang Luo 
>
> > Hi all,
> > If I only use map side to process my data (set # of reducers to 0 ), what
> > is the behavior of hadoop? Will it merge and sort each of the spills
> > generated by one mapper?
> >
> > -Gang
> >
> >
> > - 原始邮件 
> > 发件人: Gang Luo 
> > 收件人: common-user@hadoop.apache.org
> > 发送日期: 2010/1/29 (周五) 8:54:33 上午
> > 主   题: Re: fine granularity operation on HDFS
> >
> > Yeah, I see how it works. Thanks Amogh.
> >
> >
> > -Gang
> >
> >
> >
> > - 原始邮件 
> > 发件人: Amogh Vasekar 
> > 收件人: "common-user@hadoop.apache.org" 
> > 发送日期: 2010/1/28 (周四) 10:00:22 上午
> > 主   题: Re: fine granularity operation on HDFS
> >
> > Hi Gang,
> > Yes PathFilters work only on file paths. I meant you can include such
> type
> > of logic at split level.
> > The input format's getSplits() method is responsible for computing and
> > adding splits to a list container, for which JT initializes mapper tasks.
> > You can override the getSplits() method to add only a few , say, based on
> > the location or offset, to the list. Here's the reference :
> > while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
> >  int blkIndex = getBlockIndex(blkLocations,
> length-bytesRemaining);
> >  splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
> >   blkLocations[blkIndex].getHosts()));
> >  bytesRemaining -= splitSize;
> >}
> >
> >if (bytesRemaining != 0) {
> >  splits.add(new FileSplit(path, length-bytesRemaining,
> > bytesRemaining,
> > blkLocations[blkLocations.length-1].getHosts()));
> >
> > Before splits.add you can use your logic for discarding. However, you
> need
> > to ensure your record reader takes care of incomplete records at
> boundaries.
> >
> > To get the block locations to load separately, the FileSystem class APIs
> > expose few methods like getBlockLocations etc ..
> > Hope this helps.
> >
> > Amogh
> >
> > On 1/28/10 7:26 PM, "Gang Luo"  wrote:
> >
> > Thanks Amogh.
> >
> > For the second part of my question, I actually mean loading block
> > separately from HDFS. I don't know whether it is realistic. Anyway, for
> my
> > goal is to process different division of a file separately, to do that at
> > split level is OK. But even I can get the splits from inputformat, how to
> > "add only a few splits you need to mapper and discard the others"?
> > (pathfilters only works for file, but not block, I think).
> >
> > Thanks.
> > -Gang
> >
> >
> >
> >  ___
> >  好玩贺卡等你发,邮箱贺卡全新上线!
> > http://card.mail.cn.yahoo.com/
> >
> >
> >
> >  ___
> >  好玩贺卡等你发,邮箱贺卡全新上线!
> > http://card.mail.cn.yahoo.com/
> >
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: DBOutputFormat Speed Issues

2010-01-31 Thread Aaron Kimball
Nick,

I'm afraid that right now the only available OutputFormat for JDBC is that
one. You'll note that DBOutputFormat doesn't really include much support for
special-casing to MySQL or other targets.

Your best bet is to probably copy the code from DBOutputFormat and
DBConfiguration into some other class (e.g. MySQLDBOutputFormat) and modify
the code in the RecordWriter to generate PreparedStatements containing
batched insert statements.

If you arrive at a solution which is pretty general-purpose/robust, please
consider contributing it back to the Hadoop project :) If you do so, send me
an email off-list; I'm happy to help with advice on developing better DB
integration code, reviewing your work, etc.

Also on the input side, you should really be using DataDrivenDBInputFormat
instead of the older DBIF :) Sqoop (in src/contrib/sqoop on Apache 0.21 /
CDH 0.20) has pretty good support for parallel imports, and uses this
InputFormat instead.

- Aaron

On Thu, Jan 28, 2010 at 11:39 AM, Nick Jones  wrote:

> Hi all,
> I have a use case for collecting several rows from MySQL of
> compressed/unstructured data (n rows), expanding the data set, and storing
> the expanded results back into a MySQL DB (100,000n rows). DBInputFormat
> seems to perform reasonably well but DBOutputFormat is inserting rows
> one-by-one.  How can I take advantage of MySQL's support of generating fewer
> insert statements with more values within each one?
>
> Thanks.
> --
> Nick Jones
>
>


Re: hadoop under cygwin issue

2010-01-31 Thread Aaron Kimball
Brian, it looks like you missed a step in the instructions. You'll need to
format the hdfs filesystem instance before starting the NameNode server:

You need to run:

$ bin/hadoop namenode -format

.. then you can do bin/start-dfs.sh
Hope this helps,
- Aaron


On Sat, Jan 30, 2010 at 12:27 AM, Brian Wolf  wrote:

>
> Hi,
>
> I am trying to run Hadoop 0.19.2 under cygwin as per directions on the
> hadoop "quickstart" web page.
>
> I know sshd is running and I can "ssh localhost" without a password.
>
> This is from my hadoop-site.xml
>
> 
> 
> hadoop.tmp.dir
> /cygwin/tmp/hadoop-${user.name}
> 
> 
> fs.default.name
> hdfs://localhost:9000
> 
> 
> mapred.job.tracker
> localhost:9001
> 
> 
> mapred.job.reuse.jvm.num.tasks
> -1
> 
> 
> dfs.replication
> 1
> 
> 
> dfs.permissions
> false
> 
> 
> webinterface.private.actions
> true
> 
> 
>
> These are errors from my log files:
>
>
> 2010-01-30 00:03:33,091 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=9000
> 2010-01-30 00:03:33,121 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost/
> 127.0.0.1:9000
> 2010-01-30 00:03:33,161 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-01-30 00:03:33,181 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
> NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-01-30 00:03:34,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> fsOwner=brian,None,Administrators,Users
> 2010-01-30 00:03:34,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-01-30 00:03:34,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2010-01-30 00:03:34,653 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
> Initializing FSNamesystemMetrics using context
> object:org.apache.hadoop.metrics.spi.NullContext
> 2010-01-30 00:03:34,653 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2010-01-30 00:03:34,803 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Storage directory C:\cygwin\tmp\hadoop-brian\dfs\name does not exist.
> 2010-01-30 00:03:34,813 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
> Directory C:\cygwin\tmp\hadoop-brian\dfs\name is in an inconsistent state:
> storage directory does not exist or is not accessible.
>   at
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:278)
>   at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
>   at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
>   at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:288)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> 2010-01-30 00:03:34,823 INFO org.apache.hadoop.ipc.Server: Stopping server
> on 9000
>
>
>
>
>
> =
>
> 2010-01-29 15:13:30,270 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: localhost/127.0.0.1:9000. Already tried 9 time(s).
> problem cleaning system directory: null
> java.net.ConnectException: Call to localhost/127.0.0.1:9000 failed on
> connection exception: java.net.ConnectException: Connection refused: no
> further information
>   at org.apache.hadoop.ipc.Client.wrapException(Client.java:724)
>   at org.apache.hadoop.ipc.Client.call(Client.java:700)
>   at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>   at $Proxy4.getProtocolVersion(Unknown Source)
>   at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
>   at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>
>
>
> Thanks
> Brian
>
>


Re: build and use hadoop-git

2010-01-24 Thread Aaron Kimball
See http://wiki.apache.org/hadoop/HowToContribute for more step-by-step
instructions.
- Aaron

On Fri, Jan 22, 2010 at 7:36 PM, Kay Kay  wrote:

> Start with hadoop-common to start building .
>
> hadoop-hdfs / hadoop-mapred pull the dependencies from apache snapshot
> repository that contains the nightlies of last successful builds so in
> theory all 3 could be built independently because of the respective
> snapshots being present in apache snapshot repository.
>
> If you do want to make cross-project changes and test them -
> * create a local ivy resolver and
> * place it before the apache snapshot in the ivy settings .
> * publish the jar for a given project to the directory pointed by the ivy
> resolver in step 1
> * clear ivy cache
> * recompile.
>
>
>
>
>
> On 11/29/09 11:05 AM, Vasilis Liaskovitis wrote:
>
>> Hi,
>>
>> how can I build and use hadoop-git?
>> The project has recently been split into 3 repositories hadoop-common,
>> hadoop-hdfs and hadoop-mapred. It's not clear to me how to
>> build/compile and use the git/tip for the whole framework. E.g. would
>> building all jars from the 3 subprojects (and copying them under the
>> same directory) work?
>> Is there a guide/wiki page out there for this? Or perhaps there is
>> another repository which still has a centralized trunk for all
>> subprojects?
>> thanks in advance,
>>
>> - Vasilis
>>
>>
>
>


Re: Multiple file output

2010-01-07 Thread Aaron Kimball
Note that org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is
scheduled for the next CDH 0.20 release -- ready "soon."
- Aaron

2010/1/6 Amareshwari Sri Ramadasu 

> No. It is part of branch 0.21 onwards. For 0.20*, people can use old api
> only, though JobConf is deprecated.
>
> -Amareshwari.
>
> On 1/6/10 11:52 AM, "Vijay"  wrote:
>
> org.apache.hadoop.mapreduce.lib.output.MultipleOutputs is not part of the
> released version of 0.20.1 right? Is this expected to be part of 0.20.2 or
> later?
>
>
> 2010/1/5 Amareshwari Sri Ramadasu 
>
> > In branch 0.21, You can get the functionality of both
> > org.apache.hadoop.mapred.lib.MultipleOutputs and
> > org.apache.hadop.mapred.lib.MultipleOutputFormat in
> > org.apache.hadoop.mapreduce.lib.output.MultipleOutputs. Please see
> > MAPREDUCE-370 for more details.
> >
> > Thanks
> > Amareshwari
> >
> > On 1/5/10 5:56 PM, "松柳"  wrote:
> >
> > I'm afraid you have to write it by yourself, since there are no
> equivalent
> > classes in new API.
> >
> > 2009/12/28 Huazhong Ning 
> >
> > > Hi all,
> > >
> > > I need your help on multiple file output. I have many big files and I
> > hope
> > > the processing result of each file is outputted to a separate file. I
> > know
> > > in the old Hadoop APIs, the class MultipleOutputFormat works for this
> > > propose. But I cannot find the same class in new APIs. Does anybody
> know
> > in
> > > the new APIs how to solve this problem?
> > > Thanks a lot.
> > >
> > > Ning, Huazhong
> > >
> > >
> > >
> >
> >
>
>


Re: Problems on configure FairScheduler

2009-12-11 Thread Aaron Kimball
You'll need to configure mapred.fairscheduler.allocation.file to point to
your fairscheduler.xml file; this file must contain at least the following:





- Aaron

On Thu, Dec 10, 2009 at 10:34 PM, Rekha Joshi wrote:

>  What’s your hadoop version/distribution? In anycase, to eliminate the
> easy suspects first, what do the hadoop logs say on restart?Did you provide
> port on the job tracker url?Thanks!
>
>
> On 12/11/09 8:43 AM, "Jeff Zhang"  wrote:
>
> Hi all,
>
> I'd like to configure FairScheduler on hadoop. but seems it can not work.
>
> The following is what I did
> 1. add fairscheduler.jar to lib
>
> 2. add the following property to mapred-site.xml
> 
> mapred.jobtracker.taskScheduler
>org.apache.hadoop.mapred.FairScheduler
>  
>
> 3. restart hadoop cluster
>
> Although I did these work, I can not open the page  
> http://http://%3Cjobtracker>URL>/scheduler
>
> Did I miss something ?
>
> Thank you for any help.
>
>
> Jeff Zhang
>
>
>


Re: Re: Re: Re: Doubt in Hadoop

2009-11-30 Thread Aaron Kimball
You need to send a jar to the cluster so it can run your code there. Hadoop
doesn't magically know which jar is the one containing your main class, or
that of your mapper/reducer -- so you need to tell it via that call so it
knows which jar file to upload.

- Aaron

On Sun, Nov 29, 2009 at 7:54 AM,  wrote:

> Hi,
>   Actually, I just made the change suggested by Aaron and my code worked.
> But I
> still would like to know why does the setJarbyClass() method have to be
> called
> when the Main class and the Map and Reduce classes are in the same package
> ?
>
> Thank You
>
> Abhishek Agrawal
>
> SUNY- Buffalo
> (716-435-7122)
>
> On Sun 11/29/09 10:39 AM , aa...@buffalo.edu sent:
> > Hi,
> > I dont set job.setJarByClass(Map.class). But my main class, the map class
> > andthe reduce class are all in the same package. Does this make any
> difference
> > at ordo I still have to call
> >
> > Thank You
> >
> > Abhishek Agrawal
> >
> > SUNY- Buffalo
> > (716-435-7122)
> >
> > On Fri 11/27/09  1:42 PM , Aaron Kimball aa...@clou
> > dera.com sent:> When you set up the Job object, do you
> > call> job.setJarByClass(Map.class)? That will tell
> > Hadoop which jar file to> ship with the job and to use for classloading
> in
> > your code.>
> > > - Aaron
> > > On Thu, Nov 26, 2009 at 11:56 PM,
> > wrote:> Hi,
> > > Â  I am running the job from command
> > line. The job runs fine in the> local mode
> > > but something happens when I try to run the job
> > in the distributed> mode.
> > > Abhishek Agrawal
> > > SUNY- Buffalo
> > > (716-435-7122)
> > > On Fri 11/27/09 Â 2:31 AM , Jeff
> > Zhang  sent:> > Do you run the map reduce job in command
> > line or IDE?  in map> reduce
> > > > mode, you should put the jar containing the
> > map and reduce class> in
> > > > your classpath
> > > > Jeff Zhang
> > > > On Fri, Nov 27, 2009 at 2:19 PM,
> > Â  wrote:> > Hello Everybody,
> > > > Â  Â  Â
> > Â  Â  Â  Â  Â I have
> > a doubt in Haddop and was wondering> if
> > > > anybody has faced a
> > > > similar problem. I have a package called
> > test. Inside that I have> > class called
> > > > A.java, Map.java, Reduce.java. In A.java I
> > have the main method> > where I am trying
> > > > to initialize the jobConf object. I have
> > written> > jobConf.setMapperClass(Map.class) and
> > similarly for the reduce> class
> > > > as well. The
> > > > code works correctly when I run the code
> > locally via> >
> > jobConf.set("mapred.job.tracker","local") but I get an
> > exception> > when I try to
> > > > run this code on my cluster. The stack
> > trace of the exception is> as
> > > > under. I
> > > > cannot understand the problem. Any help
> > would be appreciated.> > java.lang.RuntimeException:
> > java.lang.RuntimeException:> > java.lang.ClassNotFoundException:
> > test.Map> > Â  Â  Â
> > Â at> >
> > >
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)> >
> Â  Â  Â
> > Â at> >
> > org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)> > Â
> Â  Â
> > Â at> >
> > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)> > Â  Â
> Â
> > Â at> >
> > >
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)>
> > Â  Â  Â
> > Â at> >
> > > >
> > >
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
> > > > Â  Â  Â
> > Â at>
> > org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)> > Â  Â  Â
> > Â at
> > org.apache.hadoop.mapred.Child.main(Child.java:158)> > Caused by:
> > java.lang.RuntimeException:> >
> > java.lang.ClassNotFoundException:> > Markowitz.covarMatrixMap
> > > > Â  Â  Â
> > Â at> >
> > >
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)> >
> Â  Â  Â
> > Â at> >
> > >
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)> >
> Â  Â  Â
> > Â ... 6 more> > Caused by:
> > java.lang.ClassNotFoundException: test.Map> > Â  Â  Â
> > Â at>
> > java.net.URLClassLoader$1.run(URLClassLoader.java:200)> > Â  Â  Â
> > Â at
> > java.security.AccessController.doPrivileged(Native> > Method)
> > > > Â  Â  Â
> > Â at> >
> > java.net.URLClassLoader.findClass(URLClassLoader.java:188)> > Â  Â  Â
> > Â at>
> > java.lang.ClassLoader.loadClass(ClassLoader.java:306)> > Â  Â  Â
> > Â at> >
> > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)> > Â  Â  Â
> > Â at>
> > java.lang.ClassLoader.loadClass(ClassLoader.java:251)> > Â  Â  Â
> > Â at> >
> > java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)> > Â  Â  Â
> > Â at java.lang.Class.forName0(Native Method)> > Â  Â  Â
> > Â at java.lang.Class.forName(Class.java:247)> > Â  Â  Â
> > Â at> >
> > > >
> > >
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
> > > > Â  Â  Â
> > Â at> >
> > >
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)> >
> Â  Â  Â
> > Â ... 7 more> > Thank You
> > > > Abhishek Agrawal
> > > > SUNY- Buffalo
> > > > (716-435-7122)
> > > >
> > > >
> > >
> > >
> >
> >
> >
> >
> >
>
>


Re: Processing 10MB files in Hadoop

2009-11-27 Thread Aaron Kimball
By default you get at least one task per file; if any file is bigger than a
block, then that file is broken up into N tasks where each is one block
long. Not sure what you mean by "properly calculate" -- as long as you have
more tasks than you have cores, then you'll definitely have work for every
core to do; having more tasks with high granularity will also let nodes that
get "small" tasks to complete many of them while other cores are stuck with
the "heavier" tasks.

If you call setNumMapTasks() with a higher number of tasks than the
InputFormat creates (via the algorithm above), then it should create
additional tasks by dividing files up into smaller chunks (which may be
sub-block-sized).

As for where you should run your computation.. I don't know that the "map"
and "reduce" phases are really "optimized" for computation in any particular
way. It's just a data motion thing. (At the end of the day, it's your code
doing the processing on either side of the fence, which should dominate the
execution time.) If you use an identity mapper with a pseudo-random key to
spray the data into a bunch of reduce partitions, then you'll get a bunch of
reducers each working on a hopefully-evenly-sized slice of the data. So the
map tasks will quickly read from the original source data and forward the
workload along to the reducers which do the actual heavy lifting. The cost
of this approach is that you have to pay for the time taken to transfer the
data from the mapper nodes to the reducer nodes and sort by key when it gets
there. If you're only working with 600 MB of data, this is probably
negligible. The advantages of doing your computation in the reducers is

1) You can directly control the number of reducer tasks and set this equal
to the number of cores in your cluster.
2) You can tune your partitioning algorithm such that all reducers get
roughly equal workload assignments, if there appears to be some sort of skew
in the dataset.

The tradeoff is that you have to ship all the data to the reducers before
computation starts, which sacrifices data locality and involves an
"intermediate" data set of the same size as the input data set. If this is
in the range of hundreds of GB or north, then this can be very
time-consuming -- so it doesn't scale terribly well. Of course, by the time
you've got several hundred GB of data to work with, your current workload
imbalance issues should be moot anyway.

- Aaron


On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign  wrote:

>
>
> Aaron Kimball wrote:
>
>> (Note: this is a tasktracker setting, not a job setting. you'll need to
>> set this on every
>> node, then restart the mapreduce cluster to take effect.)
>>
>>
> Ok. And here is my mistake. I set this to 16 only on the main node not also
> on data nodes. Thanks a lot!!
>
>  Of course, you need to have enough RAM to make sure that all these tasks
>> can
>> run concurrently without swapping.
>>
> No problem!
>
>
>  If your individual records require around a minute each to process as you
>> claimed earlier, you're
>> nowhere near in danger of hitting that particular performance bottleneck.
>>
>>
>>
> I was thinking that is I am under the recommended value of 64MB, Hadoop
> cannot properly calculate the number of tasks.
>


Re: Processing 10MB files in Hadoop

2009-11-27 Thread Aaron Kimball
More importantly: have you told Hadoop to use all your cores?

What is mapred.tasktracker.map.tasks.maximum set to? This defaults to 2. If
you've got 16 cores/node, you should set this to at least 15--16 so that all
your cores are being used. You may need to set this higher, like 20, to
ensure that cores aren't being starved. Measure with ganglia or top to make
sure your CPU utilization is up to where you're satisfied. (Note: this is a
tasktracker setting, not a job setting. you'll need to set this on every
node, then restart the mapreduce cluster to take effect.)

Of course, you need to have enough RAM to make sure that all these tasks can
run concurrently without swapping. Swapping will destroy your performance.
Then again, if you bought 16-way machines, presumably you didn't cheap out
in that department :)

100 tasks is not an absurd number. For large data sets (e.g., TB scale), I
have seen several tens of thousands of tasks.

In general, yes, running many tasks over small files is not a good fit for
Hadoop, but 100 is not "many small files" -- you might see some sort of
speed up by coalescing multiple files into a single task, but when you hear
problems with processing many small files, folks are frequently referring to
something like 10,000 files where each file is only a few MB, and the actual
processing per record is extremely cheap. In cases like this, task startup
times severely dominate actual computation time. If your individual records
require around a minute each to process as you claimed earlier, you're
nowhere near in danger of hitting that particular performance bottleneck.

- Aaron


On Thu, Nov 26, 2009 at 12:23 PM, CubicDesign  wrote:

>
>
>  Are the record processing steps bound by a local machine resource - cpu,
>> disk io or other?
>>
>>
> Some disk I/O. Not so much compared with the CPU. Basically it is a CPU
> bound. This is why each machine has 16 cores.
>
>  What I often do when I have lots of small files to handle is use the
>> NlineInputFormat,
>>
> Each file contains a complete/independent set of records. I cannot mix the
> data resulted from processing two different files.
>
>
> -
> Ok. I think I need to re-explain my problem :)
> While running jobs on these small files, the computation time was almost 5
> times longer than expected. It looks like the job was affected by the number
> of map task that I have (100). I don't know which are the best parameters in
> my case (10MB files).
>
> I have zero reduce tasks.
>
>
>


Re: Good idea to run NameNode and JobTracker on same machine?

2009-11-27 Thread Aaron Kimball
The real kicker is going to be memory consumption of one or both of these
services. The NN in particular uses a large amount of RAM to store the
filesystem image. I think that those who are suggesting a breakeven point of
<= 10 nodes are lowballing. In practice, unless your machines are really
thin on the RAM (e.g., 2--4 GB), I haven't seen any cases where these
services need to be separated below the 20-node mark; I've also seen several
clusters of 40 nodes running fine with these services colocated. It depends
on how many files are in HDFS and how frequently you're submitting a lot of
concurrent jobs to MapReduce.

If you're setting up a production environment that you plan to expand,
however, as a best practice you should configure the master node to have two
hostnames (e.g., "nn" and "jt") so that you can have separate hostnames in
fs.default.name and mapred.job.tracker; when the day comes that these
services are placed on different nodes, you'll then be able to just move one
of the hostnames over and not need to reconfigure all 20--40 other nodes.

- Aaron

On Thu, Nov 26, 2009 at 8:27 PM, Srigurunath Chakravarthi <
srig...@yahoo-inc.com> wrote:

> Raymond,
> Load wise, it should be very safe to run both JT and NN on a single node
> for small clusters (< 40 Task Trackers and/or Data Nodes). They don't use
> much CPU as such.
>
>  This may even work for larger clusters depending on the type of hardware
> you have and the Hadoop job mix. We usually observe < 5% CPU load with ~80
> DNs/TTs on an 8-code Intel processor based box with 16GB RAM.
>
>  It is best that you observe CPU & mem load on the JT+NN node to take a
> call on whether to separate them. iostat, top or sar should tell you.
>
> Regards,
> Sriguru
>
> >-Original Message-
> >From: John Martyniak [mailto:j...@beforedawnsolutions.com]
> >Sent: Friday, November 27, 2009 3:06 AM
> >To: common-user@hadoop.apache.org
> >Cc: 
> >Subject: Re: Good idea to run NameNode and JobTracker on same machine?
> >
> >I have a cluster of 4 machines plus one machine to run nn & jt.  I
> >have heard that 5 or 6 is the magic #.  I will see when I add the next
> >batch of machines.
> >
> >And it seems to running fine.
> >
> >-Jogn
> >
> >On Nov 26, 2009, at 11:38 AM, Yongqiang He 
> >wrote:
> >
> >> I think it is definitely not a good idea to combine these two in
> >> production
> >> environment.
> >>
> >> Thanks
> >> Yongqiang
> >> On 11/26/09 6:26 AM, "Raymond Jennings III" 
> >> wrote:
> >>
> >>> Do people normally combine these two processes onto one machine?
> >>> Currently I
> >>> have them on separate machines but I am wondering they use that
> >>> much CPU
> >>> processing time and maybe I should combine them and create another
> >>> DataNode.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>


Re: part-00000.deflate as output

2009-11-27 Thread Aaron Kimball
You are always free to run with compression disabled. But in many production
situations, space or performance concerns dictate that all data sets are
stored compressed, so I think Tim was assuming that you might be operating
in such an environment -- in which case, you'd only need things to appear in
plaintext if a human operator is inspecting the output for debugging.

- Aaron

On Thu, Nov 26, 2009 at 4:59 PM, Mark Kerzner  wrote:

> It worked!
>
> But why is it "for testing?" I only have one job, so I need by related as
> text, can I use this fix all the time?
>
> Thank you,
> Mark
>
> On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer  wrote:
>
> > For testing purposes you can also try to disable the compression:
> >
> > conf.setBoolean("mapred.output.compress", false);
> >
> > Then you can look at the output.
> >
> > - tim
> >
> >
> > Amogh Vasekar wrote:
> >
> >> Hi,
> >> ".deflate" is the default compression codec used when parameter to
> >> generate compressed output is true ( mapred.output.compress ).
> >> You may set the codec to be used via mapred.output.compression.codec,
> some
> >> commonly used are available in hadoop.io.compress package...
> >>
> >> Amogh
> >>
> >>
> >> On 11/26/09 11:03 AM, "Mark Kerzner"  wrote:
> >>
> >> Hi,
> >>
> >> I get this part-0.deflate instead of part-0.
> >>
> >> How do I get rid of the deflate option?
> >>
> >> Thank you,
> >> Mark
> >>
> >>
> >>
> >>
> >
>


Re: Re: Doubt in Hadoop

2009-11-27 Thread Aaron Kimball
When you set up the Job object, do you call job.setJarByClass(Map.class)?
That will tell Hadoop which jar file to ship with the job and to use for
classloading in your code.

- Aaron


On Thu, Nov 26, 2009 at 11:56 PM,  wrote:

> Hi,
>   I am running the job from command line. The job runs fine in the local
> mode
> but something happens when I try to run the job in the distributed mode.
>
>
> Abhishek Agrawal
>
> SUNY- Buffalo
> (716-435-7122)
>
> On Fri 11/27/09  2:31 AM , Jeff Zhang zjf...@gmail.com sent:
> > Do you run the map reduce job in command line or IDE?  in map reduce
> > mode, you should put the jar containing the map and reduce class in
> > your classpath
> > Jeff Zhang
> > On Fri, Nov 27, 2009 at 2:19 PM,   wrote:
> > Hello Everybody,
> >I have a doubt in Haddop and was wondering if
> > anybody has faced a
> > similar problem. I have a package called test. Inside that I have
> > class called
> > A.java, Map.java, Reduce.java. In A.java I have the main method
> > where I am trying
> > to initialize the jobConf object. I have written
> > jobConf.setMapperClass(Map.class) and similarly for the reduce class
> > as well. The
> > code works correctly when I run the code locally via
> > jobConf.set("mapred.job.tracker","local") but I get an exception
> > when I try to
> > run this code on my cluster. The stack trace of the exception is as
> > under. I
> > cannot understand the problem. Any help would be appreciated.
> > java.lang.RuntimeException: java.lang.RuntimeException:
> > java.lang.ClassNotFoundException: test.Map
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
> >at
> > org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)
> >at
> > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> >at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >at
> >
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
> >at org.apache.hadoop.mapred.Child.main(Child.java:158)
> > Caused by: java.lang.RuntimeException:
> > java.lang.ClassNotFoundException:
> > Markowitz.covarMatrixMap
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)
> >... 6 more
> > Caused by: java.lang.ClassNotFoundException: test.Map
> >at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
> >at java.security.AccessController.doPrivileged(Native
> > Method)
> >at
> > java.net.URLClassLoader.findClass(URLClassLoader.java:188)
> >at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >at
> > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
> >at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
> >at
> > java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
> >at java.lang.Class.forName0(Native Method)
> >at java.lang.Class.forName(Class.java:247)
> >at
> >
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)
> >... 7 more
> > Thank You
> > Abhishek Agrawal
> > SUNY- Buffalo
> > (716-435-7122)
> >
> >
>
>


Re: RE: please help in setting hadoop

2009-11-27 Thread Aaron Kimball
You've set hadoop.tmp.dir to /home/hadoop/hadoop-${user.name}.

This means that on every node, you're going to need a directory named (e.g.)
/home/hadoop/hadoop-root/, since it seems as though you're running things as
root (in general, not a good policy; but ok if you're on EC2 or something
like that).

mapred.local.dir defaults to ${hadoop.tmp.dir}/mapred/local. You've
confirmed that this exists on the machine named 'master' -- what about on
slave?

Then, what are the permissions of /home/hadoop/ on the slave node? Whichever
user started the Hadoop daemons (probably either 'root' or 'hadoop') will
need the ability to mkdir /home/hadoop/hadoop-root underneath of
/home/hadoop. If that directory doesn't exist, or is chown'd to someone
else, this will probably be the result.

- Aaron


On Thu, Nov 26, 2009 at 10:22 PM,  wrote:

> Hi,
>   There should be a folder called as logs in $HADOOP_HOME. Also try going
> through
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> .
>
>
> This is a pretty good tutorial
>
> Abhishek Agrawal
>
> SUNY- Buffalo
> (716-435-7122)
>
> On Fri 11/27/09  1:18 AM , "Krishna Kumar" krishna.ku...@nechclst.in sent:
> > I have tried, but didn't get any success. In bwt can you please tell
> exact
> > path of log file which I have to refer.
> >
> >
> > Thanks and Best Regards,
> >
> > Krishna Kumar
> >
> > Senior Storage Engineer
> >
> > Why do we have to die? If we had to die, and everything is gone after
> that,
> > then nothing else matters on this earth - everything is temporary, at
> least
> > relative to me.
> >
> >
> >
> >
> > -Original Message-
> >
> > From: aa...@buffalo.edu [aa...@buffa
> > lo.edu]
> > Sent: Friday, November 27, 2009 10:56 AM
> >
> > To: common-user@hadoop.apache.org
> > Subject: Re: please help in setting hadoop
> >
> >
> >
> > Hi,
> >
> > Just a thought, but you do not need to setup the temp directory in
> >
> > conf/hadoop-site.xml especially if you are running basic examples. Give
> > that a
> > shot, maybe it will work out. Otherwise see if you can find additional
> info
> > in
> > the LOGS
> >
> >
> >
> > Thank You
> >
> >
> >
> > Abhishek Agrawal
> >
> >
> >
> > SUNY- Buffalo
> >
> > (716-435-7122)
> >
> >
> >
> > On Fri 11/27/09 12:20 AM , "Krishna Kumar" kri
> > shna.ku...@nechclst.in sent:
> > > Dear All,
> >
> > > Can anybody please help me in getting out from
> > these error messages:
> > > [ hadoop]# hadoop jar
> >
> > >
> > /usr/lib/hadoop/hadoop-0.18.3-14.cloudera.CH0_3-examples.jar
> > > wordcount
> >
> > > test test-op
> >
> > >
> >
> > > 09/11/26 17:15:45 INFO mapred.FileInputFormat:
> > Total input paths to
> > > process : 4
> >
> > >
> >
> > > 09/11/26 17:15:45 INFO mapred.FileInputFormat:
> > Total input paths to
> > > process : 4
> >
> > >
> >
> > > org.apache.hadoop.ipc.RemoteException:
> > java.io.IOException: No valid
> > > local directories in property: mapred.local.dir
> >
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:730
> > > )
> >
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:222)
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:194)
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1557)
> > >
> >
> > > at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > Method)
> >
> > >
> >
> > > at
> >
> > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> > > a:39)
> >
> > >
> >
> > > at
> >
> > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> > > Impl.java:25)
> >
> > >
> >
> > > at
> > java.lang.reflect.Method.invoke(Method.java:585)
> > >
> >
> > > at
> > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> > >
> >
> > > at
> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
> > > I am running the hadoop cluster as root user on
> > two server nodes:
> > > master
> >
> > > and slave.  My hadoop-site.xml file format is as
> > follows :
> > > fs.default.name
> >
> > >
> >
> > > hdfs://master:54310
> > > dfs.permissions
> >
> > >
> >
> > > false
> >
> > > dfs.name.dir
> >
> > >
> >
> > > /home/hadoop/dfs/name
> >
> > > Further the o/p of ls command is as follows:
> >
> > >
> >
> > > [ hadoop]# ls -l /home/hadoop/hadoop-root/
> >
> > >
> >
> > > total 8
> >
> > >
> >
> > > drwxr-xr-x 4 root root 4096 Nov 26 16:48 dfs
> >
> > >
> >
> > > drwxr-xr-x 3 root root 4096 Nov 26 16:49 mapred
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]# ls -l
> > /home/hadoop/hadoop-root/mapred/
> > >
> >
> > > total 4
> >
> > >
> >
> > > drwxr-xr-x 2 root root 4096 Nov 26 16:49 local
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]# ls -l
> > /home/hadoop/hadoop-root/mapred/local/
> > >
> >
> > > total 0
> >
> > > Thanks and Best Regards,
> >
> > >
> >
> > > Krishna Kumar
> >
> > >
> >
> > 

Re: error setting up hdfs?

2009-11-10 Thread Aaron Kimball
You don't "need" to specify a path. If you don't specify a path argument for
ls, then it uses your home directory in HDFS ("/user/").
When you first started HDFS, /user/hadoop didn't exist, so 'hadoop fs -ls'
--> 'hadoop fs -ls /user/hadoop' --> directory not found. When you mkdir'd
'lol', you were actually effectively doing "mkdir -p /user/hadoop/lol", so
then it created your home directory underneath of that.

- Aaron

On Tue, Nov 10, 2009 at 1:30 PM, zenkalia  wrote:

> ok, things are working..  i must have forgotten what i did when first
> setting up hadoop...
>
> should these responses be considered inconsistent/an error?  hmm.
>
> hadoop dfs -ls
> error
> hadoop dfs -ls /
> irrelevant stuff about the path you're in
> hadoop dfs -mkdir lol
> works fine
> hadoop dfs -ls
> Found 1 items
> drwxr-xr-x   - hadoop supergroup  0 2009-11-10 05:28
> /user/hadoop/lol
>
> thanks stephen.
> -mike
>
> On Tue, Nov 10, 2009 at 1:19 PM, Stephen Watt  wrote:
>
> > You need to specify a path. Try  "bin/hadoop dfs -ls / "
> >
> > Steve Watt
> >
> >
> >
> > From:
> > zenkalia 
> > To:
> > core-u...@hadoop.apache.org
> > Date:
> > 11/10/2009 03:04 PM
> > Subject:
> > error setting up hdfs?
> >
> >
> >
> > had...@hadoop1:/usr/local/hadoop$ bin/hadoop dfs -ls
> > ls: Cannot access .: No such file or directory.
> >
> > anyone else get this one?  i started changing settings on my box to get
> > all
> > of my cores working, but immediately hit this error.  since then i
> started
> > from scratch and have hit this error again.  what am i missing?
> >
> >
> >
>


Re: How to build and deploy Hadoop 0.21 ?

2009-11-08 Thread Aaron Kimball
On Thu, Nov 5, 2009 at 2:34 AM, Andrei Dragomir  wrote:

> Hello everyone.
> We ran into a bunch of issues with building and deploying hadoop 0.21.
> It would be great to get some answers about how things should work, so
> we can try to fix them.
>
> 1. When checking out the repositories, each of them can be built by
> itself perfectly. BUT, if you look in hdfs it has mapreduce libraries,
> and in mapreduce it has hdfs libraries. That's kind of a cross-
> reference between projects.
>Q: Is this dependence necessary ? Can we get rid of it ?
>

Those are build-time dependencies. Ideally you'll ignore them post-build.


>Q: if it's necessary, how does one build the jars with the latest
> version of the source code ? how are the jars in the scm repository
> created  (hadoop-hdfs/lib/hadoop-mapred-0.21-dev.jar) as long as there
> is a cross-reference ?
> 2. There are issues with the jar files and the webapps (dfshealth.jsp,
> etc). Right now, the only way to have a hadoop functioning system is
> to: build hdfs and mapreduce; copy everything from hdfs/build and
> mapreduce/build to common/build.
>

Yup.



>Q: Is there a better way of doing this ? What needs to be fixed to
> have the webapps in the jar files (like on 0.20). Are there JIRA
> issues logged on this ?
>
>
I have created a Makefile and some associated scripts that will build
everything and squash it together for you; see
https://issues.apache.org/jira/browse/HADOOP-6342

There is also a longer-term effort to use Maven to coordinate the three
subprojects, and use a local repository for inter-project development on a
single machine; see https://issues.apache.org/jira/browse/HADOOP-5107 for
progress there.



> We would really appreciate some answers at least related to where
> hadoop is going with this build step, so we can help with patches /
> fixes.
>
> Thank you,
>   Andrei Dragomirt
>


Re: Can I install two different version of hadoop in the same cluster ?

2009-10-29 Thread Aaron Kimball
Also hadoop.tmp.dir and mapred.local.dir in your xml configuration, and the
environment variables HADOOP_LOG_DIR and HADOOP_PID_DIR in hadoop-env.sh.

- Aaron

On Thu, Oct 29, 2009 at 10:44 PM, Jeff Zhang  wrote:

> Hi all,
>
> I have installed hadoop 0.18.3 on my own cluster with 5 machines, now I
> want
> to install hadoop 0.20, but I do not run to uninstall the hadoop 0.18.3.
>
> So what things should I modify to eliminate conflict with hadoop 0.18.3 ? I
> think currently I should modify the data dir and name dir and the ports,
> what else should I take care ?
>
>
> Thank you
>
> Jeff zhang
>


Re: Which FileInputFormat to use for fixed length records?

2009-10-28 Thread Aaron Kimball
I think these would be good to add to mapreduce in the
{{org.apache.hadoop.mapreduce.lib.input}} package. Please file a JIRA and
apply a patch!
- Aaron

On Wed, Oct 28, 2009 at 11:15 AM, yz5od2 wrote:

> Hi all,
> I am working on writing a FixedLengthInputFormat class and a corresponding
> FixedLengthRecordReader.
>
> Would the Hadoop commons project have interest in these? Basically these
> are for reading inputs of textual record data, where each record is a fixed
> length, (no carriage returns or separators etc)
>
> thanks
>
>
>
> On Oct 20, 2009, at 11:00 PM, Aaron Kimball wrote:
>
>  You'll need to write your own, I'm afraid. You should subclass
>> FileInputFormat and go from there. You may want to look at TextInputFormat
>> /
>> LineRecordReader for an example of how an IF/RR gets put together, but
>> there
>> isn't an existing fixed-len record reader.
>>
>> - Aaron
>>
>> On Tue, Oct 20, 2009 at 12:59 PM, yz5od2 > >wrote:
>>
>>  Hi,
>>> I have input files, that contain NO carriage returns/line feeds. Each
>>> record is a fixed length (i.e. 202 bytes).
>>>
>>> Which FileInputFormat should I be using? so that each call to my Mapper
>>> receives one K,V pair, where the KEY is null or something (I don't care)
>>> and
>>> the VALUE is the 202 byte record?
>>>
>>> thanks!
>>>
>>>
>


Re: How to give consecutive numbers to output records?

2009-10-27 Thread Aaron Kimball
There is no in-MapReduce mechanism for cross-task synchronization. You'll
need to use something like Zookeeper for this, or another external database.
Note that this will greatly complicate your life.

If I were you, I'd try to either redesign my pipeline elsewhere to eliminate
this need, or maybe get really clever. For example, do your numbers need to
be sequential, or just unique?

If the latter, then take the byte offset into the reducer's current output
file and combine that with the reducer id (e.g.,
) to guarantee that they're all
building unique sequences. If the former... rethink your pipeline? :)

- Aaron

On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner  wrote:

> Hi,
>
> I need to number all output records consecutively, like, 1,2,3...
>
> This is no problem with one reducer, making recordId an instance variable
> in
> the Reducer class, and setting conf.setNumReduceTasks(1)
>
> However, it is an architectural decision forced by processing need, where
> the reducer becomes a bottleneck. Can I have a global variable for all
> reducers, which would give each the next consecutive recordId? In the
> database scenario, this would be the unique autokey. How to do it in
> MapReduce?
>
> Thank you
>


Re: Can I have multiple reducers?

2009-10-22 Thread Aaron Kimball
If you need another shuffle after your first reduce pass, then you need a
second MapReduce job to run after the first one. Just use an IdentityMapper.

This is a reasonably common situation.
- Aaron

On Thu, Oct 22, 2009 at 4:17 PM, Forhadoop  wrote:

>
> Hello,
>
> In my application I need to reduce the original reducer output keys
> further.
>
> I was reading about Chainreducer and Chainmappers but looks like it is for
> :
> one or more mapper -> reducer -> 0 or more mappers
>
> I need something like:
> one or more mapper -> reducer -> reducer
>
> Please help me figure out the best way to achieve it. Currently, the only
> options seems like I write another map reduce application and run it
> separately after the first map-reduce application. In this second
> application, the mapper will be dummy and won't do anything. The reducer
> will further club the first run outputs.
>
> Any other comments such as this is not a good programming practice are
> welcome, so that I know I am in the wrong direction..
> --
> View this message in context:
> http://www.nabble.com/Can-I-have-multiple-reducers--tp26018722p26018722.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>


Re: openssh - can't achieve passphraseless ssh

2009-10-21 Thread Aaron Kimball
Another sneaky permissions requirement is that ~/.ssh/ itself must be mode
0750 or more strict.

- Aaron

On Wed, Oct 21, 2009 at 2:47 PM, Edward Capriolo wrote:

> On Wed, Oct 21, 2009 at 5:39 PM, Dennis DiMaria
>  wrote:
> > I just downloaded and installed hadoop ver 0.200.1 and cygwin 1.5.25-15
> > and installed them (Windows XP.) I'm having trouble with ssh. When I
> > enter "ssh localhost" I'm prompted for a password. I can enter it and I
> > can log in successfully. So I ran these two commands:
> >
> >
> >
> > $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
> > $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
> >
> >
> >
> > But I'm still prompted for a password. Did I miss something when
> > configuring ssh? The files created in .ssh look ok.
> >
> >
> >
> > Btw, I am able to run one of the example hadoop applications in
> > Standalone mode and it works.
> >
> >
> >
> > I'm following the instructions in:
> >
> >
> >
> > http://hadoop.apache.org/common/docs/current/quickstart.html#Local
> >
> >
> >
> > Thanks.
> >
> >
> >
> > -Dennis
> >
> >
> More then likely this is the permissions of the authorized keys file.
> Make sure:
> chmod 600 ~/.ssh/authorized_keys
> Make sure:
> file is owned by proper user
> Make sure:
> drwx--.ssh
> You can tune up the verbosity of your ssh server to troubleshoot this more.
>
> There are lots of ssh key tutorials out there. Good hunting.
>


Re: streaming data from HDFS outside of hadoop

2009-10-20 Thread Aaron Kimball
You shouldn't directly instantiate and intialize FileSystem implementations;
there's a factory method you should use.

Do instead:

private void initHadoop(String ip, int port) throws IOException {
Configuration conf = new Configuration();
String fsUri = "hdfs://" + ip + ":" + port;
conf.set("fs.default.name", fsUri); // magic config string to indicate what
FS to use.
mHDFS = FileSystem.get(conf);
}

Cheers,
- Aaron

On Tue, Oct 20, 2009 at 5:55 PM, Stephane Brossier <
stephanebross...@gmail.com> wrote:

> I am trying to stream data from HDFS on a workstation outside of hadoop.
> I have a small method to initialize the DistributedFileSystem and i pass
> the
> IP and port of the namenode, but that fails with the following stack.
>
> Note that:
> . I tried to telnet to that ip/port and the connection works well.
> . The namenode is working well, i can access it through my browser
> . Hadoop is up and running, i can run MR jobs.
>
> Am i missing something in the code below? What can be wrong?
>
> Thanks,
>
> S.
>
>
> -
>private void initHadoop(String ip, int port) {
>   Configuration mConf  = new Configuration();
>
>   URI mUri =   URI.create("hdfs://" + ip + ":" + port);
>   mHDFS = new DistributedFileSystem();
>
>   try {
>   mHDFS.initialize(mUri, mConf);
>   } catch (IOException ioe) {
>   ioe.printStackTrace();
>   log.error("Failed to initialize Hadoop (Namenode) " +
> ioe.getMessage());
>   }
>   log.info("Initialized HDFS");
>}
>
> -
>
> 2633 [main] DEBUG org.apache.hadoop.security.UserGroupInformation  - Unix
> Login:
> stephane,staff,com.apple.sharepoint.group.1,_lpadmin,_appserveradm,com.apple.sharepoint.group.2,_appserverusr,admin
> 2662 [main] DEBUG org.apache.hadoop.ipc.Client  - The ping interval
> is6ms.
> 2792 [main] DEBUG org.apache.hadoop.ipc.Client  - Connecting to /
> 10.15.38.76:50070
> 2889 [main] DEBUG org.apache.hadoop.ipc.Client  - IPC Client (47)
> connection to /10.15.38.76:50070 from stephane sending #0
> 2891 [IPC Client (47) connection to /10.15.38.76:50070 from stephane]
> DEBUG org.apache.hadoop.ipc.Client  - IPC Client (47) connection to /
> 10.15.38.76:50070 from stephane: starting, having connections 1
> 2906 [IPC Client (47) connection to /10.15.38.76:50070 from stephane]
> DEBUG org.apache.hadoop.ipc.Client  - closing ipc connection to /
> 10.15.38.76:50070: null
> java.io.EOFException
>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:493)
>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)
> 2908 [IPC Client (47) connection to /10.15.38.76:50070 from stephane]
> DEBUG org.apache.hadoop.ipc.Client  - IPC Client (47) connection to /
> 10.15.38.76:50070 from stephane: closed
> 2908 [IPC Client (47) connection to /10.15.38.76:50070 from stephane]
> DEBUG org.apache.hadoop.ipc.Client  - IPC Client (47) connection to /
> 10.15.38.76:50070 from stephane: stopped, remaining connections 0
> java.io.IOException: Call to /10.15.38.76:50070 failed on local exception:
> null
>at org.apache.hadoop.ipc.Client.call(Client.java:699)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>at $Proxy0.getProtocolVersion(Unknown Source)
>at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>at
> org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:177)
>at
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>at com.ning.viking.tools.VisitReader.initHadoop(VisitReader.java:44)
>at com.ning.viking.tools.VisitReader.(VisitReader.java:33)
>at com.ning.viking.tools.VisitReader.main(VisitReader.java:127)
> Caused by: java.io.EOFException
>at java.io.DataInputStream.readInt(DataInputStream.java:375)
>at
> org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:493)
>at org.apache.hadoop.ipc.Client$Connection.run(Client.java:438)


Re: Which FileInputFormat to use for fixed length records?

2009-10-20 Thread Aaron Kimball
You'll need to write your own, I'm afraid. You should subclass
FileInputFormat and go from there. You may want to look at TextInputFormat /
LineRecordReader for an example of how an IF/RR gets put together, but there
isn't an existing fixed-len record reader.

- Aaron

On Tue, Oct 20, 2009 at 12:59 PM, yz5od2 wrote:

> Hi,
> I have input files, that contain NO carriage returns/line feeds. Each
> record is a fixed length (i.e. 202 bytes).
>
> Which FileInputFormat should I be using? so that each call to my Mapper
> receives one K,V pair, where the KEY is null or something (I don't care) and
> the VALUE is the 202 byte record?
>
> thanks!
>


Re: How can I run such a mapreduce program?

2009-10-17 Thread Aaron Kimball
If you're working with the Cloudera distribution, you can install CDH1
(0.18.3) and CDH2 (0.20.1) side-by-side on your development machine.

They'll install to /usr/lib/hadoop-0.18 and /usr/lib/hadoop-0.20; use
/usr/bin/hadoop-0.18 and /usr/bin/hadoop-0.20 to execute, etc.

See http://archive.cloudera.com/ for installation instructions.

- Aaron

2009/10/12 Amandeep Khurana 

> Won't work
>
> On 10/12/09, Jeff Zhang  wrote:
> > I do not think you can run jar compiled with hadoop 0.20.1 on hadoop
> 0.18.3
> >
> > They are not compatible.
> >
> >
> >
> > 2009/10/13 杨卓荦 
> >
> >> The developer's machine is Hadoop 0.20.1, Jar is compiled on the
> >> developer's machine.
> >> The server is Hadoop 0.18.3-cloudera. How can I run my mapreduce program
> >> on
> >> the server?
> >>
> >>
> >>
> >>  ___
> >>  好玩贺卡等你发,邮箱贺卡全新上线!
> >> http://card.mail.cn.yahoo.com/
> >>
> >
>
>
> --
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


Re: Error in FileSystem.get()

2009-10-15 Thread Aaron Kimball
Bhupesh: If you use FileSystem.newInstance(), does that return the correct
object type? This sidesteps CACHE.
- A

On Thu, Oct 15, 2009 at 3:07 PM, Bhupesh Bansal wrote:

> This code is not map/reduce code and run only on single machine and
> Also each node prints the right value for "fs.default.name" so it is
> Reading the right configuration file too .. The issue looks like use of
> CACHE in filesystem and someplace my code is setting up a wrong value
> If that is possible.
>
> Best
> Bhupesh
>
>
>
>
> On 10/15/09 1:46 PM, "Ashutosh Chauhan" 
> wrote:
>
> > Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.)
> > Make sure your configs are consistent on all nodes across entire cluster
> and
> > are pointing to correct fs.
> >
> > Hope it helps,
> > Ashutosh
> >
> > On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal 
> wrote:
> >
> >> Hey Folks,
> >>
> >> I am seeing a very weird problem in FileSystem.get(Configuration).
> >>
> >> I want to get a FileSystem given the configuration, so I am using
> >>
> >>  Configuration conf = new Configuration();
> >>
> >> _fs = FileSystem.get(conf);
> >>
> >>
> >> The problem is I am getting LocalFileSystem on some machines and
> >> Distributed
> >> on others. I am printing conf.get("fs.default.name") at all places and
> >> It returns the right HDFS value 'hdfs://dummy:9000'
> >>
> >> My expectation is looking at fs.default.name if it is hdfs:// it should
> >> give
> >> me a DistributedFileSystem always.
> >>
> >> Best
> >> Bhupesh
> >>
> >>
> >>
> >>
> >>
> >>
>
>


Re: Locality when placing Map tasks

2009-10-06 Thread Aaron Kimball
Map tasks are generated based on InputSplits. An InputSplit is a logical
description of the work that a task should use. The array of InputSplit
objects is created on the client by the InputFormat.
org.apache.hadoop.mapreduce.InputSplit has an abstract method:

  /**
   * Get the list of nodes by name where the data for the split would be
local.
   * The locations do not need to be serialized.
   * @return a new array of the node nodes.
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract·
String[] getLocations() throws IOException, InterruptedException;


So the InputFormat needs to do something when it's creating its list of work
items, to hint where these should go. If you take a look at FileInputFormat,
you can see how it does this based on stat'ing the files and determining the
block locations for each one and using those as node hints. Other
InputFormats may ignore this entirely in which case there is no locality.

The scheduler itself then does its "best job" of lining up tasks to nodes,
but it's usually pretty naive. Basically, tasktrackers send heartbeats back
to the JT wherein they may request another task. The scheduler then responds
with a task. If there's a local task available, it'll send that one. If not,
it'll send a non-local task instead.

- Aaron

On Fri, Oct 2, 2009 at 12:24 PM, Esteban Molina-Estolano <
eesto...@cs.ucsc.edu> wrote:

> Hi,
>
> I'm running Hadoop 0.19.1 on 19 nodes. I've been benchmarking a Hadoop
> workload with 115 Map tasks, on two different distributed filesystems (KFS
> and PVFS); in some tests, I also have a write-intensive non-Hadoop job
> running in the background (an HPC checkpointing benchmark). I've found that
> Hadoop sometimes makes most of the Map tasks data-local, and sometimes makes
> none of the Map tasks data-local; this depends both on which filesystem I
> use, and on whether the background task is running. (I never run multiple
> Hadoop jobs concurrently in these tests.)
>
> I'd like to learn how the Hadoop scheduler places Map tasks, and how
> locality is taken into account, so I can figure out why this is happening.
> (I'm using the default FIFO scheduler.) Is there some documentation
> available that would explain this?
>
> Thanks!
>


Re: FileSystem Caching in Hadoop

2009-10-06 Thread Aaron Kimball
Edward,

Interesting concept. I imagine that implementing "CachedInputFormat" over
something like memcached would make for the most straightforward
implementation. You could store 64MB chunks in memcached and try to retrieve
them from there, falling back to the filesystem on failure. One obvious
potential drawback of this is that a memcached cluster might store those
blocks on different servers than the file chunks themselves, leading to an
increased number of network transfers during the mapping phase. I don't know
if it's possible to "pin" the objects in memcached to particular nodes;
you'd want to do this for mapper locality reasons.

I would say, though, that 1 GB out of 8 GB on a datanode is somewhat
ambitious. It's been my observation that people tend to write memory-hungry
mappers. If you've got 8 cores in a node, and 1 GB each have already gone to
the OS, the datanode, and the tasktracker, that leaves only 5 GB for task
processes. Running 6 or 8 map tasks concurrently can easily gobble that up.
On a 16 GB datanode with 8 cores, you might get that much wiggle room
though.

- Aaron


On Tue, Oct 6, 2009 at 8:16 AM, Edward Capriolo wrote:

> After looking at the HBaseRegionServer and its functionality, I began
> wondering if there is a more general use case for memory caching of
> HDFS blocks/files. In many use cases people wish to store data on
> Hadoop indefinitely, however the last day,last week, last month, data
> is probably the most actively used. For some Hadoop clusters the
> amount of raw new data could be less then the RAM memory in the
> cluster.
>
> Also some data will be used repeatedly, the same source data may be
> used to generate multiple result sets, and those results may be used
> as the input to other processes.
>
> I am thinking an answer could be to dedicate an amount of physical
> memory on each DataNode, or on several dedicated node to a distributed
> memcache like layer. Managing this cache should be straight forward
> since hadoop blocks are pretty much static. (So say for a DataNode
> with 8 GB of memory dedicate 1GB to HadoopCacheServer.) If you had
> 1000 Nodes that cache would be quite large.
>
> Additionally we could create a new file system type cachedhdfs
> implemented as a facade, or possibly implement CachedInputFormat or
> CachedOutputFormat.
>
> I know that the underlying filesystems have cache, but I think Hadoop
> writing intermediate data is going to evict some of the data which
> "should be" semi-permanent.
>
> So has anyone looked into something like this? This was the closest
> thing I found.
>
> http://issues.apache.org/jira/browse/HADOOP-288
>
> My goal here is to keep recent data in memory so that tools like Hive
> can get a big boost on queries for new data.
>
> Does anyone have any ideas?
>


Re: Easiest way to pass dynamic variable to Map Class

2009-10-05 Thread Aaron Kimball
You can set these in the JobConf when you're creating the MapReduce job, and
then read them back in the configure() method of the Mapper class.

- Aaron

On Mon, Oct 5, 2009 at 4:50 PM, Pankil Doshi  wrote:

> Hello everyone,
>
> What will be easiest way to pass Dynamic value to map class??
> Dynamic value are arguments given at run time.
>
> Pankil
>


Re: Is it OK to run with no secondary namenode?

2009-10-05 Thread Aaron Kimball
Quite possible. :\
- A

On Thu, Oct 1, 2009 at 5:17 PM, Mayuran Yogarajah <
mayuran.yogara...@casalemedia.com> wrote:

> Aaron Kimball wrote:
>
>> If you want to run the 2NN on a different node than the NN, then you need
>> to
>> set dfs.http.address on the 2NN to point to the namenode's http server
>> address. See
>>
>> http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/
>>
>> - Aaron
>>
>>
>>
>
> Uhh this wasn't obvious, I totally missed it.  So I'm guessing the 2NN
> hasn't been able
> to upload the merged image back to the NN?
>
> thanks,
> M
>


Re: Is it OK to run with no secondary namenode?

2009-10-01 Thread Aaron Kimball
If you want to run the 2NN on a different node than the NN, then you need to
set dfs.http.address on the 2NN to point to the namenode's http server
address. See
http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/

- Aaron

On Mon, Sep 28, 2009 at 2:17 PM, Todd Lipcon  wrote:

> On Mon, Sep 28, 2009 at 11:10 AM, Mayuran Yogarajah <
> mayuran.yogara...@casalemedia.com> wrote:
>
> > Hey Todd,
> >
> >  I don't personally like to use the slaves/masters files for managing
> which
> >> daemons run on which nodes. But, if you'd like to, it looks like you
> >> should
> >> put it in the "masters" file, not the slaves file. Look at how
> >> start-dfs.sh
> >> works to understand how those files are used.
> >>
> >> -Todd
> >>
> >>
> >
> > DOH, I meant to say masters, not slaves =(
> > If I may ask, how are you managing the various daemons?
> >
> >
> Using Cloudera's distribution of Hadoop, you can simply use linux init
> scripts to manage which daemons run on which nodes. For a large cluster,
> you'll want to use something like kickstart, cfengine, puppet, etc, to
> manage your configuration, and that includes which init scripts are
> enabled.
>
> -Todd
>


Re: Prepare input data for Hadoop

2009-09-22 Thread Aaron Kimball
Use an external database (e.g., mysql) or some other transactional
bookkeeping system to record the state of all your datasets (STAGING,
UPLOADED, PROCESSED)

- Aaron


On Thu, Sep 17, 2009 at 7:17 PM, Huy Phan  wrote:

> Hi all,
>
> I have a question about strategy to prepare data for Hadoop to run their
> MapReduce job, we have to (somehow) copy input files from our local
> filesystem to HDFS, how can we make sure that one input file is not
> processed twice in different executions of the same MapReduce job (let's say
> my MapReduce job runs once each 30 mins) ?
> I don't want to delete my input files after finishing the MR job because I
> may want to re-use it later.
>
>
>
>


Re: SequenceFileAsBinaryOutputFormat for M/R

2009-09-22 Thread Aaron Kimball
In the 0.20 branch, the common best-practice is to use the old API and
ignore deprecation warnings. When you get to 0.22, you'll need to convert
all your code to use the new API.

There may be a new-API equivalent in org.apache.hadoop.mapreduce.lib.output
that you could use, if you convert your Mapper/Reducer instances to new-API
equivalents (org.apache.hadoop.mapreduce.Mapper/Reducer, as opposed to
old-API org.apache.hadoop.mapred.Mapper/Reducer).

- Aaron


On Mon, Sep 21, 2009 at 12:09 PM, Bill Habermaas  wrote:

> Referring to Hadoop 0.20.1 API.
>
>
>
> SequenceFileAsBinaryOutputFormat requires JobConf but JobConf is
> deprecated.
>
>
>
>
> Is there another OutputFormat I should be using ?
>
>
>
> Bill
>
>
>
>
>
>


Re: RandomAccessFile with HDFS

2009-09-22 Thread Aaron Kimball
Or maybe more pessimistically, the second "stable" append implementation.

It's not like HADOOP-1700 wasn't intended to work. It was just found not to
after the fact. Hopefully this reimplementation will succeed. If you're
running a cluster that contains mission-critical data that cannot tolerate
corruption or loss, you shouldn't jump on the new-feature bandwagon until
it's had time to prove itself in the wild.

But yes, we hope that appends will really-truly work in 0.21.
Experimental/R&D projects should be able to plan on having a working append
function in 0.21.

- Aaron

On Sun, Sep 20, 2009 at 3:58 PM, Stas Oskin  wrote:

> Hi.
>
> Just to understand the road-map, 0.21 will be the first stable "append"
> implementation?
>
> Regards.
>
> 2009/9/20 Owen O'Malley 
>
> >
> > On Sep 13, 2009, at 3:08 AM, Stas Oskin wrote:
> >
> >  Hi.
> >>
> >> Any idea when the "append" functionality is expected?
> >>
> >
> > A working append is a blocker on HDFS 0.21.0.
> >
> > The code for append is expected to be complete in a few weeks. Meanwhile,
> > the rest of Common, HDFS, and MapReduce have feature-frozen and need to
> be
> > stabilized and all of the critical bugs fixed. I'd expect the first
> releases
> > of 0.21.0 in early November.
> >
> > -- Owen
> >
>


Re: copy data (distcp) from local cluster to the EC2 cluster

2009-09-10 Thread Aaron Kimball
That's 99% correct. If you want/need to run different versions of HDFS on
the two different clusters, then you can't use hdfs:// protocol to access
both of them in the same command. In this case, use hdfs://bla/ for the
source fs and *hftp*://bla2/ for the dest fs.

- Aaron

On Tue, Sep 8, 2009 at 12:45 AM, Anthony Urso wrote:

> Yes, just run something along the lines of:
>
> hadoop distcp hdfs://local-namenode/path hdfs://ec2-namenode/path
>
> on the job tracker of a MapReduce cluster.
>
> Make sure that your EC2 security group setup allows HDFS access from
> the local HDFS cluster and wherever you run MapReduce job from.  Also,
> I believe both HDFS setups still need to be running on the same
> version of Hadoop.
>
> More here:
>
> http://hadoop.apache.org/common/docs/r0.20.0/distcp.html
>
> Cheers,
> Anthony
>
> On Mon, Sep 7, 2009 at 10:37 PM, stchu wrote:
> > Hi,
> >
> > Does Distcp support to copy data from my local cluster (1 master+3
> slaves,
> > fs=hdfs) to the EC2 cluster (1master+2slaves, fs=hdfs)?
> > If it's supported, how can I do? I appreciate for any guide or
> suggestion.
> >
> > stchu
> >
>


Re: Testing Hadoop job

2009-08-28 Thread Aaron Kimball
Hi Nikhil,

MRUnit now supports the 0.20 API as of
https://issues.apache.org/jira/browse/MAPREDUCE-800. There are no plans to
involve partitioners in MRUnit; it is for mappers and reducers only, and not
for full jobs involving input/output formats, partitioners, etc. Use the
LocalJobRunner for that.

See www.cloudera.com/hadoop-mrunit for more (slightly out-of-date)
information.
- Aaron

On Fri, Aug 28, 2009 at 9:15 AM, Nikhil Sawant wrote:

> hi
> thanks Jason, for prompt reply i will go through "pro hadoop". already on
> my to-do list
>
> any idea abt the MRUnit??? has anyone used it?
> i think it is useful as it allows dummy map and reduce drivers which
> accepts (K,V) pair/s and checks the o/p (K,V) pair/s with the expected
> (K,V)..it gives gr8 debugging capabilities while implementing complex
> logics. (all "System.outs" can be seen on console itself!!)
> i have used the basic functionalities of MRUnit testing framework
> i would like to know the limitations (e.g. i found out that MRUnit does not
> check the partioner logic) and its feasibility with hadoop 0.20...
> No proper documentation i found ! :(
>
> cheers
> nikhil
>
>
> Jason Venner wrote:
>
>> I put together a framework for the Pro Hadoop book that I use quite a bit,
>> and has some documentation in the book examples ;)
>> I haven't tried it with 0.20.0 however.
>>
>> The nicest thing that I did with the framework was provide a way to run a
>> persistent mini virtual cluster for running multiple tests on.
>>
>> On Wed, Aug 26, 2009 at 4:50 PM, Nikhil Sawant > >wrote:
>>
>>
>>
>>> hi
>>>
>>> can u guys suggest some hadoop unit testing framework apart from
>>> MRUnit???
>>> i have used MRUnit but i m not sure abt its feasibilty and support to
>>> hadoop 0.20.
>>> i could not find a proper documentation for MRUnit, is it available
>>> anywhere?
>>>
>>> --
>>> cheers
>>> nikhil
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>


Re: Help.

2009-08-25 Thread Aaron Kimball
Are you trying to serve blocks from a shared directory e.g. NFS?

The storageID for a node is recorded in a file named "VERSION" in
${dfs.data.dir}/current. If one node claims that the storage directory is
already locked, and another node is reporting the first node's storageID, it
makes me think that you have multiple datanodes attempting to use the same
shared directory for storage.

This can't be done in Hadoop. Each datanode assumes it has sole access to a
storage directory. Besides, it defeats the point of using multiple datanodes
to distribute disk and network I/O :) If you're using NFS, reconfigure all
the datanodes to store blocks in local directories only (by moving
dfs.data.dir) and try again.

The mention of the namesecondary directory in there also makes me think that
you're trying to start redundant copies of the secondarynamenode (e.g., by
listing the same node twice in the 'conf/masters' file) or that the
fs.checkpoint.dir is the same as the dfs.data.dir. This also isn't allowed
-- dfs.data.dir, dfs.name.dir, and fs.checkpoint.dir must all refer to
distinct physical locations.

- Aaron

On Fri, Aug 21, 2009 at 7:19 AM, Sujith Vellat  wrote:

>
>
> Sent from my iPhone
>
>
> On Aug 21, 2009, at 9:25 AM, Jason Venner  wrote:
>
>  It may be that the individual datanodes get different names for their ip
>> addresses than the namenode does.
>> It may also be that some subset of your namenode/datanodes do not have
>> write
>> access to the hdfs storage directories.
>>
>>
>> On Mon, Aug 17, 2009 at 10:05 PM, qiu tian 
>> wrote:
>>
>>  Hi everyone.
>>> I installed hadoop among three pcs. When I ran the command
>>> 'start-all.sh',
>>> I only could start the jobtracker and tasktrackers. I use 192.*.*.x as
>>> master and use 192.*.*.y and 192.*.*.z as slaves.
>>>
>>> The namenode log from the master 192.*.*.x is following like this:
>>>
>>> 2009-08-18 10:48:44,543 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>>> NameSystem.registerDatanode: node 192.*.*.y:50010 is replaced by
>>> 192.*.*.x:50010 with the same storageID
>>> DS-1120429845-127.0.0.1-50010-1246697164684
>>> 2009-08-18 10:48:44,543 INFO org.apache.hadoop.net.NetworkTopology:
>>> Removing a node: /default-rack/192.*.*.y:50010
>>> 2009-08-18 10:48:44,543 INFO org.apache.hadoop.net.NetworkTopology:
>>> Adding
>>> a new node: /default-rack/192.*.*.x:50010
>>> 2009-08-18 10:48:45,932 FATAL org.apache.hadoop.hdfs.StateChange: BLOCK*
>>> NameSystem.getDatanode: Data node 192.*.*.z:50010 is attempting to report
>>> storage ID DS-1120429845-127.0.0.1-50010-1246697164684. Node
>>> 192.*.*.x:50010
>>> is expected to serve this storage.
>>> 2009-08-18 10:48:45,932 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 8 on 9000, call blockReport(DatanodeRegistration(192.*.*.z:50010,
>>> storageID=DS-1120429845-127.0.0.1-50010-1246697164684, infoPort=50075,
>>> ipcPort=50020), [...@1b8ebe3) from 192.*.*.z:33177: error:
>>> org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node
>>> 192.*.*.z:50010 is attempting to report storage ID
>>> DS-1120429845-127.0.0.1-50010-1246697164684. Node 192.*.*.x:50010 is
>>> expected to serve this storage.
>>> org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node
>>> 192.*.*.z:50010 is attempting to report storage ID
>>> DS-1120429845-127.0.0.1-50010-1246697164684. Node 192.*.*.x:50010 is
>>> expected to serve this storage.
>>>   at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDatanode(FSNamesystem.java:3800)
>>>   at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.processReport(FSNamesystem.java:2771)
>>>   at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.blockReport(NameNode.java:636)
>>>   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>>>   at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>   at java.lang.reflect.Method.invoke(Method.java:597)
>>>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>>>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>>> 2009-08-18 10:48:46,398 FATAL org.apache.hadoop.hdfs.StateChange: BLOCK*
>>> NameSystem.getDatanode: Data node 192.*.*.y:50010 is attempting to report
>>> storage ID DS-1120429845-127.0.0.1-50010-1246697164684. Node
>>> 192.*.*.x:50010
>>> is expected to serve this storage.
>>> 2009-08-18 10:48:46,398 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 0 on 9000, call
>>> blockReport(DatanodeRegistration(192.9.200.y:50010,
>>> storageID=DS-1120429845-127.0.0.1-50010-1246697164684, infoPort=50075,
>>> ipcPort=50020), [...@186b634) from 192.*.*.y:47367: error:
>>> org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node
>>> 192.*.*.y:50010 is attempting to report storage ID
>>> DS-1120429845-127.0.0.1-50010-1246697164684. Node 192.*.*.x:50010 is
>>> expected to serve this storage.
>>> org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data 

Re: localhost:9000 and ip:9000 is not the same ?

2009-08-24 Thread Aaron Kimball
Jeff,

Hadoop (HDFS in particular) is overly strict about machine names. The
filesystem's id is based on the DNS name used to access it. This needs to be
consistent across all nodes and all configurations in your cluster. You
should always use the fully-qualified domain name of the namenode in your
configuration.

- Aaron

On Mon, Aug 24, 2009 at 5:47 PM, zhang jianfeng  wrote:

> Hi all,
>
>
>
> I have two computers, and in the hadoop-site.xml, I define the
> fs.default.name as localhost:9000, then I cannot access the cluster with
> Java API from another machine
>
> But if I change it to its real IP  192.168.1.103:9000, then I can access
> the
> cluster with Java API from another machine.
>
> It’s so strange, are they any different ?
>
>
>
> Thank you.
>
> Jeff zhang
>


Re: How to speed up the copy phrase?

2009-08-24 Thread Aaron Kimball
If you've got 20 nodes, then you want to have 20-ish reduce tasks. Maybe 40
if you want it to run in two waves. (Assuming 1 core/node. Multiply by N for
N cores...) As it is, each node has 500-ish map tasks that it has to read
from and for each of these, it needs to generate 500 separate reduce task
output files.  That's going to take Hadoop a long time to do. 1 map
tasks is also a very large number of map tasks. Are you processing a lot of
little files? If so, try using MultiFileInputFormat or MultipleInputs to
group them together.

As is mentioned, also set mapred.reduce.parallel.copies to 20. (The default
of 5 is appropriate only for 1--5 nodes.)

- Aaron

On Mon, Aug 24, 2009 at 12:31 AM, Amogh Vasekar  wrote:

> Maybe look at mapred.reduce.parallel.copies property to speed it up...I
> don't see as to why transfer speed be configured via params, and I'm think
> hadoop wont be messing with that.
>
> Thanks,
> Amogh
>
> -Original Message-
> From: yang song [mailto:hadoop.ini...@gmail.com]
> Sent: Monday, August 24, 2009 12:20 PM
> To: common-user@hadoop.apache.org
> Subject: How to speed up the copy phrase?
>
> Hello, everyone
>
> When I submit a big job(e.g. maptasks:1, reducetasks:500), I find that
> the copy phrase will last for a long long time. From WebUI, the message
> "reduce > copy ( of 1 at 0.01 MB/s) >" tells me the transfer speed
> is just 0.01 MB/s. Does it a regular value? How can I solve it?
>
> Thank you!
>
> P.S. The hadoop version is 0.19.1. The cluster has 20 nodes. Heap size of
> JT
> is 6G while the others are default settings.
>


Re: Hadoop streaming: How is data distributed from mappers to reducers?

2009-08-24 Thread Aaron Kimball
Yes. It works just like Java-based MapReduce in that regard.
- Aaron

On Sun, Aug 23, 2009 at 5:09 AM, Nipun Saggar wrote:

> Hi all,
>
> I have recently started using Hadoop streaming. From the documentation, I
> understand that by default, each line output from a mapper up to the first
> tab becomes the key and rest of the line is the value. I wanted to know
> that
> between the mapper and reducer, is there a shuffling(sorting) phase? More
> specifically, Would it be correct to assume that output from all mappers
> with the same key will go to the same reducer?
>
> Thanks,
> Nipun
>


Re: Writing to a db with DBOutputFormat spits out IOException Error

2009-08-24 Thread Aaron Kimball
As a more general note -- any jars needed by your mappers and reducers
either need to be in your job jar in the lib/ directory of the .jar file, or
in $HADOOP_HOME/lib/ on all tasktracker nodes where mappers and reducers get
run.

- Aaron


On Fri, Aug 21, 2009 at 10:47 AM, ishwar ramani  wrote:

> For future reference.
>
> This is a class not found exception for the mysql driver.  The
> DBOuputFormat converts
> it into an IO exception gr.
>
> I had the mysql-connector in both $HADOOP/lib and $HADOOP_CLASSPATH.
> That did not help.
>
> I had to pkg the mysql jar into my map reduce jar to fix this problem.
>
> Hope that saves a day for some one!
>
> On Thu, Aug 20, 2009 at 4:52 PM, ishwar ramani wrote:
> > Hi,
> >
> > I am trying to run a simple map reduce that writes the result from the
> > reducer to a mysql db.
> >
> > I Keep getting
> >
> > 09/08/20 15:44:59 INFO mapred.JobClient: Task Id :
> > attempt_200908201210_0013_r_00_0, Status : FAILED
> > java.io.IOException: com.mysql.jdbc.Driver
> >at
> org.apache.hadoop.mapred.lib.db.DBOutputFormat.getRecordWriter(DBOutputFormat.java:162)
> >at
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:435)
> >at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:413)
> >at org.apache.hadoop.mapred.Child.main(Child.java:170)
> >
> > when the reducer is run.
> >
> > Here is my code. The user name and password are valid and works fine.
> > Is there any way get more info on this exception?
> >
> >
> >
> > static class MyWritable implements Writable, DBWritable {
> >  long id;
> >  String description;
> >
> >  MyWritable(long mid, String mdescription) {
> >id = mid;
> >description = mdescription;
> >  }
> >
> >  public void readFields(DataInput in) throws IOException {
> >this.id = in.readLong();
> >this.description = Text.readString(in);
> >  }
> >
> >  public void readFields(ResultSet resultSet)
> >  throws SQLException {
> >this.id = resultSet.getLong(1);
> >this.description = resultSet.getString(2);
> >  }
> >
> >  public void write(DataOutput out) throws IOException {
> >out.writeLong(this.id);
> >Text.writeString(out, this.description);
> >  }
> >
> >  public void write(PreparedStatement stmt) throws SQLException {
> >stmt.setLong(1, this.id);
> >stmt.setString(2, this.description);
> >  }
> > }
> >
> >
> >
> >
> >
> >
> > public static class Reduce extends MapReduceBase implements
> > Reducer {
> >  public void reduce(Text key, Iterator values,
> > OutputCollector output, Reporter reporter)
> > throws IOException {
> >int sum = 0;
> >while (values.hasNext()) {
> >  sum += values.next().get();
> >}
> >
> >output.collect(new MyWritable(sum, key.toString()), new
> IntWritable(sum));
> >  }
> > }
> >
> >
> >
> >
> >
> > public static void main(String[] args) throws Exception {
> >  JobConf conf = new JobConf(WordCount.class);
> >  conf.setJobName("wordcount");
> >
> >  conf.setMapperClass(Map.class);
> >
> >  conf.setReducerClass(Reduce.class);
> >
> >  DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver",
> > "jdbc:mysql://localhost:8100/testvmysqlsb", "dummy", "pass");
> >
> >
> >  String fields[] = {"id", "description"};
> >  DBOutputFormat.setOutput(conf, "funtable", fields);
> >
> >
> >
> >  conf.setNumMapTasks(1);
> >  conf.setNumReduceTasks(1);
> >
> >  conf.setMapOutputKeyClass(Text.class);
> >  conf.setMapOutputValueClass(IntWritable.class);
> >
> >
> >  conf.setOutputKeyClass(MyWritable.class);
> >  conf.setOutputValueClass(IntWritable.class);
> >
> >  conf.setInputFormat(TextInputFormat.class);
> >
> >
> >
> >
> >  FileInputFormat.setInputPaths(conf, new Path(args[0]));
> >
> >
> >  JobClient.runJob(conf);
> > }
> >
>


Re: NN memory consumption on 0.20/0.21 with compressed pointers/

2009-08-20 Thread Aaron Kimball
Compressed OOPs are available now in 1.6.0u14:
https://jdk6.dev.java.net/6uNea.html
- Aaron

On Thu, Aug 20, 2009 at 10:51 AM, Raghu Angadi wrote:

>
> Suresh had made an spreadsheet for memory consumption.. will check.
>
> A large portion of NN memory is taken by references. I would expect memory
> savings to be very substantial (same as going from 64bit to 32bit), could be
> on the order of 40%.
>
> The last I heard from Sun was that compressed pointers will be in very near
> future JVM (certainly JDK 1.6_x). It can use compressed pointers upto 32GB
> of heap.
>
> I would expect runtime over head on NN would be minimal in practice.
>
> Raghu.
>
>
> Steve Loughran wrote:
>
>>
>> does anyone have any up to date data on the memory consumption per
>> block/file on the NN on a 64-bit JVM with compressed pointers?
>>
>> The best documentation on consumption is
>> http://issues.apache.org/jira/browse/HADOOP-1687 -I'm just wondering if
>> anyone has looked at the memory footprint on the latest Hadoop releases, on
>> those latest JVMs? -and which JVM the numbers from HADOOP-1687 came from?
>>
>> Those compressed pointers (which BEA JRockit had for a while) save RAM
>> when the pointer references are within a couple of GB of the other refs, and
>> which are discussed in some papers
>> http://rappist.elis.ugent.be/~leeckhou/papers/cgo06.pdf
>> http://www.elis.ugent.be/~kvenster/papers/VenstermansKris_ORA.pdf
>>
>> sun's commentary is up here
>> http://wikis.sun.com/display/HotSpotInternals/CompressedOops
>>
>> I'm just not sure what it means for the NameNode, and as there is no
>> sizeof() operator in Java, something that will take a bit of effort to work
>> out. From what I read of the Sun wiki, when you go compressed, while your
>> heap is <3-4GB, there is no decompress operation; once you go above that
>> there is a shift and an add, which is probably faster than fetching another
>> 32 bits from $L2 or main RAM. The result could be -could be- that your NN
>> takes up much less space on 64 bit JVMs than it did before, but is no
>> slower.
>>
>> Has anyone worked out the numbers yet?
>>
>> -steve
>>
>
>


Re: Using Hadoop with executables and binary data

2009-08-20 Thread Aaron Kimball
Look into "typed bytes":
http://dumbotics.com/2009/02/24/hadoop-1722-and-typed-bytes/

On Thu, Aug 20, 2009 at 10:29 AM, Jaliya Ekanayake wrote:

> Hi Stefan,
>
>
>
> I am sorry, for the late reply. Somehow the response email has slipped my
> eyes.
>
> Could you explain a bit on how to use Hadoop streaming with binary data
> formats.
>
> I can see, explanations on using it with text data formats, but not for
> binary files.
>
>
> Thank you,
>
> Jaliya
>
> Stefan Podkowinski
> Mon, 10 Aug 2009 01:40:05 -0700
>
> Jaliya,
>
> did you consider Hadoop Streaming for your case?
> http://wiki.apache.org/hadoop/HadoopStreaming
>
>
> On Wed, Jul 29, 2009 at 8:35 AM, Jaliya
> Ekanayake wrote:
> > Dear Hadoop devs,
> >
> >
> >
> > Please help me to figure out a way to program the following problem using
> > Hadoop.
> >
> > I have a program which I need to invoke in parallel using Hadoop. The
> > program takes an input file(binary) and produce an output file (binary)
> >
> >
> >
> > Input.bin ->prog.exe-> output.bin
> >
> >
> >
> > The input data set is about 1TB in size. Each input data file is about
> 33MB
> > in size. (So I have about 31000 files)
> >
> > The output binary file is about 9KBs in size.
> >
> >
> >
> > I have implemented this program using Hadoop in the following way.
> >
> >
> >
> > I keep the input data in a shared parallel file system (Lustre File
> System).
> >
> > Then, I collect the input file names and write them to a collection of
> files
> > in HDFS (let's say hdfs_input_0.txt ..).
> >
> > Each hdfs_input file contains roughly the equal number of files URIs to
> the
> > original input file.
> >
> > The map task, simply take a string Value which is a URI to an original
> input
> > data file and execute the program as an external program.
> >
> > The output of the program is also written to the shared file system
> (Lustre
> > File System).
> >
> >
> >
> > The problem in this approach is I am not utilizing the true benefit of
> > MapReduce. The use of local disks.
> >
> > Could  you please suggest me a way to use local disks for the above
> > problem.?
> >
> >
> >
> > I thought, of the following way, but would like to verify from you if
> there
> > is a better way.
> >
> >
> >
> > 1.   Upload the original data files in HDFS
> >
> > 2.   In the map task, read the data file as an binary object.
> >
> > 3.   Save it in the local file system.
> >
> > 4.   Call the executable
> >
> > 5.   Push the output from the local file system to HDFS.
> >
> >
> >
> > Any suggestion is greatly appreciated.
> >
> >
> > Thank you,
> >
> > Jaliya
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>
>
>


Re: Location of the source code for the fair scheduler

2009-08-20 Thread Aaron Kimball
What do you mean?
- Aaron

On Wed, Aug 19, 2009 at 8:35 PM, Mithila Nagendra  wrote:

> Thanks! But How do I know which version to work with?
> Mithila
>
>
> On Thu, Aug 20, 2009 at 2:30 AM, Ravi Phulari  >wrote:
>
> >  Currently Fairscheduler source is in
> >  hadoop-mapreduce/src/contrib/fairscheduler/
> >
> > Download mapreduce source from.
> > http://hadoop.apache.org/mapreduce/
> >
> > -
> > Ravi
> >
> >
> > On 8/19/09 2:48 PM, "Mithila Nagendra"  wrote:
> >
> > Hello
> >
> > I was wondering how I could locate the source code files for the fair
> > scheduler.
> >
> > Thanks
> > Mithila
> >
> >
> >
> >
>


Re: How does hadoop deal with hadoop-site.xml?

2009-08-20 Thread Aaron Kimball
On Wed, Aug 19, 2009 at 8:39 PM, yang song  wrote:

>Thank you, Aaron. I've benefited a lot. "per-node" means some settings
> associated with the node. e.g., "fs.default.name", "mapred.job.tracker",
> etc. "per-job" means some settings associated with the jobs which are
> submited from the node. e.g., "mapred.reduce.tasks". That means, if I set
> "per-job" properties on JobTracker, it will doesn't work. Is my
> understanding right?


It will work if you submit your job (run "hadoop jar ") from the
JobTracker node :) It won't if you submit your job from elsewhere.


>
>In addition, when I add some new properties, e.g.,
> "mapred.inifok.setting" on JobTracker, I can find it in every job.xml from
> WebUI. I think all jobs will use the new properties. Is it right?


If you set a property programmatically when configuring your job, that will
be available in the JobConf on all machines for that job only. If you set a
property in your hadoop-site.xml on the submitting machine, then I think
that will also be available for the job on all nodes.

- Aaron


>
>Thanks again.
>Inifok
>
> 2009/8/20 Aaron Kimball 
>
> > Hi Inifok,
> >
> > This is a confusing aspect of Hadoop, I'm afraid.
> >
> > Settings are divided into two categories: "per-job" and "per-node."
> > Unfortunately, which are which, isn't documented.
> >
> > Some settings are applied to the node that is being used. So for example,
> > if
> > you set fs.default.name on a node to be "hdfs://some.namenode:8020/",
> then
> > any FS connections you make from that node will go to some.namenode. If a
> > different machine in your cluster has fs.default.name set to
> > hdfs://other.namenode, then that machine will connect to a different
> > namenode.
> >
> > Another example of a per-machine setting is
> > mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the
> maximum
> > number of tasks it should run in parallel. Each tasktracker is free to
> > configure this value differently. e.g., if you have some quad-core and
> some
> > eight-core machines. dfs.data.dir tells a datanode where its data
> > directories should be kept. Naturally, this can vary machine-to-machine.
> >
> > Other settings are applied to a job as a whole. These settings are
> > configured when you submit the job. So if you write
> > conf.set("mapred.reduce.parallel.copies", 20) in your code, this will be
> > the
> > setting for the job. Settings that you don't explicitly put in your code,
> > are drawn from the hadoop-site.xml file on the machine where the job is
> > submitted from.
> >
> > In general, I strongly recommend you save yourself some pain by keeping
> > your
> > configuration files as identical as possible :)
> > Good luck,
> > - Aaron
> >
> >
> > On Wed, Aug 19, 2009 at 7:21 AM, yang song 
> > wrote:
> >
> > > Hello, everybody
> > >I feel puzzled about setting properties in hadoop-site.xml.
> > >Suppose I submit the job from machine A, and JobTracker runs on
> > machine
> > > B. So there are two hadoop-site.xml files. Now, I increase
> > > "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to
> > make
> > > copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI
> > is
> > > still 5. When I increase it on machine A, it changes. So, I feel very
> > > puzzled. Why does it doesn't work when I change it on B? What's more,
> > when
> > > I
> > > add some properties on B, the certain properties will be found on
> WebUI.
> > > And
> > > why I can't change properties through machine B? Does some certain
> > > properties must be changed through A and some others must be changed
> > > through
> > > B?
> > >Thank you!
> > >Inifok
> > >
> >
>


Re: RuntimeException: Not a host:port pair: local error when running bin/start-mapred.sh

2009-08-19 Thread Aaron Kimball
You need to put this in mapred-site.xml too, I think.
- Aaron

On Tue, Aug 18, 2009 at 10:31 PM, Daniel someone wrote:

> Hi everybody,
>
> I am after some help with the following error that I am getting when I try
> to start the start-mapred.sh. To me looked like it was getting "local"
> passed as the hostname:port pair but I have double checked my settings in
> "core-site.xml" and they look correct to me, I have tried changing the
> values of mapred.job.tracker to FQDN and IP but this had not changed the
> behaviour.
> I am running:
> hadoop 0.20.0
> Java 1.6.0._16
> solaris 9
>
> Any help appreciated
> Regards
> Dan
>
> Error seen in logs:
>
> 2009-08-18 18:34:07,102 INFO org.apache.hadoop.mapred.JobTracker:
> STARTUP_MSG:
> /
> STARTUP_MSG: Starting JobTracker
> STARTUP_MSG:   host = xx/xx
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.0
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
> 763504;
> compiled by 'ndale
> y' on Thu Apr  9 05:18:40 UTC 2009
> /
> 2009-08-18 18:34:08,622 FATAL org.apache.hadoop.mapred.JobTracker:
> java.lang.RuntimeException: Not a host:port pair: l
> ocal
>at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:136)
>at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:123)
>at
> org.apache.hadoop.mapred.JobTracker.getAddress(JobTracker.java:1756)
>at org.apache.hadoop.mapred.JobTracker.(JobTracker.java:1521)
>at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:174)
>at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3528)
>
> 2009-08-18 18:34:08,627 INFO org.apache.hadoop.mapred.JobTracker:
> SHUTDOWN_MSG:
> /
>


Re: Hadoop - flush() files

2009-08-19 Thread Aaron Kimball
sync doesn't work. See http://issues.apache.org/jira/browse/HDFS-200
- Aaron


On Tue, Aug 18, 2009 at 8:33 AM, Arvind Sharma  wrote:

> Just checking again :-)
>
> I have a setup where I am using Hadoop0-19.2 and the data files are kept
> open for a long time. I want them to be sync'd to the HDFS now and then, to
> avoid any data loss.
>
> In one of the last HUG, somebody mentioned  FSDataOutputStream.sync()
> method. But there were some known issues with that.
>
> Has anyone experienced any problem while using the sync() method ?
>
> Arvind
>
>
>
>
> 
>
>
> Hi,
>
> I was wondering if anyone here have stared using (or has been using) the
> newer Hadoop versions (0-20.1 ??? ) - which provides API for flushing out
> any open files on the HDFS ?
>
> Are there any known issues I should be aware of ?
>
> Thanks!
> Arvind
>
>
>
>


Re: Hadoop for Independant Tasks not using Map/Reduce?

2009-08-19 Thread Aaron Kimball
Hadoop Streaming still expects to be doing "MapReduce". But you can hack
that definition, e.g., by emitting no output data and disabling reducing.

The number of map tasks to run will be controlled by the number of
"InputSplits" -- divisions of some arbitrary piece of input -- that a job
contains. By default, InputSplits are created based on the number of files
used. This is controlled by the InputFormat you select. You might want to
look at NLineInputFormat. This lets you write out a file where each line of
the file is a separate split. So you write a file to HDFS with a line (maybe
containing some arguments) for each instance of your program you want to
run. When you launch your job, point it at this input, and it'll launch the
desired number of copies of your program on a bunch of randomly selected
nodes from your cluster.

I don't know of any specific examples of this in use to point you to, but
you can certainly make a start of this.

- Aaron

On Wed, Aug 19, 2009 at 7:05 AM, Poole, Samuel [USA]
wrote:

> I am new to Hadoop (I have not yet installed/configured), and I want to
> make sure that I have the correct tool for the job.  I do not "currently"
> have a need for the Map/Reduce functionality, but I am interested in using
> Hadoop for task orchestration, task monitoring, etc. over numerous nodes in
> a computing cluster.  Our primary programs (written in C++ and launched via
> shell scripts) each run independantly on a single node, but are deployed to
> different nodes for load balancing.  I want to task/initiate these processes
> on different nodes through a Java program located on a central server.  I
> was hoping to use Hadoop as a foundation for this.
>
> I read the following in the FAQ section:
>
> "How do I use Hadoop Streaming to run an arbitrary set of
> (semi-)independent tasks?
>
> Often you do not need the full power of Map Reduce, but only need to run
> multiple instances of the same program - either on different parts of the
> data, or on the same data, but with different parameters. You can use Hadoop
> Streaming to do this. "
>
> So, two questions I guess.
>
> 1.  Can I use Hadoop for this purpose without using Map/Reduce
> functionality?
>
> 2.  Are there any examples available on how to implement this sort of
> configuration?
>
> Any help would be greatly appreciated.
>
> Sam
>
>
>
>
>
>
>


Re: How does hadoop deal with hadoop-site.xml?

2009-08-19 Thread Aaron Kimball
Hi Inifok,

This is a confusing aspect of Hadoop, I'm afraid.

Settings are divided into two categories: "per-job" and "per-node."
Unfortunately, which are which, isn't documented.

Some settings are applied to the node that is being used. So for example, if
you set fs.default.name on a node to be "hdfs://some.namenode:8020/", then
any FS connections you make from that node will go to some.namenode. If a
different machine in your cluster has fs.default.name set to
hdfs://other.namenode, then that machine will connect to a different
namenode.

Another example of a per-machine setting is
mapred.tasktracker.map.tasks.maximum; this tells a tasktracker the maximum
number of tasks it should run in parallel. Each tasktracker is free to
configure this value differently. e.g., if you have some quad-core and some
eight-core machines. dfs.data.dir tells a datanode where its data
directories should be kept. Naturally, this can vary machine-to-machine.

Other settings are applied to a job as a whole. These settings are
configured when you submit the job. So if you write
conf.set("mapred.reduce.parallel.copies", 20) in your code, this will be the
setting for the job. Settings that you don't explicitly put in your code,
are drawn from the hadoop-site.xml file on the machine where the job is
submitted from.

In general, I strongly recommend you save yourself some pain by keeping your
configuration files as identical as possible :)
Good luck,
- Aaron


On Wed, Aug 19, 2009 at 7:21 AM, yang song  wrote:

> Hello, everybody
>I feel puzzled about setting properties in hadoop-site.xml.
>Suppose I submit the job from machine A, and JobTracker runs on machine
> B. So there are two hadoop-site.xml files. Now, I increase
> "mapred.reduce.parallel.copies"(e.g. 10) on machine B since I want to make
> copy phrase faster. However, "mapred.reduce.parallel.copies" from WebUI is
> still 5. When I increase it on machine A, it changes. So, I feel very
> puzzled. Why does it doesn't work when I change it on B? What's more, when
> I
> add some properties on B, the certain properties will be found on WebUI.
> And
> why I can't change properties through machine B? Does some certain
> properties must be changed through A and some others must be changed
> through
> B?
>Thank you!
>Inifok
>


Re: Location of the source code for the fair scheduler

2009-08-19 Thread Aaron Kimball
Hi Mithila,

In the Mapreduce svn tree, it's under src/contrib/fairscheduler/
- Aaron

On Wed, Aug 19, 2009 at 2:48 PM, Mithila Nagendra  wrote:

> Hello
>
> I was wondering how I could locate the source code files for the fair
> scheduler.
>
> Thanks
> Mithila
>


Re: Running Cloudera's distribution without their support agreement - is that a bad idea?

2009-08-19 Thread Aaron Kimball
Hi Erik,

We built our distribution to make it easy for you to get started using
Hadoop -- not to force you into buying a support agreement. (Though if
you're running Hadoop in a production environment, we're certainly happy to
talk to you about that later! ;)

You're just as likely to get help from this mailing list with any Hadoop
questions, bugs, etc. here whether you're running our packages, or whether
you download a "stock" package from Apache. An advantage of our distribution
is that we tend to backport bugfixes more aggressively than Apache does. So
our release of 0.18.3 contains some bugfixes that aren't present in Apache's
0.18.3 tarball. (They're all contributed to 0.19 or 0.20, or on the
development trunk, though.) And if you report a bug on our distribution,
we'll fix it and you just 'apt-get upgrade hadoop' when it's ready. Rather
than get too high on my soapbox, I'll just point you toward
www.cloudera.com/hadoop if you're interested in learning more.

If you've got specific questions or bug reports about installing our
packages, like apt-get throws a strange message at you, we also have a
separate community support forum at
http://getsatisfaction.com/clouderawhere we'll help you resolve these
issues, support contract or no. This is
in a separate forum to prevent Cloudera-specific issues from becoming a
distraction on this list.

Welcome to Hadoop :)
Cheers,
- Aaron

On Wed, Aug 19, 2009 at 7:43 AM, Steve Loughran  wrote:

> Erik Forsberg wrote:
>
>> Hi!
>>
>> I'm currently evaluating different Hadoop versions for a new project.
>> I'm tempted by the Cloudera distribution, since it's neatly packaged
>> into .deb files, and is the stable distribution but with some patches
>> applied, for example the bzip2 support.
>>
>> I understand that I can get a support agreement from Cloudera to match
>> this distribution, but if that's not an option, will running the
>> Cloudera distribution put me in a position where I won't get any help
>> from the community because I'm not running an official Apache Hadoop
>> release?
>>
>
> -there is no official Apache deb so you will end up using someone elses deb
> if that is how you build your cluster up.
> -everyone welcomes bug reports, especially ones with stack traces.
> -regardless of whether you use an official vs external release, a common
> answer to any bugrep will be "does it go away on the latest release?" And
> then "does it go away on trunk?".
> -only you are going to be able to track down problems on your cluster,
> because your machines and network is different from everybody else's.
> -the act of checking out and building a release locally sets you up to
> adding diagnostics and fixes to the source, fixes you can turn into patches
> to get pushed in.
>
> what cloudera are selling, then, is not the packaging, so much as them
> taking over the work of fixing bugs for you. You are still free to track
> down and fix your own problems, on their releases and the Apache ones
> -because nobody else's network/cluster matches yours.
>
> I am in favour of adding lots more diagnostics to hadoop, most of the
> patches of mine that have gone in help with this debugging of which machines
> are playing up -and why. Anything we can do to help debug hadoop, or
> validate an installation, is a welcome improvement.
>
> -steve
>
>


Re: What will we encounter if we add a lot of nodes into the current cluster?

2009-08-12 Thread Aaron Kimball
Also, if you haven't yet configured rack awareness, now's a good time to
start :)
- Aaron

On Tue, Aug 11, 2009 at 11:27 PM, Ted Dunning  wrote:

> If you add these nodes, data will be put on them as you add data to the
> cluster.
>
> Soon after adding the nodes you should rebalance the storage to avoid age
> related surprises in how files are arranged in your cluster.
>
> Other than that, your addition should cause little in the way of surprises.
>
> On Tue, Aug 11, 2009 at 11:00 PM, yang song 
> wrote:
>
> > Dear all
> >I'm sorry to disturb you.
> >Our cluster has 200 nodes now. In order to improve its ability, we
> hope
> > to add 60 nodes into the current cluster. However, we all don't know what
> > will happen if we add so many nodes at the same time. Could you give me
> > some
> > tips and notes? During the process, which part shall we pay much
> attention
> > on?
> >Thank you!
> >
> >P.S. Our environment is hadoop-0.19.1, jdk1.6.0_06, linux redhat
> > enterprise 4.0
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>


Re: Need info about "mapred.input.format.skew"

2009-08-12 Thread Aaron Kimball
Cubic,

I don't see that string appear anywhere in the MapReduce trunk source.
Where'd you come across this?
- Aaron

On Tue, Aug 11, 2009 at 12:53 PM, CubicDesign  wrote:

> Hi.
>
> Can anybody point me to the Apache documentation page for
> "mapred.input.format.skew" ?
> I cannot find the documentation for this parameter.
>
> What does it mean?
>
> Thanks
>


Re: XML files in HDFS

2009-08-11 Thread Aaron Kimball
Wasim,

RecordReader implementations should never require that elements not be
spread across multiple blocks. The start and end offsets into a file in an
InputSplit are taken as soft limits, not hard ones. The RecordReader
implementations that come with Hadoop perform this way, and any that you
author should do the same. If a logical record continues past its end
offset, it will continue to read the data from the next block until it finds
the end of the record. Similarly, if a RecordReader has a start offset > 0,
then it scans forward til the first end-of-record followed by any
beginning-of-record marker, ignoring this data (as it was processed by the
previous inputsplit), and only then does it begin reading records into its
map task.

- Aaron


On Mon, Aug 10, 2009 at 12:07 PM, Joerg Rieger <
joerg.rie...@mni.fh-giessen.de> wrote:

> Hello,
>
> while flipping through the cloud9 collections, I came across an XML
> InputFormat class:
>
>
> http://www.umiacs.umd.edu/~jimmylin/cloud9/docs/api/edu/umd/cloud9/collection/XMLInputFormat.html
>
> I haven't used it myself, but It might be worth a try.
>
>
> Joerg
>
>
>
> On 30.07.2009, at 14:16, Hyunsik Choi wrote:
>
>  Hi,
>>
>> Actually, I don't know there exists any well-made XML InputFormat or
>> Record reader.
>> To the best of my knowledge, StreamXmlRecordReader (
>>
>> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
>> ) of Hadoop streaming is only solution.
>>
>> Good luck!
>>
>> --
>> Hyunsik Choi
>> Database & Information Systems Group, Korea University
>> http://diveintodata.org
>>
>>
>>
>> On Thu, Jul 30, 2009 at 5:30 PM, Wasim Bari wrote:
>>
>>>
>>>
>>>
>>> Hi All,
>>>
>>>  I am looking to store some real big xml files in HDFS and then
>>> process them using MapReduce.
>>>
>>>
>>>
>>> Do we have some utility which uploads the xml files to hdfs making sure
>>> split  up of file in block doen't brake an elemet ( mean half element on one
>>> block and half on someother ) ?
>>>
>>>
>>>
>>> Any suggestions to work thos out will  be appreciated greatly.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> Bari
>>>
>>>
> --
>
>
>
>


Re: newbie question

2009-08-10 Thread Aaron Kimball
You can set it on a per-file basis if you'd like the control. The data
structures associated with files allow these to be individually controlled.
But there's also a  create() call that only accepts the Path to open as an
argument. This uses the configuration file defaults. This use case is
considerably more common in user applications.

- Aaron

On Mon, Aug 10, 2009 at 11:15 AM, hadooprcoks  wrote:

> Hi,I was looking at the FileSystem API and have couple of quick questions
> for experts.
> In the FileSystem.create() call two of the parameters are bufferSize and
> blockSize.
>
> I understand they correspond to io.file.buffer.size and dfs.block.size
> properties in the config files.
>
> My question is - do we expect that applications can provide different
> values
> for them on a per file basis or usually its a cluster wide setting ?
>
> Thanks
> -Vineet
>


Re: ~ Replacement for MapReduceBase ~

2009-08-10 Thread Aaron Kimball
Naga,

That's right. In the old API, Mapper and Reducer were just interfaces and
didn't provide default implementations of their code. Thus MapReduceBase.
Now Mapper and Reducer are classes to extend, so no MapReduceBase is needed.

- Aaron

On Fri, Aug 7, 2009 at 8:26 AM, Naga Vijayapuram  wrote:

> Appears we just need to extend the Mapper class and not use
> MapReduceBase anymore (in hadoop-0.20.0)
>
> If that is not the case, I would like to know the recommended approach
> in hadoop-0.20.0
>
> Thanks,
> Naga Vijayapuram
>
>
> On Fri, Aug 7, 2009 at 8:06 AM, Naga Vijayapuram
> wrote:
> > Hello,
> >
> > I am using hadoop-0.20.0
> >
> > What's the replacement for the deprecated MapReduceBase?
> >
> > Thanks,
> > Naga Vijayapuram
> >
>


Re: Maps running - how to increase?

2009-08-06 Thread Aaron Kimball
I don't know that your load-in speed is going to dramatically
increase. There's a number of parameters that adjust aspects of
MapReduce, but HDFS more or less works out of the box. You should run
some monitoring on your nodes (ganglia, nagios) or check out what
they're doing with top, iotop and iftop to see where you're
experiencing bottlenecks.
- Aaron

On Thu, Aug 6, 2009 at 11:41 AM, Zeev Milin wrote:
> Thanks Aaron,
>
> I changed the settings in hadoop-site.xml file on all the machines. BTW,
> some settings are only reflected on the job level when I change the
> hadoop-default file, not sure why hadoop-site is being ignored (ex:
> mapred.tasktracker.map.tasks.maximum).
>
> The files I am trying load are fairly small (~4MB on average). The
> configuration of each machine is: 2 dual cores (Xeon, 2.33Ghz), 8GB ram and
> a local SCSI hard drive. (total of 6 nodes)
>
> I will look into the article you mentioned, I understand that to load the
> files is going to be slow, was just wondering why the machines are not being
> utilized and mostly idle when more maps can be run in parallel. Maps running
> is always 6.
>
> Another option is to load one 20GB file but currently the speed is fairly
> slow in my opinion: 1GB in 1.5min. What kind of tuning can be done to
> speedup the load into hdfs? If you have any recommendation for specific
> parameters that might help it will be great.
>
> Thanks,
> Zeev
>


Re: Help in running hadoop from eclipse

2009-08-06 Thread Aaron Kimball
May I ask why you're trying to run the NameNode in Eclipse? This is
likely going to cause you lots of classpath headaches. I think your
current problem is that it can't find its config files, so it's not
able to read in the strings for what addresses it should listen on.

If you want to see what's happening inside the namenode, I'd run
"bin/hadoop-daemon.sh start namenode" to start it, then fire up
Eclipse and attach the debugger externally. You can then point it at
your copy of the Hadoop source code so you can set breakpoints, trace,
etc.

- A

On Wed, Aug 5, 2009 at 9:51 PM, ashish pareek  wrote:
>
> Hi Everybody,
>
>                  I am trying to run hadoop from eclipse... but when i run
> NmaeNode.java as java appliaction i get following error. Please help in
> getting rid of this problem.
>
>
>
> 2009-08-05 23:42:00,760 INFO  dfs.NameNode
> (StringUtils.java:startupShutdownMessage(464)) - STARTUP_MSG:
>
> /
>
> STARTUP_MSG: Starting NameNode
>
> STARTUP_MSG:   host = datanode1/10.0.1.109
>
> STARTUP_MSG:   args = []
>
> STARTUP_MSG:   version = Unknown
>
> STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'Unknown' on Unknown
>
> /
>
> 2009-08-05 23:42:00,806 ERROR dfs.NameNode (NameNode.java:main(843)) -
> java.lang.NullPointerException
>
>            at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)
>
>            at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:116)
>
>            at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:136)
>
>            at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
>
>            at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
>
>            at
> org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>
>            at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
>
>
>
> 2009-08-05 23:42:00,807 INFO  dfs.NameNode (StringUtils.java:run(479)) -
> SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at datanode1/10.0.1.109
>
> /


Re: Maps running - how to increase?

2009-08-06 Thread Aaron Kimball
Is that setting in the hadoop-site.xml file on every node? Each tasktracker
reads in that file once and sets its max map tasks from that. There's no way
to control this setting on a per-job basis or from the client (submitting)
system. If you've changed hadoop-site.xml after starting the tasktracker,
you need to restart the tasktracker daemon on each node.

Note that 32 maps/node is considered a *lot*. This will likely not provide
you with optimal throughput, since they'll be competing for cores, RAM, I/O,
etc. ...Unless you've got some really super-charged machines in your
datacenter :grin:

Also, in terms of optimizing your job -- do you really have 6,000 big files
worth reading? Or are you running a job over 6,000 small files (where small
means less than 100 MB or so)? If the latter, consider using
MultiFileInputFormat to allow each task to operate on multiple files. See
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/ for some
more detail. Even after all 6,000 map tasks run, you'll have to deal with
reassembling 6,000 intermediate data shards into 6 or 12 reduce tasks. This
will also be slow, unless you bunch up multiple files into a single task.

Cheers,
- Aaron


On Wed, Aug 5, 2009 at 5:06 PM, Zeev Milin  wrote:

> I now see that the mapred.tasktracker.map.tasks.maximum=32 on the job level
> and still only 6 maps running and 5000+ pending..
>
> Not sure how to force the cluster to run more maps.
>


Re: Help in running hadoop from eclipse

2009-08-06 Thread Aaron Kimball
May I ask why you're trying to run the NameNode in Eclipse? This is likely
going to cause you lots of classpath headaches. I think your current problem
is that it can't find its config files, so it's not able to read in the
strings for what addresses it should listen on.

If you want to see what's happening inside the namenode, I'd run
"bin/hadoop-daemon.sh start namenode" to start it, then fire up Eclipse and
attach the debugger externally. You can then point it at your copy of the
Hadoop source code so you can set breakpoints, trace, etc.

- A

On Wed, Aug 5, 2009 at 9:51 PM, ashish pareek  wrote:

> Hi Everybody,
>
>  I am trying to run hadoop from eclipse... but when i run
> NmaeNode.java as java appliaction i get following error. Please help in
> getting rid of this problem.
>
>
>
> 2009-08-05 23:42:00,760 INFO  dfs.NameNode
> (StringUtils.java:startupShutdownMessage(464)) - STARTUP_MSG:
>
> /
>
> STARTUP_MSG: Starting NameNode
>
> STARTUP_MSG:   host = datanode1/10.0.1.109
>
> STARTUP_MSG:   args = []
>
> STARTUP_MSG:   version = Unknown
>
> STARTUP_MSG:   build = Unknown -r Unknown; compiled by 'Unknown' on Unknown
>
> /
>
> 2009-08-05 23:42:00,806 ERROR dfs.NameNode (NameNode.java:main(843)) -
> java.lang.NullPointerException
>
>at
> org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:130)
>
>at org.apache.hadoop.dfs.NameNode.getAddress(NameNode.java:116)
>
>at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:136)
>
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:193)
>
>at org.apache.hadoop.dfs.NameNode.(NameNode.java:179)
>
>at
> org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:830)
>
>at org.apache.hadoop.dfs.NameNode.main(NameNode.java:839)
>
>
>
> 2009-08-05 23:42:00,807 INFO  dfs.NameNode (StringUtils.java:run(479)) -
> SHUTDOWN_MSG:
>
> /
>
> SHUTDOWN_MSG: Shutting down NameNode at datanode1/10.0.1.109
>
> /
>


Re: namenode -upgrade problem

2009-08-05 Thread Aaron Kimball
The only time you would need to upgrade is if you've increased the Hadoop
version but are retaining the same HDFS :) So, that's the normal case.

What does "netstat --listening --numeric --program" report?
- Aaron

On Wed, Aug 5, 2009 at 10:53 AM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:

> yes .. I have stopped all the daemons ... when i use jps ...i get only ...
> " Jps"
>
> Actually .. i upgraded the version from 18.2 to 19.x  on the same path of
> hdfs .. is it a problem?
>
>
> On Wed, Aug 5, 2009 at 11:02 PM, Aaron Kimball  wrote:
>
> > Are you sure you stopped all the daemons? Use 'sudo jps' to make sure :)
> > - Aaron
> >
> > On Mon, Aug 3, 2009 at 7:26 PM, bharath vissapragada <
> > bharathvissapragada1...@gmail.com> wrote:
> >
> > > Todd thanks for replying ..
> > >
> > > I stopped the cluster and issued the command
> > >
> > > "bin/hadoop namenode -upgrade" and iam getting this exception
> > >
> > > 09/08/04 07:52:39 ERROR namenode.NameNode: java.net.BindException:
> > Problem
> > > binding to master/10.2.24.21:54310 : Address already in use
> > >at org.apache.hadoop.ipc.Server.bind(Server.java:171)
> > >at org.apache.hadoop.ipc.Server$Listener.(Server.java:234)
> > >at org.apache.hadoop.ipc.Server.(Server.java:960)
> > >at org.apache.hadoop.ipc.RPC$Server.(RPC.java:465)
> > >at org.apache.hadoop.ipc.RPC.getServer(RPC.java:427)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:153)
> > > at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
> > >at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
> > >at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> > > Caused by: java.net.BindException: Address already in use
> > >at sun.nio.ch.Net.bind(Native Method)
> > >at
> > >
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
> > >at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
> > >at org.apache.hadoop.ipc.Server.bind(Server.java:169)
> > >... 9 more
> > >
> > > any clue?
> > >
> > > On Tue, Aug 4, 2009 at 12:51 AM, Todd Lipcon 
> wrote:
> > >
> > > > On Mon, Aug 3, 2009 at 12:08 PM, bharath vissapragada <
> > > > bharathvissapragada1...@gmail.com> wrote:
> > > >
> > > > > Hi all ,
> > > > >
> > > > > I have noticed some problem in my cluster when i changed the hadoop
> > > > version
> > > > > on the same DFS directory .. The namenode log on the master says
> the
> > > > > following ..
> > > > >
> > > > >
> > > > > ile system image contains an old layout version -16.
> > > > > *An upgrade to version -18 is required.
> > > > > Please restart NameNode with -upgrade option.
> > > > > *
> > > >
> > > >
> > > > See bolded text above -- you need to run namenode -upgrade to upgrade
> > > your
> > > > metadata format to the current version.
> > > >
> > > > -Todd
> > > >
> > > >   at
> > > > >
> > > >
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
> > > > >at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > > > >at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
> > > > >at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:288)
> > > > >at
> > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hado

Re: namenode -upgrade problem

2009-08-05 Thread Aaron Kimball
Are you sure you stopped all the daemons? Use 'sudo jps' to make sure :)
- Aaron

On Mon, Aug 3, 2009 at 7:26 PM, bharath vissapragada <
bharathvissapragada1...@gmail.com> wrote:

> Todd thanks for replying ..
>
> I stopped the cluster and issued the command
>
> "bin/hadoop namenode -upgrade" and iam getting this exception
>
> 09/08/04 07:52:39 ERROR namenode.NameNode: java.net.BindException: Problem
> binding to master/10.2.24.21:54310 : Address already in use
>at org.apache.hadoop.ipc.Server.bind(Server.java:171)
>at org.apache.hadoop.ipc.Server$Listener.(Server.java:234)
>at org.apache.hadoop.ipc.Server.(Server.java:960)
>at org.apache.hadoop.ipc.RPC$Server.(RPC.java:465)
>at org.apache.hadoop.ipc.RPC.getServer(RPC.java:427)
>at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:153)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
>at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
>at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
>at
> org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> Caused by: java.net.BindException: Address already in use
>at sun.nio.ch.Net.bind(Native Method)
>at
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
>at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
>at org.apache.hadoop.ipc.Server.bind(Server.java:169)
>... 9 more
>
> any clue?
>
> On Tue, Aug 4, 2009 at 12:51 AM, Todd Lipcon  wrote:
>
> > On Mon, Aug 3, 2009 at 12:08 PM, bharath vissapragada <
> > bharathvissapragada1...@gmail.com> wrote:
> >
> > > Hi all ,
> > >
> > > I have noticed some problem in my cluster when i changed the hadoop
> > version
> > > on the same DFS directory .. The namenode log on the master says the
> > > following ..
> > >
> > >
> > > ile system image contains an old layout version -16.
> > > *An upgrade to version -18 is required.
> > > Please restart NameNode with -upgrade option.
> > > *
> >
> >
> > See bolded text above -- you need to run namenode -upgrade to upgrade
> your
> > metadata format to the current version.
> >
> > -Todd
> >
> >   at
> > >
> >
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:288)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
> > >at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
> > >at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
> > >at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> > > 2009-08-04 00:27:51,498 INFO org.apache.hadoop.ipc.Server: Stopping
> > server
> > > on 54310
> > > 2009-08-04 00:27:51,498 ERROR
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
> > > File system image contains an old layout version -16.
> > > An upgrade to version -18 is required.
> > > Please restart NameNode with -upgrade option.
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:288)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
> > >at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
> > >at
> > >
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
> > >at
> > >
> > >
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
> > >at
> > > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
> > >
> > > 2009-08-04 00:27:51,499 INFO
> > > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
> > >
> > > Can anyone explain me the reason ... i googled it .. but those
> > explanations
> > > weren't quite useful
> > >
> > > Thanks
> > >
> >
>


Re: how to dump data from a mysql cluster to hdfs?

2009-08-05 Thread Aaron Kimball
mysqldump to local files on all 50 nodes, scp them to datanodes, and then
bin/hadoop fs -put?
- Aaron

On Mon, Aug 3, 2009 at 8:15 PM, Min Zhou  wrote:

> hi all,
>
> We need to dump data from a mysql cluster with about 50 nodes to a hdfs
> file. Considered about the issues on security , we can't use tools like
> sqoop, where all datanodes must hold a connection to mysql. any
> suggestions?
>
>
> Thanks,
> Min
> --
> My research interests are distributed systems, parallel computing and
> bytecode based virtual machine.
>
> My profile:
> http://www.linkedin.com/in/coderplay
> My blog:
> http://coderplay.javaeye.com
>


Re: questions about HDFS file access synchronization

2009-08-05 Thread Aaron Kimball
On Wed, Aug 5, 2009 at 6:09 AM, Zhang Bingjun (Eddy) wrote:

> Hi All,
>
> I am quite new to Hadoop. May I ask a simple question about HDFS file
> access
> synchronization?
>
> For some very typical scenarios below, how does HDFS respond? Is there a
> way
> to synchronize file access in HDFS?
>
> A tries to read a file currently being written by B.


There is no sync() call in HDFS. A will read whatever portion of B's data
has already been committed to disk by the datanode. It is unspecified how
much data this will contain. It may be variable depending on which replica
of the file A is reading. After B close()'s the file, all the data will be
available to A.


>
> A tries to write a file currently being written by B.


This will fail. HDFS does not allow multiple writers to a file. The
FileSystem.create() call used by A to open the file for write access will
throw IOException.


>
> A tries to write a file currently being read by B.


This will fail. HDFS does not allow file updates, so if the file already
exists and B is reading it, the FileSystem.create() call used by A will fail
with IOException.


>
>
> We plan to put some shared data in HDFS so that multiple applications can
> share the data between them. The ideal case is that the underlying
> distributed file system (HDFS here) will provide file access
> synchronization
> so that applications know when they can or cannot operate on a certain
> file.
> Is this way of thinking correct? What is the typical design for this kind
> of
> application scenario?


You'll have to think carefully. You can't update files. There is also no
equivalent of flock(), so you can't use files as locks for exclusive access
to some part of a work flow. If that's what you need, you may want to look
at the ZooKeeper project and see if you can't integrate ZK into your system.
ZK is specifically designed to handle locking, mutual exclusion, and other
distributed synchronization problems.



>
>
> I am quite confused. Definitely need to read more about HDFS and other
> distributed file systems. But before that, I would appreciate very much the
> input from experts in the mailing list.


http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html and
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html are good
places to start.


>
>
> Thanks a lot!
>
> Best regards,
> Zhang Bingjun (Eddy)
>
> E-mail: eddym...@gmail.com, bing...@nus.edu.sg, bing...@comp.nus.edu.sg
> Tel No: +65-96188110 (M)
>

Cheers,
- Aaron


Re: how to get out of Safe Mode?

2009-08-05 Thread Aaron Kimball
For future reference,
$ bin/hadoop dfsadmin safemode -leave
will also just cause HDFS to exit safemode forcibly.

- Aaron

On Wed, Aug 5, 2009 at 1:04 AM, Amandeep Khurana  wrote:

> Two alternatives:
>
> 1. Do bin/hadoop namenode -format. That'll format the metadata and you can
> start afresh.
>
> 2. If that doesnt work, manually go and delete everything that resides in
> the directories to which you've pointed your Namenode and Datanodes to
> store
> their stuff in.
>
>
>
>
> On Tue, Aug 4, 2009 at 4:10 PM, Phil Whelan  wrote:
>
> > Hi,
> >
> > In setting up my cluster and brought a few machines up and down. I did
> > have some data in which I moved to Trash. Now that data is not 100%
> > available, which is fine, because I didn't want it.
> > But now I'm stuck in "Safe Mode", because it cannot find the data. I
> > cannot purge the Trash because it's in read-only due to Safe Mode.
> >
> >   Safe mode is ON.
> >   The ratio of reported blocks 0.9931 has not reached the threshold
> 0.9990.
> >   Safe mode will be turned off automatically.
> >   459 files and directories, 583 blocks = 1042 total. Heap Size is
> > 7.8 MB / 992.31 MB (0%)
> >
> > I want to just want format the entire HDFS filesystem. I have nothing
> > I need in there. How can I do this?
> >
> > Phil
> >
>


Re: A few questions about Hadoop and hard-drive failure handling.

2009-07-24 Thread Aaron Kimball
On Fri, Jul 24, 2009 at 6:48 AM, Steve Loughran  wrote:

> Ryan Smith wrote:
>
>>  > but you dont want to be the one trying to write something just after
>> your
>> production cluster lost its namenode data.
>>
>> Steve,
>>
>> I wasnt planning on trying to solve something like this in production.  I
>> would assume everyone here is a professional and wouldn't even think of
>> something like this, but then again maybe not.  I was asking here so i
>> knew
>> the limitations before i started prototyping failure recovery logic.
>>
>> -Ryan
>>
>
>
> That's good to know. Just worrying, that's all
>
> the common failure mode people tend to hit is that their editLog, the list
> of pending operations, gets truncated when the NN runs out of disk space.
> When the NN comes back up, it tries to replay this, but the file is
> truncated and the replay fails. Which means the NN doesnt come back up.
>
> 1. Secondary namenodes help here.
>
> 2. We really do need Hadoop to recover from this more gracefully, perhaps
> by not crashing at this point, and instead halting when the replay finishes.
> You will lose some data, but dont end up having to manually edit the binary
> edit log to get to the same state. Code and tests would be valued
>

+1!


>
> -steve
>


Re: DiskChecker$DiskErrorException in TT logs

2009-07-24 Thread Aaron Kimball
Amandeep,

Does the job fail after that happens? Are there any WARN or ERROR lines in
the log nearby, or any exceptions?

Three possibilities I can think of:

You may have configured Hadoop to run under /tmp, and tmpwatch or another
cleanup utility like that decided to throw away a bunch of files in the temp
space while your job was running. In this case, you should consider moving
hadoop.tmp.dir and mapred.local.dir out from under the default /tmp.

You might be out of disk space?

mapred.local.dir or hadoop.tmp.dir might be set to paths that Hadoop doesn't
have the privileges to write to?

- A


On Thu, Jul 23, 2009 at 2:06 AM, Amandeep Khurana  wrote:

> Hi
>
> I get these messages in the TT log while running a job:
>
> 2009-07-23 02:03:59,091 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>
> taskTracker/jobcache/job_200907221738_0020/attempt_200907221738_0020_r_00_0/output/file.out
> in any of the configured local directories
>
> Whats the problem?
>
> Amandeep
>
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>


Re: hdfs question when replacing dead node...

2009-07-23 Thread Aaron Kimball
How fast did you re-run fsck after re-joining the node? fsck returns data
based on the latest block reports from datanodes -- these are scheduled to
run (I think) every 15 minutes, so the NameNode's state on block replication
may be as much as 15 minutes out of date.

- Aaron

On Thu, Jul 23, 2009 at 3:00 PM, Andy Sautins
wrote:

>
>   I recently had to replace a node on a hadoop 0.20.0 4-node cluster and I
> can't quite explain what happened.  If anyone has any insight I'd appreciate
> it.
>
>   When the node failed ( drive failure ) running the command 'hadoop fsck
> /' correctly showed the data nodes to now be 3 instead of 4 and showed the
> under replicated blocks to be replicated.  I assume that once the node was
> determined to be dead the blocks on the dead node were not considered in the
> replication factor and caused hdfs to replicate to the available nodes to
> meet the configured replication factor of 3.  All is good.  What I couldn't
> explain is that after re-building and re-starting the failed node I started
> the balancer ( bin/start-balancer.sh ) and re-ran 'hadoop fsck /'.  The
> number of nodes showed that the 4th node was now back in the cluster.  What
> struck me as strange is a large number of blocks ( > 2k ) were shown as
> under replicated.  The under replicated blocks were eventually re-replicated
> and all the data seems correct.
>
>   Can someone explain why re-adding a node that had died why the
> replication factor would go from 3 to 2?  Is there something with the
> balancer.sh script that would show fsck that the blocks are under
> replicated?
>
>   Note that I'm still getting the process for replacing failed nodes down
> so it's possible that I was looking at things wrong for a bit.
>
>Any insight would be greatly appreciated.
>
>Thanks
>
>Andy
>


Re: Testing Mappers in Hadoop 0.20.0

2009-07-23 Thread Aaron Kimball
I hacked around this in MRUnit last night. MRUnit now has support for
the new API -- See MAPREDUCE-800.

You can, in fact, subclass Mapper.Context and Reducer.Context, since
they don't actually share any state with the outer class
Mapper/Reducer implementation, just the type signatures. But doing
this requires making a dummy "Mapper" subclass to wrap around your own
Context subclass, which is somewhat unsightly. C'est la vie.

- Aaron

On Wed, Jul 22, 2009 at 10:44 PM, David Hall wrote:
> For what it's worth, we ended up solving this problem (today) by using
> EasyMock with ClassExtension. It's an awful lot of magic, but it seems
> to work just fine for our purposes.  It would be great if doing
> bytecode weaving under the hood weren't necessary just to write test
> code, though.
>
> -- David
>
> On Wed, Jul 22, 2009 at 8:21 PM, Aaron Kimball wrote:
>> er, +CC mapreduce-dev...
>> - A
>>
>> On Wed, Jul 22, 2009 at 8:17 PM, Aaron Kimball wrote:
>>> +CC mapred-dev
>>>
>>> Hm.. Making this change is actually really difficult.
>>>
>>> After changing Mapper.java, I understand why this was made a
>>> non-static member.  By making Context non-static, it can inherit from
>>> MapContext and bind to the type qualifiers already
>>> specified in the class definition. So you can't make Context static
>>> because it needs to inherit the type parameters from the Mapper class.
>>> (Since Mapper has type qualifiers, it actually can't have static inner
>>> classes anyway.) But why even have Context in there anyway, given that
>>> it does nothing beyond MapContext's implementation?
>>>
>>> We can still change the map method's signature to use MapContext.
>>>
>>> As it is, you have to write:
>>> class MyMapper extends Mapper {
>>>  void map(W k, X v, Context c) {
>>>     
>>>  }
>>> }
>>>
>>> To make the change, you would now need to write:
>>>
>>> class MyMapper extends Mapper {
>>>  void map(W k, X v, MapContext c) {
>>>     
>>>  }
>>> }
>>>
>>> So I think the primary reason for the non-static inner member is to
>>> save you some typing. That having been said, it makes adding mock
>>> implementations of Context really difficult.
>>>
>>> The real reason this is a tricky change, though, is that this is
>>> (surprisingly, to me) an incompatible change -- even though Context
>>> subclasses MapContext, so it's a type-safe API change to widen the set
>>> of inputs we accept in map(), anyone who specified @Override on the
>>> method and uses Context as an argument will get a compiler error :\
>>> (@Override, it seems, uses the literal names even if there's a
>>> type-safe promotion to a method in a superclass.)
>>>
>>> Can we still make incompatible API changes on the new API, or is it
>>> officially frozen? If incompatible changes are allowed, I'd like to
>>> see this in. I think that in the interest of better
>>> mockability/extensibility, it'd be cleaner to ditch the inner Context
>>> class in favor of explicit use of MapContext. We know Java's type
>>> system is verbose, but that doesn't mean we should try to hack around
>>> it, if that means losing functionality. (I think I know how to get
>>> around this in MRUnit, but it'd be cleaner to not have to.)
>>>
>>> - Aaron
>>>
>>>
>>>
>>> On Wed, Jul 22, 2009 at 7:22 PM, Aaron Kimball wrote:
>>>> Both of those are good points. I'll submit a patch.
>>>> - Aaron
>>>>
>>>> On Wed, Jul 22, 2009 at 6:24 PM, Ted Dunning wrote:
>>>>> To amplify David's point, why is the argument a Mapper.Context rather than
>>>>> MapContext?
>>>>>
>>>>> Also, why is the Mapper.Context not static?
>>>>>
>>>>> On Wed, Jul 22, 2009 at 5:29 PM, David Hall  wrote:
>>>>>
>>>>>> This is nice, but doesn't it suffer from the same problem? MRUnit uses
>>>>>> the mapred API, which is deprecated, and the new API doesn't use
>>>>>> OutputCollector, but a non-static inner class.
>>>>>>
>>>>>> -- David
>>>>>>
>>>>>> On Wed, Jul 22, 2009 at 4:52 PM, Aaron Kimball wrote:
>>>>>> > Hi David,
>>>>>> >
>>>>>> > I wrote 

Re: Remote access to cluster using user as hadoop

2009-07-23 Thread Aaron Kimball
The current best practice is to firewall off your cluster, configure a
SOCKS proxy/gateway, and only allow traffic to the cluster from the
gateway. Being able to SSH into the gateway provides authentication.

See 
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
for a description of how we accomplished this.

- Aaron

On Thu, Jul 23, 2009 at 3:23 AM, Steve Loughran wrote:
> Ted Dunning wrote:
>>
>> Last I heard, the API could be suborned in this scenario.  Real credential
>> based identity would be needed to provide more than this.
>>
>> The hack would involve a changed hadoop library that lies about identity.
>> This would not be difficult to do.
>>
>> On Wed, Jul 22, 2009 at 11:45 PM, Mathias Herberts <
>> mathias.herbe...@gmail.com> wrote:
>>
>>> You can simply set up some bastion hosts which are trusted and from
>>> which jobs can be run.
>>>
>>> Then let users connect to these hosts using a secure mechanism such as
>>> SSH
>>> keys.
>>>
>>> You can then create users/groups on those bastion hosts and have
>>> permissions on your HDFS files that use those credentials.
>>>
>
> There's no wire security, nothing to stop me pushing in packets straight to
> a datanode, saying who I claim to be.
>
> Even if you lock down access to the cluster so that I don't have direct
> access to the nodes, if I can run an MR job in the cluster, I can gain full
> administrative rights, by virtue of the fact the cluster is running my Java
> code on one of its nodes, a node which must have direct access to the rest
> of the cluster.
>
> the details are left as an exercise for the reader.
>
>
>
>
>


  1   2   >