Re: Extracting data from HDFS and displaying stats to a webpage

2009-07-09 Thread Usman Waheed
Thanks Christophe, Amr and Ted for your recommendations. Cheers, Usman Hey Usman, your second approach is on the right track. You don't want to have your end users interacting directly with HDFS. The latency is too high, and it wasn't designed for this. OTOH, running a script (a mapreduce,

Why HBASE??

2009-07-09 Thread Sugandha Naolekar
hello! Can you precisely tell me baout why to use HBASE?? Also, as my data is going to increase daya by day, Will I be able to search for a particular file or folder in present in HDFS efficiently nad in a fast manner? i have written a small java code, using Filesystem api of hadoop and in

RE: how to use hadoop in real life?

2009-07-09 Thread Shravan Mahankali
Hi Group, I have data to be analyzed and I would like to dump this data to Hadoop from machine.X where as Hadoop is running from machine.Y, after dumping this data to data I would like to initiate a job, get this data analyzed and get the output information back to machine.X I would like to do

Building Hadoop from source

2009-07-09 Thread Harish Mallipeddi
Hi, Are there any instructions on how to build Hadoop from source? Now that the project seems to have been split into separate projects (common, hdfs, and mapreduce), there are 3 separate repositories under svn. Information on this page is no longer correct:

Re: Building Hadoop from source

2009-07-09 Thread Mafish Liu
Use ant jar if you want to jar file. 2009/7/9 Harish Mallipeddi harish.mallipe...@gmail.com: Hi, Are there any instructions on how to build Hadoop from source? Now that the project seems to have been split into separate projects (common, hdfs, and mapreduce), there are 3 separate repositories

Fwd: how to compress..!

2009-07-09 Thread Sugandha Naolekar
-- Forwarded message -- From: Sugandha Naolekar sugandha@gmail.com Date: Thu, Jul 9, 2009 at 1:41 PM Subject: how to compress..! To: core-u...@hadoop.apache.org Hello! How to compress data by using hadoop api's?? I want to write a java code to comperss the core files(the

Re: Sort by value

2009-07-09 Thread jason hadoop
The simplest way is to swap the key and value in your mapper's output, then swap them back afterward. On Thu, Jul 9, 2009 at 7:52 AM, Marcus Herou marcus.he...@tailsweep.comwrote: Hi many times I want to sort by value instead of key. For instance when counting the top used tags in blog posts

Re: Merging many output files from reducer

2009-07-09 Thread Pankil Doshi
Thanks a lot Jason.My copy of that book is on the way..So soon I will be able to use that. Pankil On Thu, Jul 9, 2009 at 1:54 AM, jason hadoop jason.had...@gmail.com wrote: In the example code from Pro Hadoop, is a sample map reduce job that uses mapside join to merge the files into a single

Re: Lucene index creation using Hadoop

2009-07-09 Thread Ted Dunning
You don't mention what size cluster you have, but we use a relatively small cluster and index hundreds of GB in an hour to few hours (depending on the content and the size fo the cluster). So your results are anomalous. However, we wrote our own indexer. The way it works is that documents are

Sort-Merge Join using Map Reduce.

2009-07-09 Thread Pankil Doshi
Hi, Does anyone has hint on how to implement SORT-MERGE JOIN using map-reduce paradigm? I read article regarding it on Pig wiki but did not got clarity as it doesn't show in form of map and reduce. Pankil

RE: Sort by value

2009-07-09 Thread Ramakishore Yelamanchilli
This approach may not combine the results as per the key to reduce function. -Original Message- From: miles...@gmail.com [mailto:miles...@gmail.com] On Behalf Of Miles Osborne Sent: Thursday, July 09, 2009 10:03 AM To: common-user@hadoop.apache.org Subject: Re: Sort by value if you have

Re: Sort-Merge Join using Map Reduce.

2009-07-09 Thread Todd Lipcon
Hi Pankil, Basically there are two steps here - the first is to sort the two files. This can be done using an mapreduce where the mapper extracts the join column as a key. If you make sure you have the same number of reducers (and partition by the equijoin column) for both sorts, then you'll end

Re: how to compress..!

2009-07-09 Thread Alex Loddengaard
A few comments before I answer: 1) Each time you send an email, we receive two emails. Is your mail client misconfigured? 2) You already asked this question in another thread :). See my response there. Short answer:

Re: Accessing static variables in map function

2009-07-09 Thread smarthrish
Hey Ram. The problem is i initialize these variables in the run function after receiving the cmd line arguments . I want to access the same vars in the mpa function. Is there a diff way other than passing the variables through a Conf object ? -Hrishi - Original Message From:

Re: Accessing static variables in map function

2009-07-09 Thread Amandeep Khurana
The only way to do it as of now is through the conf object. On Thu, Jul 9, 2009 at 11:35 AM, smarthr...@yahoo.co.in wrote: Hey Ram. The problem is i initialize these variables in the run function after receiving the cmd line arguments . I want to access the same vars in the mpa function.

Re: Sort-Merge Join using Map Reduce.

2009-07-09 Thread Todd Lipcon
Hi Pankil, Simply use the normal FileSystem APIs to open the side input. You can construct a SequenceFile.Reader from a Path and use the normal methods inside that class to do the reading of the records. -Todd On Thu, Jul 9, 2009 at 11:12 AM, Pankil Doshi forpan...@gmail.com wrote: Dear Todd,

Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Prashant Ullegaddi
Hi Jothi, We are trying to index around 245GB compressed data (~1TB uncompressed) on a 9 node Hadoop cluster with 8 slaves and 1 master. In Map, we are just parsing the files, passing the same to reduce. In Reduce, we are indexing the parsed data in much like Nutch style. When we ran the job,

RE: Accessing static variables in map function

2009-07-09 Thread Patterson, Josh
I had the same issue with getting a static member filled out, so I used the JobConf object to get variables I had stored in the run() method from the command line. With the 0.19 api, the MapReduceBase class has a public void configure( JobConf job ) method to override that will be called before

Re: How to make data available in 10 minutes.

2009-07-09 Thread Ted Dunning
You are basically re-inventing lots of capabilities that others have solved before. The idea of building an index that refers to files which are constructed by progressive merging is very standard and very similar to the way that Lucene works. You don't say how much data you are moving, but I

Re: Accessing static variables in map function

2009-07-09 Thread Ted Dunning
Use the configuration object. Remember that the outer class is replicated all across the known universe. Your command line arguments only exist on your original machine. On Thu, Jul 9, 2009 at 11:35 AM, smarthr...@yahoo.co.in wrote: Hey Ram. The problem is i initialize these variables in the

Re: Sort by value

2009-07-09 Thread Marcus Herou
Yep figured that. On Thu, Jul 9, 2009 at 7:09 PM, Owen O'Malley omal...@apache.org wrote: You need two jobs: 1. map: line - line, 1, combiner reducer: sum values, sort by line 2. map: line, count - count, line reducer: count, line - line, count So job 1 looks like word count and job 2

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Aaron Kimball
Reduce tasks which require more than twenty minutes are not a problem. But you must emit some data periodically to inform the rest of the system that each reducer is still alive. Emitting a (k, v) output pair to the collector will reset the timer. Similarly, calling Reporter.incrCounter() will

Re: Hadoop: Reduce exceeding 100% - a bug?

2009-07-09 Thread Peter Skomoroch
I've seen this behavior before with reduces going over 100% on big jobs. What version of Hadoop are you using? I think there are some old bugs filed for this if you search the Jira. On Thu, Jul 9, 2009 at 5:31 PM, Aaron Kimball aa...@cloudera.com wrote: Reduce tasks which require more than

Re: Custom input help/debug help

2009-07-09 Thread Aaron Kimball
Hi Matthew, You can set the heap size for child jobs by calling conf.set(mapred.child.java.opts, -Xmx1024m) to get a gig of heap space. That should fix the OOM issue in IsolationRunner. You can also change the heap size used in Eclipse; if you go to Debug Configurations, create a new

Re: Accessing static variables in map function

2009-07-09 Thread jason hadoop
To clarify all of the writers. Store the values you wish to share with your map tasks, in the JobConf object. In the configure method of your mapper class, unpack the variables and store them in class fields of the mapper class. Then use them as needed in the map method of your mapper class. On

Re: how to use hadoop in real life?

2009-07-09 Thread Harish Mallipeddi
Hi Shravan, By Hadoop client, I think he means the hadoop command-line program available under $HADOOP_HOME/bin. You can either write a custom Java program which directly uses the Hadoop APIs or just write a bash/python script which will invoke this command-line app and delegate work to it. -