Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Mapred Learn
Thanks harsh !
That means basically both APIs as well as hadoop client commands allow only 
serial writes.
I was wondering what could be other ways to write data in parallel to HDFS 
other than using multiple parallel threads.

Thanks,
JJ

Sent from my iPhone

On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:

 Hello,
 
 Adding to Joey's response, copyFromLocal's current implementation is serial
 given a list of files.
 
 On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com
 wrote:
 Thanks Joey !
 I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
 serially as you pointed out.
 
 Thanks,
 -JJ
 
 On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote:
 
 The sequence file writer definitely does it serially as you can only
 ever write to the end of a file in Hadoop.
 
 Doing copyFromLocal could write multiple files in parallel (I'm not
 sure if it does or not), but a single file would be written serially.
 
 -Joey
 
 On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com
 wrote:
 Hi,
 My question is when I run a command from hdfs client, for eg. hadoop fs
 -copyFromLocal or create a sequence file writer in java code and append
 key/values to it through Hadoop APIs, does it internally transfer/write
 data
 to HDFS serially or in parallel ?
 
 Thanks in advance,
 -JJ
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 
 
 -- 
 Harsh J


Re: Error in starting tasktracker

2011-05-18 Thread Subhramanian, Deepak
Even though the log says the
folder file:/usr/lib/hadoop-0.20/logs/history/done is created, I cannot see
the folder in the directory.  Is that the root cause of the  error . Any
thoughts ?


2011-05-18 09:18:23,177 INFO org.apache.hadoop.http.HttpServer: Jetty bound
to port 50030
2011-05-18 09:18:23,177 INFO org.mortbay.log: jetty-6.1.26
2011-05-18 09:18:24,574 INFO org.mortbay.log: Started
SelectChannelConnector@0.0.0.0:50030
2011-05-18 09:18:24,575 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=JobTracker, sessionId=
2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
up at: 8021
2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
webserver: 50030
2011-05-18 09:18:25,970 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: localhost/127.0.0.1:8020. Already tried 0 time(s).
2011-05-18 09:18:27,013 INFO org.apache.hadoop.mapred.JobTracker: Cleaning
up the system directory
2011-05-18 09:18:27,355 INFO org.apache.hadoop.mapred.JobHistory: Creating
DONE folder at file:/usr/lib/hadoop-0.20/logs/history/done
2011-05-18 09:18:27,450 WARN org.apache.hadoop.util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable
2011-05-18 09:18:27,680 WARN org.apache.hadoop.mapred.JobTracker: Error
starting tracker: org.apache.hadoop.util.Shell$ExitCodeException: chmod:
cannot access `/var/log/hadoop-0
.20/history/done': No such file or directory

at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
at org.apache.hadoop.util.Shell.run(Shell.java:182)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)




On 17 May 2011 18:32, Subhramanian, Deepak 
deepak.subhraman...@newsint.co.uk wrote:

 I reinstalled everything and am able to start everything other than the
 jobtracker. Jobtracker still gives the port in use even though I verified
 that the port is not running using netstat.

 ipedited:/usr/lib/hadoop-0.20/logs/history # /usr/java/jdk1.6.0_25/bin/jps
 7435 SecondaryNameNode
 7517 TaskTracker
 7361 NameNode
 7632 Jps
 1872 Bootstrap
 7221 DataNode


 2011-05-17 17:25:10,277 INFO org.apache.hadoop.mapred.JobTracker:
 STARTUP_MSG:
 /
 STARTUP_MSG: Starting JobTracker
 STARTUP_MSG:   host = ipedited
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2-cdh3u0
 STARTUP_MSG:   build =  -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14;
 compiled by 'hudson' on Fri Mar 25 20:19:33 PDT 2011
 /
 2011-05-17 17:25:10,895 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
 2011-05-17 17:25:10,897 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMem
 ForReduceTasks) (-1, -1, -1, -1)
 2011-05-17 17:25:10,899 INFO org.apache.hadoop.util.HostsFileReader:
 Refreshing hosts (include/exclude) list
 2011-05-17 17:25:10,961 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScan
 Interval=60 min(s)
 2011-05-17 17:25:10,961 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
 2011-05-17 17:25:11,106 INFO org.apache.hadoop.mapred.JobTracker: Starting
 jobtracker with owner as mapred
 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.Server: Starting Socket
 Reader #1 for port 8021
 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=JobTracker, port=8021
 2011-05-17 17:25:11,215 INFO
 org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics
 with hostName=JobTracker, port=8021
 2011-05-17 17:25:11,275 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2011-05-17 17:25:11,407 INFO org.apache.hadoop.http.HttpServer: Added
 global filtersafety
 (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
 2011-05-17 17:25:11,439 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open() is -1.
 Opening the listener on 50030
 2011-05-17 17:25:11,440 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50030
 webServer.getConnectors()[0].getLocalPort() returned 50030
 2011-05-17 17:25:11,440 INFO org.apache.hadoop.http.HttpServer: Jetty bound
 to port 50030
 2011-05-17 17:25:11,440 INFO org.mortbay.log: jetty-6.1.26
 2011-05-17 17:25:12,504 INFO org.mortbay.log: 

metrics2 ganglia monitoring

2011-05-18 Thread Eric Berkowitz
We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations 
both on the native os and within hadoop.

We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 
we need help configuring the metrics to continue ganglia monitoring.

All tasktrackers/datanodes push unicast udp upstream to a central gmond daemon 
on their rack that is then polled by a single gmetad daemon for the cluster.

The current metrics files includes entries similar to the following for all 
contexts:

Configuration of the dfs context for ganglia
dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
dfs.period=25
dfs.servers=xsrv1.cs.roosevelt.edu:8670

We have reviewed the package documentation for metrics2 but the examples and 
explanation are not helpful.  Any assistance in the proper configuration of 
hadoop-metrics2.properties file to support our current ganglia configuration 
would be appreciated.

Eric

Re: How do you run HPROF locally?

2011-05-18 Thread Brush,Ryan
Not familiar with this setup, but I assume this is using the local
runner, which simply launches the job in the same process as your program.
Therefore no new JVMs are spun up, so the hprof settings in the
configuration never apply.


The simplest way to fix this is probably to just set the -agentlib:...
directly on the JVM of your local program, which will include the
Map/Reduce processing in that process.

On 5/17/11 6:51 PM, Mark question markq2...@gmail.com wrote:

or conf.setBoolean(mapred.task.profile, true);

Mark

On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com
wrote:

 I usually do this setting inside my java program (in run function) as
 follows:

 JobConf conf = new JobConf(this.getConf(),My.class);
 conf.set(*mapred*.task.*profile*, true);

 then I'll see some output files in that same working directory.

 Hope that helps,
 Mark


 On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com wrote:

 I am running a Hadoop Java program in local single-JVM mode via an IDE
 (IntelliJ).  I want to do performance profiling of it.  Following the
 instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added
the
 following properties to my job configuration file.


  property
namemapred.task.profile/name
valuetrue/value
  /property

  property
namemapred.task.profile.params/name


 
value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v
erbose=n,file=%s/value
  /property

  property
namemapred.task.profile.maps/name
value0-/value
  /property

  property
namemapred.task.profile.reduces/name
value0-/value
  /property


 With these properties, the job runs as before, but I don't see any
 profiler
 output.

 I also tried simply setting


  property
namemapred.child.java.opts/name


 
value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v
erbose=n,file=%s/value
  /property


 Again, no profiler output.

 I know I have HPROF installed because running java
-agentlib:hprof=help
 at
 the command prompt produces a result.

 Is is possible to run HPROF on a local Hadoop job?  Am I doing
something
 wrong?




--
CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024.


Re: Error in starting tasktracker

2011-05-18 Thread Harsh J
Hello Deepak,

Since your problems appear to be more related to Cloudera's
distribution including Apache Hadoop, I'm moving the mail discussion
to the cdh-u...@cloudera.org list. I've bcc'd
common-user@hadoop.apache.org. Please continue the discussion on
cdh-u...@cloudera.org for this.

The folder issue might just be the reason the JT fails to start with
such a status (I believe I've seen it happen for a TT once). Ensure
that the permissions are set right for the logs folder (I think it
ought to be rwx-rwx-r-x). Is this a tarball or a package install?

On Wed, May 18, 2011 at 3:14 PM, Subhramanian, Deepak
deepak.subhraman...@newsint.co.uk wrote:
 Even though the log says the
 folder file:/usr/lib/hadoop-0.20/logs/history/done is created, I cannot see
 the folder in the directory.  Is that the root cause of the  error . Any
 thoughts ?


 2011-05-18 09:18:23,177 INFO org.apache.hadoop.http.HttpServer: Jetty bound
 to port 50030
 2011-05-18 09:18:23,177 INFO org.mortbay.log: jetty-6.1.26
 2011-05-18 09:18:24,574 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50030
 2011-05-18 09:18:24,575 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=JobTracker, sessionId=
 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
 up at: 8021
 2011-05-18 09:18:24,576 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
 webserver: 50030
 2011-05-18 09:18:25,970 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: localhost/127.0.0.1:8020. Already tried 0 time(s).
 2011-05-18 09:18:27,013 INFO org.apache.hadoop.mapred.JobTracker: Cleaning
 up the system directory
 2011-05-18 09:18:27,355 INFO org.apache.hadoop.mapred.JobHistory: Creating
 DONE folder at file:/usr/lib/hadoop-0.20/logs/history/done
 2011-05-18 09:18:27,450 WARN org.apache.hadoop.util.NativeCodeLoader: Unable
 to load native-hadoop library for your platform... using builtin-java
 classes where applicable
 2011-05-18 09:18:27,680 WARN org.apache.hadoop.mapred.JobTracker: Error
 starting tracker: org.apache.hadoop.util.Shell$ExitCodeException: chmod:
 cannot access `/var/log/hadoop-0
 .20/history/done': No such file or directory

        at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
        at org.apache.hadoop.util.Shell.run(Shell.java:182)
        at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
        at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)




 On 17 May 2011 18:32, Subhramanian, Deepak 
 deepak.subhraman...@newsint.co.uk wrote:

 I reinstalled everything and am able to start everything other than the
 jobtracker. Jobtracker still gives the port in use even though I verified
 that the port is not running using netstat.

 ipedited:/usr/lib/hadoop-0.20/logs/history # /usr/java/jdk1.6.0_25/bin/jps
 7435 SecondaryNameNode
 7517 TaskTracker
 7361 NameNode
 7632 Jps
 1872 Bootstrap
 7221 DataNode


 2011-05-17 17:25:10,277 INFO org.apache.hadoop.mapred.JobTracker:
 STARTUP_MSG:
 /
 STARTUP_MSG: Starting JobTracker
 STARTUP_MSG:   host = ipedited
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.2-cdh3u0
 STARTUP_MSG:   build =  -r 81256ad0f2e4ab2bd34b04f53d25a6c23686dd14;
 compiled by 'hudson' on Fri Mar 25 20:19:33 PDT 2011
 /
 2011-05-17 17:25:10,895 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
 2011-05-17 17:25:10,897 INFO org.apache.hadoop.mapred.JobTracker: Scheduler
 configured with (memSizeForMapSlotOnJT, memSizeForReduceSlotOnJT,
 limitMaxMemForMapTasks, limitMaxMem
 ForReduceTasks) (-1, -1, -1, -1)
 2011-05-17 17:25:10,899 INFO org.apache.hadoop.util.HostsFileReader:
 Refreshing hosts (include/exclude) list
 2011-05-17 17:25:10,961 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Starting expired delegation token remover thread, tokenRemoverScan
 Interval=60 min(s)
 2011-05-17 17:25:10,961 INFO
 org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager:
 Updating the current master key for generating delegation tokens
 2011-05-17 17:25:11,106 INFO org.apache.hadoop.mapred.JobTracker: Starting
 jobtracker with owner as mapred
 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.Server: Starting Socket
 Reader #1 for port 8021
 2011-05-17 17:25:11,211 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=JobTracker, port=8021
 2011-05-17 17:25:11,215 INFO
 org.apache.hadoop.ipc.metrics.RpcDetailedMetrics: Initializing RPC Metrics
 with hostName=JobTracker, port=8021
 2011-05-17 17:25:11,275 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2011-05-17 

Re: Exception in thread AWT-EventQueue-0 java.lang.NullPointerException

2011-05-18 Thread Lạc Trung
@Steve Loughran : That's the way I repaired my program.
@Robert Evans
@Harsh J:  Thanks for your reply.I'm so happy.

Thanks so much.


Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Patrick Angeles
kinda clunky but you could do this via shell:

for $FILE in $LIST_OF_FILES ; do
  hadoop fs -copyFromLocal $FILE $DEST_PATH 
done

If doing this via the Java API, then, yes you will have to use multiple
threads.

On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote:

 Thanks harsh !
 That means basically both APIs as well as hadoop client commands allow only
 serial writes.
 I was wondering what could be other ways to write data in parallel to HDFS
 other than using multiple parallel threads.

 Thanks,
 JJ

 Sent from my iPhone

 On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:

  Hello,
 
  Adding to Joey's response, copyFromLocal's current implementation is
 serial
  given a list of files.
 
  On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com
  wrote:
  Thanks Joey !
  I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
  serially as you pointed out.
 
  Thanks,
  -JJ
 
  On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote:
 
  The sequence file writer definitely does it serially as you can only
  ever write to the end of a file in Hadoop.
 
  Doing copyFromLocal could write multiple files in parallel (I'm not
  sure if it does or not), but a single file would be written serially.
 
  -Joey
 
  On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com
  wrote:
  Hi,
  My question is when I run a command from hdfs client, for eg. hadoop
 fs
  -copyFromLocal or create a sequence file writer in java code and
 append
  key/values to it through Hadoop APIs, does it internally
 transfer/write
  data
  to HDFS serially or in parallel ?
 
  Thanks in advance,
  -JJ
 
 
 
 
  --
  Joseph Echeverria
  Cloudera, Inc.
  443.305.9434
 
 
  --
  Harsh J



Hadoop and WikiLeaks

2011-05-18 Thread Edward Capriolo
http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F

March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
Awards

The Hadoop project won the innovator of the yearaward from the UK's
Guardian newspaper, where it was described as had the potential as a
greater catalyst for innovation than other nominees including WikiLeaks and
the iPad.

Does this copy text bother anyone else? Sure winning any award is great but
does hadoop want to be associated with innovation like WikiLeaks?

Edward


Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Mapred Learn
Thanks Patrick !
This would work if directory is to be uploaded but for streaming, I guess, this 
would not work.

Sent from my iPhone

On May 18, 2011, at 9:39 AM, Patrick Angeles patr...@cloudera.com wrote:

 kinda clunky but you could do this via shell:
 
 for $FILE in $LIST_OF_FILES ; do
  hadoop fs -copyFromLocal $FILE $DEST_PATH 
 done
 
 If doing this via the Java API, then, yes you will have to use multiple
 threads.
 
 On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.comwrote:
 
 Thanks harsh !
 That means basically both APIs as well as hadoop client commands allow only
 serial writes.
 I was wondering what could be other ways to write data in parallel to HDFS
 other than using multiple parallel threads.
 
 Thanks,
 JJ
 
 Sent from my iPhone
 
 On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote:
 
 Hello,
 
 Adding to Joey's response, copyFromLocal's current implementation is
 serial
 given a list of files.
 
 On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com
 wrote:
 Thanks Joey !
 I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
 serially as you pointed out.
 
 Thanks,
 -JJ
 
 On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote:
 
 The sequence file writer definitely does it serially as you can only
 ever write to the end of a file in Hadoop.
 
 Doing copyFromLocal could write multiple files in parallel (I'm not
 sure if it does or not), but a single file would be written serially.
 
 -Joey
 
 On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com
 wrote:
 Hi,
 My question is when I run a command from hdfs client, for eg. hadoop
 fs
 -copyFromLocal or create a sequence file writer in java code and
 append
 key/values to it through Hadoop APIs, does it internally
 transfer/write
 data
 to HDFS serially or in parallel ?
 
 Thanks in advance,
 -JJ
 
 
 
 
 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434
 
 
 --
 Harsh J
 


Re: Hadoop and WikiLeaks

2011-05-18 Thread javamann
Yes!

-Pete

 Edward Capriolo edlinuxg...@gmail.com wrote: 

=
http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F

March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
Awards

The Hadoop project won the innovator of the yearaward from the UK's
Guardian newspaper, where it was described as had the potential as a
greater catalyst for innovation than other nominees including WikiLeaks and
the iPad.

Does this copy text bother anyone else? Sure winning any award is great but
does hadoop want to be associated with innovation like WikiLeaks?

Edward

--

1. If a man is standing in the middle of the forest talking, and there is no 
woman around to hear him, is he still wrong?

2. Behind every great woman... Is a man checking out her ass

3. I am not a member of any organized political party. I am a Democrat.*

4. Diplomacy is the art of saying Nice doggie until you can find a rock.*

5. A process is what you need when all your good people have left.


*Will Rogers




Re: Hadoop and WikiLeaks

2011-05-18 Thread Konstantin Boudnik
You are, perhaps, aware that now your name will be associated with
WikiLeaks too because this mailing list is archived and publicly
searchable? I think you are a hero, man!
--
  Take care,
Konstantin (Cos) Boudnik
2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622

Disclaimer: Opinions expressed in this email are those of the author,
and do not necessarily represent the views of any company the author
might be affiliated with at the moment of writing.



On Wed, May 18, 2011 at 09:53, Edward Capriolo edlinuxg...@gmail.com wrote:
 http://hadoop.apache.org/#What+Is+Apache%E2%84%A2+Hadoop%E2%84%A2%3F

 March 2011 - Apache Hadoop takes top prize at Media Guardian Innovation
 Awards

 The Hadoop project won the innovator of the yearaward from the UK's
 Guardian newspaper, where it was described as had the potential as a
 greater catalyst for innovation than other nominees including WikiLeaks and
 the iPad.

 Does this copy text bother anyone else? Sure winning any award is great but
 does hadoop want to be associated with innovation like WikiLeaks?

 Edward



Re: How do you run HPROF locally?

2011-05-18 Thread W.P. McNeill
Ryan Brush had the right answer.  If I add the following VM parameter

-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,verbose=n,file=prof.output


a profiling file called prof.output gets created in my working directory.
 Because I'm running locally, both the mapred.task.profile* options and
mapred.child.java.opts are ignored.

This makes sense now.  Thanks.

On Wed, May 18, 2011 at 6:34 AM, Brush,Ryan rbr...@cerner.com wrote:

 Not familiar with this setup, but I assume this is using the local
 runner, which simply launches the job in the same process as your program.
 Therefore no new JVMs are spun up, so the hprof settings in the
 configuration never apply.


 The simplest way to fix this is probably to just set the -agentlib:...
 directly on the JVM of your local program, which will include the
 Map/Reduce processing in that process.

 On 5/17/11 6:51 PM, Mark question markq2...@gmail.com wrote:

 or conf.setBoolean(mapred.task.profile, true);
 
 Mark
 
 On Tue, May 17, 2011 at 4:49 PM, Mark question markq2...@gmail.com
 wrote:
 
  I usually do this setting inside my java program (in run function) as
  follows:
 
  JobConf conf = new JobConf(this.getConf(),My.class);
  conf.set(*mapred*.task.*profile*, true);
 
  then I'll see some output files in that same working directory.
 
  Hope that helps,
  Mark
 
 
  On Tue, May 17, 2011 at 4:07 PM, W.P. McNeill bill...@gmail.com
 wrote:
 
  I am running a Hadoop Java program in local single-JVM mode via an IDE
  (IntelliJ).  I want to do performance profiling of it.  Following the
  instructions in chapter 5 of *Hadoop: the Definitive Guide*, I added
 the
  following properties to my job configuration file.
 
 
   property
 namemapred.task.profile/name
 valuetrue/value
   /property
 
   property
 namemapred.task.profile.params/name
 
 
 
 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v
 erbose=n,file=%s/value
   /property
 
   property
 namemapred.task.profile.maps/name
 value0-/value
   /property
 
   property
 namemapred.task.profile.reduces/name
 value0-/value
   /property
 
 
  With these properties, the job runs as before, but I don't see any
  profiler
  output.
 
  I also tried simply setting
 
 
   property
 namemapred.child.java.opts/name
 
 
 
 value-agentlib:hprof=cpu=samples,heap=sites,depth=6,force=n,thread=y,v
 erbose=n,file=%s/value
   /property
 
 
  Again, no profiler output.
 
  I know I have HPROF installed because running java
 -agentlib:hprof=help
  at
  the command prompt produces a result.
 
  Is is possible to run HPROF on a local Hadoop job?  Am I doing
 something
  wrong?
 
 
 

 --
 CONFIDENTIALITY NOTICE This message and any included attachments are from
 Cerner Corporation and are intended only for the addressee. The information
 contained in this message is confidential and may constitute inside or
 non-public information under international, federal, or state securities
 laws. Unauthorized forwarding, printing, copying, distribution, or use of
 such information is strictly prohibited and may be unlawful. If you are not
 the addressee, please promptly delete this message and notify the sender of
 the delivery error by e-mail or you may call Cerner's corporate offices in
 Kansas City, Missouri, U.S.A at (+1) (816)221-1024.



current line number as key?

2011-05-18 Thread bnonymous

Hello,

I'm trying to pick up certain lines of a text file. (say 1st, 110th line of
a file with 10^10 lines). I need a InputFormat which gives the Mapper line
number as the key. 

I tried to implement RecordReader, but I can't get line information from
InputSplit.

Any solution to this???

Thanks in advance!!!
-- 
View this message in context: 
http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: current line number as key?

2011-05-18 Thread Alexandra Anghelescu
Hi,

It is hard to pick up certain lines of a text file - globally I mean.
Remember that the file is split according to its size (byte boundries) not
lines.,, so, it is possible to keep track of the lines inside a split, but
globally for the whole file, assuming it is split among map tasks... i don't
think it is possible.. I am new to hadoop, but that is my take on it.

Alexandra

On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote:


 Hello,

 I'm trying to pick up certain lines of a text file. (say 1st, 110th line of
 a file with 10^10 lines). I need a InputFormat which gives the Mapper line
 number as the key.

 I tried to implement RecordReader, but I can't get line information from
 InputSplit.

 Any solution to this???

 Thanks in advance!!!
 --
 View this message in context:
 http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: metrics2 ganglia monitoring

2011-05-18 Thread Luke Lu
Ganglia plugin is not yet ported to metrics v2 (because Y don't use
Ganglia, see also the discussion links on HADOOP-6728). It shouldn't
be hard to do a port though, as the new sink interface is actually
simpler.

On Wed, May 18, 2011 at 4:07 AM, Eric Berkowitz
eberkow...@roosevelt.edu wrote:
 We have a 2 rack hadoop cluster with ganglia 3.0 monitoring on all stations 
 both on the native os and within hadoop.

 We want to upgrade to the to hadoop 20.203 but with the migration to metrics2 
 we need help configuring the metrics to continue ganglia monitoring.

 All tasktrackers/datanodes push unicast udp upstream to a central gmond 
 daemon on their rack that is then polled by a single gmetad daemon for the 
 cluster.

 The current metrics files includes entries similar to the following for all 
 contexts:

 Configuration of the dfs context for ganglia
 dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext
 dfs.period=25
 dfs.servers=xsrv1.cs.roosevelt.edu:8670

 We have reviewed the package documentation for metrics2 but the examples and 
 explanation are not helpful.  Any assistance in the proper configuration of
 hadoop-metrics2.properties file to support our current ganglia configuration 
 would be appreciated.

 Eric


Re: current line number as key?

2011-05-18 Thread Jonathan Coveney
To the best of my knowledge, the only way to do this is if you have fix
width columns.

Think about it this way: as alexandra mentioned, you only get byte
difference...if you split 1 file among 50 mappers, they have the offset, but
they have no idea that that offset means. with respect to other the other
files, as they do not know how many lines came before. Finding lines
inherently involves a full scan, unless a) the width is fixed or b) you do a
job beforehand to explicitly put the line in the document.

I would think about what you want to do, and whether or not it is possible
to avoid making it line dependent, or if you can make each row a fixed
number of bytes...

2011/5/18 Alexandra Anghelescu axanghele...@gmail.com

 Hi,

 It is hard to pick up certain lines of a text file - globally I mean.
 Remember that the file is split according to its size (byte boundries) not
 lines.,, so, it is possible to keep track of the lines inside a split, but
 globally for the whole file, assuming it is split among map tasks... i
 don't
 think it is possible.. I am new to hadoop, but that is my take on it.

 Alexandra

 On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote:

 
  Hello,
 
  I'm trying to pick up certain lines of a text file. (say 1st, 110th line
 of
  a file with 10^10 lines). I need a InputFormat which gives the Mapper
 line
  number as the key.
 
  I tried to implement RecordReader, but I can't get line information from
  InputSplit.
 
  Any solution to this???
 
  Thanks in advance!!!
  --
  View this message in context:
 
 http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: current line number as key?

2011-05-18 Thread Robert Evans
You are correct, that there is no easy and efficient way to do this.

You could create a new InputFormat that derives from FileInputFormat that makes 
it so the files do not split, and then have a RecordReader that keeps track of 
line numbers.  But then each file is read by only one mapper.

Alternatively you could assume that the split is going to be done 
deterministically and do two passes one, where you count the number of lines in 
each partition, and a second that then assigns the lines based off of the 
output from the first.  But that requires two map passes.

--Bobby Evans


On 5/18/11 1:53 PM, Alexandra Anghelescu axanghele...@gmail.com wrote:

Hi,

It is hard to pick up certain lines of a text file - globally I mean.
Remember that the file is split according to its size (byte boundries) not
lines.,, so, it is possible to keep track of the lines inside a split, but
globally for the whole file, assuming it is split among map tasks... i don't
think it is possible.. I am new to hadoop, but that is my take on it.

Alexandra

On Wed, May 18, 2011 at 2:41 PM, bnonymous libei.t...@gmail.com wrote:


 Hello,

 I'm trying to pick up certain lines of a text file. (say 1st, 110th line of
 a file with 10^10 lines). I need a InputFormat which gives the Mapper line
 number as the key.

 I tried to implement RecordReader, but I can't get line information from
 InputSplit.

 Any solution to this???

 Thanks in advance!!!
 --
 View this message in context:
 http://old.nabble.com/current-line-number-as-key--tp31649694p31649694.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Distcp failure - Server returned HTTP response code: 500

2011-05-18 Thread sonia gehlot
Hi Guys

I am trying to copy hadoop data from one cluster to another but I am keep on
getting this error  *Server returned HTTP response code: 500 for URL*
*
*
My distcp command is:
scripts/hadoop.sh distcp
hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/
*day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot

In here I have *day=2011-05-17* in my file path

I found this online:  https://issues.apache.org/jira/browse/HDFS-31

Is this issue is still exists? Is this could be the reason of my job
failure?

Job Error log:

2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=MAP, sessionId=
2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask:
numReduceTasks: 0
2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL
day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17
: java.io.IOException: *Server returned HTTP response code: 500 for URL*:
http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user
 at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157)
 at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410)
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537)
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error
running child
java.io.IOException: Copied: 0 Skipped: 5 Failed: 1
at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
cleanup for the task

Any help is appreciated.

Thanks,
Sonia


Reducer granularity and starvation

2011-05-18 Thread W.P. McNeill
I'm working on a cluster with 360 reducer slots.  I've got a big job, so
when I launch it I follow the recommendations in the Hadoop documentation
and set mapred.reduce.tasks=350, i.e. slightly less than the available
number of slots.

The problem is that my reducers can still take a long time (2-4 hours) to
run.  So I end up grabbing a big slab of reducers and starving everybody
else out.  I've got my priority set to VERY_LOW and
mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done
everything I can do on the job parameters front.  Currently there isn't a
way to make the individual reducers run faster, so I'm trying to figure out
the best way to run my job so that it plays nice with other users of the
cluster.  My rule of thumb has always been to not try and do any scheduling
myself, but let Hadoop handle it for me, but I don't think that works in
this scenario.

Questions:

   1. Am I correct in thinking that long reducer times just mess up Hadoop's
   scheduling granularity to a degree that it can't handle?  Is 4-hour reducer
   outside the normal operating range of Hadoop?
   2. Is there any way to stagger task launches?  (Aside from manually.)
   3. What if I set mapred.reduce.tasks to be some value much, much larger
   than the number of available reducer slots, like 100,000.  Will that make
   the amount of work sent to each reducer smaller (hence increasing the
   scheduler granularity) or will it have no effect?
   4. In this scenario, do I just have to reconcile myself to the fact that
   my job is going to squat on a block of reducers no matter what and
set mapred.reduce.tasks
   to something much less than the available number of slots?


Re: Reducer granularity and starvation

2011-05-18 Thread James Seigel
W.P.,

Hard to help out without knowing more about the characteristics of your data?   
How many keys are you expecting?  How many values per key?

Cheers
James.
On 2011-05-18, at 3:25 PM, W.P. McNeill wrote:

 I'm working on a cluster with 360 reducer slots.  I've got a big job, so
 when I launch it I follow the recommendations in the Hadoop documentation
 and set mapred.reduce.tasks=350, i.e. slightly less than the available
 number of slots.
 
 The problem is that my reducers can still take a long time (2-4 hours) to
 run.  So I end up grabbing a big slab of reducers and starving everybody
 else out.  I've got my priority set to VERY_LOW and
 mapred.reduce.slowstart.completed.maps to 0.9, so I think I've done
 everything I can do on the job parameters front.  Currently there isn't a
 way to make the individual reducers run faster, so I'm trying to figure out
 the best way to run my job so that it plays nice with other users of the
 cluster.  My rule of thumb has always been to not try and do any scheduling
 myself, but let Hadoop handle it for me, but I don't think that works in
 this scenario.
 
 Questions:
 
   1. Am I correct in thinking that long reducer times just mess up Hadoop's
   scheduling granularity to a degree that it can't handle?  Is 4-hour reducer
   outside the normal operating range of Hadoop?
   2. Is there any way to stagger task launches?  (Aside from manually.)
   3. What if I set mapred.reduce.tasks to be some value much, much larger
   than the number of available reducer slots, like 100,000.  Will that make
   the amount of work sent to each reducer smaller (hence increasing the
   scheduler granularity) or will it have no effect?
   4. In this scenario, do I just have to reconcile myself to the fact that
   my job is going to squat on a block of reducers no matter what and
 set mapred.reduce.tasks
   to something much less than the available number of slots?



Re: Distcp failure - Server returned HTTP response code: 500

2011-05-18 Thread Bill Graham
Are you able to distcp folders that don't have special characters?

What are the versions of the two clusters and are you running on the
destination cluster if there's a mis-match? If there is you'll need to use
hftp:

http://hadoop.apache.org/common/docs/current/distcp.html#cpver

On Wed, May 18, 2011 at 12:44 PM, sonia gehlot sonia.geh...@gmail.comwrote:

 Hi Guys

 I am trying to copy hadoop data from one cluster to another but I am keep
 on
 getting this error  *Server returned HTTP response code: 500 for URL*
 *
 *
 My distcp command is:
 scripts/hadoop.sh distcp

 hftp://c13-hadoop1-nn-r0-n1:50070/user/dwadmin/live/data/warehouse/facts/page_events/
 *day=2011-05-17* hdfs://phx1-rb-dev40-pipe1.cnet.com:9000/user/sgehlot

 In here I have *day=2011-05-17* in my file path

 I found this online:  https://issues.apache.org/jira/browse/HDFS-31

 Is this issue is still exists? Is this could be the reason of my job
 failure?

 Job Error log:

 2011-05-18 11:34:56,505 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=MAP, sessionId=
 2011-05-18 11:34:56,713 INFO org.apache.hadoop.mapred.MapTask:
 numReduceTasks: 0
 2011-05-18 11:34:57,039 INFO org.apache.hadoop.tools.DistCp: FAIL

 day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%3A+page_events+day%3D2011-05-17
 : java.io.IOException: *Server returned HTTP response code: 500 for URL*:

 http://c13-hadoop1-wkr-r10-n4.cnet.com:50075/streamFile?filename=/user/dwadmin/live/data/warehouse/facts/page_events/day=2011-05-17/_logs/history/c13-hadoop1-nn-r0-n1_1291919715221_job_201012091035_41977_dwadmin_CopyFactsToHive%253A+page_events+day%253D2011-05-17ugi=sgehlot,user
  at

 sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1313)
 at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:157)
  at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:398)
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.copy(DistCp.java:410)
  at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:537)
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.map(DistCp.java:306)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

 2011-05-18 11:35:06,118 WARN org.apache.hadoop.mapred.TaskTracker: Error
 running child
 java.io.IOException: Copied: 0 Skipped: 5 Failed: 1
 at org.apache.hadoop.tools.DistCp$CopyFilesMapper.close(DistCp.java:572)
  at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)
 2011-05-18 11:35:06,124 INFO org.apache.hadoop.mapred.TaskRunner: Runnning
 cleanup for the task

 Any help is appreciated.

 Thanks,
 Sonia



Re: Reducer granularity and starvation

2011-05-18 Thread W.P. McNeill
Altogether my reducers are handling about 10^8 keys.  The number of values
per key varies, but ranges from 1-100.  I'd guess the mean and mode is
around 10, but I'm not sure.


Re: Reducer granularity and starvation

2011-05-18 Thread W.P. McNeill
Also, the values are much larger (maybe a factor of 10^3) than the keys.  I
get the impression that this is unusual for Hadoop apps.  (It certainly
isn't true of word count in any event.)


Re: Reducer granularity and starvation

2011-05-18 Thread James Seigel
W.P,

Sounds like you are going to be taking a long time no matter what.  With a 
keyspace of about 10^7 that means that either hadoop is going to eventually 
allocate 10^7 reducers (if you set you reducer count to 10^7) or is going to 
re-use the ones you have 10^6 / (number of reducers you allocate) times.   It 
is probably just a big job :)

Look into fairscheduler or specify less reducers for this job and suffer a 
slight slowdown, but allow other jobs to get reducers when they need them.

You *might* get some efficiencies if you can reduce the number of keys, or 
ensure that very few keys are getting big lists of data (anti-parallel).  Make 
sure you are using a combiner if there is an opportunity to reduce the amount 
of data that goes through the shuffle.  That is always a good thing  IO = slow.

Also, see if you can break  your job up into smaller pieces so the more 
expensive operations are happening on less data volume.

Good luck!

Cheers
James.



On 2011-05-18, at 3:42 PM, W.P. McNeill wrote:

 Altogether my reducers are handling about 10^8 keys.  The number of values
 per key varies, but ranges from 1-100.  I'd guess the mean and mode is
 around 10, but I'm not sure.



Re: Reducer granularity and starvation

2011-05-18 Thread W.P. McNeill
I'm using fair scheduler and JVM reuse.  It is just plain a big job.

I'm not using a combiner right now, but that's something to look at.

What about bumping the mapred.reduce.tasks up to some huge number?  I think
that shouldn't make a difference, but I'm hearing conflicting information on
this.


Re: Reducer granularity and starvation

2011-05-18 Thread James Seigel
W.P,

Upping the reduce.tasks to a huge number just means that it will eventually 
spawn reducers = to (that huge number).  You still only have slots for 360 so 
there is no real advantage, UNLESS you are running into OOM errors, which we’ve 
seen with higher re-use on the smaller number of reducers.

Anyhoo, someone else can chime in and correct me if I am off base.

Does that make sense?

Cheers
James.
On 2011-05-18, at 4:04 PM, W.P. McNeill wrote:

 I'm using fair scheduler and JVM reuse.  It is just plain a big job.
 
 I'm not using a combiner right now, but that's something to look at.
 
 What about bumping the mapred.reduce.tasks up to some huge number?  I think
 that shouldn't make a difference, but I'm hearing conflicting information on
 this.



some guidance needed

2011-05-18 Thread Ioan Eugen Stan
Hello everybody,

I'm a GSoC student for this year and I will be working on James [1].
My project is to implement email storage over HDFS. I am quite new to
Hadoop and associates and I am looking for some hints as to get
started on the right track.

I have installed a single node Hadoop instance on my machine and
played around with it (ran some examples) but I am interested into
what you (more experienced people) think it's the best way to approach
my problem.

I am a little puzzled about the fact that that I read hadoop is best
used for large files and email aren't that large from what I know.
Another thing that crossed my mind is that since HDFS is a file
system, wouldn't it be possible to set it as a back-end for the
(existing) maildir and mailbox storage formats? (I think this question
is more suited on the James mailing list, but if you have some ideas
please speak your mind).

Also, any development resources to get me started are welcomed.


[1] http://james.apache.org/mailbox/
[2] https://issues.apache.org/jira/browse/MAILBOX-44

Regards,
-- 
Ioan Eugen Stan


Re: some guidance needed

2011-05-18 Thread Todd Lipcon
Hi Ioan,

I would encourage you to look at a system like HBase for your mail
backend. HDFS doesn't work well with lots of little files, and also
doesn't support random update, so existing formats like Maildir
wouldn't be a good fit.

-Todd

On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote:
 Hello everybody,

 I'm a GSoC student for this year and I will be working on James [1].
 My project is to implement email storage over HDFS. I am quite new to
 Hadoop and associates and I am looking for some hints as to get
 started on the right track.

 I have installed a single node Hadoop instance on my machine and
 played around with it (ran some examples) but I am interested into
 what you (more experienced people) think it's the best way to approach
 my problem.

 I am a little puzzled about the fact that that I read hadoop is best
 used for large files and email aren't that large from what I know.
 Another thing that crossed my mind is that since HDFS is a file
 system, wouldn't it be possible to set it as a back-end for the
 (existing) maildir and mailbox storage formats? (I think this question
 is more suited on the James mailing list, but if you have some ideas
 please speak your mind).

 Also, any development resources to get me started are welcomed.


 [1] http://james.apache.org/mailbox/
 [2] https://issues.apache.org/jira/browse/MAILBOX-44

 Regards,
 --
 Ioan Eugen Stan




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: some guidance needed

2011-05-18 Thread Mark Kerzner
Ioan,

I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file
system, it won't give you the immediate response about the file status that
you need. I believe Google implemented Gmail with HBase. Here is an example
of implementing a mail store with Cassandra:
http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf

http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark

On Wed, May 18, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Ioan,

 I would encourage you to look at a system like HBase for your mail
 backend. HDFS doesn't work well with lots of little files, and also
 doesn't support random update, so existing formats like Maildir
 wouldn't be a good fit.

 -Todd

 On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com
 wrote:
  Hello everybody,
 
  I'm a GSoC student for this year and I will be working on James [1].
  My project is to implement email storage over HDFS. I am quite new to
  Hadoop and associates and I am looking for some hints as to get
  started on the right track.
 
  I have installed a single node Hadoop instance on my machine and
  played around with it (ran some examples) but I am interested into
  what you (more experienced people) think it's the best way to approach
  my problem.
 
  I am a little puzzled about the fact that that I read hadoop is best
  used for large files and email aren't that large from what I know.
  Another thing that crossed my mind is that since HDFS is a file
  system, wouldn't it be possible to set it as a back-end for the
  (existing) maildir and mailbox storage formats? (I think this question
  is more suited on the James mailing list, but if you have some ideas
  please speak your mind).
 
  Also, any development resources to get me started are welcomed.
 
 
  [1] http://james.apache.org/mailbox/
  [2] https://issues.apache.org/jira/browse/MAILBOX-44
 
  Regards,
  --
  Ioan Eugen Stan
 



 --
 Todd Lipcon
 Software Engineer, Cloudera



Can I use InputSampler.RandomSampler on data with non-Text keys?

2011-05-18 Thread W.P. McNeill
I want to do a total sort on some data whose key type is Writable but not
Text.  I wrote an InputSampler.RandomSampler object following the example in
the Total Sort section of *Hadoop: The Definitive Guide*.  When I
call InputSampler.writePartitionFile() I get a class cast exception because
my key type cannot be cast to Text.  Specifically the issue seems to be the
following section of InputSampler.getSample():

K key = reader.getCurrentKey();

Text keyCopy = WritableUtils.Textclone((Text)key,
job.getConfiguration());

From this source it does appear that you can only use a RandomSampler on
data with Text keys.  However, I'm confused because I don't see this
mentioned in any documentation, and I assume this wouldn't be the case
because InputSampler takes Key, Value generic specifications.

   1. Does InputSampler.RandomSampler only work on data with Text key
   values?
   2. If so, what is the easiest way to generate a random sample for data
   with non-Text key values?  Is there example code anywhere?


Re: Reducer granularity and starvation

2011-05-18 Thread Joey Echeverria
The one advantage you would get with a large number of reducers is
that the scheduler will be able to give open reduce slots to other
jobs without having to be preemptive.

This will reduce the risk of you losing a reducer 3 hours into a 4 hour run.

-Joey

On Wed, May 18, 2011 at 3:08 PM, James Seigel ja...@tynt.com wrote:
 W.P,

 Upping the reduce.tasks to a huge number just means that it will eventually 
 spawn reducers = to (that huge number).  You still only have slots for 360 so 
 there is no real advantage, UNLESS you are running into OOM errors, which 
 we’ve seen with higher re-use on the smaller number of reducers.

 Anyhoo, someone else can chime in and correct me if I am off base.

 Does that make sense?

 Cheers
 James.
 On 2011-05-18, at 4:04 PM, W.P. McNeill wrote:

 I'm using fair scheduler and JVM reuse.  It is just plain a big job.

 I'm not using a combiner right now, but that's something to look at.

 What about bumping the mapred.reduce.tasks up to some huge number?  I think
 that shouldn't make a difference, but I'm hearing conflicting information on
 this.





-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Reducer granularity and starvation

2011-05-18 Thread W.P. McNeill
Here's a consequence that I see of having the values be much larger than the
keys: there's not much point in me adding a combiner.

My mapper emits pairs of the form:

Key, Value

where the size of value is much greater than the size of Key.  The reducer
then processes input of the form:

Key, IteratorValue

The reducer then looks at the set of values corresponding to a Key and
separates it into one of two bins.  I don't think this is particularly
CPU-intensive, however, the reducer needs access to the entire set of
Values.  The set can't be boiled down into some smaller sufficient statistic
the way, say, in a word count program we can combine the counts for a word
from different documents into a single number.  As a result, the only
combiner strategy I can see is to have the mapper emit a Value as a single
item list:

Key, [Value]

Have a combiner combine the lists:

Key, [Value, Value...]

and then the reducer would work on lists of lists.

Key, Iterator[Value, Value...]

This would save on redundant Key IO, but since Values are so much bigger
than Keys I don't think this would matter.


Re: Can I use InputSampler.RandomSampler on data with non-Text keys?

2011-05-18 Thread Joey Echeverria
That sounds like a bug to me.

I think the easiest way would be to modify InputSampler to handle non Text keys.

-Joey

On Wed, May 18, 2011 at 4:24 PM, W.P. McNeill bill...@gmail.com wrote:
 I want to do a total sort on some data whose key type is Writable but not
 Text.  I wrote an InputSampler.RandomSampler object following the example in
 the Total Sort section of *Hadoop: The Definitive Guide*.  When I
 call InputSampler.writePartitionFile() I get a class cast exception because
 my key type cannot be cast to Text.  Specifically the issue seems to be the
 following section of InputSampler.getSample():

    K key = reader.getCurrentKey();
    
    Text keyCopy = WritableUtils.Textclone((Text)key,
 job.getConfiguration());

 From this source it does appear that you can only use a RandomSampler on
 data with Text keys.  However, I'm confused because I don't see this
 mentioned in any documentation, and I assume this wouldn't be the case
 because InputSampler takes Key, Value generic specifications.

   1. Does InputSampler.RandomSampler only work on data with Text key
   values?
   2. If so, what is the easiest way to generate a random sample for data
   with non-Text key values?  Is there example code anywhere?




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434