date:20090218

Re: HADOOP-2536 supports Oracle too?

2009-02-18 Thread sandhiya

Enis ... I have tried JAR-ing with the library folder as you said. But still
no luck. I keep getting the same ClassNotFoundException again and again.
:,(

Enis Soztutar-2 wrote:

There is nothing special about the jdbc driver library. I guess that you
have added the jar from the IDE(netbeans), but did not include the
necessary libraries(jdbc driver in this case) in the TableAccess.jar.

The standard way is to include the dependent jars in the project's jar
under the lib directory. For example:

example.jar
- META-INF
- com/...
- lib/postgres.jar
- lib/abc.jar

If your classpath is correct, check whether you call
DBConfiguration.configureDB() with the correct driver class and url.

sandhiya wrote:
Hi,
I'm using postgresql and the driver is not getting detected. How do you
run
it in the first place? I just typed

bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in my
Libraries folder (I'm using NetBeans).
please help

Fredrik Hedberg-3 wrote:

Hi,

Although it's not MySQL; this might be of use:

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java

Fredrik

On Feb 16, 2009, at 8:33 AM, sandhiya wrote:

@Amandeep
Hi,
I'm new to Hadoop and am trying to run a simple database connectivity
program on it. Could you please tell me how u went about it?? my
mail id is
sandys_cr...@yahoo.com . A copy of your code that successfully
connected
to MySQL will also be helpful.
Thanks,
Sandhiya

Enis Soztutar-2 wrote:

From the exception :

java.io.IOException: ORA-00933: SQL command not properly ended

I would broadly guess that Oracle JDBC driver might be complaining
that
the statement does not end with ;, or something similar. you can
1. download the latest source code of hadoop
2. add a print statement printing the query (probably in
DBInputFormat:119)
3. build hadoop jar
4. use the new hadoop jar to see the actual SQL query
5. run the query on Oracle if is gives an error.

Enis

Amandeep Khurana wrote:

Ok. I created the same database in a MySQL database and ran the same
hadoop
job against it. It worked. So, that means there is some Oracle
specific
issue. It cant be an issue with the JDBC drivers since I am using
the
same
drivers in a simple JDBC client.

What could it be?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana ama...@gmail.com
wrote:

Ok. I'm not sure if I got it correct. Are you saying, I should
test the
statement that hadoop creates directly with the database?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar enis@gmail.com
wrote:

Hadoop-2536 connects to the db via JDBC, so in theory it should
work
with
proper jdbc drivers.
It has been tested against MySQL, Hsqldb, and PostreSQL, but not
Oracle.

To answer your earlier question, the actual SQL statements might
not be
recognized by Oracle, so I suggest the best way to test this is to
insert
print statements, and run the actual SQL statements against
Oracle to
see if
the syntax is accepted.

We would appreciate if you publish your results.

Enis

Amandeep Khurana wrote:

Does the patch HADOOP-2536 support connecting to Oracle
databases as
well?
Or is it just limited to MySQL?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

--
View this message in context:
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22073986.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: HADOOP-2536 supports Oracle too?

2009-02-18 Thread Amandeep Khurana

It should either be in the jar or in the lib folder in the Hadoop
installation.

If none of them work, check the jar that you are including.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 18, 2009 at 12:08 AM, sandhiya sandhiy...@gmail.com wrote:

Enis ... I have tried JAR-ing with the library folder as you said. But
still
no luck. I keep getting the same ClassNotFoundException again and
again.
:,(

Enis Soztutar-2 wrote:

The standard way is to include the dependent jars in the project's jar
under the lib directory. For example:

example.jar
- META-INF
- com/...
- lib/postgres.jar
- lib/abc.jar

If your classpath is correct, check whether you call
DBConfiguration.configureDB() with the correct driver class and url.

sandhiya wrote:
Hi,
I'm using postgresql and the driver is not getting detected. How do you
run
it in the first place? I just typed

bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in
my
Libraries folder (I'm using NetBeans).
please help

Fredrik Hedberg-3 wrote:

Hi,

Although it's not MySQL; this might be of use:

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java

Fredrik

On Feb 16, 2009, at 8:33 AM, sandhiya wrote:

Enis Soztutar-2 wrote:

From the exception :

java.io.IOException: ORA-00933: SQL command not properly ended

Enis

Amandeep Khurana wrote:

What could it be?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana ama...@gmail.com

wrote:

Ok. I'm not sure if I got it correct. Are you saying, I should
test the
statement that hadoop creates directly with the database?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar enis@gmail.com
wrote:

Hadoop-2536 connects to the db via JDBC, so in theory it should
work
with
proper jdbc drivers.
It has been tested against MySQL, Hsqldb, and PostreSQL, but not
Oracle.

We would appreciate if you publish your results.

Enis

Amandeep Khurana wrote:

Does the patch HADOOP-2536 support connecting to Oracle
databases as
well?
Or is it just limited to MySQL?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

--
View this message in context:

http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
View this message in context:
http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22073986.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: HADOOP-2536 supports Oracle too?

2009-02-18 Thread sandhiya

Thanks a million!!! It worked. but its a little weird though. I have to put
the Library with the jdbc jars in BOTH the executable jar file AND the lib
folder in $HADOOP_HOME. Do all of you do the same thing or is it just my
computer acting strange??

Anyway, thanks for the help.
:clap:

Amandeep Khurana wrote:

It should either be in the jar or in the lib folder in the Hadoop
installation.

If none of them work, check the jar that you are including.

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 18, 2009 at 12:08 AM, sandhiya sandhiy...@gmail.com wrote:

Enis ... I have tried JAR-ing with the library folder as you said. But
still
no luck. I keep getting the same ClassNotFoundException again and
again.
:,(

Enis Soztutar-2 wrote:

There is nothing special about the jdbc driver library. I guess that
you
have added the jar from the IDE(netbeans), but did not include the
necessary libraries(jdbc driver in this case) in the TableAccess.jar.

The standard way is to include the dependent jars in the project's jar
under the lib directory. For example:

example.jar
- META-INF
- com/...
- lib/postgres.jar
- lib/abc.jar

If your classpath is correct, check whether you call
DBConfiguration.configureDB() with the correct driver class and url.

sandhiya wrote:
Hi,
I'm using postgresql and the driver is not getting detected. How do
you
run
it in the first place? I just typed

bin/hadoop jar /root/sandy/netbeans/TableAccess/dist/TableAccess.jar

at the terminal without the quotes. I didnt copy any files from my
local
drives into the Hadoop file system. I get an error like this :

java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.postgresql.Driver

and then the complete stack trace

Am i doing something wrong?
I downloaded a jar file for postgresql jdbc support and included it in
my
Libraries folder (I'm using NetBeans).
please help

Fredrik Hedberg-3 wrote:

Hi,

Although it's not MySQL; this might be of use:

http://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java

Fredrik

On Feb 16, 2009, at 8:33 AM, sandhiya wrote:

@Amandeep
Hi,
I'm new to Hadoop and am trying to run a simple database
connectivity
program on it. Could you please tell me how u went about it?? my
mail id is
sandys_cr...@yahoo.com . A copy of your code that successfully
connected
to MySQL will also be helpful.
Thanks,
Sandhiya

Enis Soztutar-2 wrote:

From the exception :

java.io.IOException: ORA-00933: SQL command not properly ended

Enis

Amandeep Khurana wrote:

Ok. I created the same database in a MySQL database and ran the
same
hadoop
job against it. It worked. So, that means there is some Oracle
specific
issue. It cant be an issue with the JDBC drivers since I am using
the
same
drivers in a simple JDBC client.

What could it be?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 10:26 AM, Amandeep Khurana
ama...@gmail.com

wrote:

Ok. I'm not sure if I got it correct. Are you saying, I should
test the
statement that hadoop creates directly with the database?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

On Wed, Feb 4, 2009 at 7:13 AM, Enis Soztutar
enis@gmail.com
wrote:

Hadoop-2536 connects to the db via JDBC, so in theory it should
work
with
proper jdbc drivers.
It has been tested against MySQL, Hsqldb, and PostreSQL, but not
Oracle.

To answer your earlier question, the actual SQL statements might
not be
recognized by Oracle, so I suggest the best way to test this is
to
insert
print statements, and run the actual SQL statements against
Oracle to
see if
the syntax is accepted.

We would appreciate if you publish your results.

Enis

Amandeep Khurana wrote:

Does the patch HADOOP-2536 support connecting to Oracle
databases as
well?
Or is it just limited to MySQL?

Amandeep

Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz

--
View this message in context:

http://www.nabble.com/HADOOP-2536-supports-Oracle-too--tp21823199p22032715.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
View

Re: Allowing other system users to use Haddoop

2009-02-18 Thread Rasit OZDAS

Nicholas, like Matei said,
There is 2 possibility in terms of permissions:

(any permissions command is just-like in linux)

1. Create a directory for a user. Make the user owner of that directory:
hadoop dfs -chown ... (assuming hadoop doesn't need to have write access to
any file outside user's home directory)
2. Convert group ownership of all files in HDFS to a group name which any
user has. (hadoop dfs -chgrp -R groupname /). Then give group write access
(hadoop dfs -chmod -R g+w /), again to all files. (here, any user runs jobs,
hadoop creates automatically a separated home directory). This way is better
for development environment, I think.

Cheers,
Rasit

2009/2/18 Matei Zaharia ma...@cloudera.com

Other users should be able to submit jobs using the same commands
(bin/hadoop ...). Are there errors you ran into?
One thing is that you'll need to grant them permissions over any files in
HDFS that you want them to read. You can do it using bin/hadoop fs -chmod,
which works like chmod on Linux. You may need to run this as the root user
(sudo bin/hadoop fs -chmod). Also, I don't remember exactly, but you may
need to create home directories for them in HDFS as well (again create them
as root, and then sudo bin/hadoop fs -chown them).

On Tue, Feb 17, 2009 at 10:48 AM, Nicholas Loulloudes
loulloude...@cs.ucy.ac.cy wrote:

Hi all,

I just installed Hadoop (Single Node) on a Linux Ubuntu distribution as
per the instructions found in the following website:

http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29

I followed the instructions of the website to create a hadoop system
user and group and i was able to run a Map Reduce job successfully.

What i want to do now is to create more system users which will be able
to use Hadoop for running Map Reduce jobs.

Is there any guide on how to achieve these??

Any suggestions will be highly appreciated.

Thanks in advance,

--
_

Nicholas Loulloudes
High Performance Computing Systems Laboratory (HPCL)
University of Cyprus,
Nicosia, Cyprus

--
M. Raşit ÖZDAŞ

GenericOptionsParser warning

2009-02-18 Thread Sandhya E

Hi All

I prepare my JobConf object in a java class, by calling various set
apis in JobConf object. When I submit the jobconf object using
JobClient.runJob(conf), I'm seeing the warning:
Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same. From hadoop sources it looks like
setting mapred.used.genericoptionsparser will prevent this warning.
But if I set this flag to true, will it have some other side effects.

Thanks
Sandhya

Re: Hadoop User Group UK Meetup - April 14th

2009-02-18 Thread Johan Oskarsson

Registrations to the next Hadoop User Group UK meetup have now opened:
http://huguk.eventwax.com/hadoop-user-group-uk-2

The preliminary schedule:
10.00 – 10.15: Arriving and chatting
10.15 – 11.15: Practical MapReduce (Tom White, Cloudera)
11.15 – 12.15: Introducing Apache Mahout (Isabel Drost, ASF)
12.15 – 13.15: Lunch
13.15 – 14.15: Terrier (Iadh Ounis and Craig Macdonald, University of
Glasgow)
14.15 – 15.15: Having Fun with PageRank and MapReduce (Paolo Castagna, HP)
15.15 – 16.15: Apache HBase (Michael Stack, Powerset)
16.15 – 17.00: General chat, perhaps lightning talks (powered by Sun beer)
17.00 – 00.00: Discussions continues at a nearby pub

The event is hosted by Sun in London, near Monument station, for more
details see the event page or the blog: http://huguk.org/

/Johan

Johan Oskarsson wrote:
 I've started organizing the next Hadoop meetup in London, UK. The date
 is April 14th and the presentations so far include:
 
 Michael Stack (Powerset): Apache HBase
 Isabel Drost (Neofonie): Introducing Apache Mahout
 Iadh Ounis and Craig Macdonalt (University of Glasgow): Terrier
 Paolo Castagna (HP): Having Fun with PageRank and MapReduce
 
 Keep an eye on the blog for updates: http://huguk.org/
 
 Help in the form of sponsoring (venue, beer etc) would be much
 appreciated. Also let me know if you want to present. Personally I'd
 love to see presentations from other Hadoop related projects (pig, hive,
 hama etc).
 
 /Johan

Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread S D

I'm having trouble overriding the maximum number of map tasks that run on a
given machine in my cluster. The default value of
mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml. When
running my job I passed

-jobconf mapred.tasktracker.map.tasks.maximum=1

to limit map tasks to one per machine but each machine was still allocated 2
map tasks (simultaneously).  The only way I was able to guarantee a maximum
of one map task per machine was to change the value of the property in
hadoop-site.xml. This is unsatisfactory since I'll often be changing the
maximum on a per job basis. Any hints?

On a different note, when I attempt to pass params via -D I get a usage
message; when I use -jobconf the command goes through (and works in the case
of mapred.reduce.tasks=0 for example) but I get  a deprecation warning).

Thanks,
John

Re: GenericOptionsParser warning

2009-02-18 Thread Steve Loughran


Sandhya E wrote:

Hi All

I prepare my JobConf object in a java class, by calling various set
apis in JobConf object. When I submit the jobconf object using
JobClient.runJob(conf), I'm seeing the warning:
Use GenericOptionsParser for parsing the arguments. Applications
should implement Tool for the same. From hadoop sources it looks like
setting mapred.used.genericoptionsparser will prevent this warning.
But if I set this flag to true, will it have some other side effects.

Thanks
Sandhya


Seen this message too -and it annoys me; not tracked it down

Re: GenericOptionsParser warning

2009-02-18 Thread Rasit OZDAS

Hi,
There is a JIRA issue about this problem, if I understand it correctly:
https://issues.apache.org/jira/browse/HADOOP-3743

Strange, that I searched all source code, but there exists only this control
in 2 places:

if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) {
  LOG.warn(Use GenericOptionsParser for parsing the arguments.  +
   Applications should implement Tool for the same.);
}

Just an if block for logging, no extra controls.
Am I missing something?

If your class implements Tool, than there shouldn't be a warning.

Cheers,
Rasit

2009/2/18 Steve Loughran ste...@apache.org

 Sandhya E wrote:

 Hi All

 I prepare my JobConf object in a java class, by calling various set
 apis in JobConf object. When I submit the jobconf object using
 JobClient.runJob(conf), I'm seeing the warning:
 Use GenericOptionsParser for parsing the arguments. Applications
 should implement Tool for the same. From hadoop sources it looks like
 setting mapred.used.genericoptionsparser will prevent this warning.
 But if I set this flag to true, will it have some other side effects.

 Thanks
 Sandhya


 Seen this message too -and it annoys me; not tracked it down




-- 
M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS

John, did you try -D option instead of -jobconf,

I had -D option in my code, I changed it with -jobconf, this is what I get:

...
...
Options:
  -inputpath DFS input file(s) for the Map step
  -output   path DFS output directory for the Reduce step
  -mapper   cmd|JavaClassName  The streaming command to run
  -combiner JavaClassName Combiner has to be a Java class
  -reducer  cmd|JavaClassName  The streaming command to run
  -file file File/dir to be shipped in the Job jar file
  -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks num  Optional.
  -inputreader spec  Optional.
  -cmdenv   n=vOptional. Pass env.var to streaming commands
  -mapdebug path  Optional. To run this script when a map task fails
  -reducedebug path  Optional. To run this script when a reduce task fails

  -verbose

Generic options supported are
-conf configuration file specify an application configuration file
-D property=valueuse value for given property
-fs local|namenode:port  specify a namenode
-jt local|jobtracker:portspecify a job tracker
-files comma separated list of filesspecify comma separated files to
be copied to the map reduce cluster
-libjars comma separated list of jarsspecify comma separated jar files
to include in the classpath.
-archives comma separated list of archivesspecify comma separated
archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



I think -jobconf is not used in v.0.19 .

2009/2/18 S D sd.codewarr...@gmail.com

 I'm having trouble overriding the maximum number of map tasks that run on a
 given machine in my cluster. The default value of
 mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
 When
 running my job I passed

 -jobconf mapred.tasktracker.map.tasks.maximum=1

 to limit map tasks to one per machine but each machine was still allocated
 2
 map tasks (simultaneously).  The only way I was able to guarantee a maximum
 of one map task per machine was to change the value of the property in
 hadoop-site.xml. This is unsatisfactory since I'll often be changing the
 maximum on a per job basis. Any hints?

 On a different note, when I attempt to pass params via -D I get a usage
 message; when I use -jobconf the command goes through (and works in the
 case
 of mapred.reduce.tasks=0 for example) but I get  a deprecation warning).

 Thanks,
 John




-- 
M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread S D

Thanks for your response Rasit. You may have missed a portion of my post.

 On a different note, when I attempt to pass params via -D I get a usage
message; when I use
 -jobconf the command goes through (and works in the case of
mapred.reduce.tasks=0 for
 example) but I get  a deprecation warning).

I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0
as well?

John


On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 John, did you try -D option instead of -jobconf,

 I had -D option in my code, I changed it with -jobconf, this is what I get:

 ...
 ...
 Options:
  -inputpath DFS input file(s) for the Map step
  -output   path DFS output directory for the Reduce step
  -mapper   cmd|JavaClassName  The streaming command to run
  -combiner JavaClassName Combiner has to be a Java class
  -reducer  cmd|JavaClassName  The streaming command to run
  -file file File/dir to be shipped in the Job jar file
  -inputformat
 TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
 Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks num  Optional.
  -inputreader spec  Optional.
  -cmdenv   n=vOptional. Pass env.var to streaming commands
  -mapdebug path  Optional. To run this script when a map task fails
  -reducedebug path  Optional. To run this script when a reduce task fails

  -verbose

 Generic options supported are
 -conf configuration file specify an application configuration file
 -D property=valueuse value for given property
 -fs local|namenode:port  specify a namenode
 -jt local|jobtracker:portspecify a job tracker
 -files comma separated list of filesspecify comma separated files to
 be copied to the map reduce cluster
 -libjars comma separated list of jarsspecify comma separated jar
 files
 to include in the classpath.
 -archives comma separated list of archivesspecify comma separated
 archives to be unarchived on the compute machines.

 The general command line syntax is
 bin/hadoop command [genericOptions] [commandOptions]

 For more details about these options:
 Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



 I think -jobconf is not used in v.0.19 .

 2009/2/18 S D sd.codewarr...@gmail.com

  I'm having trouble overriding the maximum number of map tasks that run on
 a
  given machine in my cluster. The default value of
  mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
  When
  running my job I passed
 
  -jobconf mapred.tasktracker.map.tasks.maximum=1
 
  to limit map tasks to one per machine but each machine was still
 allocated
  2
  map tasks (simultaneously).  The only way I was able to guarantee a
 maximum
  of one map task per machine was to change the value of the property in
  hadoop-site.xml. This is unsatisfactory since I'll often be changing the
  maximum on a per job basis. Any hints?
 
  On a different note, when I attempt to pass params via -D I get a usage
  message; when I use -jobconf the command goes through (and works in the
  case
  of mapred.reduce.tasks=0 for example) but I get  a deprecation warning).
 
  Thanks,
  John
 



 --
 M. Raşit ÖZDAŞ

Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_


Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the 100 KB of data are very big), with which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
-- 
View this message in context: 
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne

just re-represent the associated data as a bit vector and set of hash
functions.  you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

 Hi,

 The bloomfilter solution works great, but I still have to copy the data
 around sometimes.

 I'm still wondering if I can reduce the associated data to the keys to a
 reference or something small (the 100 KB of data are very big), with which
 I can then later fetch the data in the reduce step.

 In the past I was using hbase to store the associated data in it (but
 unfortunately hbase proved to be very unreliable in my case). I will
 probably also start to compress the data in the value store, which will
 probably increase sorting speed (as the data is there probably
 uncompressed).
 Is there something else I could do to speed this process up?

 Thanks,
 Thibaut
 --
 View this message in context: 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread jason hadoop

The .maximum values are only loaded by the Tasktrackers at server start time
at present, and any changes you make will be ignored.


2009/2/18 S D sd.codewarr...@gmail.com

 Thanks for your response Rasit. You may have missed a portion of my post.

  On a different note, when I attempt to pass params via -D I get a usage
 message; when I use
  -jobconf the command goes through (and works in the case of
 mapred.reduce.tasks=0 for
  example) but I get  a deprecation warning).

 I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0
 as well?

 John


 On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

  John, did you try -D option instead of -jobconf,
 
  I had -D option in my code, I changed it with -jobconf, this is what I
 get:
 
  ...
  ...
  Options:
   -inputpath DFS input file(s) for the Map step
   -output   path DFS output directory for the Reduce step
   -mapper   cmd|JavaClassName  The streaming command to run
   -combiner JavaClassName Combiner has to be a Java class
   -reducer  cmd|JavaClassName  The streaming command to run
   -file file File/dir to be shipped in the Job jar file
   -inputformat
  TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
  Optional.
   -outputformat TextOutputFormat(default)|JavaClassName  Optional.
   -partitioner JavaClassName  Optional.
   -numReduceTasks num  Optional.
   -inputreader spec  Optional.
   -cmdenv   n=vOptional. Pass env.var to streaming commands
   -mapdebug path  Optional. To run this script when a map task fails
   -reducedebug path  Optional. To run this script when a reduce task
 fails
 
   -verbose
 
  Generic options supported are
  -conf configuration file specify an application configuration file
  -D property=valueuse value for given property
  -fs local|namenode:port  specify a namenode
  -jt local|jobtracker:portspecify a job tracker
  -files comma separated list of filesspecify comma separated files
 to
  be copied to the map reduce cluster
  -libjars comma separated list of jarsspecify comma separated jar
  files
  to include in the classpath.
  -archives comma separated list of archivesspecify comma separated
  archives to be unarchived on the compute machines.
 
  The general command line syntax is
  bin/hadoop command [genericOptions] [commandOptions]
 
  For more details about these options:
  Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
 
 
 
  I think -jobconf is not used in v.0.19 .
 
  2009/2/18 S D sd.codewarr...@gmail.com
 
   I'm having trouble overriding the maximum number of map tasks that run
 on
  a
   given machine in my cluster. The default value of
   mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
   When
   running my job I passed
  
   -jobconf mapred.tasktracker.map.tasks.maximum=1
  
   to limit map tasks to one per machine but each machine was still
  allocated
   2
   map tasks (simultaneously).  The only way I was able to guarantee a
  maximum
   of one map task per machine was to change the value of the property in
   hadoop-site.xml. This is unsatisfactory since I'll often be changing
 the
   maximum on a per job basis. Any hints?
  
   On a different note, when I attempt to pass params via -D I get a usage
   message; when I use -jobconf the command goes through (and works in the
   case
   of mapred.reduce.tasks=0 for example) but I get  a deprecation
 warning).
  
   Thanks,
   John
  
 
 
 
  --
  M. Raşit ÖZDAŞ

Re: Finding small subset in very large dataset

2009-02-18 Thread Thibaut_

Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be 100GB big
and is on the DFS)?

I could also directly save the offset of the data in the datafile as
reference, and then on each reducer read in that big file only once. As all
the keys are sorted, I can get all the needed values in one big read step
(skipping those entries I don't need).

Thibaut

Miles Osborne wrote:

just re-represent the associated data as a bit vector and set of hash
functions. you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the 100 KB of data are very big), with
which
I can then later fetch the data in the reduce step.

In the past I was using hbase to store the associated data in it (but
unfortunately hbase proved to be very unreliable in my case). I will
probably also start to compress the data in the value store, which will
probably increase sorting speed (as the data is there probably
uncompressed).
Is there something else I could do to speed this process up?

Thanks,
Thibaut
--
View this message in context:
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22081608.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

--
View this message in context:
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne

if i remember correctly you have two sets of data:

--set A, which is very big
--set B, which is small

and you want to find all elements of A which are in B. right?

represent A using a variant of a Bloom Filter which supports key-value
pairs. A Bloomier Filter will do this for you.

each mapper then loads-up A (represented using the Bloomier Filter)
and works over B . whenever A is present in the representation of B,
you look for the associated value in B and emit that.

if even using a Bloomier Filter you still need too much memory then
you could store it once using Hypertable

see here for an explanation of Bloomier Filters applied to the task of
storing lots of string,probability pairs

Randomized Language Models via Perfect Hash Functions
http://aclweb.org/anthology-new/P/P08/P08-1058.pdf
Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

Hi Miles,

I'm not following you.
If I'm saving an associated hash or bit vector, how can I then quickly
access the elements afterwards (the file with the data might be 100GB big
and is on the DFS)?

Thibaut

Miles Osborne wrote:

just re-represent the associated data as a bit vector and set of hash
functions. you then just copy this around, rather than the raw items
themselves.

Miles

2009/2/18 Thibaut_ tbr...@blue.lu:

Hi,

The bloomfilter solution works great, but I still have to copy the data
around sometimes.

I'm still wondering if I can reduce the associated data to the keys to a
reference or something small (the 100 KB of data are very big), with
which
I can then later fetch the data in the reduce step.

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

--
View this message in context:
http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p22082598.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread S D

Thanks Jason. That's useful information. Are you aware of plans to change
this so that the maximum values can be changed without restarting the
server?

John

2009/2/18 jason hadoop jason.had...@gmail.com

 The .maximum values are only loaded by the Tasktrackers at server start
 time
 at present, and any changes you make will be ignored.


 2009/2/18 S D sd.codewarr...@gmail.com

  Thanks for your response Rasit. You may have missed a portion of my post.
 
   On a different note, when I attempt to pass params via -D I get a usage
  message; when I use
   -jobconf the command goes through (and works in the case of
  mapred.reduce.tasks=0 for
   example) but I get  a deprecation warning).
 
  I'm using Hadoop 0.19.0 and -D is not working. Are you using version
 0.19.0
  as well?
 
  John
 
 
  On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com
 wrote:
 
   John, did you try -D option instead of -jobconf,
  
   I had -D option in my code, I changed it with -jobconf, this is what I
  get:
  
   ...
   ...
   Options:
-inputpath DFS input file(s) for the Map step
-output   path DFS output directory for the Reduce step
-mapper   cmd|JavaClassName  The streaming command to run
-combiner JavaClassName Combiner has to be a Java class
-reducer  cmd|JavaClassName  The streaming command to run
-file file File/dir to be shipped in the Job jar file
-inputformat
   TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
   Optional.
-outputformat TextOutputFormat(default)|JavaClassName  Optional.
-partitioner JavaClassName  Optional.
-numReduceTasks num  Optional.
-inputreader spec  Optional.
-cmdenv   n=vOptional. Pass env.var to streaming commands
-mapdebug path  Optional. To run this script when a map task fails
-reducedebug path  Optional. To run this script when a reduce task
  fails
  
-verbose
  
   Generic options supported are
   -conf configuration file specify an application configuration
 file
   -D property=valueuse value for given property
   -fs local|namenode:port  specify a namenode
   -jt local|jobtracker:portspecify a job tracker
   -files comma separated list of filesspecify comma separated files
  to
   be copied to the map reduce cluster
   -libjars comma separated list of jarsspecify comma separated jar
   files
   to include in the classpath.
   -archives comma separated list of archivesspecify comma separated
   archives to be unarchived on the compute machines.
  
   The general command line syntax is
   bin/hadoop command [genericOptions] [commandOptions]
  
   For more details about these options:
   Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
  
  
  
   I think -jobconf is not used in v.0.19 .
  
   2009/2/18 S D sd.codewarr...@gmail.com
  
I'm having trouble overriding the maximum number of map tasks that
 run
  on
   a
given machine in my cluster. The default value of
mapred.tasktracker.map.tasks.maximum is set to 2 in
 hadoop-default.xml.
When
running my job I passed
   
-jobconf mapred.tasktracker.map.tasks.maximum=1
   
to limit map tasks to one per machine but each machine was still
   allocated
2
map tasks (simultaneously).  The only way I was able to guarantee a
   maximum
of one map task per machine was to change the value of the property
 in
hadoop-site.xml. This is unsatisfactory since I'll often be changing
  the
maximum on a per job basis. Any hints?
   
On a different note, when I attempt to pass params via -D I get a
 usage
message; when I use -jobconf the command goes through (and works in
 the
case
of mapred.reduce.tasks=0 for example) but I get  a deprecation
  warning).
   
Thanks,
John
   
  
  
  
   --
   M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread jason hadoop

I certainly hope it changes but I am unaware that it is in the todo queue at
present.

2009/2/18 S D sd.codewarr...@gmail.com

 Thanks Jason. That's useful information. Are you aware of plans to change
 this so that the maximum values can be changed without restarting the
 server?

 John

 2009/2/18 jason hadoop jason.had...@gmail.com

  The .maximum values are only loaded by the Tasktrackers at server start
  time
  at present, and any changes you make will be ignored.
 
 
  2009/2/18 S D sd.codewarr...@gmail.com
 
   Thanks for your response Rasit. You may have missed a portion of my
 post.
  
On a different note, when I attempt to pass params via -D I get a
 usage
   message; when I use
-jobconf the command goes through (and works in the case of
   mapred.reduce.tasks=0 for
example) but I get  a deprecation warning).
  
   I'm using Hadoop 0.19.0 and -D is not working. Are you using version
  0.19.0
   as well?
  
   John
  
  
   On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com
  wrote:
  
John, did you try -D option instead of -jobconf,
   
I had -D option in my code, I changed it with -jobconf, this is what
 I
   get:
   
...
...
Options:
 -inputpath DFS input file(s) for the Map step
 -output   path DFS output directory for the Reduce step
 -mapper   cmd|JavaClassName  The streaming command to run
 -combiner JavaClassName Combiner has to be a Java class
 -reducer  cmd|JavaClassName  The streaming command to run
 -file file File/dir to be shipped in the Job jar file
 -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
 -outputformat TextOutputFormat(default)|JavaClassName  Optional.
 -partitioner JavaClassName  Optional.
 -numReduceTasks num  Optional.
 -inputreader spec  Optional.
 -cmdenv   n=vOptional. Pass env.var to streaming commands
 -mapdebug path  Optional. To run this script when a map task fails
 -reducedebug path  Optional. To run this script when a reduce task
   fails
   
 -verbose
   
Generic options supported are
-conf configuration file specify an application configuration
  file
-D property=valueuse value for given property
-fs local|namenode:port  specify a namenode
-jt local|jobtracker:portspecify a job tracker
-files comma separated list of filesspecify comma separated
 files
   to
be copied to the map reduce cluster
-libjars comma separated list of jarsspecify comma separated
 jar
files
to include in the classpath.
-archives comma separated list of archivesspecify comma
 separated
archives to be unarchived on the compute machines.
   
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
   
For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
   
   
   
I think -jobconf is not used in v.0.19 .
   
2009/2/18 S D sd.codewarr...@gmail.com
   
 I'm having trouble overriding the maximum number of map tasks that
  run
   on
a
 given machine in my cluster. The default value of
 mapred.tasktracker.map.tasks.maximum is set to 2 in
  hadoop-default.xml.
 When
 running my job I passed

 -jobconf mapred.tasktracker.map.tasks.maximum=1

 to limit map tasks to one per machine but each machine was still
allocated
 2
 map tasks (simultaneously).  The only way I was able to guarantee a
maximum
 of one map task per machine was to change the value of the property
  in
 hadoop-site.xml. This is unsatisfactory since I'll often be
 changing
   the
 maximum on a per job basis. Any hints?

 On a different note, when I attempt to pass params via -D I get a
  usage
 message; when I use -jobconf the command goes through (and works in
  the
 case
 of mapred.reduce.tasks=0 for example) but I get  a deprecation
   warning).

 Thanks,
 John

   
   
   
--
M. Raşit ÖZDAŞ

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Rasit OZDAS

I see, John.
I also use 0.19, just to note, -D option should come first, since it's one
of generic options. I use it without any errors.

Cheers,
Rasit

2009/2/18 S D sd.codewarr...@gmail.com

 Thanks for your response Rasit. You may have missed a portion of my post.

  On a different note, when I attempt to pass params via -D I get a usage
 message; when I use
  -jobconf the command goes through (and works in the case of
 mapred.reduce.tasks=0 for
  example) but I get  a deprecation warning).

 I'm using Hadoop 0.19.0 and -D is not working. Are you using version 0.19.0
 as well?

 John


 On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

  John, did you try -D option instead of -jobconf,
 
  I had -D option in my code, I changed it with -jobconf, this is what I
 get:
 
  ...
  ...
  Options:
   -inputpath DFS input file(s) for the Map step
   -output   path DFS output directory for the Reduce step
   -mapper   cmd|JavaClassName  The streaming command to run
   -combiner JavaClassName Combiner has to be a Java class
   -reducer  cmd|JavaClassName  The streaming command to run
   -file file File/dir to be shipped in the Job jar file
   -inputformat
  TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
  Optional.
   -outputformat TextOutputFormat(default)|JavaClassName  Optional.
   -partitioner JavaClassName  Optional.
   -numReduceTasks num  Optional.
   -inputreader spec  Optional.
   -cmdenv   n=vOptional. Pass env.var to streaming commands
   -mapdebug path  Optional. To run this script when a map task fails
   -reducedebug path  Optional. To run this script when a reduce task
 fails
 
   -verbose
 
  Generic options supported are
  -conf configuration file specify an application configuration file
  -D property=valueuse value for given property
  -fs local|namenode:port  specify a namenode
  -jt local|jobtracker:portspecify a job tracker
  -files comma separated list of filesspecify comma separated files
 to
  be copied to the map reduce cluster
  -libjars comma separated list of jarsspecify comma separated jar
  files
  to include in the classpath.
  -archives comma separated list of archivesspecify comma separated
  archives to be unarchived on the compute machines.
 
  The general command line syntax is
  bin/hadoop command [genericOptions] [commandOptions]
 
  For more details about these options:
  Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info
 
 
 
  I think -jobconf is not used in v.0.19 .
 
  2009/2/18 S D sd.codewarr...@gmail.com
 
   I'm having trouble overriding the maximum number of map tasks that run
 on
  a
   given machine in my cluster. The default value of
   mapred.tasktracker.map.tasks.maximum is set to 2 in hadoop-default.xml.
   When
   running my job I passed
  
   -jobconf mapred.tasktracker.map.tasks.maximum=1
  
   to limit map tasks to one per machine but each machine was still
  allocated
   2
   map tasks (simultaneously).  The only way I was able to guarantee a
  maximum
   of one map task per machine was to change the value of the property in
   hadoop-site.xml. This is unsatisfactory since I'll often be changing
 the
   maximum on a per job basis. Any hints?
  
   On a different note, when I attempt to pass params via -D I get a usage
   message; when I use -jobconf the command goes through (and works in the
   case
   of mapred.reduce.tasks=0 for example) but I get  a deprecation
 warning).
  
   Thanks,
   John
  
 
 
 
  --
  M. Raşit ÖZDAŞ
 




-- 
M. Raşit ÖZDAŞ

Getting Started with AIX mahcines

2009-02-18 Thread Aviad sela

I am attempting my first steps learning Hadoop on top of AIX machine.

I have followed the installation description:
http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html

The Stand-Alone  Mode worked just well.

However, I am failing when trying to execute the Psuedo-Distributed Mode:
I have carried out the following steps:


   1. update conf/hadoop-site.xml
   2. exec bin/hadoop namenode -format
   3. exec bin/start-all.sh
   4. exec bin/hadoop fs -put conf input


   - the execution of step 2 (Formatting the NameNode) was succesful,
   corresponding to the expected result also shown in
   
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster
   )



   - The exeuction of step 3 ( starting the Single-Node servers) , seems to
   be OK , although the output is not similiar to the one carried out by Ubuntu
   LINUX,

   It seems that the Local host shell is exited:
   starting namenode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-namenode-rcc-hrl-lpar-020.haifa.ibm.com.out
   localhost: starting datanode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-datanode-rcc-hrl-lpar-020.haifa.ibm.com.out
   localhost: Hasta la vista, baby *+ IT SEEMS that
the local host shell exits ==*
   localhost: starting secondarynamenode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-secondarynamenode-rcc-hrl-lpar-020.haifa.ibm.com.out
   localhost: Hasta la vista, baby *+ IT SEEMS that
the local host shell exits ==*
  starting jobtracker, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-jobtracker-rcc-hrl-lpar-020.haifa.ibm.com.out
   localhost: starting tasktracker, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-tasktracker-rcc-hrl-lpar-020.haifa.ibm.com.out
 localhost: Hasta la vista, baby *+ IT SEEMS that the
local host shell exits ==*

   - The execution of step 4 fails, no data is copied to DFS input directory
   , recieving exception

09/02/18 12:14:24 INFO hdfs.DFSClient:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hdpuser/input/masters could only be replicated to 0 nodes, instead of
1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:45)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
  09/02/18 12:14:24 WARN hdfs.DFSClient: NotReplicatedYetException
sleeping /user/hdpuser/input/masters retries left 4
.
.
09/02/18 12:14:30 WARN hdfs.DFSClient: Error Recovery for block null bad
datanode[0] nodes == null
09/02/18 12:14:30 WARN hdfs.DFSClient: Could not get block locations.
Aborting...
put: java.io.IOException: File /user/hdpuser/input/masters could only be
replicated to 0 nodes, instead of 1
Exception closing file /user/hdpuser/input/masters
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
at

RE: Getting Started with AIX mahcines

2009-02-18 Thread Habermaas, William


Refer to the following fix: Hadoop will not work under AIX without it.  

https://issues.apache.org/jira/browse/HADOOP-4546

Bill

-Original Message-
From: work.av...@gmail.com [mailto:work.av...@gmail.com] On Behalf Of
Aviad sela
Sent: Wednesday, February 18, 2009 12:14 PM
To: Hadoop Users Support
Subject: Getting Started with AIX mahcines

I am attempting my first steps learning Hadoop on top of AIX machine.

I have followed the installation description:
http://hadoop.apache.org/core/docs/r0.19.0/quickstart.html

The Stand-Alone  Mode worked just well.

However, I am failing when trying to execute the Psuedo-Distributed
Mode:
I have carried out the following steps:


   1. update conf/hadoop-site.xml
   2. exec bin/hadoop namenode -format
   3. exec bin/start-all.sh
   4. exec bin/hadoop fs -put conf input


   - the execution of step 2 (Formatting the NameNode) was succesful,
   corresponding to the expected result also shown in
 
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-
Node_Cluster
   )



   - The exeuction of step 3 ( starting the Single-Node servers) , seems
to
   be OK , although the output is not similiar to the one carried out by
Ubuntu
   LINUX,

   It seems that the Local host shell is exited:
   starting namenode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-namenode-rcc-hrl-lp
ar-020.haifa.ibm.com.out
   localhost: starting datanode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-datanode-rcc-hrl-lp
ar-020.haifa.ibm.com.out
   localhost: Hasta la vista, baby *+ IT SEEMS
that
the local host shell exits ==*
   localhost: starting secondarynamenode, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-secondarynamenode-r
cc-hrl-lpar-020.haifa.ibm.com.out
   localhost: Hasta la vista, baby *+ IT SEEMS
that
the local host shell exits ==*
  starting jobtracker, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-jobtracker-rcc-hrl-
lpar-020.haifa.ibm.com.out
   localhost: starting tasktracker, logging to
/usr/hadoop/hadoop-0.19.0/bin/../logs/hadoop-hdpuser-tasktracker-rcc-hrl
-lpar-020.haifa.ibm.com.out
 localhost: Hasta la vista, baby *+ IT SEEMS that
the
local host shell exits ==*

   - The execution of step 4 fails, no data is copied to DFS input
directory
   , recieving exception

09/02/18 12:14:24 INFO hdfs.DFSClient:
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/user/hdpuser/input/masters could only be replicated to 0 nodes, instead
of
1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(F
SNamesystem.java:1270)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:3
51)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:45)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:45)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:37)
at java.lang.reflect.Method.invoke(Method.java:599)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo
cationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation
Handler.java:59)
at $Proxy0.addBlock(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DF
SClient.java:2815)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(D
FSClient.java:2697)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.j
ava:1997)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCli
ent.java:2183)
  09/02/18 12:14:24 WARN hdfs.DFSClient: NotReplicatedYetException
sleeping /user/hdpuser/input/masters retries left 4
.
.
09/02/18 12:14:30 WARN hdfs.DFSClient: Error Recovery for block null bad
datanode[0] nodes == null
09/02/18 12:14:30 WARN hdfs.DFSClient: Could not get block locations.
Aborting...
put: java.io.IOException: File /user/hdpuser/input/masters could only be
replicated to 0 nodes, instead of 1
Exception closing file /user/hdpuser/input/masters
java.io.IOException: Filesystem closed
at

Disabling Reporter Output?

2009-02-18 Thread Philipp Dobrigkeit

I am currently trying Map/Reduce in Eclipse. The input comes from an hbase 
table. The performance of my jobs is terrible. Even when only done on a single 
row it takes around 10 seconds to complete the job. My current guess is that 
the reporting done to the eclipse console might play a role in here. 

I am looking for a way to disable the printing of status to the console.

Or of course any other ideas what is going wrong here.

This is a single node cluster, pretty common desktop hardware and writing to 
the hbase is a breeze.

Thanks
Philipp
-- 
Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL 
für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a

Re: Disabling Reporter Output?

2009-02-18 Thread jason hadoop

There is a moderate a mount of setup and tear down in any hadoop job. It may
be that your 10 seconds are primarily that.


On Wed, Feb 18, 2009 at 11:29 AM, Philipp Dobrigkeit pdobrigk...@gmx.dewrote:

 I am currently trying Map/Reduce in Eclipse. The input comes from an hbase
 table. The performance of my jobs is terrible. Even when only done on a
 single row it takes around 10 seconds to complete the job. My current guess
 is that the reporting done to the eclipse console might play a role in here.

 I am looking for a way to disable the printing of status to the console.

 Or of course any other ideas what is going wrong here.

 This is a single node cluster, pretty common desktop hardware and writing
 to the hbase is a breeze.

 Thanks
 Philipp
 --
 Jetzt 1 Monat kostenlos! GMX FreeDSL - Telefonanschluss + DSL
 für nur 17,95 Euro/mtl.!* http://dsl.gmx.de/?ac=OM.AD.PD003K11308T4569a

Re: Hadoop Write Performance

2009-02-18 Thread Raghu Angadi



what is the hadoop version?

You could check log on a datanode around that time. You could post any 
suspicious errors. For e.g. you can trace a particular block in client 
and datanode logs.


Most likely it not a NameNode issue, but you can check NameNode log as well.

Raghu.

Xavier Stevens wrote:

Does anyone have an expected or experienced write speed to HDFS outside
of Map/Reduce?  Any recommendations on properties to tweak in
hadoop-site.xml?
 
Currently I have a multi-threaded writer where each thread is writing to

a different file.  But after a while I get this:
 
java.io.IOException: Could not get block locations. Aborting...

 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFS
Client.java:2081)
 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.ja
va:1702)
 at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClie
nt.java:1818)
 
Which is perhaps indicating that the namenode is overwhelmed?
 
 
Thanks,
 
-Xavier

RE: Hadoop Write Performance

2009-02-18 Thread Xavier Stevens

Raghu,

I was using 0.17.2.1, but I installed 0.18.3 a couple of days ago.  I
also separated out my secondarynamenode and jobtracker to another
machine.  In addition my network operations people had misconfigured
some switches which ended up being my bottleneck.

After all of that my writer and Hadoop is working great.


-Xavier
 

-Original Message-
From: Raghu Angadi [mailto:rang...@yahoo-inc.com] 
Sent: Wednesday, February 18, 2009 11:49 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop Write Performance


what is the hadoop version?

You could check log on a datanode around that time. You could post any
suspicious errors. For e.g. you can trace a particular block in client
and datanode logs.

Most likely it not a NameNode issue, but you can check NameNode log as
well.

Raghu.

Xavier Stevens wrote:
 Does anyone have an expected or experienced write speed to HDFS 
 outside of Map/Reduce?  Any recommendations on properties to tweak in 
 hadoop-site.xml?
  
 Currently I have a multi-threaded writer where each thread is writing 
 to a different file.  But after a while I get this:
  
 java.io.IOException: Could not get block locations. Aborting...
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(D
 FS
 Client.java:2081)
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1300(DFSClient.
 ja
 va:1702)
  at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSCl
 ie
 nt.java:1818)
  
 Which is perhaps indicating that the namenode is overwhelmed?
  
  
 Thanks,
  
 -Xavier

Probelms getting Eclipse Hadoop plugin to work.

2009-02-18 Thread Erik Holstad

I'm using Eclipse 3.3.2 and want to view my remote cluster using the Hadoop
plugin.
Everything shows up and I can see the map/reduce perspective but when trying
to
connect to a location I get:
Error: Call failed on local exception

I've set the host to for example xx0, where xx0 is a remote machine
accessible from
the terminal, and the ports to 50020/50040 for M/R master and
DFS master respectively. Is there anything I'm missing to set for remote
access to the
Hadoop cluster?

Regards Erik

Re: GenericOptionsParser warning

2009-02-18 Thread Aaron Kimball

You should put this stub code in your program as the means to start your
MapReduce job:

public class Foo extends Configured implements Tool {

  public int run(String [] args) throws IOException {
JobConf conf = new JobConf(getConf(), Foo.class);
// run the job here.
return 0;
  }

  public static void main(String [] args) throws Exception {
int ret = ToolRunner.run(new Foo(), args); // calls your run() method.
System.exit(ret);
  }
}


On Wed, Feb 18, 2009 at 7:09 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi,
 There is a JIRA issue about this problem, if I understand it correctly:
 https://issues.apache.org/jira/browse/HADOOP-3743

 Strange, that I searched all source code, but there exists only this
 control
 in 2 places:

 if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) {
  LOG.warn(Use GenericOptionsParser for parsing the arguments.  +
   Applications should implement Tool for the same.);
}

 Just an if block for logging, no extra controls.
 Am I missing something?

 If your class implements Tool, than there shouldn't be a warning.

 Cheers,
 Rasit

 2009/2/18 Steve Loughran ste...@apache.org

  Sandhya E wrote:
 
  Hi All
 
  I prepare my JobConf object in a java class, by calling various set
  apis in JobConf object. When I submit the jobconf object using
  JobClient.runJob(conf), I'm seeing the warning:
  Use GenericOptionsParser for parsing the arguments. Applications
  should implement Tool for the same. From hadoop sources it looks like
  setting mapred.used.genericoptionsparser will prevent this warning.
  But if I set this flag to true, will it have some other side effects.
 
  Thanks
  Sandhya
 
 
  Seen this message too -and it annoys me; not tracked it down
 



 --
 M. Raşit ÖZDAŞ

Persistent completed jobs status not showing in jobtracker UI

2009-02-18 Thread Bill Au

I have enabled persistent completed jobs status and can see them in HDFS.
However, they are not listed in the jobtracker's UI after the jobtracker is
restarted.  I thought that jobtracker will automatically look in HDFS if it
does not find a job in its memory cache.  What am I missing?  How to I
retrieve the persistent completed job status?

Bill

the question about the common pc?

2009-02-18 Thread buddha1021


hi:
the hadoop distributes the data and processing across clusters of commonly
available computers.the document said this. but what is the commonly
available computers mean? 1U server? or the pc that people daily used on
windows?
-- 
View this message in context: 
http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092022.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: the question about the common pc?

2009-02-18 Thread buddha1021


and ,the nodes is the pc that people daily used on windows or the 1u server?

buddha1021 wrote:
 
 hi:
 the hadoop distributes the data and processing across clusters of commonly
 available computers.the document said this. but what is the commonly
 available computers mean? 1U server? or the pc that people daily used on
 windows?
 

-- 
View this message in context: 
http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

7Zip compression in Hadoop

2009-02-18 Thread 柳松

Hi all!
I'm working on the Sogou Corpus Mining with hadoop mapreduce. However, the 
files are compressed as 7zip format.
Does Hadoop has a built-in support for 7zip files? or I need to write a codec?

Regards

Song Liu in Suzhou Universtiy, China.

RE: 7Zip compression in Hadoop

2009-02-18 Thread Zheng Shao

No, you will need to write one yourself.

Zheng
-Original Message-
From: 柳松 [mailto:lamfeel...@126.com] 
Sent: Wednesday, February 18, 2009 6:19 PM
To: core-user@hadoop.apache.org
Subject: 7Zip compression in Hadoop

Hi all!
I'm working on the Sogou Corpus Mining with hadoop mapreduce. However, the 
files are compressed as 7zip format.
Does Hadoop has a built-in support for 7zip files? or I need to write a codec?

Regards

Song Liu in Suzhou Universtiy, China.

Sogou Corpus Decoder/Codec for Hadoop

2009-02-18 Thread 柳松

Dear all!
Can any one provide me a decoder or cdoec for Sogou Corpus? I'm analyzing Sogou 
Corpus using hadoop, but I cannot decode the .7z files.

I have tried LZMA, but Idont know why it is not able to uncompress and decode 
the Sogou Corpus.
If there are some one who like me are analysing this largest internet corpus, 
please let me know and help me to figure out this problem!

Thanks

Song Liu in Suzhou University , China.

Re: Persistent completed jobs status not showing in jobtracker UI

2009-02-18 Thread Amareshwari Sriramadasu


Bill Au wrote:

I have enabled persistent completed jobs status and can see them in HDFS.
However, they are not listed in the jobtracker's UI after the jobtracker is
restarted.  I thought that jobtracker will automatically look in HDFS if it
does not find a job in its memory cache.  What am I missing?  How to I
retrieve the persistent completed job status?

Bill

  
JobTracker web ui doesn't look at persistence storage after a restart. 
You can access the old jobs from job history. History link is accesible 
from web ui.

-Amareshwari

Re: Overriding mapred.tasktracker.map.tasks.maximum with -jobconf

2009-02-18 Thread Amareshwari Sriramadasu


Yes. The configuration is read only when the taskTracker starts.
You can see more discussion on jira HADOOP-5170 
(http://issues.apache.org/jira/browse/HADOOP-5170) for making it per job.

-Amareshwari
jason hadoop wrote:

I certainly hope it changes but I am unaware that it is in the todo queue at
present.

2009/2/18 S D sd.codewarr...@gmail.com

  

Thanks Jason. That's useful information. Are you aware of plans to change
this so that the maximum values can be changed without restarting the
server?

John

2009/2/18 jason hadoop jason.had...@gmail.com



The .maximum values are only loaded by the Tasktrackers at server start
time
at present, and any changes you make will be ignored.


2009/2/18 S D sd.codewarr...@gmail.com

  

Thanks for your response Rasit. You may have missed a portion of my


post.


On a different note, when I attempt to pass params via -D I get a
  

usage


message; when I use


-jobconf the command goes through (and works in the case of
  

mapred.reduce.tasks=0 for


example) but I get  a deprecation warning).
  

I'm using Hadoop 0.19.0 and -D is not working. Are you using version


0.19.0
  

as well?

John


On Wed, Feb 18, 2009 at 9:14 AM, Rasit OZDAS rasitoz...@gmail.com


wrote:
  

John, did you try -D option instead of -jobconf,

I had -D option in my code, I changed it with -jobconf, this is what
  

I


get:


...
...
Options:
 -inputpath DFS input file(s) for the Map step
 -output   path DFS output directory for the Reduce step
 -mapper   cmd|JavaClassName  The streaming command to run
 -combiner JavaClassName Combiner has to be a Java class
 -reducer  cmd|JavaClassName  The streaming command to run
 -file file File/dir to be shipped in the Job jar file
 -inputformat
TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName
Optional.
 -outputformat TextOutputFormat(default)|JavaClassName  Optional.
 -partitioner JavaClassName  Optional.
 -numReduceTasks num  Optional.
 -inputreader spec  Optional.
 -cmdenv   n=vOptional. Pass env.var to streaming commands
 -mapdebug path  Optional. To run this script when a map task fails
 -reducedebug path  Optional. To run this script when a reduce task
  

fails


 -verbose

Generic options supported are
-conf configuration file specify an application configuration
  

file
  

-D property=valueuse value for given property
-fs local|namenode:port  specify a namenode
-jt local|jobtracker:portspecify a job tracker
-files comma separated list of filesspecify comma separated
  

files


to


be copied to the map reduce cluster
-libjars comma separated list of jarsspecify comma separated
  

jar


files
to include in the classpath.
-archives comma separated list of archivesspecify comma
  

separated


archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info



I think -jobconf is not used in v.0.19 .

2009/2/18 S D sd.codewarr...@gmail.com

  

I'm having trouble overriding the maximum number of map tasks that


run
  

on


a
  

given machine in my cluster. The default value of
mapred.tasktracker.map.tasks.maximum is set to 2 in


hadoop-default.xml.
  

When
running my job I passed

-jobconf mapred.tasktracker.map.tasks.maximum=1

to limit map tasks to one per machine but each machine was still


allocated
  

2
map tasks (simultaneously).  The only way I was able to guarantee a


maximum
  

of one map task per machine was to change the value of the property


in
  

hadoop-site.xml. This is unsatisfactory since I'll often be


changing


the


maximum on a per job basis. Any hints?

On a different note, when I attempt to pass params via -D I get a


usage
  

message; when I use -jobconf the command goes through (and works in


the
  

case
of mapred.reduce.tasks=0 for example) but I get  a deprecation


warning).


Thanks,
John




--
M. Raşit ÖZDAŞ

Re:Re: the question about the common pc?

2009-02-18 Thread 柳松

Actually, there's a widely misunderstanding of this Common PC . Common PC 
doesn't means PCs which are daily used, It means the performance of each node, 
can be measured by common pc's computing power.

In the matter of fact, we dont use Gb enthernet for daily pcs' communication, 
we dont use linux for our document process, and most importantly, Hadoop cannot 
run effectively on thoese daily pcs.

 
Hadoop is designed for High performance computing equipment, but claimed to 
be fit for daily pcs.

Hadoop for pcs? what a joke.

 
  -原始邮件-
 发件人: buddha1021 buddha1...@yahoo.cn
 发送时间: 2009年2月19日 星期四
 收件人: core-user@hadoop.apache.org
 抄送: 
 主题: Re: the question about the common pc?
 
 
 and ,the nodes is the pc that people daily used on windows or the 1u server?
 
 buddha1021 wrote:
  
  hi:
  the hadoop distributes the data and processing across clusters of commonly
  available computers.the document said this. but what is the commonly
  available computers mean? 1U server? or the pc that people daily used on
  windows?
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: the question about the common pc?

2009-02-18 Thread Brian Bockelman



On Feb 18, 2009, at 11:43 PM, 柳松 wrote:

Actually, there's a widely misunderstanding of this Common PC .  
Common PC doesn't means PCs which are daily used, It means the  
performance of each node, can be measured by common pc's computing  
power.


In the matter of fact, we dont use Gb enthernet for daily pcs'  
communication,


I certainly do.


we dont use linux for our document process,


I do.

and most importantly, Hadoop cannot run effectively on thoese daily  
pcs.




Maybe your PC is under-spec'd?



Hadoop is designed for High performance computing equipment, but  
claimed to be fit for daily pcs.




Our students run it on Pentium III's with 20GB HDD.  Try finding a new  
laptop with that low of specs.



Hadoop for pcs? what a joke.



The truth is that Hadoop scales to the gear you have.  If you throw a  
bunch of Windows desktops, it'll perform like a bunch of Windows  
desktops.  If you run it on the student test cluster, it'll perform  
like Java on PIIIs.  If you run it on a new high-performance  
cluster ... well, you get the point.


If you want to run Hadoop for development work, I'd say you want to  
use your desktop.  If you want to run Hadoop for production work, I'd  
recommend a production environment - decently powered 1U linux  
servers with large disks (or whatever the recommendation is on the  
wiki).


Brian



  -原始邮件-

发件人: buddha1021 buddha1...@yahoo.cn
发送时间: 2009年2月19日 星期四
收件人: core-user@hadoop.apache.org
抄送:
主题: Re: the question about the common pc?


and ,the nodes is the pc that people daily used on windows or the  
1u server?


buddha1021 wrote:


hi:
the hadoop distributes the data and processing across clusters of  
commonly
available computers.the document said this. but what is the  
commonly
available computers mean? 1U server? or the pc that people daily  
used on

windows?



--
View this message in context: 
http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22092038.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Re:Re: the question about the common pc?

2009-02-18 Thread Tim Wintle

On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
 Hadoop is designed for High performance computing equipment, but claimed to 
 be fit for daily pcs.

The phrase High Performance Computing equipment makes me think of
infiniband, fibre all over the place etc.


Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
special hardware you couldn't find in a standard pc. That doesn't mean
you should run it on pcs that are being used for other things though.

I found that hadoop ran ok on fairly old hardware - a load of old
power-pc macs (running linux) churned through some jobs quickly, and
I've actually run it on people's office machines during the nights (not
on Windows). I did end up having to add an extra switch in for the part
of the network that was only 100 mbps to get the throughput though.

Of course ideally you would be running it on a rack of 1u servers, but
that's still normally standard pc hardware.

Re: Re:Re: the question about the common pc?

2009-02-18 Thread buddha1021


when i saidpeople daily used on windows,I want to specify the commom
handware (not OS)but don't mean hadoop run on windows! I mean that hadoop
run on common pc's hardware .certainly linux as OS !

Tim Wintle wrote:
 
 On Thu, 2009-02-19 at 13:43 +0800, 柳松 wrote:
 Hadoop is designed for High performance computing equipment, but
 claimed to be fit for daily pcs.
 
 The phrase High Performance Computing equipment makes me think of
 infiniband, fibre all over the place etc.
 
 
 Hadoop doesn't need that, it runs well on standard pc hardware - i.e. no
 special hardware you couldn't find in a standard pc. That doesn't mean
 you should run it on pcs that are being used for other things though.
 
 I found that hadoop ran ok on fairly old hardware - a load of old
 power-pc macs (running linux) churned through some jobs quickly, and
 I've actually run it on people's office machines during the nights (not
 on Windows). I did end up having to add an extra switch in for the part
 of the network that was only 100 mbps to get the throughput though.
 
 Of course ideally you would be running it on a rack of 1u servers, but
 that's still normally standard pc hardware.
 
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/the-question-about-the-common-pc--tp22092022p22094601.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

39 matches

Mail list logo