Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-24 Thread Ed Mazur
As you noticed, your map tasks are spilling three times as many
records as they are outputting. In general, if the map output buffer
is large enough to hold all records in memory, these values will be
equal. If there isn't enough room, as was the case with your job, the
buffer makes additional intermediate spills.

To fix this, you can try tuning the per-job configurables io.sort.mb
and io.sort.record.percent. Look at the counters of a few map tasks to
get an idea of how much data (io.sort.mb) and how many records
(io.sort.record.percent) they produce.

Ed

On Wed, Feb 24, 2010 at 2:45 AM, Tim Kiefer tim-kie...@gmx.de wrote:
 Sure,
 I see:
 Map input eecords: 10,000
 Map output records: 600,000
 Map output bytes: 307,216,800,000  (each reacord is about 500kb - that fits
 the application and is to be expected)

 Map spilled records: 1,802,965 (ahhh... now that you ask for it - here there
 also is a factor of 3 between output and spilled).

 So - question now is: why are three times as many records spilled than
 actually produced by the mappers?

 In my map function, I do not perform any additional file writing besides the
 context.write() for the intermediate records.

 Thanks, Tim

 Am 24.02.2010 05:28, schrieb Amogh Vasekar:

 Hi,
 Can you let us know what is the value for :
 Map input records
 Map spilled records
 Map output bytes
 Is there any side effect file written?

 Thanks,
 Amogh


 On 2/23/10 8:57 PM, Tim Kiefertim-kie...@gmx.de  wrote:

 No... 900GB is in the map column. Reduce adds another ~70GB of
 FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

 Am 23.02.2010 16:11, schrieb Ed Mazur:


 Hi Tim,

 I'm guessing a lot of these writes are happening on the reduce side.
 On the JT web interface, there are three columns: map, reduce,
 overall. Is the 900GB figure from the overall column? The value in the
 map column will probably be closer to what you were expecting. There
 are writes on the reduce side too during the shuffle and multi-pass
 merge.

 Ed

 2010/2/23 Tim Kiefertim-kie...@gmx.de:



 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:



 Hi Tim,
 the intermediate data is materialized to local file system. Before it
 is available for reducers, mappers will sort them. If the buffer
 (io.sort.mb) is too small for the intermediate data, multi-phase sorting
 happen, which means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job
 tracker, or just the information output when a job finish? If you look at
 the information given out at the end of the job, note that this is an
 overall statistics which include sorting at reduce side. It also include 
 the
 amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication
 factor. I think if you change the factor to 6, FILE_BYTES_WRITTEN is still
 the same.

  -Gang


 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I
 have a job where the map phase produces 600,000 records and map output 
 bytes
 is ~300GB. What I thought (up to now) is that these 600,000 records, i.e.,
 300GB, are materialized locally by the mappers and that later on reducers
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter
 is as high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the
 file system - but that should be HDFS only?!

 Thanks
 - tim










Re: Who can tell the meaning from fsck

2010-02-24 Thread Ravi Phulari
Moving to common-u...@h.a.o
Answers inline .

On 2/24/10 12:25 AM, jian yi eyj...@gmail.com wrote:

Hi All,


Run hadoop fsck / will give you summary of current HDFS status
including some useful information :

 Minimally replicated blocks:   51224 (100.0 %)
These are the minimally replicated block , you can set minimum block 
replication factor in hdfs-site.xml .  If there are any blocks which are not 
minimally replicated then NN will try to replicate them.

Over-replicated blocks:0 (0.0 %)
These are the excessive replicated block. Over-replicated blocks are harmless 
and could have been created due to running Balancer.

 Under-replicated blocks:   0 (0.0 %)
Blocks which are under-replicated than min

 Mis-replicated blocks: 7 (0.013665469 %)
blocks that do not satisfy block placement policy

 Default replication factor:3
Default block replication factor set using
namedfs.replication/name
value3/value

 Average block replication: 3.0
Average BR

 Missing replicas:  0 (0.0 %)
If there are any missing block replicas then that count is listed here.

Number of data-nodes:  83
Number of data nodes in your cluster

Number of racks:   6
Number of Racks in your cluster


You can configure your block replications by using following config settings in 
hdfs-site.xml
property
 namedfs.replication/name
 value3/value
 descriptionDefault block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  /description
/property

property
 namedfs.replication.max/name
 value512/value
 descriptionMaximal block replication.
  /description
/property

property
 namedfs.replication.min/name
 value1/value
 descriptionMinimal block replication.
  /description
/property

--
Ravi





Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-24 Thread Amogh Vasekar
Hi,
Map spilled records: 1,802,965 (ahhh... now that you ask for it - here there 
also is a factor of 3 between output and spilled).
Exactly what I suspected :)
Ed has already provided some pointers as to why this is the case. You should 
try to minimize this number as much as possible, since this along with the 
Reduce Shuffle Bytes degrades your job performance by considerable amount.
To understand the internals and what Ed said, I would strongly recommend going 
through
http://www.slideshare.net/gnap/berkeley-performance-tuning
By a few fellow Yahoos. There is detailed explanation on why map side spills 
occur and how one can minimize that :)

Thanks,
Amogh

On 2/24/10 1:15 PM, Tim Kiefer tim-kie...@gmx.de wrote:

Sure,
I see:
Map input eecords: 10,000
Map output records: 600,000
Map output bytes: 307,216,800,000  (each reacord is about 500kb - that
fits the application and is to be expected)

Map spilled records: 1,802,965 (ahhh... now that you ask for it - here
there also is a factor of 3 between output and spilled).

So - question now is: why are three times as many records spilled than
actually produced by the mappers?

In my map function, I do not perform any additional file writing besides
the context.write() for the intermediate records.

Thanks, Tim

Am 24.02.2010 05:28, schrieb Amogh Vasekar:
 Hi,
 Can you let us know what is the value for :
 Map input records
 Map spilled records
 Map output bytes
 Is there any side effect file written?

 Thanks,
 Amogh





Want to create custom inputformat to read from solr

2010-02-24 Thread Rakhi Khatwani
Hi,
Has anyone tried creating customInputFormat which reads from
solrIndex for processing using mapreduce??? is it possible doin tht?? and
how?
Regards,
Raakhi


Re: Is it possible to run multiple mapreduce jobs from within the same application

2010-02-24 Thread Steve Loughran

Raymond Jennings III wrote:

In other words:  I have a situation where I want to feed the output from the first iteration of my mapreduce 
job to a second iteration and so on.  I have a for loop in my main method to setup the job 
parameters and to run it through all iterations but on about the third run the Hadoop processes lose their 
association with the 'jps' command and then weird things start happening.  I remember reading somewhere about 
chaining - is that what is needed?  I'm not sure what causes jps to not report the hadoop 
processes even though they are still active as can be seen with the ps command.  Thanks.  (This 
is on version 0.20.1)


  


yes, here is something that does this to do a pagerank style ranking of 
things


http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/extras/citerank/src/org/smartfrog/services/hadoop/benchmark/citerank/CiteRank.java?revision=7728view=markup


Re: Want to create custom inputformat to read from solr

2010-02-24 Thread Rekha Joshi
The last I heard, there were some discussions of instead creating solr index 
using hadoop mapreduce rather than pushing solr index into hdfs and so on. 
SOLR-1045 ad SOLR-1301 can provide you more info.
Cheers,
/R


On 2/24/10 4:23 PM, Rakhi Khatwani rkhatw...@gmail.com wrote:

Hi,
Has anyone tried creating customInputFormat which reads from
solrIndex for processing using mapreduce??? is it possible doin tht?? and
how?
Regards,
Raakhi



Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-24 Thread Thomas Koch
Ananth,
 Just wanted to get the groups general feelings on what the preferred distro
 is and why? Obviously assuming one didn't have a service agreement with
 cloudera.
There'll shortly be a third alternative: The debian package of hadoop is in 
the Debian new queue[1] and will hopefully pass it in a couple of days to 
enter debian unstable. A preview is available from the unofficial repository 
of the Debian-Java Team.[2][3]
The Debian package took the cloudera packaging as model, with some slight 
changes:

- no version namespace, everything is called just hadoop, not hadoop-0.18 
or hadoop-0.20 as in the cloudera package

- some contributions are missing due to lack of manpower or missing 
dependencies in Debian

- the native C++ hadoop code is not in the package due to lack of manpower

The advantage of the debian packages is a more standards conform integration 
in Debian.

[1] http://ftp-master.debian.org/new.html
[2] put this in /etc/apt/sources.list:
deb http://pkg-java.alioth.debian.org unstable/all/
[3] http://wiki.debian.org/Teams/JavaPackaging

Best regards,

Thomas Koch, http://www.koch.ro


Re: Wrong FS

2010-02-24 Thread Marc Farnum Rendino
On Tue, Feb 23, 2010 at 9:38 AM, Edson Ramiro erlfi...@gmail.com wrote:

 Thanks Marc and Bill

 I solved this Wrong FS problem editing the /etc/hosts as Marc said.

 Now, the cluster is working ok  : ]


Great; 'preciate the confirmation!

- Marc


How do I get access to the Reporter within Mapper?

2010-02-24 Thread Raymond Jennings III
I am using the non-deprecated Mapper.  Can I obtain it from the Context 
somehow?  Anyone have an example of this?  Thanks.


  


Re: How do I get access to the Reporter within Mapper?

2010-02-24 Thread Jeff Zhang
You can use the Context for incrementing Counter or heartbeat.


On Wed, Feb 24, 2010 at 7:31 AM, Raymond Jennings III raymondj...@yahoo.com
 wrote:

 I am using the non-deprecated Mapper.  Can I obtain it from the Context
 somehow?  Anyone have an example of this?  Thanks.






-- 
Best Regards

Jeff Zhang


Running C++ WordCount

2010-02-24 Thread Ratner, Alan S (IS)
I am trying to run the wordcount example in c/c++ given on
http://wiki.apache.org/hadoop/C%2B%2BWordCount with Hadoop 0.18.3 but
when I run Ant using the specified command ant -Dcompile.c++=yes
examples I get a BUILD FAILED error ... Cannot run program
c:\...\hadoop-0.18.3\src\c++\pipes\configure (in directory
...\hadoop-0.18.3\build\c++-build\Windows_XP-x86-32\pipes):
CreateProcess error=193, %1 is not a valid Win32 application

Question 1: Where in the directory path do I put the wordcount code and
should it get a .cpp extension or something else?

Question 2: Where should I be when I execute the Ant command?  Ant
complains that it cannot find build.xml unless I am in the
...\hadoop-0.18.3 directory.

Question 3: The include statements in the wordcount code are of the form
#include Hadoop/xxx.hh.  These include files reside both in
...\hadoop-0.18.3\c++\Linux-i386-32\include\hadoop and in
...\hadoop-0.18.3\c++\Linux-amd64-64\include\hadoop.  Does Ant produce
both a 32-bit and a 64-bit version of my compiled code?

Question 4: Where does Ant put the compiled code?


Thanks,
Alan Ratner




java.net.SocketException: Network is unreachable

2010-02-24 Thread neo anderson

While running example programe ('hadoop jar *example*jar pi 2 2'), I
encounter 'Network is unreachable' problem (at
$HADOOP_HOME/logs/userlogs/.../stderr), as below:

Exception in thread main java.io.IOException: Call to /127.0.0.1:port
failed on local exception: java.net.SocketException: Network is unreachable
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown
Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
at org.apache.hadoop.mapred.Child.main(Child.java:64)
Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at
org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
... 6 more

Initially, it seems to me that is firewall issue, but after disabling
iptables  the example programe still can not execute correctly. 

command for disabling iptables. 
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -X
iptables -F

When starting up hadoop cluster (start-dfs.sh and start-mapred.sh), it looks
like the namenode was correctly started up because the log in name node
contains information

... org.apache.hadoop.net.NetworkTopology: Adding a new node:
/default-rack/111.222.333.5:10010
... org.apache.hadoop.net.NetworkTopology: Adding a new node:
/default-rack/111.222.333.4:10010
... org.apache.hadoop.net.NetworkTopology: Adding a new node:
/default-rack/111.222.333.3:10010

Also, in datanode 
...
INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/111.222.333.4:34539, dest: /111.222.333.5:50010, bytes: 4, op: HDFS_WRITE,
...
INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/111.222.333.4:51610, dest: /111.222.333.3:50010, bytes: 118, op:
HDFS_WRITE, cliID: ...
...

The command 'hadoop fs -ls' can list the data uploaded to the hdfs without a
problem. And jps shows the necessary processes are running. 

name node:
7710 SecondaryNameNode
7594 NameNode
8038 JobTracker

data nodes:
3181 TaskTracker
3000 DataNode

Environment: Debian squeeze, hadoop 0.20.1, jdk 1.6.x

I search online and couldn't find the possible root cause. Is there any
possibility that may cause such issue? Or any place that I may be able to
check for more deatail information?

Thanks for help.


-- 
View this message in context: 
http://old.nabble.com/java.net.SocketException%3A-Network-is-unreachable-tp27714253p27714253.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: java.net.SocketException: Network is unreachable

2010-02-24 Thread Todd Lipcon
Hi Neo,

See this bug:

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=560044

as well as the discussion here:

http://issues.apache.org/jira/browse/HADOOP-6056

Thanks
-Todd

On Wed, Feb 24, 2010 at 9:16 AM, neo anderson
javadeveloper...@yahoo.co.uk wrote:

 While running example programe ('hadoop jar *example*jar pi 2 2'), I
 encounter 'Network is unreachable' problem (at
 $HADOOP_HOME/logs/userlogs/.../stderr), as below:

 Exception in thread main java.io.IOException: Call to /127.0.0.1:port
 failed on local exception: java.net.SocketException: Network is unreachable
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
        at org.apache.hadoop.ipc.Client.call(Client.java:742)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown
 Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
        at org.apache.hadoop.mapred.Child.main(Child.java:64)
 Caused by: java.net.SocketException: Network is unreachable
        at sun.nio.ch.Net.connect(Native Method)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
        at
 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
        at
 org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
        at org.apache.hadoop.ipc.Client.call(Client.java:719)
        ... 6 more

 Initially, it seems to me that is firewall issue, but after disabling
 iptables  the example programe still can not execute correctly.

 command for disabling iptables.
 iptables -P INPUT ACCEPT
 iptables -P FORWARD ACCEPT
 iptables -P OUTPUT ACCEPT
 iptables -X
 iptables -F

 When starting up hadoop cluster (start-dfs.sh and start-mapred.sh), it looks
 like the namenode was correctly started up because the log in name node
 contains information

 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.5:10010
 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.4:10010
 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.3:10010

 Also, in datanode
 ...
 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
 /111.222.333.4:34539, dest: /111.222.333.5:50010, bytes: 4, op: HDFS_WRITE,
 ...
 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
 /111.222.333.4:51610, dest: /111.222.333.3:50010, bytes: 118, op:
 HDFS_WRITE, cliID: ...
 ...

 The command 'hadoop fs -ls' can list the data uploaded to the hdfs without a
 problem. And jps shows the necessary processes are running.

 name node:
 7710 SecondaryNameNode
 7594 NameNode
 8038 JobTracker

 data nodes:
 3181 TaskTracker
 3000 DataNode

 Environment: Debian squeeze, hadoop 0.20.1, jdk 1.6.x

 I search online and couldn't find the possible root cause. Is there any
 possibility that may cause such issue? Or any place that I may be able to
 check for more deatail information?

 Thanks for help.


 --
 View this message in context: 
 http://old.nabble.com/java.net.SocketException%3A-Network-is-unreachable-tp27714253p27714253.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: java.net.SocketException: Network is unreachable

2010-02-24 Thread Alvaro Cabrerizo
Hi:

Hope this helps: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=560142

Regards.

2010/2/24 neo anderson javadeveloper...@yahoo.co.uk


 While running example programe ('hadoop jar *example*jar pi 2 2'), I
 encounter 'Network is unreachable' problem (at
 $HADOOP_HOME/logs/userlogs/.../stderr), as below:

 Exception in thread main java.io.IOException: Call to /127.0.0.1:port
 failed on local exception: java.net.SocketException: Network is unreachable
at org.apache.hadoop.ipc.Client.wrapException(Client.java:774)
at org.apache.hadoop.ipc.Client.call(Client.java:742)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at org.apache.hadoop.mapred.$Proxy0.getProtocolVersion(Unknown
 Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:346)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:383)
at org.apache.hadoop.mapred.Child.main(Child.java:64)
 Caused by: java.net.SocketException: Network is unreachable
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507)
at

 org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:192)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at
 org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:304)
at
 org.apache.hadoop.ipc.Client$Connection.access$1700(Client.java:176)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:859)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
... 6 more

 Initially, it seems to me that is firewall issue, but after disabling
 iptables  the example programe still can not execute correctly.

 command for disabling iptables.
 iptables -P INPUT ACCEPT
 iptables -P FORWARD ACCEPT
 iptables -P OUTPUT ACCEPT
 iptables -X
 iptables -F

 When starting up hadoop cluster (start-dfs.sh and start-mapred.sh), it
 looks
 like the namenode was correctly started up because the log in name node
 contains information

 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.5:10010
 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.4:10010
 ... org.apache.hadoop.net.NetworkTopology: Adding a new node:
 /default-rack/111.222.333.3:10010

 Also, in datanode
 ...
 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
 /111.222.333.4:34539, dest: /111.222.333.5:50010, bytes: 4, op: HDFS_WRITE,
 ...
 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
 /111.222.333.4:51610, dest: /111.222.333.3:50010, bytes: 118, op:
 HDFS_WRITE, cliID: ...
 ...

 The command 'hadoop fs -ls' can list the data uploaded to the hdfs without
 a
 problem. And jps shows the necessary processes are running.

 name node:
 7710 SecondaryNameNode
 7594 NameNode
 8038 JobTracker

 data nodes:
 3181 TaskTracker
 3000 DataNode

 Environment: Debian squeeze, hadoop 0.20.1, jdk 1.6.x

 I search online and couldn't find the possible root cause. Is there any
 possibility that may cause such issue? Or any place that I may be able to
 check for more deatail information?

 Thanks for help.


 --
 View this message in context:
 http://old.nabble.com/java.net.SocketException%3A-Network-is-unreachable-tp27714253p27714253.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: CDH2 or Apache Hadoop - Official Debian packages

2010-02-24 Thread Allen Wittenauer



On 2/24/10 4:45 AM, Thomas Koch tho...@koch.ro wrote:
 There'll shortly be a third alternative:

There are already three:

- Apache
- Cloudera
- Yahoo!

and with several others in development.

For all intents and purposes, the Debian package sounds just like a
re-packaging of the Apache distribution in .deb form.


 - no version namespace, everything is called just hadoop, not hadoop-0.18
 or hadoop-0.20 as in the cloudera package

... and thus making upgrades really hard and not suitable for anything
real.



Hadoop key mismatch

2010-02-24 Thread Larry Homes
Hello,

I am trying to sort some values by using a simple map and reduce
without any processing, but I think I messed up my data types somehow.

Rather than try to paste code in an email, I have described the
problem and pasted all the code (nicely formatted) here:
http://www.coderanch.com/t/484435/Distributed-Java/java/Hadoop-key-mismatch

Thanks


Seattle Hadoop/Scalability/NoSQL Meetup Tonight!

2010-02-24 Thread Bradford Stephens
The Seattle Hadoop/Scalability/NoSQL (yeah, we vary the title) meetup
is tonight! We're going to have a guest speaker from MongoDB :)

As always, it's at the University of Washington, Allen Computer
Science building, Room 303 at 6:45pm. You can find a map here:
http://www.washington.edu/home/maps/southcentral.html?cse

If you can, please RSVP here (not required, but very nice):
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/

--
http://www.drawntoscalehq.com --  The intuitive, cloud-scale data
solution. Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science