First, thank you very much for the reply!

so, this is my input:

a\tb
b\tc
c\ta

In other words, a map function initially receives the whole string a\tb as
its value.
And it processes my input data correctly. I actually changed my reduce
function to simply emit merged pairs from a map's input for checking this.
However, when I tried to cross join cases where I have both to_'s and
from_'s (for example, a reducer gets the following pair <a, to_b ; from_c> )
by splitting each value provided by a reducer's iterator with split("_"), it
just didn't work. Even though without this additional logic reducer DOES
output these values <a, to_b ; from_c>, so it GETS them. The same split
thing works just fine for keys in a reduce function i.e. it discriminates
cases with a composite key like "a_b" from a simple key like "a." My guess
is that Hadoop should be sorting values for a reducer behind the scene and
this somehow messes up an initial character encoding. I'm using a Text class
as a serializable wrapper for my strings. I guess there is no other option
for it?)))

I wanna try to get rid of composite keys first (the last output.collect in a
map function) to make things a bit simpler and test it again then.


On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <jwfb...@cloudera.com> wrote:

> Is the tab the delimiter between records or between keys and values on the
> input?
>
> in other words does the input file look like this:
>
> a\tb
> b\tc
> c\ta
>
> or does it look like this:
>
> a   b\tb   c\tc   a\t
>
> ?
>
> Jeff
>
> On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <korovai...@gmail.com
> >wrote:
>
> > Hi everyone,
> >
> > I hope this is the right place for my question. If not, please, feel free
> > to
> > ignore it  ;) and I'm sorry for any inconvenience made :(
> >
> > I'm writing a simple program for enumerating triangles in directed graphs
> > for my project. First, for each input arc (e.g. a b, b c, c a, note: a
> tab
> > symbol serves as a delimiter) I want my map function output the following
> > pairs ([a, to_b], [b, from_a], [a_b, -1]):
> >
> >  public void map(LongWritable key, Text value,
> >
> >                OutputCollector<Text, Text> output,
> >
> >                Reporter reporter) throws IOException {
> >
> >  String line = value.toString();
> >
> >  String [] tokens = line.split("    ");
> >
> >  output.collect(new Text(tokens[0]), new Text("to_"+tokens[1]));
> >
> >  output.collect(new Text(tokens[1]), new Text("from_"+tokens[0]));
> >
> >  output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1"));
> >
> > }
> >
> > Now my reduce function is supposed to cross join all pairs that have both
> > to_'s and from_'s and to simply propogate any other pairs whose keys
> > contain
> > "_".
> >
> >      public void reduce(Text key, Iterator<Text> values,
> >
> >                   OutputCollector<Text, Text> output,
> >
> >                   Reporter reporter) throws IOException {
> >
> >  String key_s = key.toString();
> >
> >  if (key_s.indexOf("_")>0)
> >
> >      output.collect(key, new Text("completed"));
> >
> >   else {
> >
> >           HashMap <String, ArrayList<String>> lists = new HashMap
> > <String, ArrayList<String>> ();
> >
> >          while (values.hasNext()) {
> >
> >              String line = values.next().toString();
> >
> >              String[] tokens = line.split("_");
> >
> >              if (!lists.containsKey(tokens[0])) {
> >
> >                   lists.put(tokens[0], new ArrayList<String>());
> >
> >              }
> >           lists.get(tokens[0]).add(tokens[1]);
> >
> >          }
> >
> >          for (String t : lists.get("to"))
> >
> >               for (String f : lists.get("from"))
> >
> >                  output.collect(new Text(t+"_"+f), key);
> >
> >
> >  }
> >
> > }
> >
> > And this is where the most exciting stuff happens. tokens[1] yields an
> > ArrayOutOfBounds exception. If you scroll up, you can see that by this
> > point
> > the iterator should give values like "to_a", "from_b", "to_b", etc...
> when
> > I
> > just output these values, everything looks ok and I have "to_a",
> "from_b".
> > But split() don't work at all, moreover line.length() is always 1 and
> > indexOf("*") returns -1! The very same indexOf WORKS PERFECTLY for
> keys...
> > where we have pairs whose keys contain "_"* and look like "a_b", "b_c"
> >
> > I'm really puzzled with all this. MapReduce is supposed to save lives
> > making
> > everything simple. Instead I spent several hours to just spot  this...
> >
> > I'd really appreciate your help, guys!!! Thanks in advance!
> >
>

Reply via email to