Hi, You could also try String [] tokens = line.split("\\s+"); Even this is by just eyeballing... Do let us know. Regards, CVK
On Jul 16, 2010, at 1:33 PM, Jeff Bean wrote: Whitespace characters are funny. You showed me this code in the mapper: String [] tokens = line.split(" "); Which doesn't actually match for tab, which would be line.split("\t"); This would still execute, but you'd have keys and values that look right going into the reducer, but you might not catch that you have value substrings appended to the key because you didn't split correctly. This is just from eyeballing the code. Let me know if I'm on the right track. Jeff On Fri, Jul 16, 2010 at 10:16 AM, Nikolay Korovaiko <korovai...@gmail.com>wrote: > First, thank you very much for the reply! > > so, this is my input: > > a\tb > b\tc > c\ta > > In other words, a map function initially receives the whole string a\tb as > its value. > And it processes my input data correctly. I actually changed my reduce > function to simply emit merged pairs from a map's input for checking this. > However, when I tried to cross join cases where I have both to_'s and > from_'s (for example, a reducer gets the following pair <a, to_b ; from_c> > ) > by splitting each value provided by a reducer's iterator with split("_"), > it > just didn't work. Even though without this additional logic reducer DOES > output these values <a, to_b ; from_c>, so it GETS them. The same split > thing works just fine for keys in a reduce function i.e. it discriminates > cases with a composite key like "a_b" from a simple key like "a." My guess > is that Hadoop should be sorting values for a reducer behind the scene and > this somehow messes up an initial character encoding. I'm using a Text > class > as a serializable wrapper for my strings. I guess there is no other option > for it?))) > > I wanna try to get rid of composite keys first (the last output.collect in > a > map function) to make things a bit simpler and test it again then. > > > On Fri, Jul 16, 2010 at 9:16 AM, Jeff Bean <jwfb...@cloudera.com> wrote: > >> Is the tab the delimiter between records or between keys and values on > the >> input? >> >> in other words does the input file look like this: >> >> a\tb >> b\tc >> c\ta >> >> or does it look like this: >> >> a b\tb c\tc a\t >> >> ? >> >> Jeff >> >> On Thu, Jul 15, 2010 at 6:18 PM, Nikolay Korovaiko <korovai...@gmail.com >>> wrote: >> >>> Hi everyone, >>> >>> I hope this is the right place for my question. If not, please, feel > free >>> to >>> ignore it ;) and I'm sorry for any inconvenience made :( >>> >>> I'm writing a simple program for enumerating triangles in directed > graphs >>> for my project. First, for each input arc (e.g. a b, b c, c a, note: a >> tab >>> symbol serves as a delimiter) I want my map function output the > following >>> pairs ([a, to_b], [b, from_a], [a_b, -1]): >>> >>> public void map(LongWritable key, Text value, >>> >>> OutputCollector<Text, Text> output, >>> >>> Reporter reporter) throws IOException { >>> >>> String line = value.toString(); >>> >>> String [] tokens = line.split(" "); >>> >>> output.collect(new Text(tokens[0]), new Text("to_"+tokens[1])); >>> >>> output.collect(new Text(tokens[1]), new Text("from_"+tokens[0])); >>> >>> output.collect(new Text(tokens[0]+"_"+tokens[1]), new Text("-1")); >>> >>> } >>> >>> Now my reduce function is supposed to cross join all pairs that have > both >>> to_'s and from_'s and to simply propogate any other pairs whose keys >>> contain >>> "_". >>> >>> public void reduce(Text key, Iterator<Text> values, >>> >>> OutputCollector<Text, Text> output, >>> >>> Reporter reporter) throws IOException { >>> >>> String key_s = key.toString(); >>> >>> if (key_s.indexOf("_")>0) >>> >>> output.collect(key, new Text("completed")); >>> >>> else { >>> >>> HashMap <String, ArrayList<String>> lists = new HashMap >>> <String, ArrayList<String>> (); >>> >>> while (values.hasNext()) { >>> >>> String line = values.next().toString(); >>> >>> String[] tokens = line.split("_"); >>> >>> if (!lists.containsKey(tokens[0])) { >>> >>> lists.put(tokens[0], new ArrayList<String>()); >>> >>> } >>> lists.get(tokens[0]).add(tokens[1]); >>> >>> } >>> >>> for (String t : lists.get("to")) >>> >>> for (String f : lists.get("from")) >>> >>> output.collect(new Text(t+"_"+f), key); >>> >>> >>> } >>> >>> } >>> >>> And this is where the most exciting stuff happens. tokens[1] yields an >>> ArrayOutOfBounds exception. If you scroll up, you can see that by this >>> point >>> the iterator should give values like "to_a", "from_b", "to_b", etc... >> when >>> I >>> just output these values, everything looks ok and I have "to_a", >> "from_b". >>> But split() don't work at all, moreover line.length() is always 1 and >>> indexOf("*") returns -1! The very same indexOf WORKS PERFECTLY for >> keys... >>> where we have pairs whose keys contain "_"* and look like "a_b", "b_c" >>> >>> I'm really puzzled with all this. MapReduce is supposed to save lives >>> making >>> everything simple. Instead I spent several hours to just spot this... >>> >>> I'd really appreciate your help, guys!!! Thanks in advance! >>> >> >