Hi Robert, Thanks much for your valuable inputs....
This spaces and tabs problem is killing me in a way, it is pretty cumbersome to say the least.... Regards, Suvankar Roy "Robert Mah" <r...@pobox.com> Sent by: Robert Mah <robert....@gmail.com> 08/02/2009 10:52 PM To "'Suvankar Roy'" <suvankar....@tcs.com>, <pgsql-performance@postgresql.org> cc Subject RE: [PERFORM] Greenplum MapReduce Suvankar: Check your file for spaces vs tabs (one of them is bad and yes, it matters). And as an personal aside, this is yet another reason I hate YAML. Cheers, Rob From: pgsql-performance-ow...@postgresql.org [ mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Suvankar Roy Sent: Thursday, July 30, 2009 8:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Greenplum MapReduce Hi all, Has anybody worked on Greenplum MapReduce programming ? I am facing a problem while trying to execute the below Greenplum Mapreduce program written in YAML (in blue). The error is thrown in the 7th line as: Error: YAML syntax error - found character that cannot start any token while scanning for the next token, at line 7 (in red) If somebody can explain this and the potential solution %YAML 1.1 --- VERSION: 1.0.0.1 DATABASE: test_db1 USER: gpadmin DEFINE: - INPUT: NAME: doc TABLE: documents - INPUT: NAME: kw TABLE: keywords - MAP: NAME: doc_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in data.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) for term in terms: yield([doc_id, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - doc_id integer - data text RETURNS: - doc_id integer - term text - positions text - MAP: NAME: kw_map LANGUAGE: python FUNCTION: | i = 0 terms = {} for term in keyword.lower().split(): i = i + 1 if term in terms: terms[term] += ','+str(i) else: terms[term] = str(i) yield([keyword_id, i, term, terms[term]]) OPTIMIZE: STRICT IMMUTABLE PARAMETERS: - keyword_id integer - keyword text RETURNS: - keyword_id integer - nterms integer - term text - positions text - TASK: NAME: doc_prep SOURCE: doc MAP: doc_map - TASK: NAME: kw_prep SOURCE: kw MAP: kw_map - INPUT: NAME: term_join QUERY: | SELECT doc.doc_id, kw.keyword_id, kw.term, kw.nterms, doc.positions as doc_positions, kw.positions as kw_positions FROM doc_prep doc INNER JOIN kw_prep kw ON (doc.term = kw.term) - REDUCE: NAME: term_reducer TRANSITION: term_transition FINALIZE: term_finalizer - TRANSITION: NAME: term_transition LANGUAGE: python PARAMETERS: - state text - term text - nterms integer - doc_positions text - kw_positions text FUNCTION: | if state: kw_split = state.split(':') else: kw_split = [] for i in range(0,nterms): kw_split.append('') for kw_p in kw_positions.split(','): kw_split[int(kw_p)-1] = doc_positions outstate = kw_split[0] for s in kw_split[1:]: outstate = outstate + ':' + s return outstate - FINALIZE: NAME: term_finalizer LANGUAGE: python RETURNS: - count integer MODE: MULTI FUNCTION: | if not state: return 0 kw_split = state.split(':') previous = None for i in range(0,len(kw_split)): isplit = kw_split[i].split(',') if any(map(lambda(x): x == '', isplit)): return 0 adjusted = set(map(lambda(x): int(x)-i, isplit)) if (previous): previous = adjusted.intersection(previous) else: previous = adjusted if previous: return len(previous) return 0 - TASK: NAME: term_match SOURCE: term_join REDUCE: term_reducer - INPUT: NAME: final_output QUERY: | SELECT doc.*, kw.*, tm.count FROM documents doc, keywords kw, term_match tm WHERE doc.doc_id = tm.doc_id AND kw.keyword_id = tm.keyword_id AND tm.count > 0 EXECUTE: - RUN: SOURCE: final_output TARGET: STDOUT Regards, Suvankar Roy =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you ForwardSourceID:NT000058B6 =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you