Re: [PERFORM] Greenplum MapReduce

Suvankar Roy Mon, 03 Aug 2009 14:51:12 -0700

Hi Robert,

Thanks much for your valuable inputs....


This spaces and tabs problem is killing me in a way, it is pretty 
cumbersome to say the least....

Regards,

Suvankar Roy



"Robert Mah" <[email protected]> 
Sent by: Robert Mah <[email protected]>
08/02/2009 10:52 PM

To
"'Suvankar Roy'" <[email protected]>, 
<[email protected]>
cc

Subject
RE: [PERFORM] Greenplum MapReduce






Suvankar:
 
Check your file for spaces vs tabs (one of them is bad and yes, it 
matters).
 
And as an personal aside, this is yet another reason I hate YAML.
 
Cheers,
Rob
 
From: [email protected] [
mailto:[email protected]] On Behalf Of Suvankar Roy
Sent: Thursday, July 30, 2009 8:25 AM
To: [email protected]
Subject: [PERFORM] Greenplum MapReduce
 

Hi all, 

Has anybody worked on Greenplum MapReduce programming ? 

I am facing a problem while trying to execute the below Greenplum 
Mapreduce program written in YAML (in blue). 

The error is thrown in the 7th line as: 
Error: YAML syntax error - found character that cannot start any token 
while scanning for the next token, at line 7 (in red) 

If somebody can explain this and the potential solution 

%YAML 1.1 
--- 
VERSION: 1.0.0.1 
DATABASE: test_db1 
USER: gpadmin 
DEFINE: 
        - INPUT: 
                NAME: doc 
                TABLE: documents 
        - INPUT: 
                NAME: kw 
                TABLE: keywords 
        - MAP: 
                NAME:                 doc_map 
                LANGUAGE:         python 
                FUNCTION:          | 
                        i = 0 
                        terms = {} 
                        for term in data.lower().split(): 
                                i = i + 1 
                                if term in terms: 
                                        terms[term] += ','+str(i) 
                                else: 
                                        terms[term] = str(i) 
                        for term in terms: 
                                yield([doc_id, term, terms[term]])   
                OPTIMIZE: STRICT IMMUTABLE 
                PARAMETERS: 
                        - doc_id integer 
                        - data text 
                RETURNS: 
                        - doc_id integer 
                        - term text 
                        - positions text 
        - MAP: 
                NAME:         kw_map 
                LANGUAGE:         python 
                FUNCTION:         | 
                        i = 0 
                        terms = {} 
                        for term in keyword.lower().split(): 
                                i = i + 1 
                                if term in terms: 
                                        terms[term] += ','+str(i) 
                                else: 
                                        terms[term] = str(i) 
                                yield([keyword_id, i, term, terms[term]]) 
                OPTIMIZE: STRICT IMMUTABLE 
                PARAMETERS: 
                        - keyword_id integer 
                        - keyword text 
                RETURNS: 
                        - keyword_id integer 
                        - nterms integer 
                        - term text 
                        - positions text           
        - TASK: 
                NAME: doc_prep 
                SOURCE: doc 
                MAP: doc_map 
        - TASK: 
                NAME: kw_prep 
                SOURCE: kw 
                MAP: kw_map           
        - INPUT: 
                NAME: term_join 
                QUERY: | 
                        SELECT doc.doc_id, kw.keyword_id, kw.term, 
kw.nterms, 
                                 doc.positions as doc_positions, 
                                kw.positions as kw_positions 
                         FROM doc_prep doc INNER JOIN kw_prep kw ON 
(doc.term = kw.term) 
        - REDUCE: 
                NAME: term_reducer 
                TRANSITION: term_transition 
                FINALIZE: term_finalizer 
        - TRANSITION: 
                NAME: term_transition 
                LANGUAGE: python 
                PARAMETERS: 
                        - state text 
                        - term text 
                        - nterms integer 
                        - doc_positions text 
                        - kw_positions text 
                FUNCTION: | 
                        if state: 
                                kw_split = state.split(':') 
                        else: 
                                kw_split = [] 
                                for i in range(0,nterms): 
                                        kw_split.append('') 
                        for kw_p in kw_positions.split(','): 
                                kw_split[int(kw_p)-1] = doc_positions      

                        outstate = kw_split[0] 
                        for s in kw_split[1:]: 
                                outstate = outstate + ':' + s 
                        return outstate 
          - FINALIZE: 
                NAME: term_finalizer 
                LANGUAGE: python 
                RETURNS: 
                        - count integer 
                MODE: MULTI 
                FUNCTION: | 
                        if not state: 
                                return 0 
                        kw_split = state.split(':') 
                        previous = None 
                        for i in range(0,len(kw_split)): 
                                isplit = kw_split[i].split(',') 
                                if any(map(lambda(x): x == '', isplit)): 
                                        return 0 
                                adjusted = set(map(lambda(x): int(x)-i, 
isplit)) 
                                if (previous): 
                                        previous = 
adjusted.intersection(previous) 
                                else: 
                                        previous = adjusted 
                        if previous: 
                                return len(previous) 
                        return 0 
        - TASK: 
                NAME: term_match 
                SOURCE: term_join 
                REDUCE: term_reducer 
        - INPUT: 
                NAME: final_output 
                QUERY: | 
                        SELECT doc.*, kw.*, tm.count 
                        FROM documents doc, keywords kw, term_match tm 
                        WHERE doc.doc_id = tm.doc_id 
                          AND kw.keyword_id = tm.keyword_id 
                          AND tm.count > 0 
        EXECUTE: 
                - RUN: 
                        SOURCE: final_output 
                        TARGET: STDOUT 



Regards, 

Suvankar Roy
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you
 
 
ForwardSourceID:NT000058B6 
=====-----=====-----=====
Notice: The information contained in this e-mail
message and/or attachments to it may contain 
confidential or privileged information. If you are 
not the intended recipient, any dissemination, use, 
review, distribution, printing or copying of the 
information contained in this e-mail message 
and/or attachments to it are strictly prohibited. If 
you have received this communication in error, 
please notify us by reply e-mail or telephone and 
immediately and permanently delete the message 
and any attachments. Thank you

Re: [PERFORM] Greenplum MapReduce

Reply via email to