Tokenizing in pig (using python udf) ,serialized output is messy

Hemaa Mathavan Wed, 27 Jul 2016 00:27:57 -0700

Here is what i have done to tokenize in pig,
**My pig script**

    --set the debugg mode
    SET debug 'off'
    -- Registering the python udf
    REGISTER /home/hema/phd/work1/coding/myudf.py USING streaming_python as
myudf


    RAWDATA =LOAD '/home/hema/temp' USING TextLoader() AS content;
    LOWERCASE_DATA =FOREACH RAWDATA GENERATE LOWER(content) AS con;
    TOKENIZED_DATA =FOREACH LOWERCASE_DATA GENERATE
myudf.special_tokenize(con) as conn;
    DUMP TOKENIZED_DATA;

**My Python UDF**

    from pig_util import outputSchema
    import nltk

    @outputSchema('word:chararray')
    def special_tokenize(input):
        tokens=nltk.word_tokenize(input)
        return tokens

The code works fine but the output is messy. How can i remove the unwanted
underscrore and vertical bars. The output looks like this

    (|{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_)

(|{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_)

(|{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_)

(|{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_)

(|{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_)

(|{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_)
    (|{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_)
    (|{_|(_best|)_|,_|(_,|)_|}_)
    (|{_|(_svetoslav|)_|}_)

**original data**

    AdditionalContext in NameFinder
    Is there any possibility to use additionalContext with the
NameFinderME.train? If so, how? If there isn't maybe this should be an
issue to be added in the future releases?
    I would REALLY greatly appreciate if someone can help (give me some
sample code/show me)  how to add POS tag features while training and
testing NameFinder.
    If the incoming data is just tokens with NO POS tag information, where
is the information taken then? A new file? Run a POS tagging model before
training? Or?
    And what is the purpose of the resources (i.e.
Collection.<String,Object>emptyMap()) in the NameFinderME.train method?
What should be ideally included in there?
    I just can't get these things from the Java doc API.
     in advance!
    Best,
    Svetoslav

I would like to have a tuple of tokens as my final output..i have attached
the log. I can find that the output given by the udf is perfect and the
serializer is inserting the vertical bars and underscore. How can i
overcome this. Thanks in advance





*With Warm Regards,*
*R.Hemaa,*

2016-07-26 16:48:30,550 INFO To reduce the amount of information being logged only a small subset of rows are logged at the INFO level.  Call udf_logging.set_log_level_debug in pig_util to see all rows being processed.
2016-07-26 16:48:30,855 INFO Row 1: Serialized Input: Cadditionalcontext in namefinder 
2016-07-26 16:48:30,856 INFO Row 1: Deserialized Input: [u'additionalcontext in namefinder ']
2016-07-26 16:48:30,889 INFO Row 1: UDF Output: [u'additionalcontext', u'in', u'namefinder']
2016-07-26 16:48:30,890 INFO Row 1: Serialized Output: |{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_
2016-07-26 16:48:30,890 INFO Row 2: Serialized Input: Cis there any possibility to use additionalcontext with the namefinderme.train? if so, how? if there isn't maybe this should be an issue to be added in the future releases?
2016-07-26 16:48:30,891 INFO Row 2: Deserialized Input: [u"is there any possibility to use additionalcontext with the namefinderme.train? if so, how? if there isn't maybe this should be an issue to be added in the future releases?"]
2016-07-26 16:48:30,896 INFO Row 2: UDF Output: [u'is', u'there', u'any', u'possibility', u'to', u'use', u'additionalcontext', u'with', u'the', u'namefinderme.train', u'?', u'if', u'so', u',', u'how', u'?', u'if', u'there', u'is', u"n't", u'maybe', u'this', u'should', u'be', u'an', u'issue', u'to', u'be', u'added', u'in', u'the', u'future', u'releases', u'?']
2016-07-26 16:48:30,896 INFO Row 2: Serialized Output: |{_|(_is|)_|,_|(_there|)_|,_|(_any|)_|,_|(_possibility|)_|,_|(_to|)_|,_|(_use|)_|,_|(_additionalcontext|)_|,_|(_with|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_?|)_|,_|(_if|)_|,_|(_so|)_|,_|(_,|)_|,_|(_how|)_|,_|(_?|)_|,_|(_if|)_|,_|(_there|)_|,_|(_is|)_|,_|(_n't|)_|,_|(_maybe|)_|,_|(_this|)_|,_|(_should|)_|,_|(_be|)_|,_|(_an|)_|,_|(_issue|)_|,_|(_to|)_|,_|(_be|)_|,_|(_added|)_|,_|(_in|)_|,_|(_the|)_|,_|(_future|)_|,_|(_releases|)_|,_|(_?|)_|}_
2016-07-26 16:48:30,896 INFO Row 3: Serialized Input: Ci would really greatly appreciate if someone can help (give me some sample code/show me)  how to add pos tag features while training and testing namefinder.
2016-07-26 16:48:30,897 INFO Row 3: Deserialized Input: [u'i would really greatly appreciate if someone can help (give me some sample code/show me)  how to add pos tag features while training and testing namefinder.']
2016-07-26 16:48:30,897 INFO Row 3: UDF Output: [u'i', u'would', u'really', u'greatly', u'appreciate', u'if', u'someone', u'can', u'help', u'(', u'give', u'me', u'some', u'sample', u'code/show', u'me', u')', u'how', u'to', u'add', u'pos', u'tag', u'features', u'while', u'training', u'and', u'testing', u'namefinder', u'.']
2016-07-26 16:48:30,897 INFO Row 3: Serialized Output: |{_|(_i|)_|,_|(_would|)_|,_|(_really|)_|,_|(_greatly|)_|,_|(_appreciate|)_|,_|(_if|)_|,_|(_someone|)_|,_|(_can|)_|,_|(_help|)_|,_|(_(|)_|,_|(_give|)_|,_|(_me|)_|,_|(_some|)_|,_|(_sample|)_|,_|(_code/show|)_|,_|(_me|)_|,_|(_)|)_|,_|(_how|)_|,_|(_to|)_|,_|(_add|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_features|)_|,_|(_while|)_|,_|(_training|)_|,_|(_and|)_|,_|(_testing|)_|,_|(_namefinder|)_|,_|(_.|)_|}_
2016-07-26 16:48:30,898 INFO Row 4: Serialized Input: Cif the incoming data is just tokens with no pos tag information, where is the information taken then? a new file? run a pos tagging model before training? or?
2016-07-26 16:48:30,898 INFO Row 4: Deserialized Input: [u'if the incoming data is just tokens with no pos tag information, where is the information taken then? a new file? run a pos tagging model before training? or?']
2016-07-26 16:48:30,899 INFO Row 4: UDF Output: [u'if', u'the', u'incoming', u'data', u'is', u'just', u'tokens', u'with', u'no', u'pos', u'tag', u'information', u',', u'where', u'is', u'the', u'information', u'taken', u'then', u'?', u'a', u'new', u'file', u'?', u'run', u'a', u'pos', u'tagging', u'model', u'before', u'training', u'?', u'or', u'?']
2016-07-26 16:48:30,900 INFO Row 4: Serialized Output: |{_|(_if|)_|,_|(_the|)_|,_|(_incoming|)_|,_|(_data|)_|,_|(_is|)_|,_|(_just|)_|,_|(_tokens|)_|,_|(_with|)_|,_|(_no|)_|,_|(_pos|)_|,_|(_tag|)_|,_|(_information|)_|,_|(_,|)_|,_|(_where|)_|,_|(_is|)_|,_|(_the|)_|,_|(_information|)_|,_|(_taken|)_|,_|(_then|)_|,_|(_?|)_|,_|(_a|)_|,_|(_new|)_|,_|(_file|)_|,_|(_?|)_|,_|(_run|)_|,_|(_a|)_|,_|(_pos|)_|,_|(_tagging|)_|,_|(_model|)_|,_|(_before|)_|,_|(_training|)_|,_|(_?|)_|,_|(_or|)_|,_|(_?|)_|}_
2016-07-26 16:48:30,900 INFO Row 5: Serialized Input: Cand what is the purpose of the resources (i.e. collection.<string,object>emptymap()) in the namefinderme.train method? what should be ideally included in there?
2016-07-26 16:48:30,900 INFO Row 5: Deserialized Input: [u'and what is the purpose of the resources (i.e. collection.<string,object>emptymap()) in the namefinderme.train method? what should be ideally included in there?']
2016-07-26 16:48:30,901 INFO Row 5: UDF Output: [u'and', u'what', u'is', u'the', u'purpose', u'of', u'the', u'resources', u'(', u'i.e', u'.', u'collection.', u'<', u'string', u',', u'object', u'>', u'emptymap', u'(', u')', u')', u'in', u'the', u'namefinderme.train', u'method', u'?', u'what', u'should', u'be', u'ideally', u'included', u'in', u'there', u'?']
2016-07-26 16:48:30,902 INFO Row 5: Serialized Output: |{_|(_and|)_|,_|(_what|)_|,_|(_is|)_|,_|(_the|)_|,_|(_purpose|)_|,_|(_of|)_|,_|(_the|)_|,_|(_resources|)_|,_|(_(|)_|,_|(_i.e|)_|,_|(_.|)_|,_|(_collection.|)_|,_|(_<|)_|,_|(_string|)_|,_|(_,|)_|,_|(_object|)_|,_|(_>|)_|,_|(_emptymap|)_|,_|(_(|)_|,_|(_)|)_|,_|(_)|)_|,_|(_in|)_|,_|(_the|)_|,_|(_namefinderme.train|)_|,_|(_method|)_|,_|(_?|)_|,_|(_what|)_|,_|(_should|)_|,_|(_be|)_|,_|(_ideally|)_|,_|(_included|)_|,_|(_in|)_|,_|(_there|)_|,_|(_?|)_|}_
2016-07-26 16:48:30,902 INFO Row 6: Serialized Input: Ci just can't get these things from the java doc api.
2016-07-26 16:48:30,902 INFO Row 6: Deserialized Input: [u"i just can't get these things from the java doc api."]
2016-07-26 16:48:30,903 INFO Row 6: UDF Output: [u'i', u'just', u'ca', u"n't", u'get', u'these', u'things', u'from', u'the', u'java', u'doc', u'api', u'.']
2016-07-26 16:48:30,903 INFO Row 6: Serialized Output: |{_|(_i|)_|,_|(_just|)_|,_|(_ca|)_|,_|(_n't|)_|,_|(_get|)_|,_|(_these|)_|,_|(_things|)_|,_|(_from|)_|,_|(_the|)_|,_|(_java|)_|,_|(_doc|)_|,_|(_api|)_|,_|(_.|)_|}_
2016-07-26 16:48:30,903 INFO Row 7: Serialized Input: C in advance!
2016-07-26 16:48:30,904 INFO Row 7: Deserialized Input: [u' in advance!']
2016-07-26 16:48:30,904 INFO Row 7: UDF Output: [u'in', u'advance', u'!']
2016-07-26 16:48:30,904 INFO Row 7: Serialized Output: |{_|(_in|)_|,_|(_advance|)_|,_|(_!|)_|}_
2016-07-26 16:48:30,905 INFO Row 8: Serialized Input: Cbest,
2016-07-26 16:48:30,905 INFO Row 8: Deserialized Input: [u'best,']
2016-07-26 16:48:30,905 INFO Row 8: UDF Output: [u'best', u',']
2016-07-26 16:48:30,905 INFO Row 8: Serialized Output: |{_|(_best|)_|,_|(_,|)_|}_
2016-07-26 16:48:30,906 INFO Row 9: Serialized Input: Csvetoslav
2016-07-26 16:48:30,906 INFO Row 9: Deserialized Input: [u'svetoslav']
2016-07-26 16:48:30,907 INFO Row 9: UDF Output: [u'svetoslav']
2016-07-26 16:48:30,907 INFO Row 9: Serialized Output: |{_|(_svetoslav|)_|}_
2016-07-26 16:48:30,908 INFO Row 10: Serialized Input: C additionalcontext in namefinder
2016-07-26 16:48:30,908 INFO Row 10: Deserialized Input: [u' additionalcontext in namefinder']
2016-07-26 16:48:30,908 INFO Row 10: UDF Output: [u'additionalcontext', u'in', u'namefinder']
2016-07-26 16:48:30,908 INFO Row 10: Serialized Output: |{_|(_additionalcontext|)_|,_|(_in|)_|,_|(_namefinder|)_|}_
2016-07-26 16:48:30,909 INFO Row 11: Serialized Input: Cwell the additional context thing we never got right. its not really 
2016-07-26 16:48:30,909 INFO Row 11: Deserialized Input: [u'well the additional context thing we never got right. its not really ']
2016-07-26 16:48:30,910 INFO Row 11: UDF Output: [u'well', u'the', u'additional', u'context', u'thing', u'we', u'never', u'got', u'right', u'.', u'its', u'not', u'really']
2016-07-26 16:48:30,910 INFO Row 11: Serialized Output: |{_|(_well|)_|,_|(_the|)_|,_|(_additional|)_|,_|(_context|)_|,_|(_thing|)_|,_|(_we|)_|,_|(_never|)_|,_|(_got|)_|,_|(_right|)_|,_|(_.|)_|,_|(_its|)_|,_|(_not|)_|,_|(_really|)_|}_
2016-07-26 16:48:30,911 INFO Row 12: Serialized Input: Csupported
2016-07-26 16:48:30,911 INFO Row 12: Deserialized Input: [u'supported']
2016-07-26 16:48:30,911 INFO Row 12: UDF Output: [u'supported']
2016-07-26 16:48:30,912 INFO Row 12: Serialized Output: |{_|(_supported|)_|}_
2016-07-26 16:48:30,912 INFO Row 13: Serialized Input: Cduring training with the new tools and makes using the name finder a bit 
2016-07-26 16:48:30,912 INFO Row 13: Deserialized Input: [u'during training with the new tools and makes using the name finder a bit ']
2016-07-26 16:48:30,913 INFO Row 13: UDF Output: [u'during', u'training', u'with', u'the', u'new', u'tools', u'and', u'makes', u'using', u'the', u'name', u'finder', u'a', u'bit']
2016-07-26 16:48:30,913 INFO Row 13: Serialized Output: |{_|(_during|)_|,_|(_training|)_|,_|(_with|)_|,_|(_the|)_|,_|(_new|)_|,_|(_tools|)_|,_|(_and|)_|,_|(_makes|)_|,_|(_using|)_|,_|(_the|)_|,_|(_name|)_|,_|(_finder|)_|,_|(_a|)_|,_|(_bit|)_|}_
2016-07-26 16:48:30,914 INFO Row 14: Serialized Input: Cmore difficult
2016-07-26 16:48:30,914 INFO Row 14: Deserialized Input: [u'more difficult']
2016-07-26 16:48:30,914 INFO Row 14: UDF Output: [u'more', u'difficult']
2016-07-26 16:48:30,914 INFO Row 14: Serialized Output: |{_|(_more|)_|,_|(_difficult|)_|}_
2016-07-26 16:48:30,915 INFO Row 15: Serialized Input: Cbecause the additional context needs to be passed in on every call.
2016-07-26 16:48:30,915 INFO Row 15: Deserialized Input: [u'because the additional context needs to be passed in on every call.']
2016-07-26 16:48:30,915 INFO Row 15: UDF Output: [u'because', u'the', u'additional', u'context', u'needs', u'to', u'be', u'passed', u'in', u'on', u'every', u'call', u'.']
2016-07-26 16:48:30,915 INFO Row 15: Serialized Output: |{_|(_because|)_|,_|(_the|)_|,_|(_additional|)_|,_|(_context|)_|,_|(_needs|)_|,_|(_to|)_|,_|(_be|)_|,_|(_passed|)_|,_|(_in|)_|,_|(_on|)_|,_|(_every|)_|,_|(_call|)_|,_|(_.|)_|}_

Tokenizing in pig (using python udf) ,serialized output is messy

Reply via email to