Re: UnicodeDecodeError in zeppelin 0.7.1

Meethu Mathew Thu, 20 Apr 2017 22:57:08 -0700


Hi,

Thanks for the repsonse.


@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1

@ Felix Cheng : The Python version is same.

The code is as follows:

*PYSPARK*

def textPreProcessor(text):
>      for w in text.split():
>
>      
> regex = re.compile('[%s]' % re.escape(string.punctuation))
>
>     * *
> *no_punctuation = unicode(regex.sub(' ', w),'utf8')*
>
>      
> tokens = word_tokenize(no_punctuation)
>
>      
> lowercased = [t.lower() for t in tokens]
>
>      
> no_stopwords = [w for w in lowercased if not w in stopwordsX]
>
>      
> stemmed = [stemmerX.stem(w) for w in no_stopwords]
>
>      
> return [w for w in stemmed if w]



>    - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False*
>    ).repartition(96)
>    - docs.map(lambda features: sentimentObject.textPreProcessor(features.
>    split(delimiter)[text_colum])).count()
>
>
*Error:*

   - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position
   17: invalid start byte


   - Same error  *use_unicode=False* is not used


   - Error change to *'ascii' codec can't decode byte 0x97 in position 3:
   ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is
   used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). *

*Note :: In version 0.7.0 the code was running fine without using
use_unicode and unicode(regex.sub(' ', w),'utf8')*

*PYTHON*

def textPreProcessor(text_column):
>     processed_text=[]
> for text in text_column:
>        for w in text.split():
>           regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg
> exprn for puntuation
>           no_punctuation = unicode(regex.sub(' ', text_),'utf8')
>              tokens = word_tokenize(no_punctuation)
>                  lowercased = [t.lower() for t in tokens]
>            no_stopwords = [w for w in lowercased if not w in stopwordsX]
>            stemmed = [stemmerX.stem(w) for w in no_stopwords]
>            processed_text.append([w for w in stemmed if w])
> return processed_text


   - new_training = pd.read_csv(training_data,header=None,
   delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_
   column],names=['label','msg']).dropna()
   - new_training['processed_msg'] = textPreProcessor(new_training['msg'])

This python code is working and I am getting result. In version 0.7.0, I am
getting output without using the unicode function.

Hope the problem is clear now.

Regards,
Meethu Mathew


On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> And are they running with the same Python version? What is the Python
> version?
>
> _____________________________
> From: moon soo Lee <m...@apache.org>
> Sent: Thursday, April 20, 2017 11:53 AM
> Subject: Re: UnicodeDecodeError in zeppelin 0.7.1
> To: <users@zeppelin.apache.org>
>
>
>
> Hi,
>
> 0.7.1 didn't changed any encoding type as far as i know.
> One difference is 0.7.1 official artifact has been built with JDK8 while
> 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But
> i'm not sure that can make pyspark and spark encoding type changes.
>
> Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0?
>
> Thanks,
> moon
>
> On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com>
> wrote:
>
>> Hi,
>>
>> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing
>> this error while creating an RDD(in pyspark).
>>
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0:
>>> invalid start byte
>>
>>
>> I was able to create the RDD without any error after adding
>> use_unicode=False as follows
>>
>>> sc.textFile("file.csv",use_unicode=False)
>>
>>
>> But it fails when I try to stem the text. I am getting similar error
>> when trying to apply stemming to the text using python interpreter.
>>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
>>> ordinal not in range(128)
>>
>> All these code is working in 0.7.0 version. There is no change in the
>> dataset and code. Is there any change in the encoding type in the new
>> version of zeppelin?
>>
>> Regards,
>>
>>
>> Meethu Mathew
>>
>>
>
>

Re: UnicodeDecodeError in zeppelin 0.7.1

Reply via email to