Hi, Thanks for the repsonse.
@ moon soo Lee: The interpreter setting is same in 0.7.0 and 0.7.1 @ Felix Cheng : The Python version is same. The code is as follows: *PYSPARK* def textPreProcessor(text): > for w in text.split(): > > > regex = re.compile('[%s]' % re.escape(string.punctuation)) > > * * > *no_punctuation = unicode(regex.sub(' ', w),'utf8')* > > > tokens = word_tokenize(no_punctuation) > > > lowercased = [t.lower() for t in tokens] > > > no_stopwords = [w for w in lowercased if not w in stopwordsX] > > > stemmed = [stemmerX.stem(w) for w in no_stopwords] > > > return [w for w in stemmed if w] > - docs =sc.textFile(hdfs_path+training_data,*use_unicode=False* > ).repartition(96) > - docs.map(lambda features: sentimentObject.textPreProcessor(features. > split(delimiter)[text_colum])).count() > > *Error:* - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9b in position 17: invalid start byte - Same error *use_unicode=False* is not used - Error change to *'ascii' codec can't decode byte 0x97 in position 3: ordinal not in range(128) when **no_punctuation = regex.sub(' ', w)* is used instead of *no_punctuation = unicode(regex.sub(' ', w),'utf8'). * *Note :: In version 0.7.0 the code was running fine without using use_unicode and unicode(regex.sub(' ', w),'utf8')* *PYTHON* def textPreProcessor(text_column): > processed_text=[] > for text in text_column: > for w in text.split(): > regex = re.compile('[%s]' % re.escape(string.punctuation)) # reg > exprn for puntuation > no_punctuation = unicode(regex.sub(' ', text_),'utf8') > tokens = word_tokenize(no_punctuation) > lowercased = [t.lower() for t in tokens] > no_stopwords = [w for w in lowercased if not w in stopwordsX] > stemmed = [stemmerX.stem(w) for w in no_stopwords] > processed_text.append([w for w in stemmed if w]) > return processed_text - new_training = pd.read_csv(training_data,header=None, delimiter=delimiter, error_bad_lines=False, usecols=[label_column,text_ column],names=['label','msg']).dropna() - new_training['processed_msg'] = textPreProcessor(new_training['msg']) This python code is working and I am getting result. In version 0.7.0, I am getting output without using the unicode function. Hope the problem is clear now. Regards, Meethu Mathew On Fri, Apr 21, 2017 at 3:07 AM, Felix Cheung <felixcheun...@hotmail.com> wrote: > And are they running with the same Python version? What is the Python > version? > > _____________________________ > From: moon soo Lee <m...@apache.org> > Sent: Thursday, April 20, 2017 11:53 AM > Subject: Re: UnicodeDecodeError in zeppelin 0.7.1 > To: <users@zeppelin.apache.org> > > > > Hi, > > 0.7.1 didn't changed any encoding type as far as i know. > One difference is 0.7.1 official artifact has been built with JDK8 while > 0.7.0 built with JDK7 (we'll use JDK7 to build upcoming 0.7.2 binary). But > i'm not sure that can make pyspark and spark encoding type changes. > > Do you have exactly the same interpreter setting in 0.7.1 and 0.7.0? > > Thanks, > moon > > On Wed, Apr 19, 2017 at 5:30 AM Meethu Mathew <meethu.mat...@flytxt.com> > wrote: > >> Hi, >> >> I just migrated from zeppelin 0.7.0 to zeppelin 0.7.1 and I am facing >> this error while creating an RDD(in pyspark). >> >> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: >>> invalid start byte >> >> >> I was able to create the RDD without any error after adding >> use_unicode=False as follows >> >>> sc.textFile("file.csv",use_unicode=False) >> >> >> But it fails when I try to stem the text. I am getting similar error >> when trying to apply stemming to the text using python interpreter. >> >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: >>> ordinal not in range(128) >> >> All these code is working in 0.7.0 version. There is no change in the >> dataset and code. Is there any change in the encoding type in the new >> version of zeppelin? >> >> Regards, >> >> >> Meethu Mathew >> >> > >