Yes. All we need to do is changing one method. However, this is a temporary fix. I will investigate more once I have more time.
On Thu, Dec 3, 2015 at 08:34 Chen Li <[email protected]> wrote: > Thanks, Taewoo. Do you think it's easier to apply these changes > directly to Wenhai's "fuzzy branch"? > > > On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <[email protected]> wrote: > > @Wenhai: > > > > Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a > > quick temporary fix. The general fix needs to move this tokenizer into > > Asterix level so that it can properly recognize the NULL type tag so that > > it can skip token generation process. > > > > @Override > > > > public void reset(byte[] sentenceData, int start, int length) { > > > > super.reset(sentenceData, start, length); > > > > gramNum = 0; > > > > > > int numChars = 0; > > > > int pos = byteIndex; > > > > int end = pos + sentenceUtf8Length; > > > > while (pos < end) { > > > > numChars++; > > > > pos += UTF8StringUtil.charSize(sentenceData, pos); > > > > } > > > > > > if (usePrePost) { > > > > totalGrams = numChars + gramLength - 1; > > > > } else { > > > > if (length >= gramLength) { > > > > totalGrams = numChars - gramLength + 1; > > > > } else { > > > > totalGrams = 0; > > > > } > > > > } > > > > } > > > > Best, > > Taewoo > > > > On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <[email protected]> > wrote: > > > >> > >> [ > >> > https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204 > >> ] > >> > >> Taewoo Kim commented on ASTERIXDB-1208: > >> --------------------------------------- > >> > >> This error happens that the current tokenizer always assumes that it > sees > >> a UTF8 string. In this case, it sees a NULL value. We need to add a > logic > >> to bypass when a NULL value is provided. > >> > >> > ngram tokenizer failure with negative length > >> > -------------------------------------------- > >> > > >> > Key: ASTERIXDB-1208 > >> > URL: > >> https://issues.apache.org/jira/browse/ASTERIXDB-1208 > >> > Project: Apache AsterixDB > >> > Issue Type: Bug > >> > Components: Hyracks Core > >> > Reporter: Wenhai > >> > Assignee: Taewoo Kim > >> > > >> > drop dataverse test if exists; > >> > create dataverse test; > >> > use dataverse test; > >> > create type DBLPOpenType as open { > >> > id: int64, > >> > dblpid: string, > >> > authors: string, > >> > misc: string > >> > } > >> > create dataset DBLPOpen(DBLPOpenType) primary key id; > >> > insert into dataset DBLPOpen { "id": 93, "dblpid": > >> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in > >> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1 > >> db/journals/iandc/iandc90.html#IbarraJCR91" } > >> > use dataverse test; > >> > set import-private-functions 'true' > >> > for $d in dataset DBLPOpen > >> > where > >> > similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false)) > >> >= 0.5 > >> > return {"rec": $d} > >> > >> > >> > >> -- > >> This message was sent by Atlassian JIRA > >> (v6.3.4#6332) > >> >
