Yingyi Bu created ASTERIXDB-1365:
------------------------------------

             Summary: word-tokens function gets malformed strings from the 
inverted index
                 Key: ASTERIXDB-1365
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1365
             Project: Apache AsterixDB
          Issue Type: Bug
          Components: Functions - AQL
            Reporter: Yingyi Bu
            Assignee: Taewoo Kim
            Priority: Critical


[~wangsaeu],[~javierjia],

It seems there are two possible causes for the bug:
1. the inverted index generates malformed UTF8 strings;
2. UTF8StringUtil has some issue.

However, since UTF8StringUtil has been widely used elsewhere, it's very 
possible the issue is in inverted index. Thus, I assign this to Taewoo. Please 
re-assign owners if you think the assignment is not right.


This is the query:
{noformat}
use SocialNetworkData;

select distinct element message.message
from   GleambookMessages as message,
       "word-tokens"(message) as token,
       (
        select distinct element emp.organization
        from GleambookUsers as user,
             user.employment emp
       ) as org
where  org=token
       and message.send_time >= datetime('2000-06-07T12:05:32') and 
message.send_time < datetime('2000-06-08T12:05:32');
{noformat}


This is the stack trace:
{noformat}
Caused by: java.lang.IllegalArgumentException
        at 
org.apache.hyracks.util.string.UTF8StringUtil.charAt(UTF8StringUtil.java:60)
        at 
org.apache.hyracks.storage.am.lsm.invertedindex.tokenizers.DelimitedUTF8StringBinaryTokenizer.hasNext(DelimitedUTF8StringBinaryTokenizer.java:47)
        at 
org.apache.asterix.runtime.evaluators.common.WordTokensEvaluator.evaluate(WordTokensEvaluator.java:61)
        at 
org.apache.asterix.runtime.unnestingfunctions.std.ScanCollectionDescriptor$ScanCollectionUnnestingFunctionFactory$1.init(ScanCollectionDescriptor.java:88)
        at 
org.apache.hyracks.algebricks.runtime.operators.std.UnnestRuntimeFactory$1.nextFrame(UnnestRuntimeFactory.java:121)
        at 
org.apache.hyracks.dataflow.common.comm.io.AbstractFrameAppender.write(AbstractFrameAppender.java:93)
        at 
org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.flushAndReset(AbstractOneInputOneOutputOneFramePushRuntime.java:63)
        at 
org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.flushIfNotFailed(AbstractOneInputOneOutputOneFramePushRuntime.java:69)
        at 
org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.close(AbstractOneInputOneOutputOneFramePushRuntime.java:55)
        at 
org.apache.hyracks.algebricks.runtime.operators.std.StreamSelectRuntimeFactory$1.close(StreamSelectRuntimeFactory.java:125)
        at 
org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.close(AbstractOneInputOneOutputOneFramePushRuntime.java:57)
        at 
org.apache.hyracks.algebricks.runtime.operators.std.AssignRuntimeFactory$1.close(AssignRuntimeFactory.java:122)
        at 
org.apache.hyracks.algebricks.runtime.operators.base.AbstractOneInputOneOutputOneFramePushRuntime.close(AbstractOneInputOneOutputOneFramePushRuntime.java:57)
        at 
org.apache.hyracks.algebricks.runtime.operators.meta.AlgebricksMetaOperatorDescriptor$2.close(AlgebricksMetaOperatorDescriptor.java:153)
        at 
org.apache.hyracks.storage.am.common.dataflow.IndexSearchOperatorNodePushable.close(IndexSearchOperatorNodePushable.java:227)
        ... 9 more
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to