[ https://issues.apache.org/jira/browse/ASTERIXDB-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15601139#comment-15601139 ]
Wenhai commented on ASTERIXDB-1704: ----------------------------------- What I did is just applying the fuzzy branch onto the master, has your OOM branch merged back onto the latest master? One of the reason is the extra-expression computation. I have removed it in the newest fuzzy branch. git fetch https://asterix-gerrit.ics.uci.edu/asterixdb refs/changes/76/1076/23 > Fuzzy-join query is slow > ------------------------ > > Key: ASTERIXDB-1704 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-1704 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > > I have an issue regarding the prefix-based fuzzy join (non-index based fuzzy > join) on a small dataset. The following query runs forever even for a dataset > with 200K records on 9 nodes. So, each node only has 20,000 records. Also, > the record size is not that big. > {code} > count( > for $o in dataset AmazonReview > for $i in dataset AmazonReview > where similarity-jaccard(word-tokens($o.reviewText), > word-tokens($i.reviewText)) >= 0.2 and $o.id < $i.id > return {"oid":$o.reviewrID, "iid":$i.reviewID} > ); > {code} > An example record is as follows. > {code} > { > "reviewerID": "A2SUAM1J3GNN3B", > "asin": "0000013714", > "reviewerName": "J. McDonald", > "helpful": [2, 3], > "reviewText": "I bought this for my husband who plays the piano. He is > having a wonderful time playing these old hymns. The music is at times hard > to read because we think the book was published for singing from more than > playing from. Great purchase though!", > "overall": 5.0, > "summary": "Heavenly Highway Hymns", > "unixReviewTime": 1252800000, > "reviewTime": "09 13, 2009" > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)