[jira] [Issue Comment Deleted] (ASTERIXDB-1435) Massive operators will induce a heap error that should not be.

Taewoo Kim (JIRA) Wed, 17 Aug 2016 16:51:07 -0700

     [ 
https://issues.apache.org/jira/browse/ASTERIXDB-1435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Taewoo Kim updated ASTERIXDB-1435:
----------------------------------
    Comment: was deleted

(was: These two issues are actually the same issue.)

> Massive operators will induce a  heap error that should not be.
> ---------------------------------------------------------------
>
>                 Key: ASTERIXDB-1435
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-1435
>             Project: Apache AsterixDB
>          Issue Type: Improvement
>          Components: AsterixDB
>         Environment: Ubuntu 12.04 on a single machine with 2 X 12 partitions 
> on 6 CPUs X 4 hard threads
>            Reporter: Wenhai
>            Assignee: Taewoo Kim
>
> When we testing the fuzzy-join patch 
> (https://asterix-gerrit.ics.uci.edu/#/c/531/) on a million records table ACM 
> joining half-million DBLP joining 2million CITE datasets, we encountered a 
> heap error that should not to be. We divide the three-way prefix-based fuzzy 
> join in two rounds as following.
> Schema
> {noformat}
> drop dataverse test if exists;
> create dataverse test;
> use dataverse test;
> create type PaperType as open {
>   tid:uuid,
>   title: string,
>   authors: string?,
>   year: int?,
>   conf: string?,
>   idx: string,
>   abstract: string?
> }
> create dataset ACM(PaperType) primary key tid autogenerated;
> use dataverse test;
> drop dataset ACM if exists;
> create dataset ACM(PaperType) primary key tid autogenerated;
> load dataset ACM
> using localfs
> (("path"="127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/acm_split.aa,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/acm_split.ab,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/acm_split.ac,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/acm_split.ad,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/acm_split.ae"),("format"="delimited-text"),("delimiter"="#"),("quote"="\u0000"));
> use dataverse test;
> create dataset DBLP(PaperType) primary key tid autogenerated;
> load dataset DBLP
> using localfs
> (("path"="127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.aa,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.ab,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.ac,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.ad,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.ae,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/dblp_split.af"),("format"="delimited-text"),("delimiter"="#"),("quote"="\u0000"));
> create dataset ACM(PaperType) primary key tid autogenerated;
> use dataverse test;
> drop dataset CITE if exists;
> create dataset CITE(PaperType) primary key tid autogenerated;
> load dataset CITE
> using localfs
> (("path"="127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/citation_split.aa,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/citation_split.ab,127.0.0.1:///home/hadoop/Downloads/doccorpus/reproduce/citation_split.ac"),("format"="delimited-text"),("delimiter"="#"),("quote"="\u0000"));
> use dataverse test;
> drop index ACM.word_index if exists
> create index word_index on ACM(title) type keyword
> use dataverse test;
> drop index ACM.ngram_index if exists
> create index ngram_index on ACM(title) type ngram(3)
> {noformat}
> Firstly, we divide the three join onto two two-way joins and joined ACM with 
> DBLP onto a temporary table STATGE1,
> {noformat}
> use dataverse test;
> drop dataset Stage1 if exists;
> create dataset Stage1(PaperType) primary key tid autogenerated;
> set import-private-functions 'true'
> set simthreshold '.9f';
> insert into dataset Stage1 (
> for $t in dataset ('ACM')
> for $o in dataset('DBLP')
> where word-tokens($o.authors) ~= word-tokens($t.authors)
> return {"title":$t.title, 
> "authors":$t.authors,"year":$t.year,"conf":$t.conf,"idx":$t.idx,"abstract":$t.abstract})
> {noformat}
> Afterwards, we executed another two-way join as follows.
> {noformat}
> use dataverse test;
> set import-private-functions 'true'
> set simthreshold '.9f';
> let $s := sum(
> for $t in dataset ('Stage1')
> for $o in dataset('CITE')
> where word-tokens($o.authors) ~= word-tokens($t.authors)
> order by $o.id
> return 1)
> return $s
> {noformat}
> It is successful to generate the final 7-million results.
> Nevertheless, if we enforce the following query directly,
> {noformat}
> use dataverse test;
> set import-private-functions 'true'
> set simthreshold '.9f';
> let $s := sum(
> for $t in dataset ('ACM')
> for $o in dataset('DBLP')
> for $g in dataset('CITE')
> where word-tokens($o.authors) ~= word-tokens($t.authors) and 
> word-tokens($t.authors) ~= word-tokens($g.authors)
> order by $o.id
> return 1)
> return $s
> {noformat}
> we got the error
> {noformat}
> Java heap error.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (ASTERIXDB-1435) Massive operators will induce a heap error that should not be.

Reply via email to