Hi Sam,
Thanks for your response! Could you try the pre-join on your test data set firstly? You can verify whether kylin can meet your requirements on the test dat set or not. If the pre-join solution works, we can add "pre-join" option into cube definition and automate it into cube build engine. Then you can change the dimension data easily that won't impacting the cube building. Thanks Jiang Xu ------------------ ???????? ------------------ ??????: "Samuel Bock";<[email protected]>; ????????: 2015??2??28??(??????) ????2:23 ??????: "????"<[email protected]>; ????: "dev"<[email protected]>; ????: Re: OutOfMemoryError on step #3 of Cube build While that might be possible when putting together a test dataset, the actual system will need to retain the ability to change dimension data easily. A prejoined table would make that significantly harder (among other things). thanks, Sam On Wed, Feb 25, 2015 at 4:38 PM, ???? <[email protected]> wrote: > As a workaround, could you prejoin the big dimension table with fact table > in hive? Then, you can run Kylin on the prejoined table. > > We will do the optimization on the big dimension table later. > > Thanks > Jiang Xu > > ------------------ ???????? ------------------ > *??????:* Samuel Bock <[email protected]> > *????????:* 2015??02??26?? 03:28 > *??????:* dev <[email protected]> > *????:* Re: OutOfMemoryError on step #3 of Cube build > > Thank you for the follow up, > > Our dimension table is 25 million rows for our test data set, and would be > far larger in production. Given that, it sounds like our data doesn't fit > the Kylin use case. I appreciate the assistance in tracking down the source > of this issue, > > cheers, > sam > > On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <[email protected]> wrote: > > > Hi Samuel, > > > > Kylin only supports the star schema: only 1 fact table join with multiple > > lookup tables. The lookup table need be small so that Kylin can read them > > into memory for join and cube build. Also as you found, Kylin will take > > snapshot on the lookup tables and persist them in Hbase; That should be > > the problem. In your case, how many rows there in the KEYWORDS table? > > > > On 2/21/15, 2:12 AM, "Samuel Bock" <[email protected]> wrote: > > > > >Thank you for you response, > > > > > >I went into the code, and I'm fairly confident that I've isolated the > > >problem. The OutOfMemoryError is part of the dimension dictionary step, > > >but > > >is not actually related to the dictionary itself (since, as you > mentioned, > > >that is skipped when dictionary=false). The problem arises from the > second > > >half of that step in which it builds the dimension table snapshot. > Looking > > >at the code, the process of building the snapshot table loads in the > > >entire > > >table into memory as strings (SnapshotTable.takeSnapshot), then > serializes > > >that to an in memory ByteArrayOutputStream (ResourceStore.putResource), > > >then finally creates a copy of the internal byte array from the stream > in > > >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl). > > >That means that there needs to be space for three in-memory copies of > the > > >full dimension table. Given that even our test subset dimension table is > > >25 > > >million rows, 14 columns, that becomes problematic. From > experimentation, > > >it breaks even with 95 gig heap. > > > > > >For completeness, the log leading up to the crash (minus the pointless > zk > > >messages) is: > > > - Start to execute command: > > > -cubename foo -segmentname FULL_BUILD -input > > > >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS > > >egment(DictionaryGeneratorCLI.java:57)] > > >- Building snapshot of KEYWORDS > > >[QuartzScheduler_Worker-2]:[2015-02-19 > > > >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > >r.java:60)] > > >- 0 pending jobs > > >[QuartzScheduler_Worker-3]:[2015-02-19 > > > >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > >r.java:60)] > > >- 0 pending jobs > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe > > >lim(FileTableReader.java:156)] > > >- Auto detect delim to be ' ', split line to 14 columns -- > > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020 > > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246 > > >[http-bio-7070-exec-8]:[2015-02-19 > > > >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt > > >ring(AdminService.java:91)] > > >- Get Kylin Runtime Config > > >[QuartzScheduler_Worker-4]:[2015-02-19 > > > >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > >r.java:60)] > > >- 0 pending jobs > > >[QuartzScheduler_Worker-1]:[2015-02-19 > > > >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes > > >ource(ResourceStore.java:166)] > > >- Saving resource > > > >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh > > >ot > > >(Store kylin_metadata_qa@hbase) > > >[QuartzScheduler_Worker-6]:[2015-02-19 > > > >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > > >r.java:60)] > > >- 0 pending jobs > > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit > > >Dumping heap to java_pid3705.hprof ... > > > > > > > > >The cube JSON is: > > > > > >{ > > > "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110", > > > "name": "foo", > > > "description": "", > > > "dimensions": [ > > > { > > > "id": 1, > > > "name": "KEYWORDS_DERIVED", > > > "join": { > > > "type": "left", > > > "primary_key": [ > > > "DIM_ID" > > > ], > > > "foreign_key": [ > > > "KEYWORD_DIM_ID" > > > ] > > > }, > > > "hierarchy": null, > > > "table": "KEYWORDS", > > > "column": "{FK}", > > > "datatype": null, > > > "derived": [ > > > "PUBLISHER_GROUP_ID", > > > "PUBLISHER_CAMPAIGN_ID", > > > "PUBLISHER_ID" > > > ] > > > } > > > ], > > > "measures": [ > > > { > > > "id": 1, > > > "name": "_COUNT_", > > > "function": { > > > "expression": "COUNT", > > > "parameter": { > > > "type": "constant", > > > "value": "1" > > > }, > > > "returntype": "bigint" > > > }, > > > "dependent_measure_ref": null > > > }, > > > { > > > "id": 2, > > > "name": "CONVERSIONS", > > > "function": { > > > "expression": "SUM", > > > "parameter": { > > > "type": "column", > > > "value": "CONVERSIONS" > > > }, > > > "returntype": "bigint" > > > }, > > > "dependent_measure_ref": null > > > } > > > ], > > > "rowkey": { > > > "rowkey_columns": [ > > > { > > > "column": "KEYWORD_DIM_ID", > > > "length": 0, > > > "dictionary": "false", > > > "mandatory": false > > > } > > > ], > > > "aggregation_groups": [ > > > [ > > > "KEYWORD_DIM_ID" > > > ] > > > ] > > > }, > > > "signature": "T+aYTH/KlCwwmVAGRQR3hQ==", > > > "capacity": "LARGE", > > > "last_modified": 1424367558297, > > > "fact_table": "FACTS", > > > "null_string": null, > > > "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931", > > > "cube_partition_desc": { > > > "partition_date_column": null, > > > "partition_date_start": 0, > > > "cube_partition_type": "APPEND" > > > }, > > > "hbase_mapping": { > > > "column_family": [ > > > { > > > "name": "F1", > > > "columns": [ > > > { > > > "qualifier": "M", > > > "measure_refs": [ > > > "_COUNT_", > > > "CONVERSIONS" > > > ] > > > } > > > ] > > > } > > > ] > > > }, > > > "notify_list": [ > > > "sam" > > > ] > > >} > > > > > > > > >Cheers, > > >sam > > > > > >On Thu, Feb 19, 2015 at 9:49 PM, ?????? <[email protected]> wrote: > > > > > >> Also since you set the dictionary to false, there should not be any > > >>memory > > >> consuming while building dictionary. > > >> So can you also give us the json description of the cube?(in the cube > > >>tab, > > >> click the corresponding cube, click the json button) > > >> > > >> > > >> On Fri Feb 20 2015 at 1:39:15 PM ?????? <[email protected]> wrote: > > >> > > >> > Hi, Samuel > > >> > Can you give us some detail log, so we can dig into the root > > >>cause > > >> > > > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock < > [email protected] > > > > > >> > wrote: > > >> > > > >> >> Hello all, > > >> >> > > >> >> We are in the process of evaluating Kylin for use as an OLAP > engine. > > >>To > > >> >> that end, we are trying to get a minimum viable setup with a > > >> >> representative > > >> >> sample of our data in order to gather performance metrics. We have > > >>kylin > > >> >> running against a 10 node cluster, the provided cubes build > > >>successfully > > >> >> and the system seems functional. Attempting to build a simple cube > > >> against > > >> >> our data results in an OutOfMemoryError in the kylin server process > > >>(so > > >> >> far > > >> >> we have given it up to a 46 gig heap). I was wondering if you could > > >>give > > >> >> me > > >> >> some guidance as to likely causes, any configurations I'm likely to > > >>have > > >> >> missed before I start diving into the source. I have changed the > > >> >> "dictionary" setting to false, as recommended for high-cardinality > > >> >> dimensions, but have not changed configuration significantly apart > > >>from > > >> >> that. > > >> >> > > >> >> For reference, the sizes of the hive tables we're building the > cubes > > >> from > > >> >> dimension table: 25,399,061 rows > > >> >> fact table: 270,940,921 rows > > >> >> > > >> >> (And as a note, there are no pertinent log messages except to > > >>indicate > > >> >> that > > >> >> it is in the Build Dimension Dictionary step) > > >> >> > > >> >> Thank you, > > >> >> sam bock > > >> >> > > >> > > > >> > > > > > >
