As a workaround, could you prejoin the big dimension table with fact table in hive? Then, you can run Kylin on the prejoined table.
We will do the optimization on the big dimension table later. Thanks Jiang Xu ------------------ ???????? ------------------ ??????: Samuel Bock <[email protected]> ????????: 2015??02??26?? 03:28 ??????: dev <[email protected]> ????: Re: OutOfMemoryError on step #3 of Cube build Thank you for the follow up, Our dimension table is 25 million rows for our test data set, and would be far larger in production. Given that, it sounds like our data doesn't fit the Kylin use case. I appreciate the assistance in tracking down the source of this issue, cheers, sam On Tue, Feb 24, 2015 at 7:28 PM, Shi, Shaofeng <[email protected]> wrote: > Hi Samuel, > > Kylin only supports the star schema: only 1 fact table join with multiple > lookup tables. The lookup table need be small so that Kylin can read them > into memory for join and cube build. Also as you found, Kylin will take > snapshot on the lookup tables and persist them in Hbase; That should be > the problem. In your case, how many rows there in the KEYWORDS table? > > On 2/21/15, 2:12 AM, "Samuel Bock" <[email protected]> wrote: > > >Thank you for you response, > > > >I went into the code, and I'm fairly confident that I've isolated the > >problem. The OutOfMemoryError is part of the dimension dictionary step, > >but > >is not actually related to the dictionary itself (since, as you mentioned, > >that is skipped when dictionary=false). The problem arises from the second > >half of that step in which it builds the dimension table snapshot. Looking > >at the code, the process of building the snapshot table loads in the > >entire > >table into memory as strings (SnapshotTable.takeSnapshot), then serializes > >that to an in memory ByteArrayOutputStream (ResourceStore.putResource), > >then finally creates a copy of the internal byte array from the stream in > >order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl). > >That means that there needs to be space for three in-memory copies of the > >full dimension table. Given that even our test subset dimension table is > >25 > >million rows, 14 columns, that becomes problematic. From experimentation, > >it breaks even with 95 gig heap. > > > >For completeness, the log leading up to the crash (minus the pointless zk > >messages) is: > > - Start to execute command: > > -cubename foo -segmentname FULL_BUILD -input > >/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns > >[QuartzScheduler_Worker-1]:[2015-02-19 > >22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processS > >egment(DictionaryGeneratorCLI.java:57)] > >- Building snapshot of KEYWORDS > >[QuartzScheduler_Worker-2]:[2015-02-19 > >22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > >r.java:60)] > >- 0 pending jobs > >[QuartzScheduler_Worker-3]:[2015-02-19 > >23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > >r.java:60)] > >- 0 pending jobs > >[QuartzScheduler_Worker-1]:[2015-02-19 > >23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDe > >lim(FileTableReader.java:156)] > >- Auto detect delim to be ' ', split line to 14 columns -- > >1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020 > >341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246 > >[http-bio-7070-exec-8]:[2015-02-19 > >23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsSt > >ring(AdminService.java:91)] > >- Get Kylin Runtime Config > >[QuartzScheduler_Worker-4]:[2015-02-19 > >23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > >r.java:60)] > >- 0 pending jobs > >[QuartzScheduler_Worker-1]:[2015-02-19 > >23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putRes > >ource(ResourceStore.java:166)] > >- Saving resource > >/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapsh > >ot > >(Store kylin_metadata_qa@hbase) > >[QuartzScheduler_Worker-6]:[2015-02-19 > >23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetche > >r.java:60)] > >- 0 pending jobs > >java.lang.OutOfMemoryError: Requested array size exceeds VM limit > >Dumping heap to java_pid3705.hprof ... > > > > > >The cube JSON is: > > > >{ > > "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110", > > "name": "foo", > > "description": "", > > "dimensions": [ > > { > > "id": 1, > > "name": "KEYWORDS_DERIVED", > > "join": { > > "type": "left", > > "primary_key": [ > > "DIM_ID" > > ], > > "foreign_key": [ > > "KEYWORD_DIM_ID" > > ] > > }, > > "hierarchy": null, > > "table": "KEYWORDS", > > "column": "{FK}", > > "datatype": null, > > "derived": [ > > "PUBLISHER_GROUP_ID", > > "PUBLISHER_CAMPAIGN_ID", > > "PUBLISHER_ID" > > ] > > } > > ], > > "measures": [ > > { > > "id": 1, > > "name": "_COUNT_", > > "function": { > > "expression": "COUNT", > > "parameter": { > > "type": "constant", > > "value": "1" > > }, > > "returntype": "bigint" > > }, > > "dependent_measure_ref": null > > }, > > { > > "id": 2, > > "name": "CONVERSIONS", > > "function": { > > "expression": "SUM", > > "parameter": { > > "type": "column", > > "value": "CONVERSIONS" > > }, > > "returntype": "bigint" > > }, > > "dependent_measure_ref": null > > } > > ], > > "rowkey": { > > "rowkey_columns": [ > > { > > "column": "KEYWORD_DIM_ID", > > "length": 0, > > "dictionary": "false", > > "mandatory": false > > } > > ], > > "aggregation_groups": [ > > [ > > "KEYWORD_DIM_ID" > > ] > > ] > > }, > > "signature": "T+aYTH/KlCwwmVAGRQR3hQ==", > > "capacity": "LARGE", > > "last_modified": 1424367558297, > > "fact_table": "FACTS", > > "null_string": null, > > "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931", > > "cube_partition_desc": { > > "partition_date_column": null, > > "partition_date_start": 0, > > "cube_partition_type": "APPEND" > > }, > > "hbase_mapping": { > > "column_family": [ > > { > > "name": "F1", > > "columns": [ > > { > > "qualifier": "M", > > "measure_refs": [ > > "_COUNT_", > > "CONVERSIONS" > > ] > > } > > ] > > } > > ] > > }, > > "notify_list": [ > > "sam" > > ] > >} > > > > > >Cheers, > >sam > > > >On Thu, Feb 19, 2015 at 9:49 PM, ?????? <[email protected]> wrote: > > > >> Also since you set the dictionary to false, there should not be any > >>memory > >> consuming while building dictionary. > >> So can you also give us the json description of the cube?(in the cube > >>tab, > >> click the corresponding cube, click the json button) > >> > >> > >> On Fri Feb 20 2015 at 1:39:15 PM ?????? <[email protected]> wrote: > >> > >> > Hi, Samuel > >> > Can you give us some detail log, so we can dig into the root > >>cause > >> > > >> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <[email protected] > > > >> > wrote: > >> > > >> >> Hello all, > >> >> > >> >> We are in the process of evaluating Kylin for use as an OLAP engine. > >>To > >> >> that end, we are trying to get a minimum viable setup with a > >> >> representative > >> >> sample of our data in order to gather performance metrics. We have > >>kylin > >> >> running against a 10 node cluster, the provided cubes build > >>successfully > >> >> and the system seems functional. Attempting to build a simple cube > >> against > >> >> our data results in an OutOfMemoryError in the kylin server process > >>(so > >> >> far > >> >> we have given it up to a 46 gig heap). I was wondering if you could > >>give > >> >> me > >> >> some guidance as to likely causes, any configurations I'm likely to > >>have > >> >> missed before I start diving into the source. I have changed the > >> >> "dictionary" setting to false, as recommended for high-cardinality > >> >> dimensions, but have not changed configuration significantly apart > >>from > >> >> that. > >> >> > >> >> For reference, the sizes of the hive tables we're building the cubes > >> from > >> >> dimension table: 25,399,061 rows > >> >> fact table: 270,940,921 rows > >> >> > >> >> (And as a note, there are no pertinent log messages except to > >>indicate > >> >> that > >> >> it is in the Build Dimension Dictionary step) > >> >> > >> >> Thank you, > >> >> sam bock > >> >> > >> > > >> > >
