Thank you for you response,

I went into the code, and I'm fairly confident that I've isolated the
problem. The OutOfMemoryError is part of the dimension dictionary step, but
is not actually related to the dictionary itself (since, as you mentioned,
that is skipped when dictionary=false). The problem arises from the second
half of that step in which it builds the dimension table snapshot. Looking
at the code, the process of building the snapshot table loads in the entire
table into memory as strings (SnapshotTable.takeSnapshot), then serializes
that to an in memory ByteArrayOutputStream (ResourceStore.putResource),
then finally creates a copy of the internal byte array from the stream in
order to store it in HBase (HBaseResourceStore.checkAndPutResourceImpl).
That means that there needs to be space for three in-memory copies of the
full dimension table. Given that even our test subset dimension table is 25
million rows, 14 columns, that becomes problematic. From experimentation,
it breaks even with 95 gig heap.

For completeness, the log leading up to the crash (minus the pointless zk
messages) is:
 - Start to execute command:
 -cubename foo -segmentname FULL_BUILD -input
/tmp/kylin-7d2b7588-17c0-4d80-9962-14ca63929186/foo/fact_distinct_columns
[QuartzScheduler_Worker-1]:[2015-02-19
22:59:01,284][INFO][com.kylinolap.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:57)]
- Building snapshot of KEYWORDS
[QuartzScheduler_Worker-2]:[2015-02-19
22:59:53,241][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetcher.java:60)]
- 0 pending jobs
[QuartzScheduler_Worker-3]:[2015-02-19
23:00:53,252][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetcher.java:60)]
- 0 pending jobs
[QuartzScheduler_Worker-1]:[2015-02-19
23:01:01,278][INFO][com.kylinolap.dict.lookup.FileTableReader.autoDetectDelim(FileTableReader.java:156)]
- Auto detect delim to be ' ', split line to 14 columns --
1020_18768_4_127200_4647593_group_341686994 group 19510703 0 18768 1020
341686994 4647593 371981 4 127200 CONTENT 2015-01-21 22:16:36.227246
[http-bio-7070-exec-8]:[2015-02-19
23:02:07,980][DEBUG][com.kylinolap.rest.service.AdminService.getConfigAsString(AdminService.java:91)]
- Get Kylin Runtime Config
[QuartzScheduler_Worker-4]:[2015-02-19
23:02:53,934][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetcher.java:60)]
- 0 pending jobs
[QuartzScheduler_Worker-1]:[2015-02-19
23:03:10,216][DEBUG][com.kylinolap.common.persistence.ResourceStore.putResource(ResourceStore.java:166)]
- Saving resource
/table_snapshot/part-00000.csv/f87954d5-fdfa-4903-9f82-771d85df6367.snapshot
(Store kylin_metadata_qa@hbase)
[QuartzScheduler_Worker-6]:[2015-02-19
23:04:53,230][DEBUG][com.kylinolap.job.engine.JobFetcher.execute(JobFetcher.java:60)]
- 0 pending jobs
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Dumping heap to java_pid3705.hprof ...


The cube JSON is:

{
  "uuid": "ba6105ca-a18d-4839-bed0-c89b86817110",
  "name": "foo",
  "description": "",
  "dimensions": [
    {
      "id": 1,
      "name": "KEYWORDS_DERIVED",
      "join": {
        "type": "left",
        "primary_key": [
          "DIM_ID"
        ],
        "foreign_key": [
          "KEYWORD_DIM_ID"
        ]
      },
      "hierarchy": null,
      "table": "KEYWORDS",
      "column": "{FK}",
      "datatype": null,
      "derived": [
        "PUBLISHER_GROUP_ID",
        "PUBLISHER_CAMPAIGN_ID",
        "PUBLISHER_ID"
      ]
    }
  ],
  "measures": [
    {
      "id": 1,
      "name": "_COUNT_",
      "function": {
        "expression": "COUNT",
        "parameter": {
          "type": "constant",
          "value": "1"
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    },
    {
      "id": 2,
      "name": "CONVERSIONS",
      "function": {
        "expression": "SUM",
        "parameter": {
          "type": "column",
          "value": "CONVERSIONS"
        },
        "returntype": "bigint"
      },
      "dependent_measure_ref": null
    }
  ],
  "rowkey": {
    "rowkey_columns": [
      {
        "column": "KEYWORD_DIM_ID",
        "length": 0,
        "dictionary": "false",
        "mandatory": false
      }
    ],
    "aggregation_groups": [
      [
        "KEYWORD_DIM_ID"
      ]
    ]
  },
  "signature": "T+aYTH/KlCwwmVAGRQR3hQ==",
  "capacity": "LARGE",
  "last_modified": 1424367558297,
  "fact_table": "FACTS",
  "null_string": null,
  "filter_condition": "KEYWORDS.PUBLISHER_GROUP_ID=386784931",
  "cube_partition_desc": {
    "partition_date_column": null,
    "partition_date_start": 0,
    "cube_partition_type": "APPEND"
  },
  "hbase_mapping": {
    "column_family": [
      {
        "name": "F1",
        "columns": [
          {
            "qualifier": "M",
            "measure_refs": [
              "_COUNT_",
              "CONVERSIONS"
            ]
          }
        ]
      }
    ]
  },
  "notify_list": [
    "sam"
  ]
}


Cheers,
sam

On Thu, Feb 19, 2015 at 9:49 PM, 周千昊 <[email protected]> wrote:

> Also since you set the dictionary to false, there should not be any memory
> consuming while building dictionary.
> So can you also give us the json description of the cube?(in the cube tab,
> click the corresponding cube, click the json button)
>
>
> On Fri Feb 20 2015 at 1:39:15 PM 周千昊 <[email protected]> wrote:
>
> > Hi, Samuel
> >      Can you give us some detail log, so we can dig into the root cause
> >
> > On Fri Feb 20 2015 at 2:44:32 AM Samuel Bock <[email protected]>
> > wrote:
> >
> >> Hello all,
> >>
> >> We are in the process of evaluating Kylin for use as an OLAP engine. To
> >> that end, we are trying to get a minimum viable setup with a
> >> representative
> >> sample of our data in order to gather performance metrics. We have kylin
> >> running against a 10 node cluster, the provided cubes build successfully
> >> and the system seems functional. Attempting to build a simple cube
> against
> >> our data results in an OutOfMemoryError in the kylin server process (so
> >> far
> >> we have given it up to a 46 gig heap). I was wondering if you could give
> >> me
> >> some guidance as to likely causes, any configurations I'm likely to have
> >> missed before I start diving into the source. I have changed the
> >> "dictionary" setting to false, as recommended for high-cardinality
> >> dimensions, but have not changed configuration significantly apart from
> >> that.
> >>
> >> For reference, the sizes of the hive tables we're building the cubes
> from
> >> dimension table: 25,399,061 rows
> >> fact table: 270,940,921 rows
> >>
> >> (And as a note, there are no pertinent log messages except to indicate
> >> that
> >> it is in the Build Dimension Dictionary step)
> >>
> >> Thank you,
> >> sam bock
> >>
> >
>

Reply via email to