Hello, I am quite a novice when it comes to predictive modelling, so would like to see where my particular problem might lie in the spectrum of problems that you collectively have seen in your experiences.

Background: I have been handed a piece of software that uses a kohonen SOM network to analyse and predict data with missing values common, but I want to compare its results to other forms of modelling and prediction (e.g. multi-layer perceptrons, random forests??).

My data is a conglomeration of borehole data from hundreds of boreholes. Some measurements were made during the drilling of the boreholes (more or less continuous 'tool responses': geophysical well-logs), and some in the laboratory on discrete samples of 10 cm up to metre-length scales.

The data could be considered ordered series to some extent, though changes in rock types with depth can result in 'step' changes in tool responses.

My problem is not classifying the rocks, but modelling and predicting a physical attribute of the rocks---thermal conductivity, which is a lab measurement, and hard to come by / expensive. I want to use the more common well-log responses to predict this attribute.

Some boreholes have different sets of well-log data though. For example, one might have measurements from the A and B tool, while another might have A, B, and C tools, and a third the B and C tools. I can construct a decent data base of about 70,000 observations of a common set of 5 tool responses, and they have associated with them about 100 measurements of thermal conductivity. I am mostly confident that the relationship of well-log responses is non-linear to thermal conductivity. Linear regression has not proven accurate.

What 'sort' of problem is this?

Have you seen problems like this, and what did you use to solve it?

I have papers by people using other ANN type techniques (MLP in particular) to model and predict thermal conductivity, but wondered if there was something else I could try.

Some other questions I would like a little guidance on:
Are 100 samples enough of the 'target' attribute for confident modelling and prediction?
How would I quantify the certainty of results of modelling?
The well-log data is extensive, but if I look at the complete set of tool responses, there is a LOT of missing data (because there is no common tool set). Is there a way I can still use the less common tool responses? Is discretisation of the 100 measured thermal conductivities a silly idea? How many 'bins' can I construct?

Thanks for reading!
Ben.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to