DRILL-33 - Synthetic Log Generator
Project: http://git-wip-us.apache.org/repos/asf/incubator-drill/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-drill/commit/f04a0fd2 Tree: http://git-wip-us.apache.org/repos/asf/incubator-drill/tree/f04a0fd2 Diff: http://git-wip-us.apache.org/repos/asf/incubator-drill/diff/f04a0fd2 Branch: refs/heads/master Commit: f04a0fd2228862e708754e7aa09360ca3b62468e Parents: 61f09e7 Author: tdunning <[email protected]> Authored: Mon Feb 4 16:42:43 2013 -0800 Committer: tdunning <[email protected]> Committed: Mon Feb 4 16:42:43 2013 -0800 ---------------------------------------------------------------------- sandbox/prototype/contrib/synth-log/README.md | 35 + sandbox/prototype/contrib/synth-log/pom.xml | 24 + .../java/org/apache/drill/synth/LogGenerator.java | 57 + .../main/java/org/apache/drill/synth/LogLine.java | 37 + .../main/java/org/apache/drill/synth/LongTail.java | 48 + .../src/main/java/org/apache/drill/synth/Main.java | 41 + .../java/org/apache/drill/synth/TermGenerator.java | 39 + .../src/main/java/org/apache/drill/synth/User.java | 57 + .../java/org/apache/drill/synth/WordGenerator.java | 108 + .../contrib/synth-log/src/main/resources/geo-codes | 297 + .../synth-log/src/main/resources/other-words |117184 +++++++++++++++ .../src/main/resources/word-frequency-seed | 5003 + .../org/apache/drill/synth/LogGeneratorTest.java | 20 + .../org/apache/drill/synth/TermGeneratorTest.java | 80 + .../org/apache/drill/synth/WordGeneratorTest.java | 22 + 15 files changed, 123052 insertions(+), 0 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/README.md ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/README.md b/sandbox/prototype/contrib/synth-log/README.md new file mode 100644 index 0000000..b29fde7 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/README.md @@ -0,0 +1,35 @@ +log-synth +========= + +The basic idea here is to have a random log generator build fairly realistic log files for analysis. The analyses specified here are fairly typical use cases for trying to figure out where the load on a web-site is coming from. + +The Data Source +============== +The data source here is a set of heavily biased random numbers to generate traffic sources, response times and queries. In order to give a realistic long-tail experience the data are generated using special random number generators available in the Mahout library. + +There are three basic entities involved in the random process that generates these logs that are IP addresses, users and queries. Users have a basic traffic rate and a variable number of users sit behind each IP address. Queries are composed of words which are generated somewhat differently by each user. The response time for each query is determined based on the terms in the queries with a very few terms causing much longer queries than others. Each log line contains an IP address, a user cookie, a query and a response time. + +Logs of various sizes can be generated using the generator tools. + +The Queries +============== +The general goal of the queries is to find out what and/or who is causing long query times and where lots of traffic is coming from. + +The questions we would like to answer include: + +* What are the top IP addresses by request count? +* What are the top IP addresses by unique user? +* What are the most common search terms? +* What are the most common search terms in the slowest 5% of the queries? +* What is the daily number of searches, (approximate) number of unique users, (approximate) number of unique IP addresses and distribution of response times (average, min, max, 25, 50 and 75%-iles). + +Methods +======== +The general process for generating log lines is to select a user, possibly one we have not seen before. If the user is new, then we need to select an IP address for the user. Otherwise, we remember the IP address for each user. + +Queries have an overall frequency distribution that is long-tailed, but each user has a variation on that distribution. In order to model this, we sample each user's queries from a per-user Pittman-Yor process. In order to make users have similar query term distributions, each user's query term distribution is initialized from a Pittman-Yor process that has already been sampled a number of times. + +We also need to maintain an average response time per term. The response time for each query is exponentially distributed with a mean equal to the sum of the average response times for the terms. Response times for words are sampled either from an exponential distribution, from a log-gamma distribution or from a gamma distribution with a moderately low shape parameter so that we can have interestingly long tails for response time. + +Users are assigned to IP addresses using a Pittman-Yor process with a discount of 0.9. This gives long-tailed distribution to the number of users per IP address. This results in 90% of all IP addresses having only a single user. + http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/pom.xml ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/pom.xml b/sandbox/prototype/contrib/synth-log/pom.xml new file mode 100644 index 0000000..39aa680 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/pom.xml @@ -0,0 +1,24 @@ +<?xml version="1.0" encoding="UTF-8"?> +<project xmlns="http://maven.apache.org/POM/4.0.0" + xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" + xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> + <modelVersion>4.0.0</modelVersion> + + <groupId>log-synth</groupId> + <artifactId>log-synth</artifactId> + <version>0.1-SNAPSHOT</version> + + <dependencies> + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-math</artifactId> + <version>0.8-SNAPSHOT</version> + </dependency> + <dependency> + <groupId>junit</groupId> + <artifactId>junit</artifactId> + <version>4.8.2</version> + </dependency> + </dependencies> + +</project> \ No newline at end of file http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogGenerator.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogGenerator.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogGenerator.java new file mode 100644 index 0000000..38ce01c --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogGenerator.java @@ -0,0 +1,57 @@ +package org.apache.drill.synth; + +import org.apache.mahout.math.random.Sampler; + +import java.net.Inet4Address; +import java.net.InetAddress; +import java.net.UnknownHostException; +import java.util.Random; + +/** + * Generates kind of realistic log lines consisting of a user id (a cookie), an IP address and a query. + */ +public class LogGenerator implements Sampler<LogLine> { + private LongTail<InetAddress> ipGenerator = new LongTail<InetAddress>(1, 0.5) { + Random gen = new Random(); + + @Override + protected InetAddress createThing() { + int address = gen.nextInt(); + try { + return Inet4Address.getByAddress(new byte[]{ + (byte) (address >>> 24), + (byte) (0xff & (address >>> 16)), + (byte) (0xff & (address >>> 8)), + (byte) (0xff & (address)) + }); + } catch (UnknownHostException e) { + throw new RuntimeException("Can't happen with numeric IP address", e); + } + } + }; + + private WordGenerator words = new WordGenerator("word-frequency-seed", "other-words"); + private TermGenerator terms = new TermGenerator(words, 1, 0.8); + private TermGenerator geo = new TermGenerator(new WordGenerator(null, "geo-codes"), 10, 0 + ); + + private LongTail<User> userGenerator = new LongTail<User>(50000, 0) { + @Override + protected User createThing() { + return new User(ipGenerator.sample(), geo, terms); + } + }; + + public Iterable<User> getUsers() { + return userGenerator.getThings(); + } + + public LogLine sample() { + // pick a user + return new LogLine(userGenerator.sample()); + } + + public int getUserCount() { + return userGenerator.getThings().size(); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogLine.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogLine.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogLine.java new file mode 100644 index 0000000..64bc8c7 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LogLine.java @@ -0,0 +1,37 @@ +package org.apache.drill.synth; + +import java.net.InetAddress; +import java.util.Formatter; +import java.util.List; + +/** + * A log line contains a user id, an IP address and a query. + */ +public class LogLine { + private InetAddress ip; + private long cookie; + private List<String> query; + + public LogLine(InetAddress ip, long cookie, List<String> query) { + this.cookie = cookie; + this.ip = ip; + this.query = query; + } + + public LogLine(User user) { + this(user.getAddress(), user.getCookie(), user.getQuery()); + } + + @Override + public String toString() { + Formatter r = new Formatter(); + r.format("{cookie:\"%08x\", ip:\"%s\", query:", cookie, ip.getHostAddress()); + String sep = "["; + for (String term : query) { + r.format("%s\"%s\"", sep, term); + sep = ", "; + } + r.format("]}"); + return r.toString(); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LongTail.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LongTail.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LongTail.java new file mode 100644 index 0000000..1e0d2ef --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/LongTail.java @@ -0,0 +1,48 @@ +package org.apache.drill.synth; + +import com.google.common.collect.Lists; +import org.apache.mahout.math.random.ChineseRestaurant; +import org.apache.mahout.math.random.Sampler; + +import java.util.List; + +/** + * Created with IntelliJ IDEA. + * User: tdunning + * Date: 2/2/13 + * Time: 6:05 PM + * To change this template use File | Settings | File Templates. + */ +public abstract class LongTail<T> implements Sampler<T> { + private ChineseRestaurant base; + private List<T> things = Lists.newArrayList(); + + protected LongTail(double alpha, double discount) { + base = new ChineseRestaurant(alpha, discount); + } + + public T sample() { + int n = base.sample(); + while (n >= things.size()) { + things.add(createThing()); + } + return things.get(n); + } + + public ChineseRestaurant getBaseDistribution() { + return base; + } + + protected abstract T createThing(); + + public List<T> getThings() { + return things; + } + + public void setThing(int i, T thing) { + while (things.size() <= i) { + things.add(null); + } + things.set(i, thing); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/Main.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/Main.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/Main.java new file mode 100644 index 0000000..9260d36 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/Main.java @@ -0,0 +1,41 @@ +package org.apache.drill.synth; + + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import java.io.BufferedWriter; +import java.io.File; +import java.io.IOException; + +/** + * Create a query log with a specified number of log lines and an associated user profile database. + * + * Command line args include number of log lines to generate, the name of the log file to generate and the + * name of the file to store the user profile database in. + * + * Log lines and user profile entries are single line JSON. + */ +public class Main { + public static void main(String[] args) throws IOException { + int n = Integer.parseInt(args[0]); + + LogGenerator lg = new LogGenerator(); + BufferedWriter log = Files.newWriter(new File(args[1]), Charsets.UTF_8); + for (int i = 0; i < n; i++) { + if (i % 10000 == 0) { + System.out.printf("%d %d\n", i, lg.getUserCount()); + } + log.write(lg.sample().toString()); + log.newLine(); + } + log.close(); + + BufferedWriter profile = Files.newWriter(new File(args[2]), Charsets.UTF_8); + for (User user : lg.getUsers()) { + profile.write(user.toString()); + profile.newLine(); + } + profile.close(); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/TermGenerator.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/TermGenerator.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/TermGenerator.java new file mode 100644 index 0000000..85b9f54 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/TermGenerator.java @@ -0,0 +1,39 @@ +package org.apache.drill.synth; + +import org.apache.mahout.math.random.Sampler; + +/** + * Generate words at random from a specialized vocabulary. Every term generator's + * frequency distribution has a common basis, but each will diverge after initialization. + */ +public class TermGenerator implements Sampler<String> { + // the word generator handles the problem of making up new words + // it also provides the seed frequencies + private WordGenerator words; + + private LongTail<String> distribution; + + public TermGenerator(WordGenerator words, final int alpha, final double discount) { + this.words = words; + distribution = new LongTail<String>(alpha, discount) { + private int count = TermGenerator.this.words.size(); + + @Override + protected String createThing() { + return TermGenerator.this.words.getString(count++); + } + }; + + int i = 0; + for (String word : this.words.getBaseWeights().keySet()) { + distribution.getBaseDistribution().setCount(i, this.words.getBaseWeights().get(word)); + distribution.setThing(i, word); + i++; + } + + } + + public String sample() { + return distribution.sample(); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/User.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/User.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/User.java new file mode 100644 index 0000000..3c85c1f --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/User.java @@ -0,0 +1,57 @@ +package org.apache.drill.synth; + +import com.google.common.collect.Lists; +import org.apache.mahout.common.RandomUtils; +import org.apache.mahout.math.jet.random.Exponential; + +import java.net.InetAddress; +import java.util.List; + +/** + * Created with IntelliJ IDEA. + * User: tdunning + * Date: 2/2/13 + * Time: 6:15 PM + * To change this template use File | Settings | File Templates. + */ +public class User { + private Exponential queryLengthDistribution = new Exponential(0.4, RandomUtils.getRandom()); + + private long cookie = RandomUtils.getRandom().nextLong(); + + private TermGenerator terms; + private InetAddress address; + private String geoCode; + + public User(InetAddress address, TermGenerator geoCoder, TermGenerator terms) { + this.terms = terms; + geoCode = geoCoder.sample(); + this.address = address; + } + + public InetAddress getAddress() { + return address; + } + + public long getCookie() { + return cookie; + } + + public List<String> getQuery() { + int n = queryLengthDistribution.nextInt() + 1; + List<String> r = Lists.newArrayList(); + for (int i = 0; i < n; i++) { + r.add(terms.sample()); + } + return r; + } + + public String getGeoCode() { + return geoCode; + } + + @Override + public String toString() { + return String.format("{ip:\"%s\", cookie:\"%08x\", geo:\"%s\"}", address, cookie, geoCode); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/WordGenerator.java ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/WordGenerator.java b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/WordGenerator.java new file mode 100644 index 0000000..3f27d16 --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/java/org/apache/drill/synth/WordGenerator.java @@ -0,0 +1,108 @@ +package org.apache.drill.synth; + +import com.google.common.base.Charsets; +import com.google.common.base.Splitter; +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.io.LineProcessor; +import com.google.common.io.Resources; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import java.io.BufferedReader; +import java.io.IOException; +import java.net.URISyntaxException; +import java.nio.file.Files; +import java.nio.file.Paths; +import java.util.Iterator; +import java.util.List; +import java.util.Map; + +/** + * Emulates an infinite list of words, a prefix of which are taken from lists of plausible words. The first words + * are taken from a resource that has frequencies in it. These frequencies can be used to initialize term + * generators to a common language. The next batch of words are taken from a long list of words with no frequencies. + * After that, words are coined by using an integer count. + */ +public class WordGenerator { + private final Logger log = LoggerFactory.getLogger(WordGenerator.class); + + private BufferedReader wordReader; + private final List<String> words = Lists.newArrayList(); + private final Map<String, Integer> baseWeights = Maps.newLinkedHashMap(); + + public WordGenerator(String seed, String others) { + // read the common words + if (seed != null) { + try { + Resources.readLines(Resources.getResource(seed), Charsets.UTF_8, + new LineProcessor<Object>() { + private boolean header = true; + private final Splitter onTabs = Splitter.on("\t"); + + public boolean processLine(String s) throws IOException { + if (!s.startsWith("#")) { + if (!header) { + Iterator<String> fields = onTabs.split(s).iterator(); + fields.next(); + String word = fields.next(); + words.add(word); + int count = (int) Math.rint(Double.parseDouble(fields.next())); + baseWeights.put(word, count); + } else { + header = false; + } + } + return true; + } + + public Object getResult() { + return null; + } + }); + } catch (IOException e) { + log.error("Can't read resource \"{}\", will continue without realistic words", seed); + } + } + + try { + wordReader = Files.newBufferedReader(Paths.get(Resources.getResource(others).toURI()), Charsets.UTF_8); + } catch (IOException e) { + log.error("Can't read resource \"{}\", will continue without realistic words", others); + wordReader = null; + } catch (URISyntaxException e) { + log.error("Bad format for resource URI \"{}\", will continue without realistic words", others, e); + wordReader = null; + } + + } + + public String getString(int n) { + if (n >= words.size()) { + synchronized (this) { + while (n >= words.size()) { + try { + String w = wordReader.readLine(); + if (w != null) { + words.add(w); + } else { + words.add("w-" + n); + } + } catch (IOException e) { + log.error("Error reading other words resource", e); + words.add("w-" + n); + } + } + } + } + return words.get(n); + } + + public Map<String, Integer> getBaseWeights() { + return baseWeights; + } + + public int size() { + return words.size(); + } +} http://git-wip-us.apache.org/repos/asf/incubator-drill/blob/f04a0fd2/sandbox/prototype/contrib/synth-log/src/main/resources/geo-codes ---------------------------------------------------------------------- diff --git a/sandbox/prototype/contrib/synth-log/src/main/resources/geo-codes b/sandbox/prototype/contrib/synth-log/src/main/resources/geo-codes new file mode 100644 index 0000000..f08d1fc --- /dev/null +++ b/sandbox/prototype/contrib/synth-log/src/main/resources/geo-codes @@ -0,0 +1,297 @@ +AL +AK +AZ +AR +CA +CO +CT +DE +FL +GA +HI +ID +IL +IN +IA +KS +KY +LA +ME +MD +MA +MI +MN +MS +MO +MT +NE +NV +NH +NJ +NM +NY +NC +ND +OH +OK +OR +PA +RI +SC +SD +TN +TX +UT +VT +VA +WA +WV +WI +WY +AS +DC +FM +GU +MH +MP +PW +PR +VI +AF +AL +DZ +AS +AD +AO +AI +AQ +AG +AR +AM +AW +AU +AT +AZ +BS +BH +BD +BB +BY +BE +BZ +BJ +BM +BT +BO +BA +BW +BV +BR +IO +BN +BG +BF +BI +KH +CM +CA +CV +KY +CF +TD +CL +CN +CX +CC +CO +KM +CG +CD +CK +CR +CI +HR +CU +CY +CZ +DK +DJ +DM +DO +TP +EC +EG +SV +GQ +ER +EE +ET +FK +FO +FJ +FI +FR +FX +GF +PF +TF +GA +GM +GE +DE +GH +GI +GR +GL +GD +GP +GU +GT +GN +GW +GY +HT +HM +VA +HN +HK +HU +IS +IN +ID +IR +IQ +IE +IL +IT +JM +JP +JO +KZ +KE +KI +KP +KR +KW +KG +LA +LV +LB +LS +LR +LY +LI +LT +LU +MO +MK +MG +MW +MY +MV +ML +MT +MH +MQ +MR +MU +YT +MX +FM +MD +MC +MN +MS +MA +MZ +MM +NA +NR +NP +NL +AN +NC +NZ +NI +NE +NG +NU +NF +MP +NO +OM +PK +PW +PA +PG +PY +PE +PH +PN +PL +PT +PR +QA +RE +RO +RU +RW +KN +LC +VC +WS +SM +ST +SA +SN +SC +SL +SG +SK +SI +SB +SO +ZA +GS +ES +LK +SH +PM +SD +SR +SJ +SZ +SE +CH +SY +TW +TJ +TZ +TH +TG +TK +TO +TT +TN +TR +TM +TC +TV +UG +UA +AE +GB +US +UM +UY +UZ +VU +VE +VN +VG +VI +WF +EH +YE +ZM +ZW \ No newline at end of file
