Hi, We are evaluating the deployment of Spotlight on EC2 instances. We would like to know what performances should be expected, and maybe what tweaks should be done to improve the speed?
Our interest is to extract key concepts from various texts. (English for start) So far we have followed the "Run from a JAR" installation process https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Run-from-a-JAR We are running the 6.5 jar version. The server.properties file was grabbed from https://raw.github.com/dbpedia-spotlight/dbpedia-spotlight/master/conf/server.properties. We query /rest/annotate with the following params: disambiguator=Document confidence=0.3 support=10 And Headers: Accept:application/json content-type:application/x-www-form-urlencoded Requests are done from another EC2 instance to minimize bandwidth lag. We have tweaked the -Xmx value to be (maxMem - 1GB) >From a batch of 1000 texts with an average length of 3000 chars, we have the following perfs: - m1.large (7.5GB, 2cores, 4EC2 compute units): 0.7texts/sec - m1.xlarge (15GB, 4cores, 8EC2 compute units): 13.7texts/sec - m2.xlarge (17.1GB, 2cores, 6.5EC2 compute units): 8.8texts/sec - m2.2xlarge (34.2GB, 4cores, 13EC2 compute units): 16.7texts/sec Are these numbers similar to what should be expected? >From the output of our Spotlight server it looks like the disambiguation step is the most time consuming. Do you have any tips for accelerating the disambiguation? Thanks, Marc Here after is our server.properties config: #### server.properties #### org.dbpedia.spotlight.web.rest.uri = http://localhost:2222/rest org.dbpedia.spotlight.default_namespace = http://dbpedia.org/resource/ org.dbpedia.spotlight.default_ontology= http://dbpedia.org/ontology/ org.dbpedia.spotlight.language = English org.dbpedia.spotlight.language_i18n_code = en org.dbpedia.spotlight.data.stopWords.english = /data/spotlight/stopwords.en.list org.dbpedia.spotlight.spot.spotters = LingPipeSpotter,WikiMarkupSpotter,AtLeastOneNounSelector,CoOccurrenceBasedSelector # Path to serialized LingPipe dictionary used by LingPipeSpotter org.dbpedia.spotlight.spot.dictionary = /data/spotlight/surface_forms-Wikipedia-TitRedDis.thresh3.spotterDictionary org.dbpedia.spotlight.spot.allowOverlap = false org.dbpedia.spotlight.spot.caseSensitive = false # Configurations for the CoOccurrenceBasedSelector org.dbpedia.spotlight.spot.cooccurrence.datasource = ukwac org.dbpedia.spotlight.spot.cooccurrence.database.jdbcdriver = org.hsqldb.jdbcDriver org.dbpedia.spotlight.spot.cooccurrence.database.connector = jdbc:hsqldb:file:/data/spotlight/spotsel/ukwac_candidate;shutdown=true&readonly=true org.dbpedia.spotlight.spot.cooccurrence.database.user = sa org.dbpedia.spotlight.spot.cooccurrence.database.password = org.dbpedia.spotlight.spot.cooccurrence.classifier.unigram = /data/spotlight/spotsel/ukwac_unigram.model org.dbpedia.spotlight.spot.cooccurrence.classifier.ngram = /data/spotlight/spotsel/ukwac_ngram.model # Path to serialized HMM model for LingPipe-based POS tagging. Required by AtLeastOneNounSelector and CoOccurrenceBasedSelector org.dbpedia.spotlight.tagging.hmm = /data/spotlight/pos-en-general-brown.HiddenMarkovModel org.dbpedia.spotlight.spot.opennlp.dir = /data/spotlight/3.7/opennlp org.dbpedia.spotlight.spot.opennlp.location=http://dbpedia.org/ontology/Place # From http://spotlight.dbpedia.org/download/release-0.5/candidate-index-full.tgz org.dbpedia.spotlight.candidateMap.dir = /data/spotlight/candidateIndexTitRedDis org.dbpedia.spotlight.candidateMap.loadToMemory = true # List of disambiguators to load: Document,Occurrences,CuttingEdge,Default org.dbpedia.spotlight.disambiguate.disambiguators = Document # Path to a directory containing Lucene index files. These can be downloaded from the website or created by org.dbpedia.spotlight.lucene.index.IndexMergedOccurrences org.dbpedia.spotlight.index.dir =/data/spotlight/index-withSF-withTypes-compressed org.dbpedia.spotlight.index.loadToMemory = false org.dbpedia.spotlight.lucene.analyzer = org.apache.lucene.analysis.en.EnglishAnalyzer org.dbpedia.spotlight.lucene.version = LUCENE_36 jcs.default.cacheattributes.MaxObjects = 5000 # Configuration for SparqlFilter org.dbpedia.spotlight.sparql.endpoint = http://dbpedia.org/sparql org.dbpedia.spotlight.sparql.graph = http://dbpedia.org ####################### ------------------------------------------------------------------------------ LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial Remotely access PCs and mobile devices and provide instant support Improve your efficiency, and focus on delivering more value-add services Discover what IT Professionals Know. Rescue delivers http://p.sf.net/sfu/logmein_12329d2d _______________________________________________ Dbp-spotlight-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dbp-spotlight-users
