[ https://issues.apache.org/jira/browse/SOLR-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011667#comment-16011667 ]
Vivek Narang edited comment on SOLR-10317 at 5/16/17 3:08 AM: -------------------------------------------------------------- Hello [~ichattopadhyaya] I also found out another dataset which can be used for some scenarios/features. Please see [https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge with over half a million records and ten fields. There is a good mix of text and numeric fields. The current indexing time as observed is 1222 seconds on a standalone node. Please access the file (~250MB) here: [http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me know what you think. Thanks. --- Data Structure Details --- Id ProductId - unique identifier for the product UserId - unqiue identifier for the user ProfileName HelpfulnessNumerator - number of users who found the review helpful HelpfulnessDenominator - number of users who indicated whether they found the review helpful Score - rating between 1 and 5 Time - timestamp for the review Summary - brief summary of the review Text - text of the review --- was (Author: vivek.nar...@uga.edu): Hello [~ichattopadhyaya] I also found out another dataset which can be used for some scenarios/features. Please see [https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge with over half a million records and ten fields. There is a good mix of text and numeric fields. The current indexing time as observed is 1222 seconds on a standalone node. Please access the file (~250MB) here: [http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me know what you think. Thanks. > Solr Nightly Benchmarks > ----------------------- > > Key: SOLR-10317 > URL: https://issues.apache.org/jira/browse/SOLR-10317 > Project: Solr > Issue Type: Task > Reporter: Ishan Chattopadhyaya > Labels: gsoc2017, mentor > Attachments: changes-lucene-20160907.json, > changes-solr-20160907.json, managed-schema, > Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks.docx, > Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks-FINAL-PROPOSAL.pdf, > solrconfig.xml > > > Solr needs nightly benchmarks reporting. Similar Lucene benchmarks can be > found here, https://home.apache.org/~mikemccand/lucenebench/. > Preferably, we need: > # A suite of benchmarks that build Solr from a commit point, start Solr > nodes, both in SolrCloud and standalone mode, and record timing information > of various operations like indexing, querying, faceting, grouping, > replication etc. > # It should be possible to run them either as an independent suite or as a > Jenkins job, and we should be able to report timings as graphs (Jenkins has > some charting plugins). > # The code should eventually be integrated in the Solr codebase, so that it > never goes out of date. > There is some prior work / discussion: > # https://github.com/shalinmangar/solr-perf-tools (Shalin) > # https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md > (Ishan/Vivek) > # SOLR-2646 & SOLR-9863 (Mark Miller) > # https://home.apache.org/~mikemccand/lucenebench/ (Mike McCandless) > # https://github.com/lucidworks/solr-scale-tk (Tim Potter) > There is support for building, starting, indexing/querying and stopping Solr > in some of these frameworks above. However, the benchmarks run are very > limited. Any of these can be a starting point, or a new framework can as well > be used. The motivation is to be able to cover every functionality of Solr > with a corresponding benchmark that is run every night. > Proposing this as a GSoC 2017 project. I'm willing to mentor, and I'm sure > [~shalinmangar] and [~markrmil...@gmail.com] would help here. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org