[ 
https://issues.apache.org/jira/browse/SOLR-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011667#comment-16011667
 ] 

Vivek Narang edited comment on SOLR-10317 at 5/16/17 3:08 AM:
--------------------------------------------------------------

Hello [~ichattopadhyaya] I also found out another dataset which can be used for 
some scenarios/features. Please see 
[https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge 
with over half a million records and ten fields. There is a good mix of text 
and numeric fields. The current indexing time as observed is 1222 seconds on a 
standalone node. Please access the file (~250MB) here: 
[http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me 
know what you think. Thanks. 

--- Data Structure Details ---
Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the 
review helpful
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review
---


was (Author: vivek.nar...@uga.edu):
Hello [~ichattopadhyaya] I also found out another dataset which can be used for 
some scenarios/features. Please see 
[https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge 
with over half a million records and ten fields. There is a good mix of text 
and numeric fields. The current indexing time as observed is 1222 seconds on a 
standalone node. Please access the file (~250MB) here: 
[http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me 
know what you think. Thanks. 

> Solr Nightly Benchmarks
> -----------------------
>
>                 Key: SOLR-10317
>                 URL: https://issues.apache.org/jira/browse/SOLR-10317
>             Project: Solr
>          Issue Type: Task
>            Reporter: Ishan Chattopadhyaya
>              Labels: gsoc2017, mentor
>         Attachments: changes-lucene-20160907.json, 
> changes-solr-20160907.json, managed-schema, 
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks.docx, 
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks-FINAL-PROPOSAL.pdf, 
> solrconfig.xml
>
>
> Solr needs nightly benchmarks reporting. Similar Lucene benchmarks can be 
> found here, https://home.apache.org/~mikemccand/lucenebench/.
> Preferably, we need:
> # A suite of benchmarks that build Solr from a commit point, start Solr 
> nodes, both in SolrCloud and standalone mode, and record timing information 
> of various operations like indexing, querying, faceting, grouping, 
> replication etc.
> # It should be possible to run them either as an independent suite or as a 
> Jenkins job, and we should be able to report timings as graphs (Jenkins has 
> some charting plugins).
> # The code should eventually be integrated in the Solr codebase, so that it 
> never goes out of date.
> There is some prior work / discussion:
> # https://github.com/shalinmangar/solr-perf-tools (Shalin)
> # https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md 
> (Ishan/Vivek)
> # SOLR-2646 & SOLR-9863 (Mark Miller)
> # https://home.apache.org/~mikemccand/lucenebench/ (Mike McCandless)
> # https://github.com/lucidworks/solr-scale-tk (Tim Potter)
> There is support for building, starting, indexing/querying and stopping Solr 
> in some of these frameworks above. However, the benchmarks run are very 
> limited. Any of these can be a starting point, or a new framework can as well 
> be used. The motivation is to be able to cover every functionality of Solr 
> with a corresponding benchmark that is run every night.
> Proposing this as a GSoC 2017 project. I'm willing to mentor, and I'm sure 
> [~shalinmangar] and [~markrmil...@gmail.com] would help here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to