Any good text Information Retrieval dataset may be a good candidate. https://github.com/harpribot/awesome-information-retrieval#datasets these datasets have benchmarks and sample queries also.
On Fri, Sep 4, 2020 at 11:26 AM David Smiley <dsmi...@apache.org> wrote: > It's tempting to accomplish two goals at once (tutorial & searchable ref > guide) but I think the realities of making a *good* searchable ref guide > may distract someone from learning as it tries to do both well. A > searchable ref-guide could very well be its own project that we point > people learning at who move beyond some of the very early basics. > > ~ David Smiley > Apache Lucene/Solr Search Developer > http://www.linkedin.com/in/davidwsmiley > > > On Tue, Sep 1, 2020 at 1:23 PM Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > >> That Jeopardy set reads very dubious. Content that was collected by >> scraping and available on various sharing sites (including Mega!). I >> would not feel comfortable working with that in our context. >> >> There are other dataset sources. I like the ones that Data is Plural >> newsletter collects: https://tinyletter.com/data-is-plural (full list >> at: >> https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0 >> ). Again, copyright is important and I think having a local copy is >> important too, for at least tutorial purposes. >> >> But I wish we could figure out a way to include the RefGuide. It is >> just so much more triple-bottom line solution than just random other >> dataset. We could do a graph of cross-references in the guide, figure >> out how to extract java path references, etc. >> >> Anyway, it is not something that is super-urgent. I don't even know >> whether our new build processes can be augmented to do this. I guess >> it is a bit similar to how we run tests. >> >> I just wanted to get a strong yay/nay on the idea. So far it feels >> like I got one strong yay, one caution and one soft nay. >> >> Regards, >> Alex. >> >> >> >> On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <jan....@cominvent.com> wrote: >> > >> > What about 200.000 Jeopardy questions in JSON format? >> > >> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/ >> > I downloaded the file in a few seconds, and it also has some structured >> content, e.g. >> > >> > { >> > "category": "NOVELS", >> > "air_date": "2005-01-27", >> > "question": "'Even the epilogue is lengthy in this 1869 Tolstoy >> epic; it comes out in 2 parts &, in our copy, is 105 pages long'", >> > "value": "$400", >> > "answer": "War and Peace", >> > "round": "Jeopardy!", >> > "show_number": "4699" >> > }, >> > { >> > "category": "BRIGHT IDEAS", >> > "air_date": "2005-01-27", >> > "question": "'In 1948 scientists at Bristol-Meyers \"buffered\" >> this medicine for the first time'", >> > "value": "$400", >> > "answer": "aspirin", >> > "round": "Jeopardy!", >> > "show_number": "4699" >> > }, >> > >> > Lots of docs. Enough free-text to learn some analysis, enough metadata >> for some meaningful facets / filters… >> > >> > As long as we only provide a URL and not re-distribute the content, >> licensing is less of a concern. >> > >> > Jan >> > >> > 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <arafa...@gmail.com >> >: >> > >> > I've thought of providing instructions. But for good indexing, we >> > should use adoc format as source, rather than html (as Cassandra's >> > presentation showed), so that means dependencies to build by user to >> > get asciidoctor library. And the way to get content, so either git >> > clone or download the whole source and unpack and figure out the >> > directory locations. It feels messy. Then, it may as well be an >> > external package or even an external independent project. And >> > therefore, it would lose value as a shipped tutorial material. >> > >> > We could also discuss actually shipping the Solr Reference Guide with >> > Solr now that the release cycles align, but that would actually not >> > help my sub-project too much, again because of adoc vs. html formats. >> > >> > In terms of other datasets: >> > *) I could just stay with limited full-text in the one I am thinking >> > of. The bulk download mode allows for fields such as Occupation, >> > Company and Vehicle model which are 2-7 words long. That's about the >> > same length as current examples we ship. It does not allow for a >> > meaningful discussion about longer-text issues such as >> > length-normalization, but we don't have those now anyway. >> > *) I could use a public domain book and break it into parts. From >> > somewhere like https://standardebooks.org/ . But there is a question >> > about licensing and also whether we will be able to show interesting >> > effects with that. >> > *) I was also told that there is Wikipedia, but again, would we just >> > include a couple of articles at random? What's the license? >> > *) It is possible to index Stack Overflow questions, either from the >> > feed (DIH was doing that) or as a download. I think the license was >> > compatible. >> > *) I could augment the dataset with some mix of the above, like a >> > "favourite quote" field with random book sentences. This feels like >> > fun, but possibly a whole separate project of its own. >> > >> > Anyway, I am open to further thoughts. It is quite likely I missed >> something. >> > >> > Regards, >> > Alex. >> > >> > T >> > >> > On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <jan....@cominvent.com> wrote: >> > >> > >> > I’d rather ship a tutorial and tooling that explains how to index the >> ref-guide, than shipping a binary index. >> > What other full-text datasets have you considered as candidates for >> getting-started examples? >> > >> > Jan >> > >> > 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <arafa...@gmail.com >> >: >> > >> > I did not say it was trivial, but I also did not quite mention the >> previous research. >> > >> > >> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java >> > >> > Uses official AsciidoctorJ library directory. Not sure if that's just >> JRuby version of Asciidoctor we currently use to build. But this should >> only affect the development process, not the final built package. >> > >> > I think I am more trying to figure out what people think about shipping >> an actual core with the distribution. That is something I haven't seen done >> before. And may have issues I did not think of. >> > >> > Regards, >> > Alex >> > >> > On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gus.h...@gmail.com> >> wrote: >> > >> > >> > Some background to consider before committing to that... it might not >> be as trivial as you think. (I've often thought it ironic that we don't >> have real search for our ref guide... ) >> > >> > https://www.youtube.com/watch?v=DixlnxAk08s >> > >> > -Gus >> > >> > On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya < >> ichattopadhy...@gmail.com> wrote: >> > >> > >> > I love the idea of making the ref guide itself as an example dataset. >> That way, we won't need to ship anything separately. Python's beautiful >> soup can extract text from the html pages. I'm sure there maybe such things >> in Java too (can Tika do this?). >> > >> > On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, < >> arafa...@gmail.com> wrote: >> > >> > >> > Hi, >> > I need a sanity check. >> > >> > I am in the planning stages for the new example datasets to ship with >> > Solr 9. The one I am looking at is great for structured information, >> > but is quite light on full-text content. So, I am thinking of how >> > important that is and what other sources could be used. >> > >> > One - only slightly - crazy idea is to use Solr Reference Guide itself >> > as a document source. I am not saying we need to include the guide >> > with Solr distribution, but: >> > 1) I could include a couple of sample pages >> > 2) I could index the whole guide (with custom Java-code) during the >> > final build and we could ship the full index (with stored=false) with >> > Solr, which then basically becomes a local search for the remote guide >> > (with absolute URLs). >> > >> > Either way would allow us to also explore what a good search >> > configuration could look like for the Ref Guide for when we are >> > actually ready to move beyond its current "headings-only" javascript >> > search. Actually, done right, same/similar tool could also feed >> > subheadings into the javascript search. >> > >> > Like I said, sanity check? >> > >> > Regards, >> > Alex. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> > >> > >> > -- >> > http://www.needhamsoftware.com (work) >> > http://www.the111shift.com (play) >> > >> > >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: dev-h...@lucene.apache.org >> > >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: dev-h...@lucene.apache.org >> >>