Any good text Information Retrieval dataset may be a good candidate.
https://github.com/harpribot/awesome-information-retrieval#datasets
these datasets have benchmarks and sample queries also.

On Fri, Sep 4, 2020 at 11:26 AM David Smiley <dsmi...@apache.org> wrote:

> It's tempting to accomplish two goals at once (tutorial & searchable ref
> guide) but I think the realities of making a *good* searchable ref guide
> may distract someone from learning as it tries to do both well.  A
> searchable ref-guide could very well be its own project that we point
> people learning at who move beyond some of the very early basics.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Tue, Sep 1, 2020 at 1:23 PM Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> That Jeopardy set reads very dubious. Content that was collected by
>> scraping and available on various sharing sites (including Mega!). I
>> would not feel comfortable working with that in our context.
>>
>> There are other dataset sources. I like the ones that Data is Plural
>> newsletter collects: https://tinyletter.com/data-is-plural (full list
>> at:
>> https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0
>> ). Again, copyright is important and I think having a local copy is
>> important too, for at least tutorial purposes.
>>
>> But I wish we could figure out a way to include the RefGuide. It is
>> just so much more triple-bottom line solution than just random other
>> dataset. We could do a graph of cross-references in the guide, figure
>> out how to extract java path references, etc.
>>
>> Anyway, it is not something that is super-urgent. I don't even know
>> whether our new build processes can be augmented to do this. I guess
>> it is a bit similar to how we run tests.
>>
>> I just wanted to get a strong yay/nay on the idea. So far it feels
>> like I got one strong yay, one caution and one soft nay.
>>
>> Regards,
>>    Alex.
>>
>>
>>
>> On Tue, 1 Sep 2020 at 12:28, Jan Høydahl <jan....@cominvent.com> wrote:
>> >
>> > What about 200.000 Jeopardy questions in JSON format?
>> >
>> https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
>> > I downloaded the file in a few seconds, and it also has some structured
>> content, e.g.
>> >
>> >   {
>> >     "category": "NOVELS",
>> >     "air_date": "2005-01-27",
>> >     "question": "'Even the epilogue is lengthy in this 1869 Tolstoy
>> epic; it comes out in 2 parts &, in our copy, is 105 pages long'",
>> >     "value": "$400",
>> >     "answer": "War and Peace",
>> >     "round": "Jeopardy!",
>> >     "show_number": "4699"
>> >   },
>> >   {
>> >     "category": "BRIGHT IDEAS",
>> >     "air_date": "2005-01-27",
>> >     "question": "'In 1948 scientists at Bristol-Meyers \"buffered\"
>> this medicine for the first time'",
>> >     "value": "$400",
>> >     "answer": "aspirin",
>> >     "round": "Jeopardy!",
>> >     "show_number": "4699"
>> >   },
>> >
>> > Lots of docs. Enough free-text to learn some analysis, enough metadata
>> for some meaningful facets / filters…
>> >
>> > As long as we only provide a URL and not re-distribute the content,
>> licensing is less of a concern.
>> >
>> > Jan
>> >
>> > 1. sep. 2020 kl. 15:59 skrev Alexandre Rafalovitch <arafa...@gmail.com
>> >:
>> >
>> > I've thought of providing instructions. But for good indexing, we
>> > should use adoc format as source, rather than html (as Cassandra's
>> > presentation showed), so that means dependencies to build by user to
>> > get asciidoctor library. And the way to get content, so either git
>> > clone or download the whole source and unpack and figure out the
>> > directory locations. It feels messy. Then, it may as well be an
>> > external package or even an external independent project. And
>> > therefore, it would lose value as a shipped tutorial material.
>> >
>> > We could also discuss actually shipping the Solr Reference Guide with
>> > Solr now that the release cycles align, but that would actually not
>> > help my sub-project too much, again because of adoc vs. html formats.
>> >
>> > In terms of other datasets:
>> > *) I could just stay with limited full-text in the one I am thinking
>> > of. The bulk download mode allows for fields such as Occupation,
>> > Company and Vehicle model which are 2-7 words long. That's about the
>> > same length as current examples we ship. It does not allow for a
>> > meaningful discussion about longer-text issues such as
>> > length-normalization, but we don't have those now anyway.
>> > *) I could use a public domain book and break it into parts. From
>> > somewhere like https://standardebooks.org/ . But there is a question
>> > about licensing and also whether we will be able to show interesting
>> > effects with that.
>> > *) I was also told that there is Wikipedia, but again, would we just
>> > include a couple of articles at random? What's the license?
>> > *) It is possible to index Stack Overflow questions, either from the
>> > feed (DIH was doing that) or as a download. I think the license was
>> > compatible.
>> > *) I could augment the dataset with some mix of the above, like a
>> > "favourite quote" field with random book sentences. This feels like
>> > fun, but possibly a whole separate project of its own.
>> >
>> > Anyway, I am open to further thoughts. It is quite likely I missed
>> something.
>> >
>> > Regards,
>> >   Alex.
>> >
>> > T
>> >
>> > On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <jan....@cominvent.com> wrote:
>> >
>> >
>> > I’d rather ship a tutorial and tooling that explains how to index the
>> ref-guide, than shipping a binary index.
>> > What other full-text datasets have you considered as candidates for
>> getting-started examples?
>> >
>> > Jan
>> >
>> > 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <arafa...@gmail.com
>> >:
>> >
>> > I did not say it was trivial, but I also did not quite mention the
>> previous research.
>> >
>> >
>> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>> >
>> > Uses official AsciidoctorJ library directory. Not sure if that's just
>> JRuby version of Asciidoctor we currently use to build. But this should
>> only affect the development process, not the final built package.
>> >
>> > I think I am more trying to figure out what people think about shipping
>> an actual core with the distribution. That is something I haven't seen done
>> before. And may have issues I did not think of.
>> >
>> > Regards,
>> >    Alex
>> >
>> > On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gus.h...@gmail.com>
>> wrote:
>> >
>> >
>> > Some background to consider before committing to that... it might not
>> be as trivial as you think. (I've often thought it ironic that we don't
>> have real search for our ref guide... )
>> >
>> > https://www.youtube.com/watch?v=DixlnxAk08s
>> >
>> > -Gus
>> >
>> > On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya <
>> ichattopadhy...@gmail.com> wrote:
>> >
>> >
>> > I love the idea of making the ref guide itself as an example dataset.
>> That way, we won't need to ship anything separately. Python's beautiful
>> soup can extract text from the html pages. I'm sure there maybe such things
>> in Java too (can Tika do this?).
>> >
>> > On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <
>> arafa...@gmail.com> wrote:
>> >
>> >
>> > Hi,
>> > I need a sanity check.
>> >
>> > I am in the planning stages for the new example datasets to ship with
>> > Solr 9. The one I am looking at is great for structured information,
>> > but is quite light on full-text content. So, I am thinking of how
>> > important that is and what other sources could be used.
>> >
>> > One - only slightly - crazy idea is to use Solr Reference Guide itself
>> > as a document source. I am not saying we need to include the guide
>> > with Solr distribution, but:
>> > 1) I could include a couple of sample pages
>> > 2) I could index the whole guide (with custom Java-code) during the
>> > final build and we could ship the full index (with stored=false) with
>> > Solr, which then basically becomes a local search for the remote guide
>> > (with absolute URLs).
>> >
>> > Either way would allow us to also explore what a good search
>> > configuration could look like for the Ref Guide for when we are
>> > actually ready to move beyond its current "headings-only" javascript
>> > search. Actually, done right, same/similar tool could also feed
>> > subheadings into the javascript search.
>> >
>> > Like I said, sanity check?
>> >
>> > Regards,
>> >   Alex.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>> >
>> >
>> > --
>> > http://www.needhamsoftware.com (work)
>> > http://www.the111shift.com (play)
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>

Reply via email to