I've thought of providing instructions. But for good indexing, we
should use adoc format as source, rather than html (as Cassandra's
presentation showed), so that means dependencies to build by user to
get asciidoctor library. And the way to get content, so either git
clone or download the whole source and unpack and figure out the
directory locations. It feels messy. Then, it may as well be an
external package or even an external independent project. And
therefore, it would lose value as a shipped tutorial material.

We could also discuss actually shipping the Solr Reference Guide with
Solr now that the release cycles align, but that would actually not
help my sub-project too much, again because of adoc vs. html formats.

In terms of other datasets:
*) I could just stay with limited full-text in the one I am thinking
of. The bulk download mode allows for fields such as Occupation,
Company and Vehicle model which are 2-7 words long. That's about the
same length as current examples we ship. It does not allow for a
meaningful discussion about longer-text issues such as
length-normalization, but we don't have those now anyway.
*) I could use a public domain book and break it into parts. From
somewhere like https://standardebooks.org/ . But there is a question
about licensing and also whether we will be able to show interesting
effects with that.
*) I was also told that there is Wikipedia, but again, would we just
include a couple of articles at random? What's the license?
*) It is possible to index Stack Overflow questions, either from the
feed (DIH was doing that) or as a download. I think the license was
compatible.
*) I could augment the dataset with some mix of the above, like a
"favourite quote" field with random book sentences. This feels like
fun, but possibly a whole separate project of its own.

Anyway, I am open to further thoughts. It is quite likely I missed something.

Regards,
   Alex.

T

On Tue, 1 Sep 2020 at 03:10, Jan Høydahl <jan....@cominvent.com> wrote:
>
> I’d rather ship a tutorial and tooling that explains how to index the 
> ref-guide, than shipping a binary index.
> What other full-text datasets have you considered as candidates for 
> getting-started examples?
>
> Jan
>
> 1. sep. 2020 kl. 05:53 skrev Alexandre Rafalovitch <arafa...@gmail.com>:
>
> I did not say it was trivial, but I also did not quite mention the previous 
> research.
>
> https://github.com/arafalov/solr-refguide-indexing/blob/master/src/com/solrstart/refguide/Indexer.java
>
> Uses official AsciidoctorJ library directory. Not sure if that's just JRuby 
> version of Asciidoctor we currently use to build. But this should only affect 
> the development process, not the final built package.
>
> I think I am more trying to figure out what people think about shipping an 
> actual core with the distribution. That is something I haven't seen done 
> before. And may have issues I did not think of.
>
> Regards,
>     Alex
>
> On Mon., Aug. 31, 2020, 10:11 p.m. Gus Heck, <gus.h...@gmail.com> wrote:
>>
>> Some background to consider before committing to that... it might not be as 
>> trivial as you think. (I've often thought it ironic that we don't have real 
>> search for our ref guide... )
>>
>> https://www.youtube.com/watch?v=DixlnxAk08s
>>
>> -Gus
>>
>> On Mon, Aug 31, 2020 at 2:06 PM Ishan Chattopadhyaya 
>> <ichattopadhy...@gmail.com> wrote:
>>>
>>> I love the idea of making the ref guide itself as an example dataset. That 
>>> way, we won't need to ship anything separately. Python's beautiful soup can 
>>> extract text from the html pages. I'm sure there maybe such things in Java 
>>> too (can Tika do this?).
>>>
>>> On Mon, 31 Aug, 2020, 11:18 pm Alexandre Rafalovitch, <arafa...@gmail.com> 
>>> wrote:
>>>>
>>>> Hi,
>>>> I need a sanity check.
>>>>
>>>> I am in the planning stages for the new example datasets to ship with
>>>> Solr 9. The one I am looking at is great for structured information,
>>>> but is quite light on full-text content. So, I am thinking of how
>>>> important that is and what other sources could be used.
>>>>
>>>> One - only slightly - crazy idea is to use Solr Reference Guide itself
>>>> as a document source. I am not saying we need to include the guide
>>>> with Solr distribution, but:
>>>> 1) I could include a couple of sample pages
>>>> 2) I could index the whole guide (with custom Java-code) during the
>>>> final build and we could ship the full index (with stored=false) with
>>>> Solr, which then basically becomes a local search for the remote guide
>>>> (with absolute URLs).
>>>>
>>>> Either way would allow us to also explore what a good search
>>>> configuration could look like for the Ref Guide for when we are
>>>> actually ready to move beyond its current "headings-only" javascript
>>>> search. Actually, done right, same/similar tool could also feed
>>>> subheadings into the javascript search.
>>>>
>>>> Like I said, sanity check?
>>>>
>>>> Regards,
>>>>    Alex.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>>
>>
>>
>> --
>> http://www.needhamsoftware.com (work)
>> http://www.the111shift.com (play)
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to