OK, a lot of dialog while I was gone for two days! I read the whole thread, but I'm a newbie to Solr, so some of the dialog was Greek to me. I understand the words, of course, but applying it so I know exactly what to do without screwing something else up is the problem. After all, that is how I got into the mess in the first place. I'm glad I have good help to untangle the knots I've made!

I'd like to start over (option 1 below), but does this mean delete all my config and reinstalling Solr?? Maybe that is not a bad idea, but I will at least save off my data-config.xml as that is clearly the one thing that is probably working right. However, I did do quite a bit of editing that I would have to do again. Please advise...

To be fair, I must answer Erick's question of how I created the data index in the first place, because this might be relevant...

The bulk of the data is read from 9000+ text files, where each file was manually typed. Before inserting into the database, I do a little bit of processing of the text using "sed" to delete the top few and bottom few lines, and to substitute each single-quote character with a pair of single-quotes (so PostgreSQL doesn't choke). Line-feed characters are preserved as ASCII 10 (hex 0A), but there shouldn't be (and I am not aware of) any characters aside from what is on the keyboard.

Next, I insert it with this command:
psql -U awips -d OHRFC -c "INSERT INTO EventLogText VALUES('$postDate', '$user', '$postDate', '$entryText', '$postCatVal');"

In case you are wondering about my table, it is defined in this way:
CREATE TABLE eventlogtext (
posttime timestamp without time zone NOT NULL, -- Timestamp of this entry's original posting username character varying(8), -- username (logname) of the original poster
  lastmodtime timestamp without time zone, -- Last time record was altered
  logtext text, -- text of the log entry
  category integer, -- bit-wise category value
  CONSTRAINT eventlogtext_pkey PRIMARY KEY (posttime)
)

To do the indexing, I merely use /dataimport?full-import, but it knows what to do from my data-config.xml; which is here:

<dataConfig>
<dataSource driver="org.postgresql.Driver" url="jdbc:postgresql://dx1f/OHRFC" user="awips" />
    <document>
<entity name="eventlogtext" query="SELECT posttime AS id, username, logtext, category FROM eventlogtext;" deltaQuery="SELECT posttime AS id FROM eventlogtext WHERE lastmodtime > '${dataimporter.last_index_time}';"> <entity name="categorytypes" query="SELECT catname FROM categorytypes WHERE catid='${eventlogtext.category}';">
            </entity>
        </entity>
    </document>
</dataConfig>

Hope this helps!

Thanks,
Mark

On 9/24/2015 10:57 AM, Erick Erickson wrote:
Geraint:

Good Catch! I totally missed that. So all of our focus on schema.xml has
been... totally irrelevant. Now that you pointed that out, there's also the
addition: add-unknown-fields-to-the-schema, which indicates you started
this up in "schemaless" mode.

In short, solr is trying to guess what your field types should be and
guessing wrong (again and again and again). This is the classic weakness of
schemaless. It's great for indexing stuff fast, but if it guesses wrong
you're stuck.


So to the original problem: I'd start over and either
1> use the regular setup, not schemaless
or
2> use the _managed_ schema API to explicitly add fields and fieldTypes to
the managed schema

Best,
Erick

On Thu, Sep 24, 2015 at 2:02 AM, Duck Geraint (ext) GBJH <
geraint.d...@syngenta.com> wrote:

Okay, so maybe I'm missing something here (I'm still relatively new to
Solr myself), but am I right in thinking the following is still in your
solrconfig.xml file:

   <schemaFactory class="ManagedIndexSchemaFactory">
     <bool name="mutable">true</bool>
     <str name="managedSchemaResourceName">managed-schema</str>
   </schemaFactory>

If so, wouldn't using a managed schema make several of your field
definitions inside the schema.xml file semi-redundant?

Regards,
Geraint


Geraint Duck
Data Scientist
Toxicology and Health Sciences
Syngenta UK
Email: geraint.d...@syngenta.com


-----Original Message-----
From: Alessandro Benedetti [mailto:benedetti.ale...@gmail.com]
Sent: 24 September 2015 09:23
To: solr-user@lucene.apache.org
Subject: Re: query parsing

I would focus on this :

"

5> now kick off the DIH job and look again.

Now it shows a histogram, but most of the "terms" are long -- the full
texts of (the table.column) eventlogtext.logtext, including the whitespace
(with %0A used for newline characters)...  So, it appears it is not being
tokenized properly, correct?"
Can you open from your Solr ui , the schema xml and show us the snippets
for that field that seems to not tokenise ?
Can you show us ( even a screenshot is fine) the schema browser page
related ?
Could be a problem of encoding ?
Following Erick details about the analysis, what are your results ?

Cheers

2015-09-24 8:04 GMT+01:00 Upayavira <u...@odoko.co.uk>:

typically, the index dir is inside the data dir. Delete the index dir
and you should be good. If there is a tlog next to it, you might want
to delete that also.

If you dont have a data dir, i wonder whether you set the data dir
when creating your core or collection. Typically the instance dir and
data dir aren't needed.

Upayavira

On Wed, Sep 23, 2015, at 10:46 PM, Erick Erickson wrote:
OK, this is bizarre. You'd have had to set up SolrCloud by
specifying the -zkRun command when you start Solr or the -zkHost;
highly unlikely. On the admin page there would be a "cloud" link on
the left side, I really doubt one's there.

You should have a data directory, it should be the parent of the
index and tlog directories. As of sanity check try looking at the
analysis page.
Type
a bunch of words in the left hand side indexing box and uncheck the
verbose box. As you can tell I'm grasping at straws. I'm still
puzzled why you don't have a "data" directory here, but that
shouldn't really matter. How did you create this index? I don't mean
data import handler more how did you create the core that you're
indexing to?

Best,
Erick

On Wed, Sep 23, 2015 at 10:16 AM, Mark Fenbers
<mark.fenb...@noaa.gov>
wrote:

On 9/23/2015 12:30 PM, Erick Erickson wrote:

Then my next guess is you're not pointing at the index you think
you
are
when you 'rm -rf data'

Just ignore the Elall field for now I should think, although get
rid
of it
if you don't think you need it.

DIH should be irrelevant here.

So let's back up.
1> go ahead and "rm -fr data" (with Solr stopped).

I have no "data" dir.  Did you mean "index" dir?  I removed 3
index directories (2 for spelling):
cd /localapps/dev/eventLog; rm -rfv index solr/spFile solr/spIndex

2> start Solr
3> do NOT re-index.
4> look at your index via the schema-browser. Of course there
4> should
be
nothing there!

Correct!  It said "there is no term info :("

5> now kick off the DIH job and look again.

Now it shows a histogram, but most of the "terms" are long -- the
full texts of (the table.column) eventlogtext.logtext, including
the
whitespace
(with %0A used for newline characters)...  So, it appears it is
not
being
tokenized properly, correct?

Your logtext field should have only single tokens. The fact that
you
have
some very
long tokens presumably with whitespace) indicates that you aren't
really
blowing
the index away between indexing.

Well, I did this time for sure.  I verified that initially,
because it showed there was no term info until I DIH'd again.

Are you perhaps in Solr Cloud with more than one replica?

Not that I know of, but being new to Solr, there could be things
going
on
that I'm not aware of.  How can I tell?  I certainly didn't set
anything up
for solrCloud deliberately.

In that case you
might be getting the index replicated on startup assuming you
didn't blow away all replicas. If you are in SolrCloud, I'd just
delete the collection and start over, after insuring that you'd
pushed the configset up to Zookeeper.

BTW, I always look at the schema.xml file from the Solr admin
window
just
as
a sanity check in these situations.

Good idea!  But the one shown in the browser is identical to the
one
I've
been editing!  So that's not an issue.




--
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England
________________________________


Syngenta Limited, Registered in England No 2710846;Registered Office :
Syngenta Limited, European Regional Centre, Priestley Road, Surrey Research
Park, Guildford, Surrey, GU2 7YH, United Kingdom
________________________________
  This message may contain confidential information. If you are not the
designated recipient, please notify the sender immediately, and delete the
original and any copies. Any use of the message by you is prohibited.


Reply via email to