Hi,

This is in interesting discussion. Would you mind moving it to the jira or
it's own DISCUSS thread?

Thanks!

-D...


On Thu, Nov 3, 2016 at 7:26 AM, zeo...@gmail.com <zeo...@gmail.com> wrote:

> To clarify, it only needs to truncate fields > 32766 which need a
> full/exact string match search to be run on them (analyzed fields generally
> would not hit this limitation but I guess in theory they could).  However,
> that's probably every field which can get > 32766 because I'm assuming
> those will all be strings.
>
> I also think using the profiler to monitor the truncation action could be a
> useful default.
>
> Jon
>
> On Wed, Nov 2, 2016, 21:08 zeo...@gmail.com <zeo...@gmail.com> wrote:
>
> > That would break searching on uri entirely unless you queried and knew to
> > truncate at 32766 because it's not analyzed.  I don't like pushing that
> > complication to the end user.
> >
> > I would suggest truncation in the indexingBolt (not using stellar because
> > you'd want this across the board) for all fields > 32766 (how do we make
> > sure this gets updated if the limitation changes in Lucene?) and adding
> > metadata key-value pairs (pre-trunc length, hash, truncated bool, etc.).
> > In the URI scenario I would also suggest doing a multifield mapping by
> > default because of the way that data is useful (not sure which analyser
> to
> > use though - maybe write or find a good URI analyzer?).  Since timestamp
> is
> > a required field for all messages (I'm pretty sure?) I'm ok with
> timestamp
> > and field value used as the UID, but would prefer something better.
> >
> > Jon
> >
> > On Wed, Nov 2, 2016, 20:33 James Sirota <jsir...@apache.org> wrote:
> >
> > Jon,
> >
> > For METRON-517 would it suffice to have a stellar statement to take a URI
> > string and truncate it to length of 32766 in the ES writer?  But still
> > write the actual string to HDFS? You can then search against ES on the
> > truncated portion, but retrieve the actual timestamp from HDFS.  It's
> easy
> > to do because you know the timestamp from the original message.  So you
> > know which logs in HDFS to search through to find the data.
> >
> > 02.11.2016, 14:12, "zeo...@gmail.com" <zeo...@gmail.com>:
> > > I personally would like to see the following things done before things
> > > leave BETA:
> > > (1) Address data integrity concerns (Specifically thinking of
> METRON-370,
> > > METRON-517)
> > > (2) Make cluster tuning easier and more consistent (METRON-485,
> > METRON-470,
> > > and the "[DISCUSS] moving parsers back to flux" which I can't find a
> JIRA
> > > for).
> > >
> > > I would also want to see the upgrade path (as opposed to rebuild) be
> more
> > > thoroughly and regularly tested once things leave BETA. From my
> > > perspective I think the project is very close but not yet ready.
> > >
> > > Jon
> > >
> > > On Wed, Nov 2, 2016 at 4:44 PM Casey Stella <ceste...@gmail.com>
> wrote:
> > >
> > > Hello Everyone,
> > >
> > > Now that the discussion around the next release has started, it has
> been
> > > proposed and I think it's a good time to discuss what to name this next
> > > release. Before, we have adopted the BETA suffix. I think it might be
> > > time to drop it and call the next release 0.2.2
> > >
> > > Thoughts?
> > >
> > > Best,
> > >
> > > Casey
> > >
> > > --
> > >
> > > Jon
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
> > --
> >
> > Jon
> >
> --
>
> Jon
>

Reply via email to