Re: Change in search and replace for v1.3.1 of the Daffodil Extension for VS Code

Davin Shearer Fri, 28 Jul 2023 11:27:59 -0700

Thanks for this feedback Mike!

The Data Profiler has been handy for me to see if a file has any
unusually high or low frequencies of certain bytes, or if what looks like a
text file actually has some binary or unanticipated control characters
embedded in it.  The start offset, end offset, and length are editable so
if the user wants to, they can choose different ranges and lengths.  I just
capped the max length to 1M bytes.


I'm relieved to hear that searching for a matching pattern (could be text
or binary), then offering First, Previous, Next, Last is reasonable.  There
are tools like 'grep` can stream through and locate all the matches and
tools like `sed` can deal with massive replace operations on text files at
least.

You bought up not allowing insert and delete.  This is perfect because we
just finished getting this feature implemented for the upcoming v1.3.1
release.  In this mode, we only allow existing bytes to be overwritten.
Replace will replace matches with the replacement pattern overwriting
whatever is there instead of removing the match and inserting the
replacement (like it does Delete, Insert, and Overwrite mode).  In
Overwrite only mode, the file size becomes constant.  You can freely switch
between the modes if you want to though.

Overwrite mode is not the default, but in a later release we're probably
going to allow persisting the settings so you can make the defaults be
whatever you like when you start up the editor.

Applying a stack of operations on a set of files in a directory is indeed
something people do.  In the past, I've used GNU Parallel on the command
line to efficiently queue up and parallelize operations on large
directories of files of a particular kind.  That isn't something that is
presently in scope for the Data Editor feature of the extension, but we can
discuss this further if you'd like.  Today, Apache NiFi or AirFlow can do
this in spades if you have code or a script.  Editing multiple files at the
same time could be done with tmux.

I'm going to press on with the solutions presented for v1.3.1, so we're not
broken by scale.  We can tackle handling many transformations in a single
transaction in a following release.

Thanks again!

On Fri, Jul 28, 2023 at 11:41 AM Mike Beckerle <mbecke...@apache.org> wrote:

> I have a few opinions to offer.
>
> First, what is this data profile for? I know of only a few things one would
> use that sort of profile for, and those could all be handled by looking at
> the first Kbyte of the data.
> (Guessing natural language, guessing charset, LF vs. CRLF, and guessing
> text vs. binary data, guessing binary integers vs. packed-decimal integers,
> compressed/not), if you randomly select another kbyte of the file and you
> get a different profile result, that's also perhaps interesting because it
> tells you the data is not consistent throughout the file. But all this is
> just to guess a few top-level data characteristics. I really don't know
> anything that requires a byte histogram over an entire file, so I wouldn't
> even suggest having that operation, or default it to 1 kbyte of data and
> let the user enlarge it.
>
> I agree that you can't just treat binary data files like text editing. To
> me, I think Search and Search/Replace are very unlikely to be used in bulk
> on binary data files. I deal with mostly binary data and there I think
> there's so many opportunities for false matches that bulk operations
> (replace all, or search all to get a count) make little or no sense.
> Textual data is a different story. That's just big-data text editing. But
> many data formats, even mostly-textual ones, have stored length
> information. In those cases, editing the data in any way which changes the
> number of bytes is going to immediately break the whole file, so
> insert/delete makes basically no sense. I think an editor needs to have a
> mode where you can't accidently insert nor delete data. The user could
> modify data only, not insert/delete data. Once a user knows their data is
> of a kind that has stored lengths, they would want to toggle this on, so as
> to avoid corrupting the entire file by simple accident. Arguably, data
> editors should default to this mode, and require users to say it is ok to
> insert/delete bytes in order to access those features. Once a user sets
> these options, remembering them in some sort of sticky settings saved in
> the project folders is helpful.
>
> For data files of any size, doing a total search that tells you how many
> total matches there are to a pattern, at least for binary data, is fairly
> meaningless so I would suggest that attempting to display both the first
> few matches, and a count of how many total matches is not necessary. Just
> show the first match, and prefetch the second (or maybe a few) so the user
> gets some context, but I would not bother to do anything beyond that. The
> count of how many is just not that useful in binary data. Text data maybe,
> but binary data I don't see the use case.
>
> Last thought. Often there is not one big data file, but a directory (or
> several) full of smaller data files. Operations on data should be able to
> span files or operate on a single file in a fairly transparent manner.
> There's very little conceptual difference between a file of binary records,
> and a directory of files each containing 1 binary record, so a data editing
> environment should treat these as roughly equivalent. So searching for a
> particular byte pattern or bit pattern in one file, or across a directory
> of files, and moving from one match to the next, should be more or less the
> same operation to the user.
>
> On Thu, Jul 27, 2023 at 9:41 AM Davin Shearer <da...@apache.org> wrote:
>
> > In v1.3.1 we've added support for editing large files, but it has
> > exposed some other challenges related to search, replace, and data
> > profiling.  I outline the problems and possible solutions to these
> problems
> > in a discussion thread here (
> > https://github.com/ctc-oss/daffodil-vscode/discussions/122).
> >
> > The bottom line up front is that for search and replace, I think we'll
> need
> > to adopt an interactive approach rather than an all at once approach.
> > For example search will find the next match from where you are, click
> next
> > and it will find the next, and so on, instead of finding all the matches
> up
> > front.  Similarly, with replace, we find the next match, then you can
> > either replace or skip to the next match, and so on.  These are
> departures
> > from v1.3.0, but we need something that will scale.
> >
> > Data profiling is a new feature in v1.3.1 that creates a byte frequency
> > graph and some statistics on all or part of the edited file.  Right now
> > I've allowed it to profile from the beginning to the end of the file,
> even
> > if the file is multiple gigabytes in size.  Currently though that could
> > take longer than 5 seconds especially if the file has many editing
> > changes.  After 5 seconds the request is timed out in the Scala gRPC
> > server.  I can bump up the time out, but that's just a band aid (what
> > happens if someone wants to profile 1+TB file, for example).  I think a
> > reasonable fix is to allow the user to select any offset in the file and
> we
> > profile up to X bytes from that offset, where X is perhaps something on
> the
> > order of 1M.  This ensures the UI is responsive and can scale.
> >
> > We expect to have a release candidate of v1.3.1 within two weeks from
> now,
> > and I'm hoping to address these scale issues before then.  Feedback
> > welcome!
> >
> > Thank you.
> >
>

Re: Change in search and replace for v1.3.1 of the Daffodil Extension for VS Code

Reply via email to