Re: [OctDev] textread and textscan select wrong number of headerlines

Philip Nienhuis Fri, 08 Apr 2011 12:47:48 -0700

Hi there Brett

brett.t.stew...@exxonmobil.com wrote:

Both textscan() and textread() allow the specification of headerlines like
this:


textscan(fid,'%s','headerlines',N)

in which N is the number of lines to skip.  However, with the current
version, if you try to specify N, it always uses N = 2.


Thanks for the report.

You've sent it to the wrong forum, but you didn't know I guess.

octave-dev = for octave-forge packages, i.e. add-on packages that arenot "core-octave" (not maintained by the octave developers sensu strictobut by other folks).

As both textread and textscan are in core octave, you'd rather ask forhelp in help-oct...@octave.org (the Help mailing list).Actually even that is not correct; the folks there rather want (you toadd) an entry in the bug tracker.

I'll do that for you (later tonight), I already found & fixed the bug(same one in both functions) and besides, the last months I have fixed acouple of other bugs in textread and friends.

You can help me by swapping the attached strread.m, textread.m andtextscan.m into the io package in place of the old versions (firstrename those to _textread.m and _textscan.m).You can do "which textscan.m" (w/o quotes) in octave to find out wherethey are located.


Please report back if the attached versions work OK or not.

Philip

## Copyright (C) 2009-2011 Eric Chassande-Mottin, CNRS (France)
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or (at
## your option) any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn  {Function File} {[@var{a}, @dots{}] =} strread (@var{str})
## @deftypefnx {Function File} {[@var{a}, @dots{}] =} strread (@var{str}, @var{format})
## @deftypefnx {Function File} {[@var{a}, @dots{}] =} strread (@var{str}, @var{format}, @var{prop1}, @var{value1}, @dots{})
## Read data from a string.
##
## The string @var{str} is split into words that are repeatedly matched to the
## specifiers in @var{format}.  The first word is matched to the first
## specifier,
## the second to the second specifier and so forth.  If there are more words
## than
## specifiers, the process is repeated until all words have been processed.
##
## The string @var{format} describes how the words in @var{str} should be
## parsed.
## It may contain any combination of the following specifiers:
## @table @code
## @item %s
## The word is parsed as a string.
##
## @item %d
## @itemx %f
## The word is parsed as a number.
##
## @item %*
## The word is skipped.
## @end table
##
## Parsed word corresponding to the first specifier are returned in the first
## output argument and likewise for the rest of the specifiers.
##
## By default, @var{format} is @t{"%f"}, meaning that numbers are read from
## @var{str}.
##
## For example, the string
##
## @example
## @group
## @var{str} = "\
## Bunny Bugs   5.5\n\
## Duck Daffy  -7.5e-5\n\
## Penguin Tux   6"
## @end group
## @end example
##
## @noindent
## can be read using
##
## @example
## [@var{a}, @var{b}, @var{c}] = strread (@var{str}, "%s %s %f");
## @end example
##
## The behavior of @code{strread} can be changed via property-value
## pairs.  The following properties are recognized:
##
## @table @asis
## @item "commentstyle"
## Parts of @var{str} are considered comments and will be skipped.
## @var{value} is the comment style and can be any of the following.
## @itemize
## @item "shell"
## Everything from @code{#} characters to the nearest end-line is skipped.
##
## @item "c"
## Everything between @code{/*} and @code{*/} is skipped.
##
## @item "c++"
## Everything from @code{//} characters to the nearest end-line is skipped.
##
## @item "matlab"
## Everything from @code{%} characters to the nearest end-line is skipped.
## @end itemize
##
## @item "delimiter"
## Any character in @var{value} will be used to split @var{str} into words 
## (default value = \"\n\").
##
## @item "whitespace"
## Any character in @var{value} will be interpreted as whitespace and
## trimmed; the string defining whitespace must be enclosed in double
## quotes for proper processing of special characters like \t.
##
## @item "emptyvalue"
## Parts of the output where no word is available is filled with @var{value}.
## @end table
##
## @seealso{textread, load, dlmread, fscanf}
## @end deftypefn

function varargout = strread (str, format = "%f", varargin)
  ## Check input
  if (nargin < 1)
    print_usage ();
  endif

  if (!ischar (str) || !ischar (format))
    error ("strread: STR and FORMAT arguments must be strings");
  endif

  ## Parse options
  comment_flag = false;
  numeric_fill_value = 0;
  white_spaces = " \n\r\t\b";
  delimiter_str = "";
  for n = 1:2:length (varargin)
    switch (lower (varargin {n}))
      case "commentstyle"
        comment_flag = true;
        switch (lower (varargin {n+1}))
          case "c"
            comment_specif = {"/*", "*/"};
          case "c++"
            comment_specif = {"//", "\n"};
          case "shell"
            comment_specif = {"#", "\n"};
          case "matlab"
            comment_specif = {"%", "\n"};
          otherwise
            warning ("strread: unknown comment style '%s'", val);
        endswitch
      case "delimiter"
        delimiter_str = varargin {n+1};
      case "emptyvalue"
        numeric_fill_value = varargin {n+1};
      case "bufsize"
        ## XXX: We could synthesize this, but that just seems weird...
        warning ("strread: property \"bufsize\" is not implemented");
      case "whitespace"
        white_spaces = varargin {n+1};
      case "expchars"
        warning ("strread: property \"expchars\" is not implemented");
      otherwise
        warning ("strread: unknown property \"%s\"", varargin {n});
    endswitch
  endfor
  if (isempty (delimiter_str))
    if (~isempty (white_spaces))
      delimiter_str = white_spaces;
    else
      ## Default delimiter = newline
      delimiter_str = "\n";
    endif
  endif

  ## Parse format string
  idx = strfind (format, "%")';
  specif = format ([idx, idx+1]);
  nspecif = length (idx);
  idx_star = strfind (format, "%*");
  nfields = length (idx) - length (idx_star);

  if (max (nargout, 1) != nfields)
    error ("strread: the number of output variables must match that specified by FORMAT");
  endif

  ## Remove comments
  if (comment_flag)
    cstart = strfind (str, comment_specif{1});
    cstop  = strfind (str, comment_specif{2});
    if (length (cstart) > 0)
      ## Ignore nested openers.
      [idx, cidx] = unique (lookup (cstop, cstart), "first");
      if (idx(end) == length (cstop))
        cidx(end) = []; # Drop the last one if orphaned.
      endif
      cstart = cstart(cidx);
    endif
    if (length (cstop) > 0)
      ## Ignore nested closers.
      [idx, cidx] = unique (lookup (cstart, cstop), "first");
      if (idx(1) == 0)
        cidx(1) = []; # Drop the first one if orphaned.
      endif
      cstop = cstop(cidx);
    endif
    len = length (str);
    c2len = length (comment_specif{2});
    str = cellslices (str, [1, cstop + c2len], [cstart - 1, len]);
    str = [str{:}];
  endif

  ## Determine the number of words per line
  format = strrep (format, "%", " %");
  [~, ~, ~, fmt_words] = regexp (format, "[^ ]+");

  num_words_per_line = numel (fmt_words);
  for m = 1:numel(fmt_words)
    ## Convert formats such as "%Ns" to "%s" (see the FIXME below)
    if (length (fmt_words{m}) > 2)
      if (strcmp (fmt_words{m}(1:2), "%*"))
        fmt_words{m} = "%*";
      elseif (fmt_words{m}(1) == "%")
        fmt_words{m} = fmt_words{m}([1, end]);
      endif
    endif
  endfor

  if (~isempty (white_spaces))
    ## Check for overlapping whitespaces and delimiters & trim whitespace
    [ovlp, iw, ~] = intersect (white_spaces, delimiter_str);
    if (~isempty (ovlp))
      ## Remove delimiter chars from white_spaces
      white_spaces = cell2mat (strsplit (white_spaces, white_spaces(iw)));
    endif
  endif

  if (~isempty (white_spaces))
    ## Remove repeated white_space chars. First find white_spaces positions
    idx = strchr (str, white_spaces);
    ## Find repeated white_spaces
    idx2 = ~(idx(2:end) - idx(1:end-1) - 1);
    ## Set al whitespace chars to spaces
    ## FIXME: this implies real spaces are always part of white_spaces
    str(idx(find (idx))) = ' ';
    ## Set all repeated white_space to \0
    str(idx(find (idx2))) = "\0";
    str = strsplit (str, "\0");
    ## Reconstruct trimmed str
    str = cell2mat (str);
  endif

  ## Split 'str' into words
  words = split_by (str, delimiter_str);
  if (~isempty (white_spaces))
    ## Trim leading and trailing white_spaces
    words = strtrim (words);
  endif
  num_words = numel (words);
  num_lines = ceil (num_words / num_words_per_line);

  ## For each specifier
  k = 1;
  for m = 1:num_words_per_line
    data = words (m:num_words_per_line:end);
    ## Map to format
    ## FIXME - add support for formats like "%4s" or "<%s>", "%[a-zA-Z]"
    ##         Someone with regexp experience is needed.
    switch fmt_words{m}
      case "%s"
        data (end+1:num_lines) = {""};
        varargout {k} = data';
        k++;
      case {"%d", "%f"}
        n = cellfun (@isempty, data);
        data = str2double (data);
        data(n) = numeric_fill_value;
        data (end+1:num_lines) = numeric_fill_value;
        varargout {k} = data.';
        k++;
      case {"%*", "%*s"}
        ## skip the word
      otherwise
        ## Ensure descriptive content is consistent
        if (numel (unique (data)) > 1
            || ! strcmpi (unique (data), fmt_words{m}))
          error ("strread: FORMAT does not match data");
        endif
    endswitch
  endfor
endfunction

function out = split_by (text, sep)
  sep = union (sep, "\n");  # Why would newline always have to be a separator?
  pat = sprintf ("[^%s]+", sep);
  [~, ~, ~, out] = regexp (text, pat);
  out(cellfun (@isempty, out)) = {""};
endfunction

%!test
%! [a, b] = strread ("1 2", "%f%f");
%! assert (a == 1 && b == 2);

%!test
%! str = "# comment\n# comment\n1 2 3";
%! [a, b] = strread (str, '%d %s', 'commentstyle', 'shell');
%! assert (a, [1; 3]);
%! assert (b, {"2"; ""});

%!test
%! str = '';
%! a = rand (10, 1);
%! b = char (round (65 + 20 * rand (10, 1)));
%! for k = 1:10
%!   str = sprintf ('%s %.6f %s\n', str, a (k), b (k));
%! endfor
%! [aa, bb] = strread (str, '%f %s');
%! assert (a, aa, 1e-5);
%! assert (cellstr (b), bb);

%!test
%! str = '';
%! a = rand (10, 1);
%! b = char (round (65 + 20 * rand (10, 1)));
%! for k = 1:10
%!   str = sprintf ('%s %.6f %s\n', str, a (k), b (k));
%! endfor
%! aa = strread (str, '%f %*s');
%! assert (a, aa, 1e-5);

%!test
%! str = sprintf ('/* this is\nacomment*/ 1 2 3');
%! a = strread (str, '%f', 'commentstyle', 'c');
%! assert (a, [1; 2; 3]);

%!test
%! str = sprintf ("Tom 100 miles/hr\nDick 90 miles/hr\nHarry 80 miles/hr");
%! fmt = "%s %f miles/hr";
%! c = cell (1, 2);
%! [c{:}] = strread (str, fmt);
%! assert (c{1}, {"Tom"; "Dick"; "Harry"})
%! assert (c{2}, [100; 90; 80])

%!test
%! a = strread ("a b c, d e, , f", "%s", "delimiter", ",");
%! assert (a, {"a b c"; "d e"; ""; "f"});

## Copyright (C) 2009-2011 Eric Chassande-Mottin, CNRS (France)
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or (at
## your option) any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn  {Function File} {[@var{a}, @dots{}] =} textread (@var{filename})
## @deftypefnx {Function File} {[@var{a}, @dots{}] =} textread (@var{filename}, @var{format})
## @deftypefnx {Function File} {[@var{a}, @dots{}] =} textread (@var{filename}, @var{format}, @var{prop1}, @var{value1}, @dots{})
## Read data from a text file.
##
## The file @var{filename} is read and parsed according to @var{format}.  The
## function behaves like @code{strread} except it works by parsing a file
## instead
## of a string.  See the documentation of @code{strread} for details.
## In addition to the options supported by @code{strread}, this function
## supports one more:
## @itemize
## @item "headerlines":
## @end itemize
## The first @var{value} number of lines of @var{str} are skipped.
## @seealso{strread, load, dlmread, fscanf}
## @end deftypefn

## Updates:
## Philip Nienhuis <prnienh...@users.sf.net>
## 2011-03-18 Fix default whitespace setting to same as ML
## 2011-04-08 Fix headerline processing

function varargout = textread (filename, format = "%f", varargin)
  ## Check input
  if (nargin < 1)
    print_usage ();
  endif

  if (!ischar (filename) || !ischar (format))
    error ("textread: first and second input arguments must be strings");
  endif

  ## Read file
  fid = fopen (filename, "r");
  if (fid == -1)
    error ("textread: could not open '%s' for reading", filename);
  endif

  ## Maybe skip header lines. Only first occurence of keyword is used
  headerlines = find (strcmpi (varargin, "headerlines"), 1);
  if (! isempty (headerlines))
    h_lines = varargin{headerlines + 1};
    ## Beware of (possibly computed) zero value for headerline 
    if (h_lines > 0), fskipl (fid, h_lines); endif
    varargin(headerlines:headerlines+1) = [];
  endif

  str = fread (fid, "char=>char").';
  fclose (fid);
  
  ## If needed, set up default whitespace param value
  if (isempty (strmatch ('whitespace', tolower (strtrim (varargin)))))
    nargs = numel (varargin);
    varargin(nargs+1:nargs+2) = {'whitespace', " \b\t"};
  endif

  ## Call strread to make it do the real work
  [varargout{1:max (nargout, 1)}] = strread (str, format, varargin {:});

endfunction

## Copyright (C) 2010-2011 Ben Abbott <bpabb...@mac.com>
##
## This file is part of Octave.
##
## Octave is free software; you can redistribute it and/or modify it
## under the terms of the GNU General Public License as published by
## the Free Software Foundation; either version 3 of the License, or (at
## your option) any later version.
##
## Octave is distributed in the hope that it will be useful, but
## WITHOUT ANY WARRANTY; without even the implied warranty of
## MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
## General Public License for more details.
##
## You should have received a copy of the GNU General Public License
## along with Octave; see the file COPYING.  If not, see
## <http://www.gnu.org/licenses/>.

## -*- texinfo -*-
## @deftypefn  {Function File} {@var{C} =} textscan (@var{fid}, @var{format})
## @deftypefnx {Function File} {@var{C} =} textscan (@var{fid}, @var{format}, @
## @var{n})
## @deftypefnx {Function File} {@var{C} =} textscan (@var{fid}, @var{format}, @
## @var{param}, @var{value}, @dots{})
## @deftypefnx {Function File} {@var{C} =} textscan (@var{fid}, @var{format}, @
## @var{n}, @var{param}, @var{value}, @dots{})
## @deftypefnx {Function File} {@var{C} =} textscan (@var{str}, @dots{})
## @deftypefnx {Function File} {[@var{C}, @var{position}] =} textscan (@dots{})
## Read data from a text file.
##
## The file associated with @var{fid} is read and parsed according to
## @var{format}.  The function behaves like @code{strread} except it works by
## parsing a file instead of a string.  See the documentation of
## @code{strread} for details.  In addition to the options supported by
## @code{strread}, this function supports one more:
## @itemize
## @item "headerlines":
## @end itemize
## The first @var{value} number of lines of @var{str} are skipped.
##
## The optional input, @var{n}, specifes the number of lines to be read from
## the file, associated with @var{fid}.
##
## The output, @var{C}, is a cell array whose length is given by the number
## of format specifiers.
##
## The second output, @var{position}, provides the position, in characters,
## from the beginning of the file.
##
## @seealso{dlmread, fscanf, load, strread, textread}
## @end deftypefn

## Updates:
## Philip Nienhuis <prnienhuis@ I respond privatelyusers.sf.net>
## 2011-04-08 Fix headerline arg processing bug

function [C, p] = textscan (fid, format, varargin)

  ## Check input
  if (nargin < 1)
    print_usage ();
  elseif (nargin == 1 || isempty (format))
    format = "%f";
  endif

  if (nargin > 2 && isnumeric (varargin{1}))
    nlines = varargin{1};
    args = varargin(2:end);
  else
    nlines = Inf;
    args = varargin;
  endif

  if (! any (strcmpi (args, "emptyvalue")))
    ## Matlab returns NaNs for missing values
    args{end+1} = "emptyvalue";
    args{end+1} = NaN;
  endif

  if (isa (fid, "double") && fid > 0 || ischar (fid))
    if (ischar (format))
      if (ischar (fid))
        if (nargout == 2)
          error ("textscan: cannot provide position information for character input");
        endif
        str = fid;
      else
        ## Maybe skip header lines
        headerlines = find (strcmpi (args, "headerlines"), 1);
        if (! isempty (headerlines))
          h_lines = varargin{headerlines + 1};
          ## Beware of zero headerline value, fskipl will count lines to EOF then
          if (h_lines > 0), fskipl (fid, h_lines); endif
          args(headerlines:headerlines+1) = [];
        endif
        if (isfinite (nlines))
          str = "";
          for n = 1:nlines
            str = strcat (str, fgets (fid));
          endfor
            else
          str = fread (fid, "char=>char").';
        endif
      endif

      ## Determine the number of data fields
      num_fields = numel (strfind (format, "%")) - ...
                   numel (idx_star = strfind (format, "%*"));

      ## Call strread to make it do the real work
      C = cell (1, num_fields);
      [C{:}] = strread (str, format, args{:});

      if (ischar (fid) && isfinite (nlines))
        C = cellfun (@(x) x(1:nlines), C, "uniformoutput", false);
      endif

      if (nargout == 2)
        p = ftell (fid);
      endif

    else
      error ("textscan: FORMAT must be a valid specification");
    endif
  else
    error ("textscan: first argument must be a file id or character string");
  endif

endfunction

%!test
%! str = "1,  2,  3,  4\n 5,  ,  ,  8\n 9, 10, 11, 12";
%! fmtstr = "%f %d %f %s";
%! c = textscan (str, fmtstr, 2, "delimiter", ",", "emptyvalue", -Inf);
%! assert (isequal (c{1}, [1;5]))
%! assert (length (c{1}), 2);
%! assert (iscellstr (c{4}))
%! assert (isequal (c{3}, [3; -Inf]))

%!test
%! b = [10:10:100];
%! b = [b; 8*b/5];
%! str = sprintf ("%g miles/hr = %g kilometers/hr\n", b);
%! fmt = "%f miles/hr = %f kilometers/hr";
%! c = textscan (str, fmt);
%! assert (b(1,:)', c{1})
%! assert (b(2,:)', c{2})

------------------------------------------------------------------------------
Xperia(TM) PLAY
It's a major breakthrough. An authentic gaming
smartphone on the nation's most reliable network.
And it wants your games.
http://p.sf.net/sfu/verizon-sfdev

_______________________________________________
Octave-dev mailing list
Octave-dev@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/octave-dev

Re: [OctDev] textread and textscan select wrong number of headerlines

Reply via email to