On Sat, Mar 23, 2013 at 3:34 PM, Kaushik Vaidyanathan <
kvaid...@andrew.cmu.edu> wrote:

> Hi Matt
>
> Thanks for your detailed explanation on how signature gets stored and
> interpreted.
>
> I was looking up the codes in libclamav to see what data formats get used
> for string compare. Some backtracking from cli_bm_scanbuff took me to str.c
> where I see there is a function" cli_hex2str", which if I understand
> correctly maps two hexs to one character (unsigned char). Would it fair to
> speculate that this function is used by the clamav engine to map two hexs
> read from a signature or scanned file into one char for string matching
> purposes?
>
> Thank you..
>
>
> On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <mol...@sourcefire.com>
> wrote:
>
> > Well....data is data.  There is no difference (from a storage
> perspective)
> > from an executable with an "inc ecx" instruction or a text document with
> an
> > "A".  Both are represented by the value 0x41.  So from Clam's
> perspective,
> > a signature matching a single A would be identical to a signature that
> > detected a single "inc ecx" instruction.  Both would look for 41.
> >
> > In short your statement "some files are hex and some are character-based"
> > isn't really accurate.  At the risk of painting with a broad brush, I
> would
> > say that all files are stored as a series of values, a series of bytes.
> >  How you display them is different.  When I used 010 Editor to view a
> file
> > as hex, I get a set of ascii-hex representations.  When I look at a file
> > with a web-browser I get ascii text.  But underlying all of that is the
> > same idea, a set of bytes.  And that is how ClamAV treats all files.
> >
> > A signature with a 41 in it would be converted in memory to look for
> 0x41,
> > a single byte of value 0x41.  A signature written like that would detect
> an
> > executable or pdf or a flash or anything that has 0x41 in the data.
> >
> > Hope that answers your question.
> >
> > Matt
> >
> >
> > On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> > kvaid...@andrew.cmu.edu> wrote:
> >
> > > Hi
> > >
> > > I have a basic question. Most body-based signatures are hex based(lets
> > > focus on fixed string signatures alone for simplicity), whereas some of
> > the
> > > files are hex(EXE) or character-based(HTML).
> > >
> > > In the code I see unsigned chars used predominantly to represent
> patterns
> > > and file contents. At the very core, do the string matching algorithms,
> > > mainly extended Boyer Moore, I would like to understand how the
> datatypes
> > > gets manipulated.
> > >
> > > 1) Do the character based files get translated to hex to compare with
> > body
> > > based signatures?
> > >
> > > 2) Does the signature get treated as a string of chars?
> > > If yes,
> > > Does a toy signature "fe" gets treated as two chars(8 bits each) for
> "f"
> > > and "e" (or)
> > > Does the code read the signature "fe" and maps into one character based
> > on
> > > the ASCII table (for example)?
> > >
> > > Thank you..
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>

Read from signature, yes. Read from file, no. To quickly compare bytes it
is better to do it using the in-file binary representation. It is more
direct to say that cli_hex2str() is converting human-readable
representation of a hexadecimal number into the binary equivalent. For any
byte pattern to match, the signature-format equivalent will take twice as
many bytes as the raw binary value.

Example: "Hex" in ASCII
Actual data is 3 bytes long. 1st byte: 0x48. 2nd byte: 0x65. 3rd byte: 0x78
Signature-format equivalent is 6 bytes long, one for each hex digit.

This is where the name of the function came from. Input and output are both
char arrays (i.e. strings). The function takes in the "hex"-format version
of the content [486578], and returns the content in a usable string format
[Hex]. Hence, from "hex" to string.

Dave R.

-- 
---
Dave Raynor
Sourcefire Vulnerability Research Team
dray...@sourcefire.com
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net

Reply via email to