Re: [Tutor] CVS File Opening

Christian Witts Wed, 27 May 2009 07:43:22 -0700

Paras K. wrote:

There are no headers for the log files, and there are mulitple logfiles so what that walk through the directory for all csv files?THANK IN ADVANCE!!!

On Wed, May 27, 2009 at 10:18 AM, Christian Witts<cwi...@compuscan.co.za <mailto:cwi...@compuscan.co.za>> wrote:


    Paras K. wrote:

        As requested - here is some example rows from the csv files:

117.86.68.157 BitTorrent Client Activity 15/21/2009 6:5682.210.106.99 BitTorrent Client Activity 15/20/2009 12:3981.132.134.83 BitTorrent Client Activity 15/21/2009 3:14


         The rows are: IP, Activity, Count, Date / Time these are
        typical log files.


         On Tue, May 26, 2009 at 6:51 PM, Sander Sweers
        <sander.swe...@gmail.com <mailto:sander.swe...@gmail.com>
        <mailto:sander.swe...@gmail.com
        <mailto:sander.swe...@gmail.com>>> wrote:

           2009/5/26 Paras K. <para...@gmail.com
        <mailto:para...@gmail.com> <mailto:para...@gmail.com
        <mailto:para...@gmail.com>>>:

           > Hello,
           >
           > I have been working on this script / program all weekend. I
           emailed this
           > address before and got some great help. I hope that I can get
           that again!
           >
           >
           > First to explain what I need to do:
           >
           > Have about 6 CSV files that I need to read. Then I need
        to split
           based on a
           > range of IP address and if the count number is larger
        than 75.
           >
           > I currently merge all the CSV files by using the command
        line:
           >
           > C:Reports> copy *.csv merge.csv
           >
           > Then I run the dos command: for /F "TOKENS=* SKIP=1" %i in
           ('find "."
           > merge.csv ^| find /v "----"') do echo %i>> P2PMerge.csv
           >
           > From some of my friends they tell me that should remove that
           last carriage
           > return, which it does, however when it goes through the
        python
           script it
           > returns no values.

           Why would you need to strip off a carriage return? And why
        would you
           not process the csv files one after another? It would be
        easier to
           have some example data.

           > Now if I open the merge.csv and remove that carriage return
           manually and
           > save it as P2PMerge.csv the script runs just fine.
           >
           > Here is my source code:
           >
           > # P2P Report / Bitorrent Report
           > # Version 1.0
           > # Last Updated: May 26, 2009
           > # This script is designed to go through the cvs files and
        find
           the valid IP
           > Address
           > # Then copys them all to a new file
           > import sys
           > import win32api
           > import win32ui
           > import shutil
           > import string
           > import os
           > import os.path
           > import csv

           You import csv but do not use it below?

           > #Global Variables
           > P2Pfiles = []
           > totalcount = 0
           > t = 0
           > #still in the development process -- where to get the
        files from
           > #right now the location is C:\P2P
           > def getp2preportdestion():
           >     win32ui.MessageBox('Welcome to P2P Reporting.\nThis
        program
           is designed
           > to aid in the P2P reporting. \n\nThe locations of P2P Reports
           should be in
           > C:\P2P \nWith no subdirectories.\n\nVersion 1.0 -
        \n\nPress "OK"
           to continue
           > with this program.')
           >     p2preport = 'C://P2P\\'
           >     return p2preport
           >
           >
           > #Main Program
           > #Get location of directories
           > p2ploc = getp2preportdestion()
           > #Checking to make sure directory is there.
           > if os.path.exists(p2ploc):
           >     if os.path.isfile(p2ploc +'/p2pmerge.csv'):
           >         win32ui.MessageBox('P2PMerge.csv file does
           exists.\n\nWill continue
           > with P2P Reporting.')
           >     else:
           >          win32ui.MessageBox('P2PMerge.csv files does not
        exists.
           \n\nPlease
           > run XXXXXXX.bat files first.')
           >          sys.exit()
           > else:
           >     win32ui.MessageBox('The C:\P2P directory does not
           exists.\n\nPlease
           > create and copy all the files there.\nThen re-run this
        script')
           >     sys.exit()
           > fh = open('C://P2P/P2PMerge.csv', "rb")
           > ff = open('C://P2P/P2PComplete.csv', "wb")
           > igot1 = fh.readlines()
           >
           > for line in igot1:

           You can also write the below and get rid of igot1.
           for line in fh.readlines():

           >     readline = line
           >     ipline = readline
           >     ctline = readline

           You are making variables to the same object and all are not
        necessary.
           See below idle session which should show what I mean.

           >>> line = [1,2,3,4]
           >>> readline = line
           >>> ipline = readline
           >>> ctline = readline
           >>> line
           [1, 2, 3, 4]
           >>> line.append('This will be copied to readline, iplin and
        ctline')
           >>> readline
           [1, 2, 3, 4, 'This will be copied to readline, iplin and
        ctline']
           >>> ipline
           [1, 2, 3, 4, 'This will be copied to readline, iplin and
        ctline']
           >>> ctline
           [1, 2, 3, 4, 'This will be copied to readline, iplin and
        ctline']

           >     count = ctline.split(',')[2]
           >     count2 = int(count)
           >     print count2
           >     t = count2

           Again making variables to the same object? And you really
        do not
           not need t.

           >     ip = ipline.split(' ')[0]

           so all the above can be simplified like:
                 data = line.split(' ')
                 count = int(data[2])
                 ip = data[0]

           >     split_ip = ip.split('.')
           >     if ((split_ip[0] == '192') and (t >=75)):

           The above then would be:
                 if ip.startswith('192') and count >= 75:

           >         ff.write(readline)
           This will change as well:
                     ff.write(line)

           You can figure out the rest ;-)

           >         totalcount +=1
           >     elif ((split_ip[0] == '151') and (t >=75)):
           >         ff.write(readline)
           >         totalcount +=1
           >     elif (((split_ip[0] == '142') and (split_ip[1]) == '152')
           and (t >=75)):
           >           ff.write(readline)
           >           totalcount +=1
           >
           > tc = str(totalcount)
           > win32ui.MessageBox('Total Number of IPs in P2P Reporting:
        '+ tc)
           > fh.close()
           > ff.close()
           >
           >
           > What I am looking for is an working example of how to go
        through the
           > directory and read each csv file within that directory or
        how to
           remove the
           > carriage return at the end of the csv file.

           You can avoid the removal of this carriage return, read
        below. But if
           you really need to you can use str.rstrip('carriage return').

           > NOTE: This is not for a class - it is for work to assist
        me in
           reading
           > multiple csv files within a couple days.
           >
           > Any assistance is greatly appreciated.

           Use te glob module which can easilly find all csv files in a
           directory. In general I would loop over each file and do your
           processing. Like,

           import glob

           totalcount = 0
           for f in glob.glob('inpath' + '*csv'):
              for line in f.readlines():
                  You code comes here.

           Greets
           Sander


        ------------------------------------------------------------------------

        _______________________________________________
        Tutor maillist  -  Tutor@python.org <mailto:Tutor@python.org>
        http://mail.python.org/mailman/listinfo/tutor

    If that's your log structure and it's all IP Addresses and what
    you want is to count the amount of P2P activity per IP and for
    whatever purpose then what you could do is something similar to this:

    from glob import glob

    if __name__ == '__main__':
      IP_Addresses = dict()
      for filename in glob('*.csv'):
          fIn = open(filename, 'rb')
          for line in fIn:
              IP, Activity, Count, TimeDate = line.strip().split(' ')
              if IP in IP_Addresses:
                  IP_Addresses[IP] += int(Count)
              else:
                  IP_Addresses[IP] = int(Count)
      for IP, Cnt in IP_Addresses.items():
          if Cnt >= 75:
              if IP.split('.')[0] in ('192', '151'):
                  print IP, Cnt
              elif IP.split('.')[:2] == ['142', '152']:
                  print IP, Cnt

    Obviously if you want to keep the original log line then you will
    need to store that in your dictionary as well, but for the purpose
    of reporting how many 'offences' an IP Address has had this is
    simple enough.

--Kind Regards,

    Christian Witts

from glob import glob
for filename in glob('/path/to/your/files/*.csv'):
   print filename

That will recurse the files in the folder for everything with a .csvextension which is what you want.Then for each file that matches the extension type, the application Iwrote in the previous with recurse through each line in the file, splitthe contents of the log on spaces, although it looks like tabs in yoursample then just change the .split(' ') with .split('\t') which willbreak it up into IP, Activity, Count, DateTime.

It will add the IP Address to a dictionary of IP Addresses if it is notthere with the count of that log and any further from that IP willincrement it by the log count. Once all files have been processed itwill then check the Addresses and check what the count is (you don'tcare about ones with less than 75 hits) and then check what range theyare in and output those.


Hope that helps.

--
Kind Regards,
Christian Witts


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] CVS File Opening

Reply via email to