[PHP] Read Through PHP Files

2006-11-10 Thread Kevin

Hi,

I am using the function fopen to open a word document, loading the 
contents into a variable and then using a substr_count to count the 
number of times a certain string is found, this is allowing me to search 
through the file and say how many times the word appears, I can even use 
str_replace to highlight certain words. However Microsoft word seems to 
put a lot of rubbish in the header and footer, I am wondering is it 
possible to filter this rubbish out to get the exact document.


I also tried using fopen to open a PDF file, but as PDF is handled 
differently it came up completely different with no words at all, just 
full of rubbish. Is there anyway I can get this information using a 
simple fopen?


I am basically trying to create a search engine which can read within 
files similar to google. The only problem I would have after I have done 
all this is actually weighting the search results, however I would 
probably have to create the results first and then finally go through 
the results to try to weight them.


Does anyone else have any experience in this or could help me out with 
any of the problems I am having?


Thanks

Kevin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Read Through PHP Files

2006-11-10 Thread Thomas Munz
 You cannnot just open those files. That things that you see are not
 'rubish' or whatever. Those files are in a binary format. You need to
 understand the .doc format and the .pdf format. You can get this
 infromation by using google and search for 'Binary word format' and so on.
 Then you have to parse the file with the HEX codes etc and so on. This is
 pretty complex and I'm sure you dont wanna do that :D. Maybe there is
 allready a libary also in PHP that does it for you.

 But in generaly, you have to think in a different way. If you dont
 unserstand what binary formats are and how to parse them, its pretty hard
 and its better if you dont try it :)

 on Friday 10 November 2006 11:55, Kevin wrote:
  Hi,
 
  I am using the function fopen to open a word document, loading the
  contents into a variable and then using a substr_count to count the
  number of times a certain string is found, this is allowing me to search
  through the file and say how many times the word appears, I can even use
  str_replace to highlight certain words. However Microsoft word seems to
  put a lot of rubbish in the header and footer, I am wondering is it
  possible to filter this rubbish out to get the exact document.
 
  I also tried using fopen to open a PDF file, but as PDF is handled
  differently it came up completely different with no words at all, just
  full of rubbish. Is there anyway I can get this information using a
  simple fopen?
 
  I am basically trying to create a search engine which can read within
  files similar to google. The only problem I would have after I have done
  all this is actually weighting the search results, however I would
  probably have to create the results first and then finally go through
  the results to try to weight them.
 
  Does anyone else have any experience in this or could help me out with
  any of the problems I am having?
 
  Thanks
 
  Kevin

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Read Through PHP Files

2006-11-10 Thread Frank Arensmeier

There are search engins written in PHP available already. E.g. PHPdig.

http://www.phpdig.net/

PHPdig for example is able to index PDF and doc files (I think - see  
the docs). Maybe it would also be a good idea to have a look at the  
source code.


/frank

/frank
10 nov 2006 kl. 14.50 skrev Thomas Munz:


 You cannnot just open those files. That things that you see are not
 'rubish' or whatever. Those files are in a binary format. You need to
 understand the .doc format and the .pdf format. You can get this
 infromation by using google and search for 'Binary word format'  
and so on.
 Then you have to parse the file with the HEX codes etc and so on.  
This is

 pretty complex and I'm sure you dont wanna do that :D. Maybe there is
 allready a libary also in PHP that does it for you.

 But in generaly, you have to think in a different way. If you dont
 unserstand what binary formats are and how to parse them, its  
pretty hard

 and its better if you dont try it :)

 on Friday 10 November 2006 11:55, Kevin wrote:

Hi,

I am using the function fopen to open a word document, loading the
contents into a variable and then using a substr_count to count the
number of times a certain string is found, this is allowing me to  
search
through the file and say how many times the word appears, I can  
even use
str_replace to highlight certain words. However Microsoft word  
seems to

put a lot of rubbish in the header and footer, I am wondering is it
possible to filter this rubbish out to get the exact document.

I also tried using fopen to open a PDF file, but as PDF is handled
differently it came up completely different with no words at all,  
just

full of rubbish. Is there anyway I can get this information using a
simple fopen?

I am basically trying to create a search engine which can read within
files similar to google. The only problem I would have after I  
have done

all this is actually weighting the search results, however I would
probably have to create the results first and then finally go through
the results to try to weight them.

Does anyone else have any experience in this or could help me out  
with

any of the problems I am having?

Thanks

Kevin


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php




--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php