On 4 November 2010 15:31, robert mena <robert.m...@gmail.com> wrote: > Hi Richard, > I am not top posting. I am just explaining other symptoms that may point to > the cause since they may be the same and this is happening with the same > file. I'll try to get approval to release the file. > Meanwhile, In your opinion what would be the safest way to read and explode > (using \t) a text file encoded in UTF-8? > > On Thu, Nov 4, 2010 at 11:22 AM, Richard Quadling <rquadl...@gmail.com> > wrote: >> >> On 4 November 2010 15:11, robert mena <robert.m...@gmail.com> wrote: >> > Hi, >> > The core of the code is simply >> > $fp = fopen('file.tab', 'rb'); >> > while(!feof($fp)) >> > { >> > $line = fgets($fp); >> > $data = explode("\t", $line); >> > ... >> > } >> > So I try to manipulate the $data[X]. For example $data[0] is supposed >> > to be >> > numeric so I $n = (int) $data[0] >> > One other thing if the second column should contain a string. If I >> > check >> > the string visually it is correct but a if( $data[1] == 'stringX') is >> > false >> > even if in the file I can see this (and print those two) >> > I even did a md5 of both and they are different. >> > I seems to be an encoding issue. Is it safe to use explode with utf8 >> > strings? >> > I even tried this code but no match found (jst to replace the explode) >> > $str = "abc 文字化け efg"; >> > $results = array(); >> > preg_match_all("/\t/u", $str, $results); >> > var_dump($results[0]); >> > On Thu, Nov 4, 2010 at 6:33 AM, Richard Quadling <rquadl...@gmail.com> >> > wrote: >> >> >> >> On 3 November 2010 21:42, Alexander Holodny >> >> <alexander.holo...@gmail.com> >> >> wrote: >> >> > To exclude unexcepted behavior in case of wrongly formated input >> >> > data, >> >> > it would be much better to use such type-casting method: >> >> > intval(ltrim(trim($inStr), '0')) >> >> > >> >> > 2010/11/3, Nicholas Kell <n...@monkeyknight.com>: >> >> >> >> >> >> On Nov 3, 2010, at 4:22 PM, robert mena wrote: >> >> >> >> >> >>> Hi, >> >> >>> >> >> >>> I have a text file (utf-8 encoded) which contains lines with >> >> >>> numbers >> >> >>> and >> >> >>> text separated by \t. I need to convert the numbers that contains >> >> >>> 0 >> >> >>> (at >> >> >>> left) to integers. >> >> >>> >> >> >>> For some reason one line that contains 00000002 is casted to 0 >> >> >>> instead >> >> >>> of >> >> >>> 2. >> >> >>> Bellow the output of the cast (int) $field[0] where I get this >> >> >>> from >> >> >>> explode each line. >> >> >>> >> >> >>> 0 00000002 >> >> >>> 4 00000004 >> >> >> >> >> >> >> >> >> >> >> >> My first guess is wondering how you are grabbing the strings from >> >> >> the >> >> >> file. >> >> >> Seems to me like it would just drop the zeros on the left by >> >> >> default. >> >> >> Are >> >> >> you including the \t in the string by accident? If so, that may be >> >> >> hosing >> >> >> it. Otherwise, have you tried ltrim on it? >> >> >> >> >> >> Ex: >> >> >> >> >> >> $_castableString = ltrim($_yourString, '0'); >> >> >> >> >> >> // Now cast >> >> >> >> <?php >> >> // Create test file. >> >> $s_TabbedFilename = './test.tab'; >> >> file_put_contents($s_TabbedFilename, "0\t00000002" . PHP_EOL . >> >> "4\t00000004" . PHP_EOL); >> >> >> >> // Open test file. >> >> $fp_TabbedFile = fopen($s_TabbedFilename, 'rt') or die("Could not open >> >> {$s_TabbedFilename}\n"); >> >> >> >> // Iterate file. >> >> while(True) >> >> { >> >> if (False !== ($a_Line = fgetcsv($fp_TabbedFile, 0, "\t"))) >> >> { >> >> var_dump($a_Line); >> >> foreach($a_Line as $i_Index => $m_Value) >> >> { >> >> $a_Line[$i_Index] = intval($m_Value); >> >> } >> >> var_dump($a_Line); >> >> } >> >> else >> >> { >> >> break; >> >> } >> >> } >> >> >> >> // Close the file. >> >> fclose($fp_TabbedFile); >> >> >> >> // Delete the file. >> >> unlink($s_TabbedFilename); >> >> >> >> >> >> outputs ... >> >> >> >> array(2) { >> >> [0]=> >> >> string(1) "0" >> >> [1]=> >> >> string(8) "00000002" >> >> } >> >> array(2) { >> >> [0]=> >> >> int(0) >> >> [1]=> >> >> int(2) >> >> } >> >> array(2) { >> >> [0]=> >> >> string(1) "4" >> >> [1]=> >> >> string(8) "00000004" >> >> } >> >> array(2) { >> >> [0]=> >> >> int(4) >> >> [1]=> >> >> int(4) >> >> } >> >> >> >> intval() operates as standard on base 10, so no need to worry about >> >> leading zeros' being thought of as base8/octal. >> >> >> >> What is your code? Can you reduce it to something as small like the >> >> above to see if you can repeat the issue? >> >> Please don't top post. >> >> >> With regards to utf-8 data, no, PHP is not unicode aware. >> >> If a multi-byte character is comprised of a 0x09 byte, then it will be >> broken. >> >> Can you supply the file you are working on? >> >> b64encode it and drop it into a pastebin. >> >> >> -- >> Richard Quadling >> Twitter : EE : Zend >> @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY > >
I've not used it, but the mbstring extension has mb_split() - Split multibyte string using regular expression Whilst it probably isn't as performant as explode() or fgetcsv(), it should work. But I'm not an unicode expert and having a file I can test this mechanism easily enough. I'd be interested in knowing what output the code I produced outputs when used in conjunction with your data. -- Richard Quadling Twitter : EE : Zend @RQuadling : e-e.com/M_248814.html : bit.ly/9O8vFY -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php