Re: R arrow package question

Angelo Casalan Wed, 08 Feb 2023 21:06:33 -0800

Hi Everyone,

Thanks for the responses. I hope you are all well.

Hi Dewey. As to the problematic column error message: Invalid: Could not
open CSV input source 'folder/name.CSV': Invalid: In CSV column #30: Row
#5: CSV conversion error to int32: invalid value ''

I manually opened the csv and saw the cells are empty or blanks along with
integers on the same column 30. Also present in some other columns.

I tried manually setting via schema() the columns as utf8()/character
equivalent in R, or string().

I still get the same error message.

disk.frame read these columns mixing integers with spaces/blanks as
integers smoothly with no error messages at all. I think disk.frame read
the spaces/blanks as null values/NA in R studio.

I am scripting all of these in RMarkdown if that might be a factor.

Questions:
1.Is there a way in open_dataset() to automatically set all blanks as null
values across multiple csvs which im trying to load into R? Similar in
logic to pandas.read_csv('test.csv',na_values=['nan'])

manual re-encoding is not feasible because im dealing with millions of data
points, I am also just a secondary user of this data, and my goal is to
automate in R for my organization.

2.  Are there other arrow functions/commands that can load multiple csvs
from my local folder as an arrow object?

Regards,

On Tue, Jan 31, 2023 at 8:50 AM Angelo Casalan <acasalan...@gmail.com>
wrote:

> Hi Jacob,
>
> Thanks. To provide some specifics on my query:
>
> 1.which version of arrow are you running?
> - 10.0.1
>
> 2. The error message provides an exact col,row position, have you checked
> the value there?
> Yes. It is int64. This is after running open_dataset without specifying
> schema:
> '''
> arrow<-open_dataset(
> sources="location of csv files",
> format="csv"
> )
> '''
>
>  3. I have to correct the exact error message:
> CSV conversion error to int64:invalid value ' '
> I think arrow tells me the invalid value present is ' '
>
>  4. This reminds me of cases where scientific notation is used for
> integers
>  which causes an error but that usually shows the value e.g. "1e6".
> the invalid value is: ' '
>
> 5. I am really confused because using disk.frame() function, on the same
> csvs, I have not encountered this problem on this column because it was
> cleanly encoded as a numeric variable.
>
> Regards,
>
>
>
> On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <acasalan...@gmail.com>
> wrote:
>
>> Hi ,
>>
>> I hope you are well. I wish to ask how I can resolve this error:
>>
>> "CSV conversion error to int64: invalid value"
>>
>>
>> To give an idea of my dataset. I have 4 csvs all placed in a local folder.
>>
>>
>> The code below worked when importing:
>>
>>
>> arrow<-open_dataset(
>> sources="csv location",
>> format="csv")
>>
>>
>> However, when I run:
>>
>>
>> arrow %>% count(column) %>% collect()
>> nrow(arrow %>% collect)
>>
>> head(arrow %>% collect(),10 )
>>
>> I always get the same  error message: "Invalid: In CSV column #12: Row
>> #580. CSV conversion error to int64: invalid value"
>>
>> I tried going back to open_dataset(,schema() ). Where the column that is
>> giving me problems is set as utf8 or sometimes str in the schema argument.
>>
>> schema(
>> col=utf8(),
>> other nth columns
>> )
>>
>> But I still encounter the same problem.
>>
>> Using this code below fail to work either.
>>
>> arrow2<-arrow_table(arrow)
>>
>> Thanks in advance if you can help me.
>>
>> --
>> Regards,
>>
>> Angelo Casalan
>> Statistical Methodology Unit
>>
>
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

-- 
Regards,

Angelo Casalan

Re: R arrow package question

Reply via email to