Re: R arrow package question

Nic Crane Tue, 31 Jan 2023 14:43:48 -0800

Hi Angelo,

The original code with just `open_dataset()` works as it's created a
dataset without actually pulling the data into your R session.  The
subsequent commands you tried (i.e. involving `collect()` read in the
files, resulting in an error when the data is read in.

It looks like there's an invalid value in your dataset which is causing it
to fail to load.  From the error message you see there, it looks like it's
in the 12th column of your data in row 580.  I think when Jacob asked "have
you checked the value there", another way of phrasing what he said would be
to ask if you have manually checked the contents of whichever CSV is
causing the problem, in row 580 and column 12, to see what value is there?
(rather than checking the data type/value reported by Arrow).

It's going to be tricky to help diagnose the issue without a reproducible
example. If I'm working with a larger dataset, I usually narrow down the
issue by dividing it into two smaller datasets and running the code on each
to see which one contains the problematic row, and then keep going until I
find the row which is failing to load.  If you can get to the point where
you can pinpoint the exact values which are causing problems, this will be
the quickest way we can help you.

Best wishes,

Nic

On Tue, 31 Jan 2023 at 00:52, Angelo Casalan <[email protected]> wrote:

> Hi Jacob,
>
> Thanks. To provide some specifics on my query:
>
> 1.which version of arrow are you running?
> - 10.0.1
>
> 2. The error message provides an exact col,row position, have you checked
> the value there?
> Yes. It is int64. This is after running open_dataset without specifying
> schema:
> '''
> arrow<-open_dataset(
> sources="location of csv files",
> format="csv"
> )
> '''
>
>  3. I have to correct the exact error message:
> CSV conversion error to int64:invalid value ' '
> I think arrow tells me the invalid value present is ' '
>
>  4. This reminds me of cases where scientific notation is used for integers
>  which causes an error but that usually shows the value e.g. "1e6".
> the invalid value is: ' '
>
> 5. I am really confused because using disk.frame() function, on the same
> csvs, I have not encountered this problem on this column because it was
> cleanly encoded as a numeric variable.
>
> Regards,
>
>
>
> On Fri, Jan 27, 2023 at 9:43 AM Angelo Casalan <[email protected]>
> wrote:
>
> > Hi ,
> >
> > I hope you are well. I wish to ask how I can resolve this error:
> >
> > "CSV conversion error to int64: invalid value"
> >
> >
> > To give an idea of my dataset. I have 4 csvs all placed in a local
> folder.
> >
> >
> > The code below worked when importing:
> >
> >
> > arrow<-open_dataset(
> > sources="csv location",
> > format="csv")
> >
> >
> > However, when I run:
> >
> >
> > arrow %>% count(column) %>% collect()
> > nrow(arrow %>% collect)
> >
> > head(arrow %>% collect(),10 )
> >
> > I always get the same  error message: "Invalid: In CSV column #12: Row
> > #580. CSV conversion error to int64: invalid value"
> >
> > I tried going back to open_dataset(,schema() ). Where the column that is
> > giving me problems is set as utf8 or sometimes str in the schema
> argument.
> >
> > schema(
> > col=utf8(),
> > other nth columns
> > )
> >
> > But I still encounter the same problem.
> >
> > Using this code below fail to work either.
> >
> > arrow2<-arrow_table(arrow)
> >
> > Thanks in advance if you can help me.
> >
> > --
> > Regards,
> >
> > Angelo Casalan
> > Statistical Methodology Unit
> >
>
>
> --
> Regards,
>
> Angelo Casalan
> Statistical Methodology Unit
>

Re: R arrow package question

Reply via email to