Re: [R] no data when reading directory of csvs

Gary Clark Sat, 05 Sep 2020 04:42:11 -0700

Hi Neal,

Thanks for the help. What I meant by "default install" was just running
"install.packages('arrow)" w/o any special arguments. It turns out that I
just had an old version due to the mirror my setup was using. After
specifying a more up-to-date mirror, all worked out just fine.


I'm currently trying out 1.0.1 at the moment, and I'm still struggling a
bit with the open_dataset() function. Again, maybe it is something that I
am doing, or possibly a bug. Here is what I am trying to do:

1. I have a list of 2 csv files in a directory: *fp = './file_path/'*
2. None of these csvs have headers
3. I create a schema for these csvs: *s = schema(col1 = int64(), col2 =
timestamp())*
4. I use the open_dataset function: *df = open_dataset(sources = fp, schema
= s, format = 'csv')*
5. Now I want to view stuff about the data:
    a. *df %>% names() |* this works!!
    b. *df$num_cols | *this works!!
    c. *df$num_rows | *yields... *NA, number of rows unknown*
    d. *df %>% head() |* yields... *Error in x[seq_len(n)] : object of type
'environment' is not subsettable*
    e. *df %>% collect() |* yields... *CSV file contained multiple columns
named 2006-12-20 00:00:00.000*

>From 'a' it would seem as if my data has been read in with column names
that I defined in the schema (as intended), but then 'e' mentions multiple
column names with a timestamp that would be part of the data implying that
there are NOT column names..?

Any help is appreciated. If need be I can raise a JIRA for this.

On Thu, Sep 3, 2020 at 1:55 PM Neal Richardson <neal.p.richard...@gmail.com>
wrote:

> Hi Gary,
> I'm not sure what you mean by "default install" (though I am curious), but
> CSV dataset support was added in 1.0.0 (
> https://arrow.apache.org/docs/r/news/index.html#datasets), so it is
> expected that format = "csv" would not work on 0.16.
>
> As for why you're getting no data, you may want to upgrade to 1.0.1, which
> was released last week. There were a couple of path-related issues that
> were fixed, and it's possible that one of those is causing what you're
> experiencing. If upgrading doesn't solve it, please open a JIRA and provide
> some more specifics about the files, and we can try to get to the bottom of
> it.
>
> Neal
>
> On Thu, Sep 3, 2020 at 11:47 AM Gary Clark <gclark...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm not sure if this is a bug or not, so I wanted to ask first and see if
>> im doing something wrong. I am trying to use open_dataset to read a
>> directory full of csv files. The code is:
>>
>> ```
>> path = './some_dir/'
>> dfs = open_dataset(path, format = 'csv')
>> ```
>> sessionInfo():
>>
>> R version 3.6.3 (2020-02-29)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 18.04.4 LTS
>>
>>
>> and my version of arrow: arrow_1.0.0
>>
>>
>> The default install of arrow gave me arrow_0.16.0 so I had to manually 
>> upgrade to 1.0.0. Additionally, the default version DID NOT accept *csv* as 
>> an argument to format. After upgrading it seems to accept it now, however, 
>> when I 'collect()' the data and perform an operation like 'count()' it 
>> returns nothing, i.e an empty tibble. Am I missing something?
>>
>>
>> Thanks for the help.
>>
>>
>> --
>> Gary Clark
>> *Data Scientist & Data Engineer*
>> *B.S. Mechanical Engineering, Howard University '13*
>> +1 (717) 798-6916
>> gclark...@gmail.com
>>
>

-- 
Gary Clark
*Data Scientist & Data Engineer*
*B.S. Mechanical Engineering, Howard University '13*
+1 (717) 798-6916
gclark...@gmail.com

Re: [R] no data when reading directory of csvs

Reply via email to