Sounds like a bug or two in there if some things work as expected and others not. My guess is there is a mismatch in how a dataset can have its schema specified and how the CSV parse options are set (i.e. that you’ve specified column names).
A JIRA with a minimal reproducible example would be great, thank you. Neal Sent from my iPhone > On Sep 5, 2020, at 4:41 AM, Gary Clark <gclark...@gmail.com> wrote: > > > Hi Neal, > > Thanks for the help. What I meant by "default install" was just running > "install.packages('arrow)" w/o any special arguments. It turns out that I > just had an old version due to the mirror my setup was using. After > specifying a more up-to-date mirror, all worked out just fine. > > I'm currently trying out 1.0.1 at the moment, and I'm still struggling a bit > with the open_dataset() function. Again, maybe it is something that I am > doing, or possibly a bug. Here is what I am trying to do: > > 1. I have a list of 2 csv files in a directory: fp = './file_path/' > 2. None of these csvs have headers > 3. I create a schema for these csvs: s = schema(col1 = int64(), col2 = > timestamp()) > 4. I use the open_dataset function: df = open_dataset(sources = fp, schema = > s, format = 'csv') > 5. Now I want to view stuff about the data: > a. df %>% names() | this works!! > b. df$num_cols | this works!! > c. df$num_rows | yields... NA, number of rows unknown > d. df %>% head() | yields... Error in x[seq_len(n)] : object of type > 'environment' is not subsettable > e. df %>% collect() | yields... CSV file contained multiple columns named > 2006-12-20 00:00:00.000 > > From 'a' it would seem as if my data has been read in with column names that > I defined in the schema (as intended), but then 'e' mentions multiple column > names with a timestamp that would be part of the data implying that there are > NOT column names..? > > Any help is appreciated. If need be I can raise a JIRA for this. > >> On Thu, Sep 3, 2020 at 1:55 PM Neal Richardson <neal.p.richard...@gmail.com> >> wrote: >> Hi Gary, >> I'm not sure what you mean by "default install" (though I am curious), but >> CSV dataset support was added in 1.0.0 >> (https://arrow.apache.org/docs/r/news/index.html#datasets), so it is >> expected that format = "csv" would not work on 0.16. >> >> As for why you're getting no data, you may want to upgrade to 1.0.1, which >> was released last week. There were a couple of path-related issues that were >> fixed, and it's possible that one of those is causing what you're >> experiencing. If upgrading doesn't solve it, please open a JIRA and provide >> some more specifics about the files, and we can try to get to the bottom of >> it. >> >> Neal >> >>> On Thu, Sep 3, 2020 at 11:47 AM Gary Clark <gclark...@gmail.com> wrote: >>> Hi, >>> >>> I'm not sure if this is a bug or not, so I wanted to ask first and see if >>> im doing something wrong. I am trying to use open_dataset to read a >>> directory full of csv files. The code is: >>> >>> ``` >>> path = './some_dir/' >>> dfs = open_dataset(path, format = 'csv') >>> ``` >>> sessionInfo(): >>> R version 3.6.3 (2020-02-29) >>> Platform: x86_64-pc-linux-gnu (64-bit) >>> Running under: Ubuntu 18.04.4 LTS >>> >>> and my version of arrow: arrow_1.0.0 >>> >>> The default install of arrow gave me arrow_0.16.0 so I had to manually >>> upgrade to 1.0.0. Additionally, the default version DID NOT accept *csv* as >>> an argument to format. After upgrading it seems to accept it now, however, >>> when I 'collect()' the data and perform an operation like 'count()' it >>> returns nothing, i.e an empty tibble. Am I missing something? >>> >>> Thanks for the help. >>> >>> -- >>> Gary Clark >>> Data Scientist & Data Engineer >>> B.S. Mechanical Engineering, Howard University '13 >>> +1 (717) 798-6916 >>> gclark...@gmail.com > > > -- > Gary Clark > Data Scientist & Data Engineer > B.S. Mechanical Engineering, Howard University '13 > +1 (717) 798-6916 > gclark...@gmail.com