Hi Neal, Thanks for the help. What I meant by "default install" was just running "install.packages('arrow)" w/o any special arguments. It turns out that I just had an old version due to the mirror my setup was using. After specifying a more up-to-date mirror, all worked out just fine.
I'm currently trying out 1.0.1 at the moment, and I'm still struggling a bit with the open_dataset() function. Again, maybe it is something that I am doing, or possibly a bug. Here is what I am trying to do: 1. I have a list of 2 csv files in a directory: *fp = './file_path/'* 2. None of these csvs have headers 3. I create a schema for these csvs: *s = schema(col1 = int64(), col2 = timestamp())* 4. I use the open_dataset function: *df = open_dataset(sources = fp, schema = s, format = 'csv')* 5. Now I want to view stuff about the data: a. *df %>% names() |* this works!! b. *df$num_cols | *this works!! c. *df$num_rows | *yields... *NA, number of rows unknown* d. *df %>% head() |* yields... *Error in x[seq_len(n)] : object of type 'environment' is not subsettable* e. *df %>% collect() |* yields... *CSV file contained multiple columns named 2006-12-20 00:00:00.000* >From 'a' it would seem as if my data has been read in with column names that I defined in the schema (as intended), but then 'e' mentions multiple column names with a timestamp that would be part of the data implying that there are NOT column names..? Any help is appreciated. If need be I can raise a JIRA for this. On Thu, Sep 3, 2020 at 1:55 PM Neal Richardson <neal.p.richard...@gmail.com> wrote: > Hi Gary, > I'm not sure what you mean by "default install" (though I am curious), but > CSV dataset support was added in 1.0.0 ( > https://arrow.apache.org/docs/r/news/index.html#datasets), so it is > expected that format = "csv" would not work on 0.16. > > As for why you're getting no data, you may want to upgrade to 1.0.1, which > was released last week. There were a couple of path-related issues that > were fixed, and it's possible that one of those is causing what you're > experiencing. If upgrading doesn't solve it, please open a JIRA and provide > some more specifics about the files, and we can try to get to the bottom of > it. > > Neal > > On Thu, Sep 3, 2020 at 11:47 AM Gary Clark <gclark...@gmail.com> wrote: > >> Hi, >> >> I'm not sure if this is a bug or not, so I wanted to ask first and see if >> im doing something wrong. I am trying to use open_dataset to read a >> directory full of csv files. The code is: >> >> ``` >> path = './some_dir/' >> dfs = open_dataset(path, format = 'csv') >> ``` >> sessionInfo(): >> >> R version 3.6.3 (2020-02-29) >> Platform: x86_64-pc-linux-gnu (64-bit) >> Running under: Ubuntu 18.04.4 LTS >> >> >> and my version of arrow: arrow_1.0.0 >> >> >> The default install of arrow gave me arrow_0.16.0 so I had to manually >> upgrade to 1.0.0. Additionally, the default version DID NOT accept *csv* as >> an argument to format. After upgrading it seems to accept it now, however, >> when I 'collect()' the data and perform an operation like 'count()' it >> returns nothing, i.e an empty tibble. Am I missing something? >> >> >> Thanks for the help. >> >> >> -- >> Gary Clark >> *Data Scientist & Data Engineer* >> *B.S. Mechanical Engineering, Howard University '13* >> +1 (717) 798-6916 >> gclark...@gmail.com >> > -- Gary Clark *Data Scientist & Data Engineer* *B.S. Mechanical Engineering, Howard University '13* +1 (717) 798-6916 gclark...@gmail.com