Sounds like a bug or two in there if some things work as expected and others 
not. My guess is there is a mismatch in how a dataset can have its schema 
specified and how the CSV parse options are set (i.e. that you’ve specified 
column names). 

A JIRA with a minimal reproducible example would be great, thank you. 

Neal

Sent from my iPhone

> On Sep 5, 2020, at 4:41 AM, Gary Clark <gclark...@gmail.com> wrote:
> 
> 
> Hi Neal,
> 
> Thanks for the help. What I meant by "default install" was just running 
> "install.packages('arrow)" w/o any special arguments. It turns out that I 
> just had an old version due to the mirror my setup was using. After 
> specifying a more up-to-date mirror, all worked out just fine.
> 
> I'm currently trying out 1.0.1 at the moment, and I'm still struggling a bit 
> with the open_dataset() function. Again, maybe it is something that I am 
> doing, or possibly a bug. Here is what I am trying to do:
> 
> 1. I have a list of 2 csv files in a directory: fp = './file_path/'
> 2. None of these csvs have headers
> 3. I create a schema for these csvs: s = schema(col1 = int64(), col2 = 
> timestamp())
> 4. I use the open_dataset function: df = open_dataset(sources = fp, schema = 
> s, format = 'csv')
> 5. Now I want to view stuff about the data:
>     a. df %>% names() | this works!!
>     b. df$num_cols | this works!!
>     c. df$num_rows | yields... NA, number of rows unknown
>     d. df %>% head() | yields... Error in x[seq_len(n)] : object of type 
> 'environment' is not subsettable
>     e. df %>% collect() | yields... CSV file contained multiple columns named 
> 2006-12-20 00:00:00.000
> 
> From 'a' it would seem as if my data has been read in with column names that 
> I defined in the schema (as intended), but then 'e' mentions multiple column 
> names with a timestamp that would be part of the data implying that there are 
> NOT column names..? 
> 
> Any help is appreciated. If need be I can raise a JIRA for this.
> 
>> On Thu, Sep 3, 2020 at 1:55 PM Neal Richardson <neal.p.richard...@gmail.com> 
>> wrote:
>> Hi Gary,
>> I'm not sure what you mean by "default install" (though I am curious), but 
>> CSV dataset support was added in 1.0.0 
>> (https://arrow.apache.org/docs/r/news/index.html#datasets), so it is 
>> expected that format = "csv" would not work on 0.16.
>> 
>> As for why you're getting no data, you may want to upgrade to 1.0.1, which 
>> was released last week. There were a couple of path-related issues that were 
>> fixed, and it's possible that one of those is causing what you're 
>> experiencing. If upgrading doesn't solve it, please open a JIRA and provide 
>> some more specifics about the files, and we can try to get to the bottom of 
>> it.
>> 
>> Neal
>> 
>>> On Thu, Sep 3, 2020 at 11:47 AM Gary Clark <gclark...@gmail.com> wrote:
>>> Hi,
>>> 
>>> I'm not sure if this is a bug or not, so I wanted to ask first and see if 
>>> im doing something wrong. I am trying to use open_dataset to read a 
>>> directory full of csv files. The code is:
>>> 
>>> ```
>>> path = './some_dir/'
>>> dfs = open_dataset(path, format = 'csv')
>>> ```
>>> sessionInfo():
>>> R version 3.6.3 (2020-02-29)
>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>> Running under: Ubuntu 18.04.4 LTS
>>> 
>>> and my version of arrow: arrow_1.0.0
>>> 
>>> The default install of arrow gave me arrow_0.16.0 so I had to manually 
>>> upgrade to 1.0.0. Additionally, the default version DID NOT accept *csv* as 
>>> an argument to format. After upgrading it seems to accept it now, however, 
>>> when I 'collect()' the data and perform an operation like 'count()' it 
>>> returns nothing, i.e an empty tibble. Am I missing something?
>>> 
>>> Thanks for the help.
>>> 
>>> -- 
>>> Gary Clark
>>> Data Scientist & Data Engineer
>>> B.S. Mechanical Engineering, Howard University '13
>>> +1 (717) 798-6916
>>> gclark...@gmail.com
> 
> 
> -- 
> Gary Clark
> Data Scientist & Data Engineer
> B.S. Mechanical Engineering, Howard University '13
> +1 (717) 798-6916
> gclark...@gmail.com

Reply via email to