[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-11-09 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631279#comment-17631279
 ] 

Lance Dacey commented on ARROW-15474:
-

Nice, I was able to test it out and seemed to get the correct results. I have 
been using polars and duckdb to handle de-duplication for a while now so I used 
that as a comparison.

{code:java}
%%time

table = con.execute("select distinct on (forecast_group) * from scanner order 
by session_id, date").arrow()

CPU times: user 735 ms, sys: 45.7 ms, total: 780 ms
Wall time: 1.92 s
{code}


Your suggestion:

{code:java}
%%time 

table = scanner.to_table()

t1 = table.append_column('i', pa.array(np.arange(len(table
t2 = t1.group_by(['forecast_group']).aggregate([('i', 'min')]).column('i_min')
table = pc.take(table, t2)

CPU times: user 872 ms, sys: 60.9 ms, total: 933 ms
Wall time: 4.6 s
{code}


A bit slower than duckdb somehow, but for me it is acceptable and gives me an 
option to drop duplicates without requiring additional libraries, including 
pandas. Thanks!


> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-09 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631108#comment-17631108
 ] 

Lance Dacey commented on ARROW-15716:
-

Yes, ultimate goal is to create a single expression which would filter all 
unique partitions that had data written into them.

I added unique partitions there because it is possible for multiple file 
fragments to be written to the same partition (max_rows during write) - I never 
tested what happens if you run an expression that has duplicates though. Any 
idea if that would matter? For example, the filter expression for both of these 
fragments would be the same:

'path/to/data/section=a/part-0.parquet',
'path/to/data/section=a/part-1.parquet',

The example [~westonpace] provided would work great.




> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Assignee: Vibhatha Lakmal Abeykoon
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-07 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630175#comment-17630175
 ] 

Lance Dacey commented on ARROW-15716:
-

Yes, if I could easily retrieve a list of the unique partitions which were 
written to that would be helpful. If I could then parse the list of partitions 
into a dataset expression (used for table(filter=expression)), that would be 
even better.

Right now I can get a list of the fragments, parse them into expressions, and 
from there I can determine the partitions using ds._get_partition_keys()

Full example below. I am essentially just looking for a potential shortcut, 
convenience method, or better approach.

Say these are the fragments which were written during dataset write:
{code:python}
['path/to/data/month_id=202105/v1-manual__2022-11-06T22:50:20.parquet',
 'path/to/data/month_id=202106/v1-manual__2022-11-06T22:50:20.parquet',
 'path/to/data/month_id=202107/v1-manual__2022-11-06T22:50:20..parquet']
{code}

My ultimate goal is for a downstream task to filter the dataset for those three 
partitions (not just the fragments since other files might exist).

{code:python}
partitioning = dataset.partitioning

#parse each fragment path to get a list of expressions
expressions = [partitioning.parse(file) for file in paths]

#get the partitions
filters = [ds._get_partition_keys(expression) for expression in expressions]

[{'month_id': 202105}, {'month_id': 202106}, {'month_id': 202107}]

#Convert to an expression

from pyarrow.parquet import filters_to_expression

filters_to_expression(filters)


{code}




> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-07 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630046#comment-17630046
 ] 

Lance Dacey commented on ARROW-15716:
-

I wanted to check if this is something which might be possible eventually. It 
would reduce a lot of ugly custom code that I use to achieve the result that I 
am looking for.

Write dataset, collect the fragment paths:

{code:python}
collector = []
ds.write_dataset(
  table, 
  base_dir="dev/staging", 
  partitioning=["date"], 
  partitioning_flavor="hive", 
  file_visitor=lambda x: collector.append(x)
)
{code}

Next my hope would be parse those paths into a consolidate filter expression 
which I could use to query the original dataset. This ensures that I read in 
the entire partition since it is possible that other files already existed 
before the write step above.

{code:python}

paths = [file.path for file in collector]
partitioning = ds.partitioning(flavor="hive") 
filter_expression = partitioning.parse(paths) #parse a list of paths, ideally 
using the "hive" shortcut

dataset = ds.dataset(source="dev/staging", partitioning=partitioning)
new_table = dataset.to_table(filter=filter_expression)
ds.write_dataset(new_table, base_dir="dev/final", 
existing_data_behavior="delete_matching")
{code}


> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
> ], skip_nulls=false})
> {code}
> My hope would be to do something like filt_exp = partitioning.parse(paths) 
> which would return a dataset expression.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-11-07 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-15716:

Description: 
Is it possible for partitioning.parse() to be updated to parse a list of paths 
instead of just a single path? 

I am passing the .paths from file_visitor to downstream tasks to process data 
which was recently saved, but I can run into problems with this if I overwrite 
data with delete_matching in order to consolidate small files since the paths 
won't exist. 

Here is the output of my current approach to use filters instead of reading the 
paths directly:

{code:python}
# Fragments saved during write_dataset 
['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
'dev/dataset/fragments/date_id=20210114/data-0.parquet']

# Run partitioning.parse() on each fragment 
[, 
, , ]

# Format those expressions into a list of tuples
[('date_id', 'in', [20210114, 20210813])]

# Convert to an expression which is used as a filter in .to_table()
is_in(date_id, {value_set=int64:[
  20210114,
  20210813
], skip_nulls=false})
{code}


My hope would be to do something like filt_exp = partitioning.parse(paths) 
which would return a dataset expression.


  was:
Is it possible for partitioning.parse() to be updated to parse a list of paths 
instead of just a single path? 

I am passing the .paths from file_visitor to downstream tasks to process data 
which was recently saved, but I can run into problems with this if I overwrite 
data with delete_matching in order to consolidate small files since the paths 
won't exist. 

Here is the output of my current approach to use filters instead of reading the 
paths directly:

{code:java}
# Fragments saved during write_dataset 
['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
'dev/dataset/fragments/date_id=20210114/data-0.parquet']

# Run partitioning.parse() on each fragment 
[, 
, , ]

# Format those expressions into a list of tuples
[('date_id', 'in', [20210114, 20210813])]

# Convert to an expression which is used as a filter in .to_table()
is_in(date_id, {value_set=int64:[
  20210114,
  20210813
], skip_nulls=false})
{code}

And here is how I am creating the filter from a list of .paths (perhaps there 
is a better way?):

{code:python}
partitioning = ds.HivePartitioning(partition_schema)
expressions = []
for file in paths:
expressions.append(partitioning.parse(file))
values = []
filters = []
for expression in expressions:
partitions = ds._get_partition_keys(expression)
if len(partitions.keys()) > 1:
element = [(k, "==", v) for k, v in partitions.items()]
if element not in filters:
filters.append(element)
else:
for k, v in partitions.items():
if v not in values:
values.append(v)
filters = [(k, "in", sorted(values))]

filt_exp = pa.parquet._filters_to_expression(filters)
dataset.to_table(filter=filt_exp)
{code}


My hope would be to do something like filt_exp = partitioning.parse(paths) 
which would return a dataset expression.



> [Dataset][Python] Parse a list of fragment paths to gather filters
> --
>
> Key: ARROW-15716
> URL: https://issues.apache.org/jira/browse/ARROW-15716
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Lance Dacey
>Priority: Minor
>
> Is it possible for partitioning.parse() to be updated to parse a list of 
> paths instead of just a single path? 
> I am passing the .paths from file_visitor to downstream tasks to process data 
> which was recently saved, but I can run into problems with this if I 
> overwrite data with delete_matching in order to consolidate small files since 
> the paths won't exist. 
> Here is the output of my current approach to use filters instead of reading 
> the paths directly:
> {code:python}
> # Fragments saved during write_dataset 
> ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
> 'dev/dataset/fragments/date_id=20210114/data-0.parquet']
> # Run partitioning.parse() on each fragment 
> [, 
> , 
> , 
> ]
> # Format those expressions into a list of tuples
> [('date_id', 'in', [20210114, 20210813])]
> # Convert to an expression which is used as a filter in .to_table()
> is_in(date_id, {value_set=int64:[
>   20210114,
>   20210813
>

[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-10-24 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623362#comment-17623362
 ] 

Lance Dacey commented on ARROW-15474:
-

Nice - I will give that a shot, thanks. I have been using a library called 
`polars` to drop duplicates from a pyarrow table lately, but it would be nice 
to have a native-pyarrow way to do it.

Can we sort the data before adding the `cumulative_sum`? My concern is that the 
order of the raw data might be messed up and we might select the wrong row to 
keep.

> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-04-22 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526487#comment-17526487
 ] 

Lance Dacey commented on ARROW-12358:
-

Nice, thanks. I can try to test with a nightly build this weekend.

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 9.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-04-20 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524991#comment-17524991
 ] 

Lance Dacey commented on ARROW-15474:
-

I'll keep this open since this is a major wish list item for me. If anyone has 
some sample functions they have implemented outside of core pyarrow to achieve 
this then I would be interested in seeing that as well.

> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
> Fix For: 9.0.0
>
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

2022-04-05 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478
 ] 

Lance Dacey edited comment on ARROW-16077 at 4/5/22 2:26 PM:
-

I am not sure about any public datasets. Locally, I use 
[azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio]
 for testing which can be installed or run as a Docker container. Note that I 
only use Azure Blob and not Azure Data Lake, so there might be some differences 
I am not aware of.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet 
data from Azure. I did a couple of tests with double slashes in the path. 
Perhaps I misunderstood what the original issue was, but it looks like I can 
read the data with pq.read_table and with pandas using fs.open() and 
storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:1";
ACCOUNT_NAME = "devstoreaccount1"
KEY = 
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = 
f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
return {
"date_id": [20210114, 20210811],
"id": [1, 2],
"created_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"updated_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"category": ["cow", "sheep"],
"value": [0, 99],
}


def test_double_slashes(example_data):
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, 
connection_string=CONN_STR)
fs.mkdir("resource")
path = "resource/path/to//parquet/files/part-001.parquet"
table = pa.table(example_data)
pq.write_table(table, where=path, filesystem=fs)

# use pq.read_table() with filesystem
new = pq.read_table(source=path, filesystem=fs)
assert new == table

# use adlfs filesystem.open()
df = pd.read_parquet(fs.open(path, mode="rb"))
dataframe_table = pa.Table.from_pandas(df)
assert table == dataframe_table

# use abfs path with storage options
df2 = pd.read_parquet(f"abfs://{path}", 
storage_options={"connection_string": CONN_STR})
assert_frame_equal(df, df2)

{code}





was (Author: ldacey):
I am not sure about any public datasets. Locally, I use 
[azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio]
 for testing which can be installed or run as a Docker container.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet 
data from Azure. I did a couple of tests with double slashes in the path. 
Perhaps I misunderstood what the original issue was, but it looks like I can 
read the data with pq.read_table and with pandas using fs.open() and 
storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:1";
ACCOUNT_NAME = "devstoreaccount1"
KEY = 
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = 
f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
return {
"date_id": [20210114, 20210811],
"id": [1, 2],
"created_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"updated_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"category": ["cow", "sheep"],
"value": [0, 99],
}


def test_double_slashes(example_data):
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, 
connection_string=CONN_STR)
fs.mkdir("resource")
path = "resource/path/to//parquet/files/part-001.parquet"
table = pa.table(example_data)
pq.write_table(table, where=path, filesystem=fs)

# use pq.read_table() with filesystem
new = pq.read_table(source=path, filesystem=fs)
assert new == table

# use adlfs filesystem.open()
df = pd.read_parquet(fs.open(path, mode="rb"))
dataframe_table = pa.Table.from_pandas(df)
assert table == dataframe_table

# use abfs path with storage options
df2 = pd.read_parquet(f"abfs://{path}", 
storage_options={"connection_string": CONN_STR})
assert_frame_equal(df, df2)

{code}




> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in th

[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path

2022-04-05 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478
 ] 

Lance Dacey commented on ARROW-16077:
-

I am not sure about any public datasets. Locally, I use 
[azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio]
 for testing which can be installed or run as a Docker container.

I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet 
data from Azure. I did a couple of tests with double slashes in the path. 
Perhaps I misunderstood what the original issue was, but it looks like I can 
read the data with pq.read_table and with pandas using fs.open() and 
storage_options. I pasted my quick tests below.



{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pytest
from adlfs import AzureBlobFileSystem
from pandas.testing import assert_frame_equal


URL = "http://127.0.0.1:1";
ACCOUNT_NAME = "devstoreaccount1"
KEY = 
"Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw=="
CONN_STR = 
f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};"


@pytest.fixture
def example_data():
return {
"date_id": [20210114, 20210811],
"id": [1, 2],
"created_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"updated_at": [
"2021-01-14 16:45:18",
"2021-08-11 15:10:00",
],
"category": ["cow", "sheep"],
"value": [0, 99],
}


def test_double_slashes(example_data):
fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, 
connection_string=CONN_STR)
fs.mkdir("resource")
path = "resource/path/to//parquet/files/part-001.parquet"
table = pa.table(example_data)
pq.write_table(table, where=path, filesystem=fs)

# use pq.read_table() with filesystem
new = pq.read_table(source=path, filesystem=fs)
assert new == table

# use adlfs filesystem.open()
df = pd.read_parquet(fs.open(path, mode="rb"))
dataframe_table = pa.Table.from_pandas(df)
assert table == dataframe_table

# use abfs path with storage options
df2 = pd.read_parquet(f"abfs://{path}", 
storage_options={"connection_string": CONN_STR})
assert_frame_equal(df, df2)

{code}




> [Python] ArrowInvalid error on reading partitioned parquet files with 
> fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
> ---
>
> Key: ARROW-16077
> URL: https://issues.apache.org/jira/browse/ARROW-16077
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 7.0.0
>Reporter: Jon Rosenberg
>Priority: Major
>
> Reading a partitioned parquet from adlfs with pyarrow through pandas will 
> throw unnecessary exceptions on not matching forward slashes in the listed 
> files returned from adlfs, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to/parquet/files"){code}
> results in exception of the form
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> 'path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to/parquet/files/'{code}
>  
> and testing with modifying the adlfs method to prepend slashes to all 
> returned files, we still end up with an error on file paths that would 
> otherwise be handled correctly where there is a double slash in a location 
> where there should be one, ie:
>  
> {code:python}
> import pandas as pd
> pd.read_parquet("adl://resource/path/to//parquet/files") {code}
> would throw
> {code:bash}
> pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 
> '/path/to/parquet/files/part-0001.parquet', which is outside base dir 
> '/path/to//parquet/files/' {code}
> In both cases the ls has returned correctly from adlfs, given it's discovered 
> the file part-0001.parquet but the pyarrow exception stops what could 
> otherwise be successful processing. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-04-05 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517431#comment-17517431
 ] 

Lance Dacey commented on ARROW-12358:
-

Is this on the radar to be fixed for the next release?

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-03-04 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501328#comment-17501328
 ] 

Lance Dacey commented on ARROW-12358:
-

Is this issue sufficient to track this? In the meantime, is there a more 
efficient way to create the partitions instead using "overwrite_or_ignore" and 
then "delete_matching" if the first attempt failed?

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Weston Pace
>Priority: Major
>  Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

2022-03-04 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-12365.
---
Fix Version/s: 6.0.0
   Resolution: Resolved

delete_matching option solves this issue

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> --
>
> Key: ARROW-12365
> URL: https://issues.apache.org/jira/browse/ARROW-12365
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
> Fix For: 6.0.0
>
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
> file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our 
> visualization tool connects directly to our parquet files on Azure Blob 
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is 
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" 
> dataset (this has all versions of the data and there will be many files 
> within each partition. In this case, up to 24 files within a single date 
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the 
> updated_at timestamp and drop duplicates on the unique constraint, then 
> partition and save it with partition_filename_cb. In the example below, if I 
> partition by the "date_id" column, then my dataset structure will be 
> "/date_id=202104123/20210413.parquet"
> {code:java}
> use_legacy_dataset=True, partition_filename_cb=lambda x: 
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per 
> partition and that I only have the latest version of each row based on the 
> maximum updated_at timestamp. My visualization tool can safely connect to and 
> incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an 
> existing partition. I do not care about the file names so much as I want to 
> ensure that I am fully replacing any data which might already exist in my 
> partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters

2022-02-17 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-15716:
---

 Summary: [Dataset][Python] Parse a list of fragment paths to 
gather filters
 Key: ARROW-15716
 URL: https://issues.apache.org/jira/browse/ARROW-15716
 Project: Apache Arrow
  Issue Type: Wish
Affects Versions: 7.0.0
Reporter: Lance Dacey


Is it possible for partitioning.parse() to be updated to parse a list of paths 
instead of just a single path? 

I am passing the .paths from file_visitor to downstream tasks to process data 
which was recently saved, but I can run into problems with this if I overwrite 
data with delete_matching in order to consolidate small files since the paths 
won't exist. 

Here is the output of my current approach to use filters instead of reading the 
paths directly:

{code:java}
# Fragments saved during write_dataset 
['dev/dataset/fragments/date_id=20210813/data-0.parquet', 
'dev/dataset/fragments/date_id=20210114/data-2.parquet', 
'dev/dataset/fragments/date_id=20210114/data-1.parquet', 
'dev/dataset/fragments/date_id=20210114/data-0.parquet']

# Run partitioning.parse() on each fragment 
[, 
, , ]

# Format those expressions into a list of tuples
[('date_id', 'in', [20210114, 20210813])]

# Convert to an expression which is used as a filter in .to_table()
is_in(date_id, {value_set=int64:[
  20210114,
  20210813
], skip_nulls=false})
{code}

And here is how I am creating the filter from a list of .paths (perhaps there 
is a better way?):

{code:python}
partitioning = ds.HivePartitioning(partition_schema)
expressions = []
for file in paths:
expressions.append(partitioning.parse(file))
values = []
filters = []
for expression in expressions:
partitions = ds._get_partition_keys(expression)
if len(partitions.keys()) > 1:
element = [(k, "==", v) for k, v in partitions.items()]
if element not in filters:
filters.append(element)
else:
for k, v in partitions.items():
if v not in values:
values.append(v)
filters = [(k, "in", sorted(values))]

filt_exp = pa.parquet._filters_to_expression(filters)
dataset.to_table(filter=filt_exp)
{code}


My hope would be to do something like filt_exp = partitioning.parse(paths) 
which would return a dataset expression.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-02-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485722#comment-17485722
 ] 

Lance Dacey commented on ARROW-12358:
-

Is this slated for a fix in 7.0.0? I am writing a dataset using 
"overwrite_or_ignore" and then "delete_matching" if my initial save fails 
(FileNotFoundError) using "delete_matching".

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-01-27 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483481#comment-17483481
 ] 

Lance Dacey commented on ARROW-15474:
-

Ahh, that would be great. Random is a bit risky for my use case since I 
generally care about the latest version.

I found [this 
repository|https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops] 
which has a method to drop duplicates that I might be able to adopt in the 
meantime. I would need to digest exactly what is happening down below a bit 
more, but I think there are some compute functions like `pc.sort_indices`,  
`pc.unique`, etc that could probably be used to replace some of the numpy code. 

{code:python}
def drop_duplicates(table, on=[], keep='first'):
# Gather columns to arr
arr = columns_to_array(table, (on if on else table.column_names))

# Groupify
dic, counts, sort_idxs, bgn_idxs = groupify_array(arr)

# Gather idxs
if keep == 'last':
idxs = (np.array(bgn_idxs) - 1)[1:].tolist() + [len(sort_idxs) - 1]
elif keep == 'first':
idxs = bgn_idxs
elif keep == 'drop':
idxs = [i for i, c in zip(bgn_idxs, counts) if c == 1]
return table.take(sort_idxs[idxs])

def groupify_array(arr):
# Input: Pyarrow/Numpy array
# Output:
#   - 1. Unique values
#   - 2. Sort index
#   - 3. Count per unique
#   - 4. Begin index per unique
dic, counts = np.unique(arr, return_counts=True)
sort_idx = np.argsort(arr)
return dic, counts, sort_idx, [0] + np.cumsum(counts)[:-1].tolist()

def combine_column(table, name):
return table.column(name).combine_chunks()

f = np.vectorize(hash)
def columns_to_array(table, columns):
columns = ([columns] if isinstance(columns, str) else list(set(columns)))
if len(columns) == 1:
return f(combine_column(table, 
columns[0]).to_numpy(zero_copy_only=False))
else:
values = [c.to_numpy() for c in table.select(columns).itercolumns()]
return np.array(list(map(hash, zip(*values
{code}
 

> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
> Fix For: 8.0.0
>
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-01-27 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483114#comment-17483114
 ] 

Lance Dacey commented on ARROW-15474:
-

I would personally be okay with only having the first row retained since I 
could just sort the table before dropping duplicates to get the desired results.

Is it possible to get the first or nth values from a table groupby? In pandas, 
we can do this which I think has the desired behavior even with multiple 
columns (as long as we sort the data first). If we can get the indices of which 
rows to keep, then we could use table.take() to return a new table with the 
latest values.
{code:python}
df = pd.DataFrame(
{
"id": [1, 1, 1, 2, 2, 2],
"name": ["a", "a", "a", "b", "c", "c"],
"updated_at": [
"2021-01-01 00:02:19",
"2022-01-04 12:13:10",
"2022-01-06 04:10:52",
"2022-01-02 17:32:21",
"2022-01-06 01:27:14",
"2022-01-06 23:09:56",
],
}
)

df.sort_values(["id", "name", "updated_at"], ascending=[1, 1, 
0]).groupby(["id", "name"]).nth(0).reset_index()
{code}

> [Python] Possibility of a table.drop_duplicates() function?
> ---
>
> Key: ARROW-15474
> URL: https://issues.apache.org/jira/browse/ARROW-15474
> Project: Apache Arrow
>  Issue Type: Wish
>Affects Versions: 6.0.1
>Reporter: Lance Dacey
>Priority: Major
> Fix For: 8.0.0
>
>
> I noticed that there is a group_by() and sort_by() function in the 7.0.0 
> branch. Is it possible to include a drop_duplicates() function as well? 
> ||id||updated_at||
> |1|2022-01-01 04:23:57|
> |2|2022-01-01 07:19:21|
> |2|2022-01-10 22:14:01|
> Something like this which would return a table without the second row in the 
> example above would be great. 
> I usually am reading an append-only dataset and then I need to report on 
> latest version of each row. To drop duplicates, I am temporarily converting 
> the append-only table to a pandas DataFrame, and then I convert it back to a 
> table and save a separate "latest-version" dataset.
> {code:python}
> table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
> "ascending")]).drop_duplicates(subset=["id"] keep="last")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?

2022-01-26 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-15474:
---

 Summary: [Python] Possibility of a table.drop_duplicates() 
function?
 Key: ARROW-15474
 URL: https://issues.apache.org/jira/browse/ARROW-15474
 Project: Apache Arrow
  Issue Type: Wish
Affects Versions: 6.0.1
Reporter: Lance Dacey
 Fix For: 8.0.0


I noticed that there is a group_by() and sort_by() function in the 7.0.0 
branch. Is it possible to include a drop_duplicates() function as well? 

||id||updated_at||
|1|2022-01-01 04:23:57|
|2|2022-01-01 07:19:21|
|2|2022-01-10 22:14:01|

Something like this which would return a table without the second row in the 
example above would be great. 

I usually am reading an append-only dataset and then I need to report on latest 
version of each row. To drop duplicates, I am temporarily converting the 
append-only table to a pandas DataFrame, and then I convert it back to a table 
and save a separate "latest-version" dataset.

{code:python}
table.sort_by(sorting=[("id", "ascending"), ("updated_at", 
"ascending")]).drop_duplicates(subset=["id"] keep="last")
{code}








--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-01-14 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476120#comment-17476120
 ] 

Lance Dacey commented on ARROW-12358:
-

Ah, so it must be related to the filesystem. I am using adlfs / fsspec to save 
datasets on Azure Blob:


{code:python}
import pyarrow as pa
import pyarrow.dataset as ds

print(type(fs))
tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 3] })
ds.write_dataset(data=tab,
 base_dir='/dev/newdataset',
 partitioning_flavor='hive',
 partitioning=['part'],
 existing_data_behavior='delete_matching',
 format='parquet',
 filesystem=fs)
{code}

Output:


{code:python}


[2022-01-14 12:45:44,076] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,090] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,093] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,109] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,121] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
[2022-01-14 12:45:44,124] {api.py:76} WARNING - Given content is empty, 
stopping the process very early, returning empty utf_8 str match
---
FileNotFoundError Traceback (most recent call last)
/tmp/ipykernel_47/3075266795.py in 
  4 print(type(fs))
  5 tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 
3] })
> 6 ds.write_dataset(data=tab,
  7  base_dir='/dev/newdataset',
  8  partitioning_flavor='hive',

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, 
partitioning_flavor, schema, filesystem, file_options, use_threads, 
max_partitions, file_visitor, existing_data_behavior)
876 scanner = data
877 
--> 878 _filesystemdataset_write(
879 scanner, base_dir, basename_template, filesystem, partitioning,
880 file_options, max_partitions, file_visitor, 
existing_data_behavior

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_delete_dir_contents()

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in 
delete_dir_contents(self, path)
357 raise ValueError(
358 "delete_dir_contents called on path '", path, "'")
--> 359 self._delete_dir_contents(path)
360 
361 def delete_root_dir_contents(self):

/opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in 
_delete_dir_contents(self, path)
347 
348 def _delete_dir_contents(self, path):
--> 349 for subpath in self.fs.listdir(path, detail=False):
350 if self.fs.isdir(subpath):
351 self.fs.rm(subpath, recursive=True)

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/spec.py in 
listdir(self, path, detail, **kwargs)
   1221 def listdir(self, path, detail=True, **kwargs):
   1222 """Alias of `AbstractFileSystem.ls`."""
-> 1223 return self.ls(path, detail=detail, **kwargs)
   1224 
   1225 def cp(self, path1, path2, **kwargs):

/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in ls(self, 
path, detail, invalidate_cache, delimiter, return_glob, **kwargs)
753 ):
754 
--> 755 files = sync(
756 self.loop,
757 self._ls,

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in 
sync(loop, func, timeout, *args, **kwargs)
 69 raise FSTimeoutError from return_result
 70 elif isinstance(return_result, BaseException):
---> 71 raise return_result
 72 else:
 73 return return_result

/opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in 
_runner(event, coro, result, timeout)
 23 coro = asyncio.wait_for(coro, timeout=timeout)
 24 try:
---> 25 result[0] = await coro
 26 except Exception as ex:
 27 result[0] = ex

/opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in _ls(self, 
path, invalidate_cache, delimiter, return_glob, **kwargs)
875 if not finalblobs:
876 if not await self._exists(target_path):
--> 877

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2022-01-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475363#comment-17475363
 ] 

Lance Dacey commented on ARROW-12358:
-

[~westonpace] Just wanted to check if this issue with "delete_matching" not 
creating the partition directory is still on the radar. I am currently using 
"overwrite_or_ignore", and then writing the same table again with 
"delete_matching" which is a bit redundant. 

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 8.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-12-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450796#comment-17450796
 ] 

Lance Dacey edited comment on ARROW-12358 at 12/3/21, 3:04 AM:
---

I was not able to install 6.0.1 until the latest version of turbodbc supported 
it. Finally have it up and running and I see that the `existing_data_behavior` 
argument has been added.

 Is this the proper way to use the "delete_matching" feature? When I tried to 
set that as default, there was a FileNotFound error (because the base_dir did 
not exist at all).

EDIT - using the try, except does not really work. I need to save the dataset 
as "overwrite_or_ignore" first, then save the dataset again as "delete_matching"
 
{code:python}
try:
ds.write_dataset(
data=table,
existing_data_behavior="error",
)
except pa.lib.ArrowInvalid:
ds.write_dataset(
data=table,
...,
existing_data_behavior="delete_matching",
)
{code}


I created a dataset using my old method (`use_legacy_dataset` = True with a 
`partition_filename_cb` to overwrite partitions) and the output matched the new 
"delete_matching" dataset. I believe I can completely retire the 
use_legacy_dataset code now. Really amazing, thank you.



was (Author: ldacey):
I was not able to install 6.0.1 until the latest version of turbodbc supported 
it. Finally have it up and running and I see that the `existing_data_behavior` 
argument has been added.

 Is this the proper way to use the "delete_matching" feature? When I tried to 
set that as default, there was a FileNotFound error (because the base_dir did 
not exist at all).
 
{code:python}
try:
ds.write_dataset(
data=table,
existing_data_behavior="error",
)
except pa.lib.ArrowInvalid:
ds.write_dataset(
data=table,
...,
existing_data_behavior="delete_matching",
)
{code}


I created a dataset using my old method (`use_legacy_dataset` = True with a 
`partition_filename_cb` to overwrite partitions) and the output matched the new 
"delete_matching" dataset. I believe I can completely retire the 
use_legacy_dataset code now. Really amazing, thank you.


> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 7.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-12-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452649#comment-17452649
 ] 

Lance Dacey commented on ARROW-12358:
-

Any thoughts on "delete_matching" creating the partition if it does not exist 
already? 

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 7.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset

2021-12-01 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451813#comment-17451813
 ] 

Lance Dacey commented on ARROW-14938:
-

Sure - refer to this section: 
https://arrow.apache.org/docs/python/dataset.html#different-partitioning-schemes

"hive" is a shortcut which will infer the data type of the partition column 
when it gets added back to the table, but you can specify the schema of your 
partitioned columns too using ds.partitioning().



> Partition column dissappear when reading dataset
> 
>
> Key: ARROW-14938
> URL: https://issues.apache.org/jira/browse/ARROW-14938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
> Environment: Debian bullseye, python 3.9
>Reporter: Martin Gran
>Priority: Major
>
> Appending CSV to parquet dataset with partitioning on "code".
> {code:python}
> table = pa.Table.from_pandas(chunk)
>         pa.dataset.write_dataset(
>             table,
>             output_path,
>             basename_template=f"chunk_\{y}_\{{i}}",
>             format="parquet",
>             partitioning=["code"],
>             existing_data_behavior="overwrite_or_ignore",
>         )
> {code}
> Loading the dataset again and expecting code to be in the dataframe.
> {code:python}
> import pyarrow.dataset as ds
> dataset = ds.dataset("../data/interim/2020_elements_parquet/", 
> format="parquet",)
> df = dataset.to_table().to_pandas()
> >>>df["code"]
> {code}
> Trace
> {code:python}
> --- 
> KeyError Traceback (most recent call last) 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3360 try: -> 3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The 
> above exception was the direct cause of the following exception: KeyError 
> Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in 
>  > 1 df["code"] 
> ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in 
> __getitem__(self, key)  3456 if self.columns.nlevels > 1:  3457 return 
> self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)  
> 3459 if is_integer(indexer):  3460 indexer = [indexer] 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: -> 3363 raise 
> KeyError(key) from err  3364  3365 if is_scalar(key) and isna(key) and not 
> self.hasnans: KeyError: 'code'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset

2021-12-01 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451752#comment-17451752
 ] 

Lance Dacey commented on ARROW-14938:
-

If you add the partitioning argument to ds.dataset(source, format, 
partitioning) that should fix it.

For example, partitioning="hive" or specify it with a partitioning object 
partitioning=ds.partitioning(pa.schema(["code", pa.string()]), flavor="hive"). 
I used hive in those examples but there is directory partitioning as well.

> Partition column dissappear when reading dataset
> 
>
> Key: ARROW-14938
> URL: https://issues.apache.org/jira/browse/ARROW-14938
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 6.0.1
> Environment: Debian bullseye, python 3.9
>Reporter: Martin Gran
>Priority: Major
>
> Appending CSV to parquet dataset with partitioning on "code".
> {code:python}
> table = pa.Table.from_pandas(chunk)
>         pa.dataset.write_dataset(
>             table,
>             output_path,
>             basename_template=f"chunk_\{y}_\{{i}}",
>             format="parquet",
>             partitioning=["code"],
>             existing_data_behavior="overwrite_or_ignore",
>         )
> {code}
> Loading the dataset again and expecting code to be in the dataframe.
> {code:python}
> import pyarrow.dataset as ds
> dataset = ds.dataset("../data/interim/2020_elements_parquet/", 
> format="parquet",)
> df = dataset.to_table().to_pandas()
> >>>df["code"]
> {code}
> Trace
> {code:python}
> --- 
> KeyError Traceback (most recent call last) 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3360 try: -> 3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
> pandas._libs.index.IndexEngine.get_loc() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() 
> pandas/_libs/hashtable_class_helper.pxi in 
> pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The 
> above exception was the direct cause of the following exception: KeyError 
> Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in 
>  > 1 df["code"] 
> ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in 
> __getitem__(self, key)  3456 if self.columns.nlevels > 1:  3457 return 
> self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)  
> 3459 if is_integer(indexer):  3460 indexer = [indexer] 
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
> get_loc(self, key, method, tolerance)  3361 return 
> self._engine.get_loc(casted_key)  3362 except KeyError as err: -> 3363 raise 
> KeyError(key) from err  3364  3365 if is_scalar(key) and isna(key) and not 
> self.hasnans: KeyError: 'code'
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-11-29 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450796#comment-17450796
 ] 

Lance Dacey commented on ARROW-12358:
-

I was not able to install 6.0.1 until the latest version of turbodbc supported 
it. Finally have it up and running and I see that the `existing_data_behavior` 
argument has been added.

 Is this the proper way to use the "delete_matching" feature? When I tried to 
set that as default, there was a FileNotFound error (because the base_dir did 
not exist at all).
 
{code:python}
try:
ds.write_dataset(
data=table,
existing_data_behavior="error",
)
except pa.lib.ArrowInvalid:
ds.write_dataset(
data=table,
...,
existing_data_behavior="delete_matching",
)
{code}


I created a dataset using my old method (`use_legacy_dataset` = True with a 
`partition_filename_cb` to overwrite partitions) and the output matched the new 
"delete_matching" dataset. I believe I can completely retire the 
use_legacy_dataset code now. Really amazing, thank you.


> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 7.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-14608) [Python] Provide access to hash_aggregate functions through a group_by method

2021-11-29 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450770#comment-17450770
 ] 

Lance Dacey commented on ARROW-14608:
-

If we can do group_by using the pyarrow table, then I should be able to 
drop_duplicates as well if it is combined with a filter right? Sorting and 
dropping duplicates is one of the big reasons I still need to convert some 
pyarrow tables into a pandas DataFrame temporarily.
{code:java}
df.sort_values(['id', 'updated_at'], 
ascending=True).drop_duplicates(subset=['id'], keep='last'){code}

> [Python] Provide access to hash_aggregate functions through a group_by method
> -
>
> Key: ARROW-14608
> URL: https://issues.apache.org/jira/browse/ARROW-14608
> Project: Apache Arrow
>  Issue Type: Sub-task
>  Components: Python
>Affects Versions: 6.0.0
>Reporter: Alessandro Molina
>Assignee: Alessandro Molina
>Priority: Major
>  Labels: pull-request-available
> Fix For: 7.0.0
>
>  Time Spent: 10h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-08-24 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403772#comment-17403772
 ] 

Lance Dacey commented on ARROW-12358:
-

kDeleteMatchingPartitions - So this only deletes the individual partitions and 
not the entire dataset correct? So if I save a dataset made up of hundreds of 
partitions but only 4 of them are written to, then only those 4 partitions will 
have their existing files cleared? If so, then yes that should work for me.

 

 

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

2021-08-24 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403767#comment-17403767
 ] 

Lance Dacey commented on ARROW-12365:
-

The metadata collector works great, but this issue is more related to 
https://issues.apache.org/jira/browse/ARROW-12358

I use the partition_filename_cb to guaranteed that I overwrite partitions which 
I do not think we can control with ds.write_dataset() because of the \{i} 
counter which may be different and accidentally write a new file into an 
existing partition (I need to be sure that there are no duplicates in the data, 
since our Power BI tool connects directly to the parquet dataset)

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> --
>
> Key: ARROW-12365
> URL: https://issues.apache.org/jira/browse/ARROW-12365
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
> file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our 
> visualization tool connects directly to our parquet files on Azure Blob 
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is 
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" 
> dataset (this has all versions of the data and there will be many files 
> within each partition. In this case, up to 24 files within a single date 
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the 
> updated_at timestamp and drop duplicates on the unique constraint, then 
> partition and save it with partition_filename_cb. In the example below, if I 
> partition by the "date_id" column, then my dataset structure will be 
> "/date_id=202104123/20210413.parquet"
> {code:java}
> use_legacy_dataset=True, partition_filename_cb=lambda x: 
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per 
> partition and that I only have the latest version of each row based on the 
> maximum updated_at timestamp. My visualization tool can safely connect to and 
> incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an 
> existing partition. I do not care about the file names so much as I want to 
> ensure that I am fully replacing any data which might already exist in my 
> partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-08-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398635#comment-17398635
 ] 

Lance Dacey commented on ARROW-12358:
-


I do not clear my append dataset, but I need to add tasks to consolidate the 
small files someday. If I download a source every hour, I will have a minimum 
of 24 files in a single daily partition and some of them might be small. 

But yes, I am basically describing a materialized view. I cannot rely on an 
incremental refresh in many cases because I partition data based on the 
created_at date and not the updated_at date.

Here is an example where the data was all updated today, but there were some 
rows originally created days or even months ago.

{code:python}
table = pa.table(
{
"date_id": [20210114, 20210811, 20210812, 20210813],#based on the 
created_at timestamp
"created_at": ["2021-01-14 16:45:18", "2021-08-11 15:10:00", 
"2021-08-12 11:19:26", "2021-08-13 23:01:47"],
"updated_at": ["2021-08-13 00:04:12", "2021-08-13 02:16:23", 
"2021-08-13 09:55:44", "2021-08-13 22:36:01"],
"category": ["cow", "sheep", "dog", "cat"],
"value": [0, 99, 17, 238],
}
)
{code}

Partitioning this by date_id would save the following files in my "append" 
dataset. Note that this has one row which is from January, so I cannot do an 
incremental refresh from the minimum date because it would be too much data in 
a real world scenario. 

{code:python}
written_paths = [
"dev/test/date_id=20210812/test-20210813114024-2.parquet",
"dev/test/date_id=20210813/test-20210813114024-3.parquet",
"dev/test/date_id=20210811/test-20210813114024-1.parquet",
"dev/test/date_id=20210114/test-20210813114024-0.parquet",
]
{code}


During my next task, I create a new dataset from the written_paths above (so a 
dataset of only the new/changed data). Using .get_fragments() and partition 
expressions, I ultimately generate a filter expression:

{code:python}
fragments = ds.dataset(written_paths, fs).get_fragments()
for frag in fragments:
partitions = ds._get_partition_keys(frag.partition_expression)
#... other stuff
filter_expression = 

{code}

Finally, I use that filter to query my "append" dataset which has all 
historical data. So I read all of the data in each partition 
{code:python}
df = ds.dataset(source, fs).to_table(filters=filter_expression).to_pandas()
{code}
, convert the table to pandas, sort and drop duplicates, convert back to a 
table, and then save to my "final" dataset with partition_filename_cb to 
overwrite whatever was there. This means that if even a single row was updated 
within a partition, I will be read all of the data in that partition and 
recompute the final version of each row. This also requires me to use the 
"use_legacy_dataset" flag to support overwriting the existing partitions.

I found a custom implementation of drop_duplicates 
(https://github.com/TomScheffers/pyarrow_ops/blob/main/pyarrow_ops/ops.py) 
using pyarrow Tables, but I am still just using pandas for now. I keep a close 
eye on the pyarrow.compute() docs and have been slowly replacing stuff I do 
with pandas directly in the pyarrow tables, which is great.

You mentioning the temporary staging area got me to realize that I could 
replace my messy staging append dataset (many small files) with something 
temporary that I delete each schedule, and then read from it and create a 
consolidated historical append-only dataset similar to what I am doing in the 
example above (one file per partition instead of potentially hundreds)






> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the t

[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-08-12 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398007#comment-17398007
 ] 

Lance Dacey commented on ARROW-12358:
-

What is the common workflow pattern for folks trying to imitate something 
similar to a view in a database?

 

In many of my sources I have a dataset which is append only (using UUIDs in the 
basename template), normally partitioned by date. If this data is downloaded 
frequently or is generated from multiple sources (for example, several 
endpoints or servers), then each partition might have many files. Most likely 
there are also different versions of each row (one ID will have a row for each 
time it was updated, for example).

 

I then write to a new dataset which is used for reporting and visualization. 
 # Get the list of files which were saved to the append-only dataset during the 
most recent schedule
 # Create a dataset from the list of paths which were just saved and use 
.get_fragments() and ds._get_partition_keys(fragment.partition_expression) to 
generate a filter expression (this allows me to query for *all* of the data in 
each relevant partition which was recently modified - so if only a single row 
was modified in the 2021-08-05 partition, then I still need to read all of the 
other data in that partition in order to finalize it)
 # Create a dataframe, sort the data and drop duplicates on a primary key, 
convert back to a table (it would be nice to be able to do this purely in a 
pyarrow table so I could leave out pandas!)
 # Use pq.write_to_dataset() with partition_filename_cb=lambda x: str(x[-1]) + 
".parquet" to write to a final dataset. This allows me to overwrite the 
relevant partitions because the filenames are the same. I can be certain that I 
only have the latest version of each row.

 

This is my approach to come close to what I would achieve with a view in the 
database. It works fine, but the storage is essentially doubled since I am 
maintaining two datasets (append-only and final). Our visualization tool 
connects directly to these parquet files, so there is some benefit in having 
less files (one per partition instead of potentially hundreds) as well.

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 6.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes

2021-07-07 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376448#comment-17376448
 ] 

Lance Dacey commented on ARROW-13074:
-

Sure Joris, I posted it and then I read that you said to keep the discussion 
separate so I tried to be sneaky and delete it before you noticed

> [Python] Start with deprecating ParquetDataset custom attributes
> 
>
> Key: ARROW-13074
> URL: https://issues.apache.org/jira/browse/ARROW-13074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As a first step for ARROW-9720, we should start with deprecating 
> attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep 
> / are conflicting with the "dataset API". 
> I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} 
> class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). 
> In addition, some of the keywords are also exposed as properties (memory_map, 
> read_dictionary, buffer_size, fs), and could be deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes

2021-07-06 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-13074:

Comment: was deleted

(was: I have run into a few issues with basename_template:

 

1) If I run tasks in parallel (for example, Airflow downloads data from various 
SQL servers and writes to the same partitions), then there is a chance to 
overwrite existing data (part-0.parquet)

2) If I make the basename_template unique, then I can end up with duplicate 
data inside of my partitions because I am not overwriting what is already there.

 

The way I have been organizing this so far is to have use two datasets:

 

*Dataset A*:
 * UUID filenames, so everything is unique. This most likely has duplicate 
values, and most certainly will have old versions of rows (based on an 
updated_at timestamp)
 * This normally has a lot of files per partition since I download data every 
30 minutes - 1 hour in many cases

*Dataset B:*
 * Reads from Dataset A, sorts, drop duplicates, and then resave using a 
partition_filename_cb

{code:java}
use_legacy_dataset=True, 
partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code}
 * I normally partition by date_id, so each partition is something like
{code:java}
path/date_id=20210706/20210706.parquet{code}

 * This allows me to have a single file per partition which has the final 
version of the each row with no duplicates. Our visualization tool connects to 
these fragments directly (Power BI in this case) 

 

I think that I might be able to use basename_template if I was careful and made 
sure that I did not write data in parallel, so the part-0.parquet file would be 
overwritten each time. Or perhaps I could list the files in that partition and 
delete them before saving new data (risky if another process might be using 
those files at that time).

 

 
 )

> [Python] Start with deprecating ParquetDataset custom attributes
> 
>
> Key: ARROW-13074
> URL: https://issues.apache.org/jira/browse/ARROW-13074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As a first step for ARROW-9720, we should start with deprecating 
> attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep 
> / are conflicting with the "dataset API". 
> I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} 
> class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). 
> In addition, some of the keywords are also exposed as properties (memory_map, 
> read_dictionary, buffer_size, fs), and could be deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes

2021-07-06 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375901#comment-17375901
 ] 

Lance Dacey commented on ARROW-13074:
-

I have run into a few issues with basename_template:

 

1) If I run tasks in parallel (for example, Airflow downloads data from various 
SQL servers and writes to the same partitions), then there is a chance to 
overwrite existing data (part-0.parquet)

2) If I make the basename_template unique, then I can end up with duplicate 
data inside of my partitions because I am not overwriting what is already there.

 

The way I have been organizing this so far is to have use two datasets:

 

*Dataset A*:
 * UUID filenames, so everything is unique. This most likely has duplicate 
values, and most certainly will have old versions of rows (based on an 
updated_at timestamp)
 * This normally has a lot of files per partition since I download data every 
30 minutes - 1 hour in many cases

*Dataset B:*
 * Reads from Dataset A, sorts, drop duplicates, and then resave using a 
partition_filename_cb

{code:java}
use_legacy_dataset=True, 
partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code}
 * I normally partition by date_id, so each partition is something like
{code:java}
path/date_id=20210706/20210706.parquet{code}

 * This allows me to have a single file per partition which has the final 
version of the each row with no duplicates. Our visualization tool connects to 
these fragments directly (Power BI in this case) 

 

I think that I might be able to use basename_template if I was careful and made 
sure that I did not write data in parallel, so the part-0.parquet file would be 
overwritten each time. Or perhaps I could list the files in that partition and 
delete them before saving new data (risky if another process might be using 
those files at that time).

 

 
 

> [Python] Start with deprecating ParquetDataset custom attributes
> 
>
> Key: ARROW-13074
> URL: https://issues.apache.org/jira/browse/ARROW-13074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As a first step for ARROW-9720, we should start with deprecating 
> attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep 
> / are conflicting with the "dataset API". 
> I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} 
> class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). 
> In addition, some of the keywords are also exposed as properties (memory_map, 
> read_dictionary, buffer_size, fs), and could be deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes

2021-07-06 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375518#comment-17375518
 ] 

Lance Dacey commented on ARROW-13074:
-

 Any idea if this includes the partition_filename_cb function? I am still using 
that pretty extensively to write my "final" datasets that Power BI connects to 
for visualization since it allows me to overwrite each partition.

> [Python] Start with deprecating ParquetDataset custom attributes
> 
>
> Key: ARROW-13074
> URL: https://issues.apache.org/jira/browse/ARROW-13074
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 5.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> As a first step for ARROW-9720, we should start with deprecating 
> attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep 
> / are conflicting with the "dataset API". 
> I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} 
> class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). 
> In addition, some of the keywords are also exposed as properties (memory_map, 
> read_dictionary, buffer_size, fs), and could be deprecated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

2021-06-22 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-12364.
---
Fix Version/s: 5.0.0
   Resolution: Fixed

> [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
> --
>
> Key: ARROW-12364
> URL: https://issues.apache.org/jira/browse/ARROW-12364
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Parquet, Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
> Fix For: 5.0.0
>
>
> The legacy pq.write_to_dataset() has an option to save metadata to a list 
> when writing partitioned data.
> {code:python}
> collector = []
> pq.write_to_dataset(
> table=table,
> root_path=output_path,
> use_legacy_dataset=True,
> metadata_collector=collector,
> )
> fragments = []
> for piece in collector:
> files.append(filesystem.sep.join([output_path, 
> piece.row_group(0).column(0).file_path]))
> {code}
> This allows me to save a list of the specific parquet files which were 
> created when writing the partitions to storage. I use this when scheduling 
> tasks with Airflow.
> Task A downloads data and partitions it --> Task B reads the file fragments 
> which were just saved and transforms it --> Task C creates a list of dataset 
> filters from the file fragments I transformed, reads each filter to into a 
> table and then processes the data further (normally dropping duplicates or 
> selecting a subset of the columns) and saves it for visualization
> {code:java}
> fragments = 
> ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', 
> 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', 
> 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ]
> {code}
> I can use this list downstream to do two things:
>  1) I can read the list of fragments directly as a new dataset and transform 
> the data
> {code:java}
> ds.dataset(fragments)
> {code}
> 2) I can generate filters from the fragment paths which were saved using 
> ds._get_partition_keys(). This allows me to query the dataset and retrieve 
> all fragments within the partition. For example, if I partition by date and I 
> process data every 30 minutes I might have 48 individual file fragments 
> within a single partition. I need to know to query the *entire* partition 
> instead of reading a single fragment.
> {code:java}
> def consolidate_filters(fragments):
> """Retrieves the partition_expressions from a list of dataset fragments 
> to build a list of unique filters"""
> filters = []
> for frag in fragments:
> partitions = ds._get_partition_keys(frag.partition_expression)
> filter = [(k, "==", v) for k, v in partitions.items()]
> if filter not in filters:
> filters.append(filter)
> return filters
> filter_expression = pq._filters_to_expression(
> filters=consolidate_filters(fragments=fragments)
> )
> {code}
> My current problem is that when I use ds.write_dataset(), I do not have a 
> convenient method for generating a list of the file fragments I just saved. 
> My only choice is to use basename_template and fs.glob() to find a list of 
> the files based on the basename_template pattern. This is much slower and a 
> waste of listing files on blob storage. [Related stackoverflow question with 
> the basis of the approach I am using now 
> |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

2021-06-22 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367552#comment-17367552
 ] 

Lance Dacey commented on ARROW-12364:
-

I think this is taken care of by ARROW-10440

> [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
> --
>
> Key: ARROW-12364
> URL: https://issues.apache.org/jira/browse/ARROW-12364
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Parquet, Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
>
> The legacy pq.write_to_dataset() has an option to save metadata to a list 
> when writing partitioned data.
> {code:python}
> collector = []
> pq.write_to_dataset(
> table=table,
> root_path=output_path,
> use_legacy_dataset=True,
> metadata_collector=collector,
> )
> fragments = []
> for piece in collector:
> files.append(filesystem.sep.join([output_path, 
> piece.row_group(0).column(0).file_path]))
> {code}
> This allows me to save a list of the specific parquet files which were 
> created when writing the partitions to storage. I use this when scheduling 
> tasks with Airflow.
> Task A downloads data and partitions it --> Task B reads the file fragments 
> which were just saved and transforms it --> Task C creates a list of dataset 
> filters from the file fragments I transformed, reads each filter to into a 
> table and then processes the data further (normally dropping duplicates or 
> selecting a subset of the columns) and saves it for visualization
> {code:java}
> fragments = 
> ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', 
> 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', 
> 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ]
> {code}
> I can use this list downstream to do two things:
>  1) I can read the list of fragments directly as a new dataset and transform 
> the data
> {code:java}
> ds.dataset(fragments)
> {code}
> 2) I can generate filters from the fragment paths which were saved using 
> ds._get_partition_keys(). This allows me to query the dataset and retrieve 
> all fragments within the partition. For example, if I partition by date and I 
> process data every 30 minutes I might have 48 individual file fragments 
> within a single partition. I need to know to query the *entire* partition 
> instead of reading a single fragment.
> {code:java}
> def consolidate_filters(fragments):
> """Retrieves the partition_expressions from a list of dataset fragments 
> to build a list of unique filters"""
> filters = []
> for frag in fragments:
> partitions = ds._get_partition_keys(frag.partition_expression)
> filter = [(k, "==", v) for k, v in partitions.items()]
> if filter not in filters:
> filters.append(filter)
> return filters
> filter_expression = pq._filters_to_expression(
> filters=consolidate_filters(fragments=fragments)
> )
> {code}
> My current problem is that when I use ds.write_dataset(), I do not have a 
> convenient method for generating a list of the file fragments I just saved. 
> My only choice is to use basename_template and fs.glob() to find a list of 
> the files based on the basename_template pattern. This is much slower and a 
> waste of listing files on blob storage. [Related stackoverflow question with 
> the basis of the approach I am using now 
> |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

2021-06-22 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367199#comment-17367199
 ] 

Lance Dacey commented on ARROW-12364:
-

Hi @jorisvandenbossche, you asked me to create a separate issue for the 
metadata collector for ds.write_dataset. Just wanted to make sure that you had 
a chance to take a look.

I had to switch back to the legacy dataset writer for most projects. Using 
fs.glob() can be very slow on very large datasets with many thousands of files, 
and my workflow often depends on knowing which files were written during a 
previous Airflow task.

> [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
> --
>
> Key: ARROW-12364
> URL: https://issues.apache.org/jira/browse/ARROW-12364
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Parquet, Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
>
> The legacy pq.write_to_dataset() has an option to save metadata to a list 
> when writing partitioned data.
> {code:python}
> collector = []
> pq.write_to_dataset(
> table=table,
> root_path=output_path,
> use_legacy_dataset=True,
> metadata_collector=collector,
> )
> fragments = []
> for piece in collector:
> files.append(filesystem.sep.join([output_path, 
> piece.row_group(0).column(0).file_path]))
> {code}
> This allows me to save a list of the specific parquet files which were 
> created when writing the partitions to storage. I use this when scheduling 
> tasks with Airflow.
> Task A downloads data and partitions it --> Task B reads the file fragments 
> which were just saved and transforms it --> Task C creates a list of dataset 
> filters from the file fragments I transformed, reads each filter to into a 
> table and then processes the data further (normally dropping duplicates or 
> selecting a subset of the columns) and saves it for visualization
> {code:java}
> fragments = 
> ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', 
> 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', 
> 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ]
> {code}
> I can use this list downstream to do two things:
>  1) I can read the list of fragments directly as a new dataset and transform 
> the data
> {code:java}
> ds.dataset(fragments)
> {code}
> 2) I can generate filters from the fragment paths which were saved using 
> ds._get_partition_keys(). This allows me to query the dataset and retrieve 
> all fragments within the partition. For example, if I partition by date and I 
> process data every 30 minutes I might have 48 individual file fragments 
> within a single partition. I need to know to query the *entire* partition 
> instead of reading a single fragment.
> {code:java}
> def consolidate_filters(fragments):
> """Retrieves the partition_expressions from a list of dataset fragments 
> to build a list of unique filters"""
> filters = []
> for frag in fragments:
> partitions = ds._get_partition_keys(frag.partition_expression)
> filter = [(k, "==", v) for k, v in partitions.items()]
> if filter not in filters:
> filters.append(filter)
> return filters
> filter_expression = pq._filters_to_expression(
> filters=consolidate_filters(fragments=fragments)
> )
> {code}
> My current problem is that when I use ds.write_dataset(), I do not have a 
> convenient method for generating a list of the file fragments I just saved. 
> My only choice is to use basename_template and fs.glob() to find a list of 
> the files based on the basename_template pattern. This is much slower and a 
> waste of listing files on blob storage. [Related stackoverflow question with 
> the basis of the approach I am using now 
> |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-05-17 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346110#comment-17346110
 ] 

Lance Dacey commented on ARROW-12358:
-

Being able to update and replace specific rows would be very powerful. For my 
use case, I am basically overwriting the entire partition in order to update a 
(sometimes tiny) subset of rows. That means that I need to read the existing 
data for that partition which was saved previously, and the new data with 
updated or new rows. Then I need to sort and drop duplicates (I use pandas 
because there is no simple .drop_duplicates() for a pyarrow table, but adding a 
step with pandas can add some complication sometimes with data types), then I 
need to overwrite the partition (I use the partition_filename_cb to guarantee 
that the final file for the partition is the same).

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

2021-04-29 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-12365.
---
Fix Version/s: 5.0.0
   Resolution: Not A Problem

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> --
>
> Key: ARROW-12365
> URL: https://issues.apache.org/jira/browse/ARROW-12365
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
> Fix For: 5.0.0
>
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
> file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our 
> visualization tool connects directly to our parquet files on Azure Blob 
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is 
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" 
> dataset (this has all versions of the data and there will be many files 
> within each partition. In this case, up to 24 files within a single date 
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the 
> updated_at timestamp and drop duplicates on the unique constraint, then 
> partition and save it with partition_filename_cb. In the example below, if I 
> partition by the "date_id" column, then my dataset structure will be 
> "/date_id=202104123/20210413.parquet"
> {code:java}
> use_legacy_dataset=True, partition_filename_cb=lambda x: 
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per 
> partition and that I only have the latest version of each row based on the 
> maximum updated_at timestamp. My visualization tool can safely connect to and 
> incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an 
> existing partition. I do not care about the file names so much as I want to 
> ensure that I am fully replacing any data which might already exist in my 
> partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

2021-04-29 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335292#comment-17335292
 ] 

Lance Dacey commented on ARROW-12365:
-

@jorisvandenbossche I will close this issue in favor of an overwrite option for 
partitions since that is the only reason I use the partition_filename_cb

https://issues.apache.org/jira/browse/ARROW-12358

> [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
> --
>
> Key: ARROW-12365
> URL: https://issues.apache.org/jira/browse/ARROW-12365
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: Python
>Affects Versions: 3.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset, parquet, python
>
> I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
> file within a partition will have a specific name. 
> My use case is that I need to report on the final version of data and our 
> visualization tool connects directly to our parquet files on Azure Blob 
> (Power BI).
> 1) Download data every hour based on an updated_at timestamp (this data is 
> partitioned by date)
> 2) Transform the data which was just downloaded and save it into a "staging" 
> dataset (this has all versions of the data and there will be many files 
> within each partition. In this case, up to 24 files within a single date 
> partition since we download hourly)
> 3) Filter the transformed data and read a subset of columns, sort it by the 
> updated_at timestamp and drop duplicates on the unique constraint, then 
> partition and save it with partition_filename_cb. In the example below, if I 
> partition by the "date_id" column, then my dataset structure will be 
> "/date_id=202104123/20210413.parquet"
> {code:java}
> use_legacy_dataset=True, partition_filename_cb=lambda x: 
> str(x[-1]) + ".parquet",{code}
> Ultimately, I am sure that this final dataset has exactly one file per 
> partition and that I only have the latest version of each row based on the 
> maximum updated_at timestamp. My visualization tool can safely connect to and 
> incrementally refresh from this dataset.
>  
>  
> An alternative solution would be to allow us to overwrite anything in an 
> existing partition. I do not care about the file names so much as I want to 
> ensure that I am fully replacing any data which might already exist in my 
> partition, and I want to limit the number of physical files.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-04-16 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-11250.
---
Fix Version/s: (was: 5.0.0)
   3.0.0
   Resolution: Fixed

This was fixed with a new version of the adlfs library

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 3.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
> 429 
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405 
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", 
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
> flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same server when I try to run the same code against the 
> same dataset in Airflow it takes over 3 minutes (comparing the timestamps in 
> my logs between right before I read the dataset, and immediately after the 
> dataset is available to filter):
> {code:java}
> [2021-01-14 03:52:04,011] INFO - Reading dev/test-split
> [2021-01-14 03:55:17,360] INFO - Processing dat

[jira] [Closed] (ARROW-9682) [Python] Unable to specify the partition style with pq.write_to_dataset

2021-04-16 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-9682.
--
Resolution: Not A Problem

This works using ds.write_dataset()

> [Python] Unable to specify the partition style with pq.write_to_dataset
> ---
>
> Key: ARROW-9682
> URL: https://issues.apache.org/jira/browse/ARROW-9682
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 1.0.0
> Environment: Ubuntu 18.04
> Python 3.7
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset-parquet-write, parquet, parquetWriter
>
> I am able to import and test DirectoryPartitioning but I am not able to 
> figure out a way to write a dataset using this feature. It seems like 
> write_to_dataset defaults to the "hive" style. Is there a way to test this?
> {code:java}
> from pyarrow.dataset import DirectoryPartitioning
> partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), 
> ("month", pa.int8()), ("day", pa.int8())]))
> print(partitioning.parse("/2009/11/3"))
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset

2021-04-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320221#comment-17320221
 ] 

Lance Dacey commented on ARROW-12358:
-

I think that having an "overwrite" option would satisfy my need for the 
partition_filename_cb  https://issues.apache.org/jira/browse/ARROW-12365 if we 
can replace _all_ data inside the partition. This would be great for file 
compaction as well - we could read a dataset with a lot of tiny file fragments 
and then overwrite it.

Overwriting a specific file is also useful. For example, my basename_template 
is usually my f"\{task-id}-\{schedule-timestamp}-\{file-count}-\{i}.parquet". I 
am able clear a task and overwrite a file which already exists. The only flaw 
here is that we cannot control the \{i} variable so I guess it is not 
guaranteed. I could live without this.

For "append", is it possible for the counter to be per partition instead 
(potential race conditions if multiple tasks write to the same partition in 
parallel perhaps, and it seems to be a more demanding step for large 
datasets..)? Or could the \{i} variable optionally be a uuid instead of the 
fragment count?

"error" makes sense. 

> [C++][Python][R][Dataset] Control overwriting vs appending when writing to 
> existing dataset
> ---
>
> Key: ARROW-12358
> URL: https://issues.apache.org/jira/browse/ARROW-12358
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset
> Fix For: 5.0.0
>
>
> Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} 
> uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when 
> you are writing to an existing dataset, you de facto overwrite previous data 
> when using this default template.
> There is some discussion in ARROW-10695 about how the user can avoid this by 
> ensuring the file names are unique (the user can specify the 
> {{basename_template}} to be something unique). There is also ARROW-7706 about 
> silently doubling data (so _not_ overwriting existing data) with the legacy 
> {{parquet.write_to_dataset}} implementation. 
> It could be good to have a "mode" when writing datasets that controls the 
> different possible behaviours. And erroring when there is pre-existing data 
> in the target directory is maybe the safest default, because both appending 
> vs overwriting silently can be surprising behaviour depending on your 
> expectations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2021-04-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320192#comment-17320192
 ] 

Lance Dacey commented on ARROW-10695:
-

[~jorisvandenbossche] 
partition_filename_cb: https://issues.apache.org/jira/browse/ARROW-12358
metadata_collector: https://issues.apache.org/jira/browse/ARROW-12365

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, dataset-parquet-write
> Fix For: 5.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()

2021-04-13 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-12365:
---

 Summary: [Python] [Dataset] Add partition_filename_cb to 
ds.write_dataset()
 Key: ARROW-12365
 URL: https://issues.apache.org/jira/browse/ARROW-12365
 Project: Apache Arrow
  Issue Type: Wish
  Components: Python
Affects Versions: 3.0.0
 Environment: Ubuntu 18.04
Reporter: Lance Dacey


I need to use the legacy pq.write_to_dataset() in order to guarantee that a 
file within a partition will have a specific name. 

My use case is that I need to report on the final version of data and our 
visualization tool connects directly to our parquet files on Azure Blob (Power 
BI).

1) Download data every hour based on an updated_at timestamp (this data is 
partitioned by date)

2) Transform the data which was just downloaded and save it into a "staging" 
dataset (this has all versions of the data and there will be many files within 
each partition. In this case, up to 24 files within a single date partition 
since we download hourly)

3) Filter the transformed data and read a subset of columns, sort it by the 
updated_at timestamp and drop duplicates on the unique constraint, then 
partition and save it with partition_filename_cb. In the example below, if I 
partition by the "date_id" column, then my dataset structure will be 
"/date_id=202104123/20210413.parquet"
{code:java}
use_legacy_dataset=True, partition_filename_cb=lambda x: 
str(x[-1]) + ".parquet",{code}
Ultimately, I am sure that this final dataset has exactly one file per 
partition and that I only have the latest version of each row based on the 
maximum updated_at timestamp. My visualization tool can safely connect to and 
incrementally refresh from this dataset.

 

 

An alternative solution would be to allow us to overwrite anything in an 
existing partition. I do not care about the file names so much as I want to 
ensure that I am fully replacing any data which might already exist in my 
partition, and I want to limit the number of physical files.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()

2021-04-13 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-12364:
---

 Summary: [Python] [Dataset] Add metadata_collector option to 
ds.write_dataset()
 Key: ARROW-12364
 URL: https://issues.apache.org/jira/browse/ARROW-12364
 Project: Apache Arrow
  Issue Type: Wish
  Components: Parquet, Python
Affects Versions: 3.0.0
 Environment: Ubuntu 18.04
Reporter: Lance Dacey


The legacy pq.write_to_dataset() has an option to save metadata to a list when 
writing partitioned data.
{code:python}
collector = []
pq.write_to_dataset(
table=table,
root_path=output_path,
use_legacy_dataset=True,
metadata_collector=collector,
)
fragments = []
for piece in collector:
files.append(filesystem.sep.join([output_path, 
piece.row_group(0).column(0).file_path]))
{code}
This allows me to save a list of the specific parquet files which were created 
when writing the partitions to storage. I use this when scheduling tasks with 
Airflow.

Task A downloads data and partitions it --> Task B reads the file fragments 
which were just saved and transforms it --> Task C creates a list of dataset 
filters from the file fragments I transformed, reads each filter to into a 
table and then processes the data further (normally dropping duplicates or 
selecting a subset of the columns) and saves it for visualization
{code:java}
fragments = ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', 
'dev/date_id=20180114/transform-split-20210301013200-69.parquet', 
'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ]
{code}
I can use this list downstream to do two things:
 1) I can read the list of fragments directly as a new dataset and transform 
the data
{code:java}
ds.dataset(fragments)
{code}
2) I can generate filters from the fragment paths which were saved using 
ds._get_partition_keys(). This allows me to query the dataset and retrieve all 
fragments within the partition. For example, if I partition by date and I 
process data every 30 minutes I might have 48 individual file fragments within 
a single partition. I need to know to query the *entire* partition instead of 
reading a single fragment.
{code:java}
def consolidate_filters(fragments):
"""Retrieves the partition_expressions from a list of dataset fragments to 
build a list of unique filters"""
filters = []
for frag in fragments:
partitions = ds._get_partition_keys(frag.partition_expression)
filter = [(k, "==", v) for k, v in partitions.items()]
if filter not in filters:
filters.append(filter)
return filters

filter_expression = pq._filters_to_expression(
filters=consolidate_filters(fragments=fragments)
)
{code}
My current problem is that when I use ds.write_dataset(), I do not have a 
convenient method for generating a list of the file fragments I just saved. My 
only choice is to use basename_template and fs.glob() to find a list of the 
files based on the basename_template pattern. This is much slower and a waste 
of listing files on blob storage. [Related stackoverflow question with the 
basis of the approach I am using now 
|https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2021-04-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320167#comment-17320167
 ] 

Lance Dacey commented on ARROW-10695:
-

I have been creating my own basename_template with either a uuid or a name with 
the task+timestamp of when the data was processed and it has worked well. I 
like that approach better than the uuid filename actually. 

I think the remaining issue with the default part-{i} template is that it can 
also be a bit inconsistent when writing data in loops. Say I am processing a 
directory of files one by one in a loop and I partition the data on the "date" 
column. A lot of the files will just overwrite the part-0.parquet file, but you 
might also see part-11.parquet or another random filename. I suppose the 
surprising part is that write_dataset() does not always append new random files 
nor does it *always* overwrite what is there. This does not impact me now that 
I customize the basename_template though, but I think an "append" or "replace" 
flag would make a lot of sense

I'll open another issue with my use case for metadata_collector and 
partition_filename_cb which I am using heavily

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, dataset-parquet-write
> Fix For: 5.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2021-03-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306996#comment-17306996
 ] 

Lance Dacey commented on ARROW-10695:
-

Sorry, did not see a notification for this. Hm - I am not sure how to provide a 
minimal example easily. The issue is when multiple machines are writing to the 
same dataset at the same time into the same partition. 

For example, machine A downloads data from server 1 and saves it to the dataset 
at the same time as machine B downloading data and saving data from server 2.

My workaround for now was to  ensure that the basename_template is a unique 
value. Initially, I was using a UUID filename as the basename_template, but I 
need to be able to use fs.glob() to get a list of all of the fragments which 
were just written to process them in downstream tasks. Unfortunately, there is 
no metadata_collector for ds.write_dataset() yet.

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, dataset-parquet-write
> Fix For: 5.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()

2021-03-11 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299754#comment-17299754
 ] 

Lance Dacey commented on ARROW-10440:
-

Can someone confirm if this issue would cover my use case or if I should add a 
separate feature request issue? My goal is to simply be able to retrieve the 
list of fragment paths which were saved using the ds.write_dataset() function.

I believe it does since I am using the metadata_collector argument to gather 
this information with the legacy dataset, but let me know if this is different. 
thanks!

> [C++][Dataset][Python] Add a callback to visit file writers just before 
> Finish()
> 
>
> Key: ARROW-10440
> URL: https://issues.apache.org/jira/browse/ARROW-10440
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 5.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow 
> stats to be embedded in IPC file footer metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2021-03-10 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-10694.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

https://github.com/dask/adlfs/pull/193

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset
> Fix For: 3.0.0
>
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2021-03-10 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298828#comment-17298828
 ] 

Lance Dacey commented on ARROW-10694:
-

This is being worked on in the adlfs library so I will close this. There are 
working aldfs branches that I have tested, but they have unfortunately also 
included new problems. Hopefully there will be a final solution soon.

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>  Labels: dataset
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()

2021-03-03 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294650#comment-17294650
 ] 

Lance Dacey commented on ARROW-10440:
-

Will this change allow us to get a list of the blob paths which were saved as 
file fragments?

I am currently using fs.glob() to find a list of files which were just saved 
using a specific basename_template as a work around.


{code:java}
pattern = filesystem.sep.join([output_path, f"**{base_template}-*"])
files = filesystem.glob(
pattern,
details=False,
invalidate_cache=True,
)
{code}

However, with the legacy write_to_dataset(), I am able to use the 
metadata_collector and then create a list of the file paths like this, which is 
more convenient (I do not have to worry about generating unique/predictable 
basename templates).


{code:java}
files = []
for piece in collector:
files.append(filesystem.sep.join([output_path, 
piece.row_group(0).column(0).file_path]))
{code}


I require the lists of blobs to pass along to other Airflow tasks to either 
read as a dataset, or I generate a list of filters from the paths.


> [C++][Dataset][Python] Add a callback to visit file writers just before 
> Finish()
> 
>
> Key: ARROW-10440
> URL: https://issues.apache.org/jira/browse/ARROW-10440
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 2.0.0
>Reporter: Ben Kietzman
>Assignee: Ben Kietzman
>Priority: Major
> Fix For: 4.0.0
>
>
> This will fill the role of (for example) {{metadata_collector}} or allow 
> stats to be embedded in IPC file footer metadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2021-02-17 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286036#comment-17286036
 ] 

Lance Dacey commented on ARROW-10695:
-

Perhaps this has changed, but I was running into issues when writing to a 
dataset in parallel. 

For example, I use Airflow to extract data from 6 different servers in parallel 
(separate tasks are used to download data from each source "extract_cms_1", 
"extract_cms_2") using turbodbc which fetches the data in pyarrow tables --> 
this data is written to Azure Blob using ds.write_dataset()

I noticed that the part-{i} names were clashing when this happened. part-0 
would be replaced a few times for example, and it seemed random or hinted at 
race conditions. I have another Airflow DAG which is downloading from 74 
different REST APIs as well (the downloads can happen simultaneously but the 
source and credentials used are different per account).

Adding the guid() to the filenames solved that issue for me. 

Is there a separate issue open for the partition_filename_cb to be added to 
ds.write_dataset()? I have been using that feature to "repartition" Dataset A 
with many small files into Dataset B with one file per partition (larger 
physical file, less fragments).

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Minor
>  Labels: dataset, dataset-parquet-write
> Fix For: 4.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11453) [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0

2021-02-01 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-11453:
---

 Summary: [Python] [Dataset] Unable to use write_dataset() to Azure 
Blob with adlfs 0.6.0
 Key: ARROW-11453
 URL: https://issues.apache.org/jira/browse/ARROW-11453
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: This environment results in an error:

adlfs v0.6.0
fsspec 0.8.5
azure.storage.blob 12.6.0
adal 1.2.6
pandas 1.2.1
pyarrow 3.0.0
Reporter: Lance Dacey


https://github.com/dask/adlfs/issues/171

I am unable to save data to Azure Blob using ds.write_dataset() with pyarrow 
3.0 and adlfs 0.6.0. Reverting to 0.5.9 fixes the issue, but I am not sure what 
the cause is - posting this here in case there were filesystem changes in 
pyarrow recently which are incompatible with changes made in adlfs.



{code:java}
  File "pyarrow/_dataset.pyx", line 2343, in 
pyarrow._dataset._filesystemdataset_write
  File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir
  File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in 
create_dir
self.fs.mkdir(path, create_parents=recursive)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in 
wrapper
return maybe_sync(func, self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in 
maybe_sync
return sync(loop, func, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync
raise exc.with_traceback(tb)
  File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f
result[0] = await future
  File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in 
_mkdir
raise FileExistsError(
FileExistsError: Cannot overwrite existing Azure container -- dev already 
exists.  
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc

2021-01-27 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-11390.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

I reorganized my Dockerfile to ensure that pyarrow 3.0 was installed before 
turbodbc (there was a base image which was installing 2.0), and I believe that 
conda-forge was updated for turbodbc as well

> [Python] pyarrow 3.0 issues with turbodbc
> -
>
> Key: ARROW-11390
> URL: https://issues.apache.org/jira/browse/ARROW-11390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: pyarrow 3.0.0
> fsspec 0.8.4
> adlfs v0.5.9
> pandas 1.2.1
> numpy 1.19.5
> turbodbc 4.1.1
>Reporter: Lance Dacey
>Priority: Major
>  Labels: python, turbodbc
> Fix For: 3.0.0
>
>
> This is more of a turbodbc issue I think, but perhaps someone here would have 
> some idea of what changed to cause potential issues. 
> {code:java}
> cursor = connection.cursor()
> cursor.execute("select top 10 * from dbo.tickets")
> table = cursor.fetchallarrow(){code}
> I am able to run table.num_rows and it will print out 10.
> If I run table.to_pandas() or table.schema or try to write the table to a 
> dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 
> and the same code works again.
> [https://github.com/blue-yonder/turbodbc/issues/289]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc

2021-01-27 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273199#comment-17273199
 ] 

Lance Dacey commented on ARROW-11390:
-

Everything seems to be all set now, thanks!

pyarrow 3.0.0
fsspec 0.8.4
adlfs v0.5.9
pandas 1.2.1
numpy 1.19.5
turbodbc 4.1.1

> [Python] pyarrow 3.0 issues with turbodbc
> -
>
> Key: ARROW-11390
> URL: https://issues.apache.org/jira/browse/ARROW-11390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: pyarrow 3.0.0
> fsspec 0.8.4
> adlfs v0.5.9
> pandas 1.2.1
> numpy 1.19.5
> turbodbc 4.1.1
>Reporter: Lance Dacey
>Priority: Major
>  Labels: python, turbodbc
>
> This is more of a turbodbc issue I think, but perhaps someone here would have 
> some idea of what changed to cause potential issues. 
> {code:java}
> cursor = connection.cursor()
> cursor.execute("select top 10 * from dbo.tickets")
> table = cursor.fetchallarrow(){code}
> I am able to run table.num_rows and it will print out 10.
> If I run table.to_pandas() or table.schema or try to write the table to a 
> dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 
> and the same code works again.
> [https://github.com/blue-yonder/turbodbc/issues/289]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc

2021-01-27 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272708#comment-17272708
 ] 

Lance Dacey commented on ARROW-11390:
-

That makes sense. I checked further and the base image I was using is this:

https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile

Which pins pyarrow at 2.0:
{code:java}
RUN conda install --quiet --yes --satisfied-skip-solve \
'pyarrow=2.0.*' && \
{code}

I'll try again now that 3.0 is on conda-forge


> [Python] pyarrow 3.0 issues with turbodbc
> -
>
> Key: ARROW-11390
> URL: https://issues.apache.org/jira/browse/ARROW-11390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: pyarrow 3.0.0
> fsspec 0.8.4
> adlfs v0.5.9
> pandas 1.2.1
> numpy 1.19.5
> turbodbc 4.1.1
>Reporter: Lance Dacey
>Priority: Major
>  Labels: python, turbodbc
>
> This is more of a turbodbc issue I think, but perhaps someone here would have 
> some idea of what changed to cause potential issues. 
> {code:java}
> cursor = connection.cursor()
> cursor.execute("select top 10 * from dbo.tickets")
> table = cursor.fetchallarrow(){code}
> I am able to run table.num_rows and it will print out 10.
> If I run table.to_pandas() or table.schema or try to write the table to a 
> dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 
> and the same code works again.
> [https://github.com/blue-yonder/turbodbc/issues/289]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc

2021-01-26 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272271#comment-17272271
 ] 

Lance Dacey commented on ARROW-11390:
-

Actually, turbodbc would have been installed before pyarrow since version 3.0 
was not on conda-forge so I moved it down to the pip section. Do I need to 
reverse this installation process?


{code:java}
&& /opt/conda/bin/conda install -c conda-forge -yq \
pandas \
numpy \
pyodbc \
pybind11 \
turbodbc \
azure-storage-blob \
azure-storage-common \
xlrd \
openpyxl \
mysql-connector-python \ 
zeep \
xmltodict \
dask \
dask-labextension \
pymssql=2.1 \
sqlalchemy-redshift \
python-snappy \
seaborn \
python-gitlab \
pyxlsb \
humanfriendly \
jupyterlab \
notebook=6.1.4 \
pip \
&& /opt/conda/bin/pip install --no-cache-dir --upgrade pip \
smartsheet-python-sdk \
duo-client \
adlfs \
pyarrow \

"apache-airflow[postgres,redis,celery,crypto,ssh,password]==$AIRFLOW_VERSION" \
{code}


I have not been able to get turbodbc to work with pip which is why I am using 
conda right now. Actually I was just trying to get it to work again using a 
CFLAGS argument "-D_GLIBCXX_USE_CXX11_ABI=0", but had no luck. I will attempt 
some more and perhaps raise an issue on the turbodbc project though.

Let me know if there is a proper way to install these libraries! (ideally with 
just plain pip, since my base image is from Airflow which does not use conda by 
default)







> [Python] pyarrow 3.0 issues with turbodbc
> -
>
> Key: ARROW-11390
> URL: https://issues.apache.org/jira/browse/ARROW-11390
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 3.0.0
> Environment: pyarrow 3.0.0
> fsspec 0.8.4
> adlfs v0.5.9
> pandas 1.2.1
> numpy 1.19.5
> turbodbc 4.1.1
>Reporter: Lance Dacey
>Priority: Major
>  Labels: python, turbodbc
>
> This is more of a turbodbc issue I think, but perhaps someone here would have 
> some idea of what changed to cause potential issues. 
> {code:java}
> cursor = connection.cursor()
> cursor.execute("select top 10 * from dbo.tickets")
> table = cursor.fetchallarrow(){code}
> I am able to run table.num_rows and it will print out 10.
> If I run table.to_pandas() or table.schema or try to write the table to a 
> dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 
> and the same code works again.
> [https://github.com/blue-yonder/turbodbc/issues/289]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc

2021-01-26 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-11390:
---

 Summary: [Python] pyarrow 3.0 issues with turbodbc
 Key: ARROW-11390
 URL: https://issues.apache.org/jira/browse/ARROW-11390
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 3.0.0
 Environment: pyarrow 3.0.0
fsspec 0.8.4
adlfs v0.5.9
pandas 1.2.1
numpy 1.19.5
turbodbc 4.1.1
Reporter: Lance Dacey


This is more of a turbodbc issue I think, but perhaps someone here would have 
some idea of what changed to cause potential issues. 
{code:java}
cursor = connection.cursor()
cursor.execute("select top 10 * from dbo.tickets")
table = cursor.fetchallarrow(){code}
I am able to run table.num_rows and it will print out 10.

If I run table.to_pandas() or table.schema or try to write the table to a 
dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 and 
the same code works again.

[https://github.com/blue-yonder/turbodbc/issues/289]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266151#comment-17266151
 ] 

Lance Dacey commented on ARROW-11250:
-

Good idea - I was a able to list all of the files and print the info quickly, 
one interesting thing is that the ds.dataset() failed right after though and 
the error message is a little different. 

 

My input path was "dev/case-history/" with the final slash. This shows that it 
took 8 seconds to get the len(fs.find()) which is about the same amount of time 
it takes to read ds.dataset() in Jupyter. This error message is different than 
usual though and it mentions something about a dircache:

 
{code:java}
[2021-01-15 15:51:47,158] INFO - Reading /dev/case-history/
[2021-01-15 15:51:55,607] INFO - 9682
[2021-01-15 15:51:55,892] INFO - {'name': '/dev/case-history', 'size': 0, 
'type': 'directory'}
[2021-01-15 15:51:55,893] {taskinstance.py:1150} ERROR - '/dev/case-history/'
Traceback (most recent call last):
...
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, 
in dataset
return _filesystem_dataset(source, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, 
in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 395, 
in _ensure_single_source
file_info = filesystem.get_file_info([path])[0]
  File "pyarrow/_fs.pyx", line 434, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 122, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/_fs.pyx", line 1012, in pyarrow._fs._cb_get_file_info_vector
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py", line 195, in 
get_file_info
info = self.fs.info(path)
  File "/opt/conda/lib/python3.7/site-packages/adlfs/spec.py", line 522, in info
fetch_from_azure = (path and self._ls_from_cache(path) is None) or refresh
  File "/opt/conda/lib/python3.7/site-packages/fsspec/spec.py", line 336, in 
_ls_from_cache
return self.dircache[path]
  File "/opt/conda/lib/python3.7/site-packages/fsspec/dircache.py", line 62, in 
__getitem__
return self._cache[item]  # maybe raises KeyError
KeyError: '/dev/case-history/'
{code}
 

I edited my DAG and changed the input path to be "dev/case-history" with no 
final slash and the error was different (note that fs.info() always either 
removes or adds the final slash to the name of the path):
{code:java}
[2021-01-15 15:36:25,603] INFO - {'name': '/dev/case-history/', 'size': 0, 
'type': 'directory'}
[2021-01-15 15:36:25,604] ERROR - /dev/case-history
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, 
in dataset
return _filesystem_dataset(source, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, 
in _filesystem_dataset
fs, paths_or_selector = _ensure_single_source(source, filesystem)
  File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 404, 
in _ensure_single_source
raise FileNotFoundError(path)
FileNotFoundError: /dev/case-history
{code}
 

Without any fs.info() or fs.find() it took 11 minutes to read the same 
dataset... from 17:45 to 17:56
{code:java}
[2021-01-14 17:45:10,470] INFO - Reading /dev/case-history/
[2021-01-14 17:56:58,307] INFO - Processing dataset in batches
{code}
 

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor

[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265928#comment-17265928
 ] 

Lance Dacey edited comment on ARROW-10247 at 1/15/21, 11:08 AM:


Nice - how would you generally go about finding the array of values? Would it 
be detected from the file paths, or would I need  store it externally somewhere 
(sometimes new categories could be added into the field without me being aware 
so explicitly listing them in my code might be weird)?


was (Author: ldacey):
Nice - how would you general go about finding the array of values? Would it be 
detected from the file paths, or would I need  store it externally somewhere 
(sometimes new categories could be added into the field without me being aware 
so explicitly listing them in my code might be weird)?

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265928#comment-17265928
 ] 

Lance Dacey commented on ARROW-10247:
-

Nice - how would you general go about finding the array of values? Would it be 
detected from the file paths, or would I need  store it externally somewhere 
(sometimes new categories could be added into the field without me being aware 
so explicitly listing them in my code might be weird)?

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265922#comment-17265922
 ] 

Lance Dacey commented on ARROW-11250:
-

Do you have any idea at all what could also be causing my Airflow scheduler to 
take SO long to read the same dataset that I am able to read in under 10 
seconds on Jupyter? Could it be an overlay network or something? I have ensured 
that my tasks calling ds.dataset() are running on the same node that my 
Jupyterhub is running on. All software between the environments seems to be 
identical as well (same requirements.txt).

 

11 minutes on the latest airflow run and 9 seconds if I run it in a notebook.. 
is there a way to narrow down my troubleshooting scope for this?
{code:java}
dataset = ds.dataset(
 source=input_path,
 format="parquet",
 partitioning=partitioning,
 filesystem=fs,
 ){code}
 

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
> 429 
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405 
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("

[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265909#comment-17265909
 ] 

Lance Dacey commented on ARROW-11250:
-

Sure, I can raise an issue there.

 
{code:java}
fs_pa.get_file_info("dev/test-split")

{code}
 

I had to tweak the code you provided a bit to get it to run for the 
FileSelector:
{code:java}
fs_pa.get_file_info(FileSelector("dev/test-split", recursive=True))

[,
 ,
 ,
 ,
 ,
 ,
 ,
 ,
...
]{code}
 

FYI - if I add an ending slash to the path I get type=Directory instead of 
NotFound:
{code:java}
fs_pa.get_file_info("dev/test-split/")

{code}

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
> 429 
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405 
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that that if I read a dataset 
> inside of my Jupyter notebook,
>  
> {code:java}
> %%time
> dataset = ds.dataset("dev/test-split", 
> partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
> flavor="hive"), 
> filesystem=fs,
> exclude_invalid_files=False)
> CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
>  
> Now, on the exact same se

[jira] [Comment Edited] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265869#comment-17265869
 ] 

Lance Dacey edited comment on ARROW-11250 at 1/15/21, 10:19 AM:


{code:java}
selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files1 == selected_files2

True{code}
I am able to run the above cell over and over again.

 

Now when I use fs.info() without a final slash:
{code:java}
fs.info("dev/test-split")
{'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code}
If I add a slash to the folder name, the slash is removed in the fs.info() 
return - will this impact anything? 
{code:java}
fs.info("dev/test-split/")
{'name': 'dev/test-split', 'size': 0, 'type': 'directory'}
{code}
 
{code:java}
selected_files3 = fs.info("dev/test-split")
selected_files4 = fs.info("dev/test-split/")
selected_files3 == selected_files4

False{code}
 

 

Edit - running fs.info() on the same path fails if I do it more than once 
without changing the name by adding a slash, or resetting my kernel. Even if I 
delete the fs variable and create a new filesystem, it does not work.


was (Author: ldacey):
{code:java}
selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files1 == selected_files2

True{code}
I am able to run the above cell over and over again.

 

Now when I use fs.info() without a final slash:
{code:java}
fs.info("dev/test-split")
{'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code}
If I add a slash to the folder name, the slash is removed in the fs.info() 
return - will this impact anything? 
{code:java}
fs.info("dev/test-split/")
{'name': 'dev/test-split', 'size': 0, 'type': 'directory'}
{code}
 
{code:java}
selected_files3 = fs.info("dev/test-split")
selected_files4 = fs.info("dev/test-split/")
selected_files3 == selected_files4

False{code}

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib

[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-15 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265869#comment-17265869
 ] 

Lance Dacey commented on ARROW-11250:
-

{code:java}
selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, 
detail=True)
selected_files1 == selected_files2

True{code}
I am able to run the above cell over and over again.

 

Now when I use fs.info() without a final slash:
{code:java}
fs.info("dev/test-split")
{'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code}
If I add a slash to the folder name, the slash is removed in the fs.info() 
return - will this impact anything? 
{code:java}
fs.info("dev/test-split/")
{'name': 'dev/test-split', 'size': 0, 'type': 'directory'}
{code}
 
{code:java}
selected_files3 = fs.info("dev/test-split")
selected_files4 = fs.info("dev/test-split/")
selected_files3 == selected_files4

False{code}

> [Python] Inconsistent behavior calling ds.dataset()
> ---
>
> Key: ARROW-11250
> URL: https://issues.apache.org/jira/browse/ARROW-11250
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> adal  1.2.5  pyh9f0ad1d_0conda-forge
> adlfs 0.5.9  pyhd8ed1ab_0conda-forge
> apache-airflow1.10.14  pypi_0pypi
> azure-common  1.1.24 py_0conda-forge
> azure-core1.9.0  pyhd3deb0d_0conda-forge
> azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
> azure-identity1.5.0  pyhd8ed1ab_0conda-forge
> azure-nspkg   3.0.2  py_0conda-forge
> azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
> azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
> fsspec0.8.5  pyhd8ed1ab_0conda-forge
> jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
> pandas1.2.0py37ha9443f7_0
> pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
>Reporter: Lance Dacey
>Priority: Minor
>  Labels: azureblob, dataset,, python
> Fix For: 4.0.0
>
>
> In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
> dataset which certainly exists on Azure Blob.
>  
> {code:java}
> fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
> {code}
>  
> One example of this is reading a dataset in one cell:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
>  
> Then in another cell I try to read the same dataset:
>  
> {code:java}
> ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, partitioning, format, 
> partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
> 426 fs, paths_or_selector = _ensure_multiple_sources(source, 
> filesystem)
> 427 else:
> --> 428 fs, paths_or_selector = _ensure_single_source(source, 
> filesystem)
> 429 
> 430 options = FileSystemFactoryOptions(
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _ensure_single_source(path, filesystem)
> 402 paths_or_selector = [path]
> 403 else:
> --> 404 raise FileNotFoundError(path)
> 405 
> 406 return filesystem, paths_or_selector
> FileNotFoundError: dev/test-split
> {code}
>  
> If I reset the kernel, it works again. It also works if I change the path 
> slightly, like adding a "/" at the end (so basically it just not work if I 
> read the same dataset twice):
>  
> {code:java}
> ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
> {code}
>  
>  
> The other strange behavior I have noticed that th

[jira] [Created] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()

2021-01-14 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-11250:
---

 Summary: [Python] Inconsistent behavior calling ds.dataset()
 Key: ARROW-11250
 URL: https://issues.apache.org/jira/browse/ARROW-11250
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 2.0.0
 Environment: Ubuntu 18.04

adal  1.2.5  pyh9f0ad1d_0conda-forge
adlfs 0.5.9  pyhd8ed1ab_0conda-forge
apache-airflow1.10.14  pypi_0pypi
azure-common  1.1.24 py_0conda-forge
azure-core1.9.0  pyhd3deb0d_0conda-forge
azure-datalake-store  0.0.51 pyh9f0ad1d_0conda-forge
azure-identity1.5.0  pyhd8ed1ab_0conda-forge
azure-nspkg   3.0.2  py_0conda-forge
azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge
azure-storage-common  2.1.0py37hc8dfbb8_3conda-forge
fsspec0.8.5  pyhd8ed1ab_0conda-forge
jupyterlab_pygments   0.1.2  pyh9f0ad1d_0conda-forge
pandas1.2.0py37ha9443f7_0
pyarrow   2.0.0   py37h4935f41_6_cpuconda-forge
Reporter: Lance Dacey
 Fix For: 3.0.0


In a Jupyter notebook, I have noticed that sometimes I am not able to read a 
dataset which certainly exists on Azure Blob.

 
{code:java}
fs = fsspec.filesystem(protocol="abfs", account_name, account_key)
{code}
 
One example of this is reading a dataset in one cell:

 
{code:java}
ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code}
 

Then in another cell I try to read the same dataset:

 
{code:java}
ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)


---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
426 fs, paths_or_selector = _ensure_multiple_sources(source, 
filesystem)
427 else:
--> 428 fs, paths_or_selector = _ensure_single_source(source, 
filesystem)
429 
430 options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_ensure_single_source(path, filesystem)
402 paths_or_selector = [path]
403 else:
--> 404 raise FileNotFoundError(path)
405 
406 return filesystem, paths_or_selector

FileNotFoundError: dev/test-split
{code}
 

If I reset the kernel, it works again. It also works if I change the path 
slightly, like adding a "/" at the end (so basically it just not work if I read 
the same dataset twice):

 
{code:java}
ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs)
{code}
 

 

The other strange behavior I have noticed that that if I read a dataset inside 
of my Jupyter notebook,

 
{code:java}
%%time
dataset = ds.dataset("dev/test-split", 
partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), 
flavor="hive"), 
filesystem=fs,
exclude_invalid_files=False)

CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code}
 

Now, on the exact same server when I try to run the same code against the same 
dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs 
between right before I read the dataset, and immediately after the dataset is 
available to filter):
{code:java}
[2021-01-14 03:52:04,011] INFO - Reading dev/test-split
[2021-01-14 03:55:17,360] INFO - Processing dataset in batches
{code}
This is probably not a pyarrow issue, but what are some potential causes that I 
can look into? I have one example where it is 9 seconds to read the dataset in 
Jupyter, but then 11 *minutes* in Airflow. I don't know what to really 
investigate - as I mentioned, the Jupyter notebook and Airflow are on the same 
server and both are deployed using Docker. Airflow is using the CeleryExecutor.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2021-01-09 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996
 ] 

Lance Dacey edited comment on ARROW-10247 at 1/10/21, 3:27 AM:
---

What is the best workaround for this issue right now? I was playing around with 
making a new partition schema if a dictionary type was found in my partition 
columns:

 
{code:java}
partitioning = None
part_schema = t.select(["project", "date"]).schema
fields = []
for part in part_schema:
if pa.types.is_dictionary(part.type):
fields.append(pa.field(part.name, part.type.value_type))
else:
fields.append(pa.field(part.name, part.type))
new_schema = pa.schema(fields)
partitioning = ds.partitioning(new_schema, flavor="hive")
{code}
This seems to work for me. My only issue is if I have multiple partition 
columns with different types.

This would return an error when I read the dataset with ds.dataset():
{code:java}
partitioning = ds.partitioning(pa.schema([('date', pa.date32()), ("project", 
pa.dictionary(index_type=pa.int32(), value_type=pa.string()))]), 
flavor="hive"){code}
ArrowInvalid: No dictionary provided for dictionary field project: 
dictionary

 

And this returns dictionaries for both partitions (instead of date being 
pa.date32()) which is not ideal:
{code:java}
partitioning=ds.HivePartitioning.discover(infer_dictionary=True){code}


was (Author: ldacey):
What is the best workaround for this issue right now? If a column in the 
partition columns is_dictionary(), then convert it to pa.string() to save the 
dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to 
read the dataset later?

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as thi

[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field

2021-01-09 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996
 ] 

Lance Dacey commented on ARROW-10247:
-

What is the best workaround for this issue right now? If a column in the 
partition columns is_dictionary(), then convert it to pa.string() to save the 
dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to 
read the dataset later?

> [C++][Dataset] Cannot write dataset with dictionary column as partition field
> -
>
> Key: ARROW-10247
> URL: https://issues.apache.org/jira/browse/ARROW-10247
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Assignee: Ben Kietzman
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 3.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When the column to use for partitioning is dictionary encoded, we get this 
> error:
> {code}
> In [9]: import pyarrow.dataset as ds
> In [10]: part = ["xxx"] * 3 + ["yyy"] * 3
> ...: table = pa.table([
> ...: pa.array(range(len(part))),
> ...: pa.array(part).dictionary_encode(),
> ...: ], names=['col', 'part'])
> In [11]: part = ds.partitioning(table.select(["part"]).schema)
> In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ---
> ArrowTypeErrorTraceback (most recent call last)
>  in 
> > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", 
> partitioning=part)
> ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, 
> base_dir, basename_template, format, partitioning, schema, filesystem, 
> file_options, use_threads)
> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> --> 775 filesystem, partitioning, file_options, use_threads,
> 776 )
> ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()
> ArrowTypeError: scalar xxx (of type string) is invalid for part: 
> dictionary
> In ../src/arrow/dataset/filter.cc, line 1082, code: 
> VisitConjunctionMembers(*and_.left_operand(), visitor)
> In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, 
> [&](const std::string& name, const std::shared_ptr& value) { auto&& 
> _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { 
> ::arrow::Status __s = 
> ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if 
> ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); 
> _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, 
> "(_error_or_value28).status()"); return _st; } } while (0); } while (false); 
> auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const 
> auto& field = schema_->field(match[0]); if 
> (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", 
> value->ToString(), " (of type ", *value->type, ") is invalid for ", 
> field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); 
> })
> In ../src/arrow/dataset/file_base.cc, line 321, code: 
> (_error_or_value24).status()
> In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish()
> {code}
> While this seems a quit normal use case, as this column will typically be 
> repeated many times (and we also support reading it as such with dictionary 
> type, so a roundtrip is currently not possible in that case)
> I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't 
> yet look into how easy it would be to fix.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10523) [Python] Pandas timestamps are inferred to have only microsecond precision

2021-01-07 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260878#comment-17260878
 ] 

Lance Dacey commented on ARROW-10523:
-

I noticed that even explicitly using (unit="ns") would not work when using 
write_to_dataset() with the legacy dataset.

I would print table.schema right before saving the dataset to Azure Blob (it 
would show "ns"), and when I read the dataset.schema afterwards the unit was 
the "us". In the end, I explicitly wrote the data using unit="us" and also 
added the coerce_timestamps="us" write option.

> [Python] Pandas timestamps are inferred to have only microsecond precision
> --
>
> Key: ARROW-10523
> URL: https://issues.apache.org/jira/browse/ARROW-10523
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 2.0.0
>Reporter: David Li
>Priority: Minor
>
> {code:java}
> import pyarrow as pa
> import pandas as pd
> arr = pa.array([pd.Timestamp(year=2020, month=1, day=1, nanosecond=999)])
> print(arr)
> print(arr.type) {code}
> This gives:
> {noformat}
> [
>   2020-01-01 00:00:00.00
> ]
> timestamp[us]
> {noformat}
> However, Pandas Timestamps have nanosecond precision, which would be nice to 
> preserve in inference.
> The reason is that TypeInferrer [hardcodes 
> microseconds|https://github.com/apache/arrow/blob/apache-arrow-2.0.0/cpp/src/arrow/python/inference.cc#L466]
>  as it only knows about the standard library datetime, so I'm treating this 
> as a feature request and not quite a bug. Of course, this can be worked 
> around easily by specifying an explicit type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset

2020-12-04 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244232#comment-17244232
 ] 

Lance Dacey commented on ARROW-10695:
-

FYI, I think this might be necessary for some use cases. For example, I have 
Airflow extract data from dozens of APIs in parallel and write to the same 
target partitioned dataset (partitioned based on the Airflow scheduled date, so 
all files belong in the same batch folder) - this causes the part-0.parquet 
file to be overwritten each time which results in lost data instead of there 
being dozens of files.

For the meantime, I added the code below. I need to keep the \{i} it seems or I 
get an error:
{code:python}
if self.create_uuid_filename:
basename_template = guid() + "-{i}.parquet"
else:
basename_template = "part-{i}.parquet"
{code}
guid() is imported from pyarrow.utils

> [C++][Dataset] Allow to use a UUID in the basename_template when writing a 
> dataset
> --
>
> Key: ARROW-10695
> URL: https://issues.apache.org/jira/browse/ARROW-10695
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
>  Labels: dataset, dataset-parquet-write
> Fix For: 3.0.0
>
>
> Currently we allow the user to specify a {{basename_template}}, and this can 
> include a {{"\{i\}"}} part to replace it with an automatically incremented 
> integer (so each generated file written to a single partition is unique):
> https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717
> It _might_ be useful to also have the ability to use a UUID, to ensure the 
> file is unique in general (not only for a single write) and to mimic the 
> behaviour of the old {{write_to_dataset}} implementation.
> For example, we could look for a {{"\{uuid\}"}} in the template string, and 
> if present replace it for each file with a new UUID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-12-04 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243963#comment-17243963
 ] 

Lance Dacey commented on ARROW-10517:
-

Yes, I think the uuid specifier would work fine for my purposes. Generally, I 
have had pyarrow create the resulting filenames with the partition_filename_cb 
function, but you are right - I could probably generate the filenames directly 
since I am dictating which filters to use in the first place (and each filter 
becomes a file).
 
{code:python}
d1 = {
"id": [1, 2, 3, 4, 5],
"created_at": [
datetime.date(2020, 5, 7),
datetime.date(2020, 6, 19),
datetime.date(2020, 9, 14),
datetime.date(2020, 11, 22),
datetime.date(2020, 12, 2),
],
"updated_at": [
datetime.date(2020, 12, 2),
datetime.date(2020, 12, 2),
datetime.date(2020, 12, 2),
datetime.date(2020, 12, 2),
datetime.date(2020, 12, 2),
],
}
df = pd.DataFrame(data=d1)
table = pa.Table.from_pandas(df)

#historical dataset which has all history of each ID each time it gets updated
#each created_at partition would have a sub-partition for updated_at since 
historical data can change - this can generate many small files depending on 
how often my schedule runs to download data
#I use pa.string() as the partition data type here because I have had issues 
using pa.date32(), sometimes I will get an error that we cannot convert a 
string to date32() but using a date works perfectly fine
ds.write_dataset(
data=table,
base_dir=output_path,
format="parquet",
partitioning=ds.partitioning(pa.schema([("created_at", pa.string()), 
("updated_at", pa.string())]), flavor="hive"),
schema=table.schema,
filesystem=fs,
)

#the next task would read the dataset and filter for the created_at partition 
(ignoring the updated_at partition)
dataset = ds.dataset(
source=output_path, 
format="parquet",
partitioning="hive",
filesystem=fs,
)

#I save the unique filters (each created_at value) externally and build the 
dataset filter expression
filter_expression = pq._filters_to_expression(filters=[[('created_at', '==', 
'2020-05-07')], 
[('created_at', '==', '2020-06-19')], [('created_at', '==', '2020-09-14')], 
[('created_at', '==', '2020-11-22')], [('created_at', '==', '2020-12-02')]])

table = dataset.to_table(filter=filter_expression)

#Turn the table into a pandas dataframe to remove duplicates and retain the 
latest row for each ID
df = table.to_pandas(self_destruct=True).sort_values(["id", "updated_at"], 
ascending=True).drop_duplicates(["id"], keep="last")
table = pa.Table.from_pandas(df)

#this writes the final dataset. 
#There would be one file per created_at partition. 
"container/created_at=2020-05-07/2020-05-07.parquet"
#our visualization tool connects directly to these parquet files so we can 
report on the latest status of each ticket (not much attention is paid to the 
historical changes)
pq.write_to_dataset(
table=table,
root_path=output_path,
partition_cols=["created_at"],
partition_filename_cb=lambda x: str(x[-1]) + '.parquet',,
filesystem=fs,
)
{code}

***Note regarding the filters I use. I am using code similar to something I 
found in the pyarrow.write_to_dataset function (pasted below) to generate these 
filters. I could probably generate filenames instead though and use write_table 
like you mentioned.

{code:python}
for keys, subgroup in data_df.groupby(partition_keys):
if not isinstance(keys, tuple):
keys = (keys,)
subdir = '/'.join(
['{colname}={value}'.format(colname=name, value=val)
 for name, val in zip(partition_cols, keys)])
{code}


> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-12-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242425#comment-17242425
 ] 

Lance Dacey commented on ARROW-10517:
-

FYI, it seems like the "part-\{i}" basename_template does not work well if 
schedules run in parallel. For example, I ran 30 schedules (in parallel) which 
read separate JSON files and output the data into the same partitioned parquet 
dataset. Only part-0.parquet was being overwritten each time. For now, I 
imported the guid() function from pyarrow.utils to ensure that all files are 
written.

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, parti

[jira] [Comment Edited] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-12-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242213#comment-17242213
 ] 

Lance Dacey edited comment on ARROW-10694 at 12/2/20, 9:57 AM:
---

I am simply listing and deleting blobs without ".parquet" as a workaround for 
now. I think this is still an issue that should be resolved since this can 
delete _common_metadata and _metadata files unless I specifically ignore them


was (Author: ldacey):
I am simply listing and deleting blobs with ".parquet" as a workaround for now. 
I think this is still an issue that should be resolved since this can delete 
_common_metadata and _metadata files unless I specifically ignore them

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-12-02 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242213#comment-17242213
 ] 

Lance Dacey commented on ARROW-10694:
-

I am simply listing and deleting blobs with ".parquet" as a workaround for now. 
I think this is still an issue that should be resolved since this can delete 
_common_metadata and _metadata files unless I specifically ignore them

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-12-01 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241833#comment-17241833
 ] 

Lance Dacey commented on ARROW-10517:
-

Thanks - since the \{i} increments each time a new file is written, I am not 
sure if this can work for my use case unless I am designing this incorrectly.

I am using the partition_filename_cb similar to how I would create a 
materialized view in a database to ensure that there is only one row per unique 
ID based on the latest update timestamp. I can then connect this parquet 
dataset to our visualization tool, or I can export it to CSV format and email 
it to another team, etc.

 
{code:java}
#the historical dataset includes all rows, the number of files will depend on 
the frequency of scheduled downloads. it is possible to have multiple rows per 
unique ID
historical_dataset = [
 'dev/test/report_date=2018-01-01/part-0.parquet',
 'dev/test/report_date=2018-01-01/part-1.parquet',
 'dev/test/report_date=2018-01-01/part-2.parquet',
 'dev/test/report_date=2018-01-01/part-3.parquet',
 'dev/test/report_date=2018-01-01/part-4.parquet',
 'dev/test/report_date=2018-01-01/part-5.parquet',
]
#read the historical dataset and filter for the partition. in this case, 
report_date = 2018-01-01, so all data from that date is read into a table
#convert to pandas dataframe, sort based on "id" and "updated_at" fields
#drop duplicates based on "id" field, retaining the latest version
#write to a new dataset which is just the latest version of each "id". The 6 
parts are now in a single file which will be continuously overwritten if any 
new data is added to the historical_dataset. Our visualization tool connects to 
these finalized files, and sometimes I send the data through email for 
reporting purposes
latest_dataset = [
 'dev/test/report_date=2018-01-01/2018-01-01.parquet',
]
{code}
 

Perhaps there is a better way to go about this? With a database, I would just 
create view which selects distinct on the ID column based on the latest update 
timestamp. This seems to be a common use case, so I am not sure how people 
would go about it with Parquet.

 

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
>   

[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237341#comment-17237341
 ] 

Lance Dacey commented on ARROW-10694:
-

FYI, I tested HivePartitioning as well, but faced the same issue. 

 
{code:java}
from pyarrow.dataset import HivePartitioning 

partition = HivePartitioning(pa.schema([("year", pa.int16()), ("month", 
pa.int8()), ("day", pa.int8())]))

FileNotFoundError: dev/test-dataset2/year=2018/month=1/day=1{code}

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237316#comment-17237316
 ] 

Lance Dacey commented on ARROW-10694:
-

{code:java}
print(fs.isfile("dev/test-dataset/2018/1/1"))
print(fs.info("dev/test-dataset/2018/1/1", detail=True)){code}
False
{'name': 'dev/test-dataset/2018/1/1/', 'size': 0, 'type': 'directory'}

 

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10/1',
>  'dev/test-dataset/2018/10/1/test-27.parquet',
>  'dev/test-dataset/2018/3/1',
>  'dev/test-dataset/2018/3/1/test-6.parquet',
>  'dev/test-dataset/2020/1/1',
>  'dev/test-dataset/2020/1/1/test-2.parquet',
>  'dev/test-dataset/2020/10/1',
>  'dev/test-dataset/2020/10/1/test-29.parquet',
>  'dev/test-dataset/2020/11/1',
>  'dev/test-dataset/2020/11/1/test-32.parquet',
>  'dev/test-dataset/2020/2/1',
>  'dev/test-dataset/2020/2/1/test-5.parquet',
>  'dev/test-dataset/2020/7/1',
>  'dev/test-dataset/2020/7/1/test-20.parquet',
>  'dev/test-dataset/2020/8/1',
>  'dev/test-dataset/2020/8/1/test-23.parquet',
>  'dev/test-dataset/2020/9/1',
>  'dev/test-dataset/2020/9/1/test-26.parquet'
> ]{code}
> As you can see, there is an empty file for each "day" partition. I was not 
> even able to read the dataset at all until I manually deleted the first empty 
> file in the dataset (2018/1/1).
> I then get an error when I try to use the to_table() method:
> {code:java}
> OSError   Traceback (most recent call last)
>  in 
> > 1 
> dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
> in 
> pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
>  in 
> pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in 
> pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
>  in pyarrow.lib.check_status()OSError: Could not open parquet input source 
> 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
> {code}
> If I manually delete the empty file, I can then use the to_table() function:
> {code:java}
> dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
> 10)).to_pandas()
> {code}
> Is this a bug with pyarrow, adlfs, or fsspec?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237299#comment-17237299
 ] 

Lance Dacey commented on ARROW-10694:
-

Sure. https://github.com/dask/adlfs/issues/137

I tried the exclude_invalid_files argument but ran into an error:

 
{code:java}
dataset = ds.dataset(source="dev/test-dataset", 
 format="parquet", 
 partitioning=partition,
 exclude_invalid_files=True,
 filesystem=fs)

---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 dataset = ds.dataset(source="dev/test-dataset", 
  2  format="parquet",
  3  partitioning=partition,
  4  exclude_invalid_files=True,
  5  filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
434 selector_ignore_prefixes=selector_ignore_prefixes
435 )
--> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
options)
437 
438 return factory.finish(schema)

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset.FileSystemDatasetFactory.__init__()

/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_open_input_file()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, 
path)
274 
275 if not self.fs.isfile(path):
--> 276 raise FileNotFoundError(path)
277 
278 return PythonFile(self.fs.open(path, mode="rb"), mode="r")

FileNotFoundError: dev/test-dataset/2018/1/1
{code}
That folder and the empty file exists though:
{code:java}
for file in fs.find("dev/test-dataset"):
print(file)

dev/test-dataset/2018/1/1
dev/test-dataset/2018/1/1/test-0.parquet
dev/test-dataset/2018/10/1
dev/test-dataset/2018/10/1/test-27.parquet
dev/test-dataset/2018/11/1
dev/test-dataset/2018/11/1/test-30.parquet
dev/test-dataset/2018/12/1
dev/test-dataset/2018/12/1/test-33.parquet
dev/test-dataset/2018/2/1
dev/test-dataset/2018/2/1/test-3.parquet

{code}
 

> [Python] ds.write_dataset() generates empty files for each final partition
> --
>
> Key: ARROW-10694
> URL: https://issues.apache.org/jira/browse/ARROW-10694
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
> Python 3.8.6
> adlfs master branch
>Reporter: Lance Dacey
>Priority: Major
>
> ds.write_dataset() is generating empty files for the final partition folder 
> which causes errors when reading the dataset or converting a dataset to a 
> table.
> I believe this may be caused by fs.mkdir(). Without the final slash in the 
> path, an empty file is created in the "dev" container:
>  
> {code:java}
> fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
> account_key=base.password)
> fs.mkdir("dev/test2")
> {code}
>  
> If the final slash is added, a proper folder is created:
> {code:java}
> fs.mkdir("dev/test2/"){code}
>  
> Here is a full example of what happens with ds.write_dataset:
> {code:java}
> schema = pa.schema(
> [
> ("year", pa.int16()),
> ("month", pa.int8()),
> ("day", pa.int8()),
> ("report_date", pa.date32()),
> ("employee_id", pa.string()),
> ("designation", pa.dictionary(index_type=pa.int16(), 
> value_type=pa.string())),
> ]
> )
> part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
> pa.int8()), ("day", pa.int8())]))
> ds.write_dataset(data=table, 
>  base_dir="dev/test-dataset", 
>  basename_template="test-{i}.parquet", 
>  format="parquet",
>  partitioning=part, 
>  schema=schema,
>  filesystem=fs)
> dataset.files
> #sample printed below, note the empty files
> [
>  'dev/test-dataset/2018/1/1/test-0.parquet',
>  'dev/test-dataset/2018/10

[jira] [Created] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition

2020-11-23 Thread Lance Dacey (Jira)
Lance Dacey created ARROW-10694:
---

 Summary: [Python] ds.write_dataset() generates empty files for 
each final partition
 Key: ARROW-10694
 URL: https://issues.apache.org/jira/browse/ARROW-10694
 Project: Apache Arrow
  Issue Type: Bug
Affects Versions: 2.0.0
 Environment: Ubuntu 18.04
Python 3.8.6
adlfs master branch
Reporter: Lance Dacey


ds.write_dataset() is generating empty files for the final partition folder 
which causes errors when reading the dataset or converting a dataset to a table.

I believe this may be caused by fs.mkdir(). Without the final slash in the 
path, an empty file is created in the "dev" container:

 
{code:java}
fs = fsspec.filesystem(protocol='abfs', account_name=base.login, 
account_key=base.password)
fs.mkdir("dev/test2")
{code}
 

If the final slash is added, a proper folder is created:
{code:java}
fs.mkdir("dev/test2/"){code}
 

Here is a full example of what happens with ds.write_dataset:
{code:java}
schema = pa.schema(
[
("year", pa.int16()),
("month", pa.int8()),
("day", pa.int8()),
("report_date", pa.date32()),
("employee_id", pa.string()),
("designation", pa.dictionary(index_type=pa.int16(), 
value_type=pa.string())),
]
)

part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", 
pa.int8()), ("day", pa.int8())]))

ds.write_dataset(data=table, 
 base_dir="dev/test-dataset", 
 basename_template="test-{i}.parquet", 
 format="parquet",
 partitioning=part, 
 schema=schema,
 filesystem=fs)

dataset.files

#sample printed below, note the empty files
[
 'dev/test-dataset/2018/1/1/test-0.parquet',
 'dev/test-dataset/2018/10/1',
 'dev/test-dataset/2018/10/1/test-27.parquet',
 'dev/test-dataset/2018/3/1',
 'dev/test-dataset/2018/3/1/test-6.parquet',
 'dev/test-dataset/2020/1/1',
 'dev/test-dataset/2020/1/1/test-2.parquet',
 'dev/test-dataset/2020/10/1',
 'dev/test-dataset/2020/10/1/test-29.parquet',
 'dev/test-dataset/2020/11/1',
 'dev/test-dataset/2020/11/1/test-32.parquet',
 'dev/test-dataset/2020/2/1',
 'dev/test-dataset/2020/2/1/test-5.parquet',
 'dev/test-dataset/2020/7/1',
 'dev/test-dataset/2020/7/1/test-20.parquet',
 'dev/test-dataset/2020/8/1',
 'dev/test-dataset/2020/8/1/test-23.parquet',
 'dev/test-dataset/2020/9/1',
 'dev/test-dataset/2020/9/1/test-26.parquet'
]{code}
As you can see, there is an empty file for each "day" partition. I was not even 
able to read the dataset at all until I manually deleted the first empty file 
in the dataset (2018/1/1).

I then get an error when I try to use the to_table() method:
{code:java}
OSError   Traceback (most recent call last)
 in 
> 1 
dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx 
in 
pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx
 in 
pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in 
pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi
 in pyarrow.lib.check_status()OSError: Could not open parquet input source 
'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes
{code}
If I manually delete the empty file, I can then use the to_table() function:
{code:java}
dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 
10)).to_pandas()
{code}
Is this a bug with pyarrow, adlfs, or fsspec?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-21 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236702#comment-17236702
 ] 

Lance Dacey commented on ARROW-10517:
-

Regarding partition_filename_cb, some common ones I am using are to create a 
full date name from the partition folders.
{code:java}
year=2020/month=8/day=4
partition_filename_cb=lambda x: "-".join(str(y).zfill(2) for y in x) + 
".parquet"
2020-08-04.parquet{code}
I am doing this to address a "many small files" situation in a few scenarios. 
Perhaps there is a better way to go about it though where this would not be 
necessary.

 

Scenario 1):
 * I use turbodbc to query 6 different SQL servers every 30 minutes (48 
schedules per date * 6) directly into pyarrow tables which I then write to a 
partitioned dataset.
 * This creates a lot of small files which I then filter for and write to a 
separate dataset with the partition_filename_cb to consolidate the data into a 
single daily file

Scenario 2):
 * I query for data every hour from some REST APIs (Zendesk and ServiceNow) for 
any tickets which have changed since my last query (based on the latest 
updated_at timestamp)
 * I partition this data based on the created_at date. So we have a lot of 
small files due to the frequency of downloads, and a single download might have 
tickets which were created_at in the past. At least 24 files * the amount of 
unique dates which were updated.
 * So again, I filter for any created_at partition which was changed in the 
last hour and then rewrite a "final" consolidated version of the data in a 
separate dataset using the partition_filename_cb which is then used for 
downstream tasks and transformation.
 * Ultimately, I need to ensure that our visualizations/reports only display 
the latest version of each ticket even if it was updated a dozen times, so this 
step generally includes me sorting the data and dropping duplicates on some 
unique constraints

 

Both scenarios have tiny files each download interval or based on how I 
partition the data, but are pretty large overall (scenario 1 is over 500 
million rows, and scenario 2 is over 70 million rows from March of this year). 
Maybe it is not required to use the partition_filename_cb though, it just 
seemed faster and more organized (under 300ms to read a single file compared to 
over 1 minute to filter for a day with 96 UUID filenames)

 

Any best practices here to avoid the need to use the partition_filename_cb 
function?

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/pytho

[jira] [Closed] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-20 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey closed ARROW-10517.
---
Fix Version/s: 2.0.0
   Resolution: Later

My issue is caused by another library (adlfs). Once this is fixed, this issue 
will not be relevant.

https://github.com/dask/adlfs/issues/135

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Fix For: 2.0.0
>
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-20 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236357#comment-17236357
 ] 

Lance Dacey commented on ARROW-10517:
-

Thanks for your help. By adding **kwargs to the adlfs find() return, I was able 
to get ds.dataset features to work (read and write) with the latest version of 
adlfs. I am sure the library will be updated soon.

 

Since I am stuck with azure-storage-blob SDK v2 in production, I have been 
using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I 
am able to use write_to_dataset() with the legacy system. This error leads back 
to adlfs core.py in the mkdir function.

I think I will close this issue now since write_to_dataset() works for my needs 
right now and it supports the _partition_filename_cb_ which I find useful. I 
will just wait until I can safely upgrade to the latest version of adlfs where 
I know it will work fine.

 
{code:java}
ds.write_dataset(data=table, 
 base_dir="dev/test-write", 
 format="parquet",
 
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
 filesystem=fs)

---
RuntimeError  Traceback (most recent call last)
 in 
> 1 ds.write_dataset(data=table, 
  2  base_dir="dev/test-write",
  3  format="parquet",
  4  
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
  5  filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test-write/2018-03-01.
{code}

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir=

[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-20 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236357#comment-17236357
 ] 

Lance Dacey edited comment on ARROW-10517 at 11/20/20, 6:24 PM:


Thanks for your help. By adding **kwargs to the adlfs find() return, I was able 
to get ds.dataset features to work (read and write) with the latest version of 
adlfs. I am sure the library will be updated soon: 
https://github.com/dask/adlfs/issues/135.

 

Since I am stuck with azure-storage-blob SDK v2 in production, I have been 
using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I 
am able to use write_to_dataset() with the legacy system. This error leads back 
to adlfs core.py in the mkdir function.

I think I will close this issue now since write_to_dataset() works for my needs 
right now and it supports the _partition_filename_cb_ which I find useful. I 
will just wait until I can safely upgrade to the latest version of adlfs where 
I know it will work fine.

 
{code:java}
ds.write_dataset(data=table, 
 base_dir="dev/test-write", 
 format="parquet",
 
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
 filesystem=fs)

---
RuntimeError  Traceback (most recent call last)
 in 
> 1 ds.write_dataset(data=table, 
  2  base_dir="dev/test-write",
  3  format="parquet",
  4  
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
  5  filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test-write/2018-03-01.
{code}


was (Author: ldacey):
Thanks for your help. By adding **kwargs to the adlfs find() return, I was able 
to get ds.dataset features to work (read and write) with the latest version of 
adlfs. I am sure the library will be updated soon.

 

Since I am stuck with azure-storage-blob SDK v2 in production, I have been 
using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I 
am able to use write_to_dataset() with the legacy system. This error leads back 
to adlfs core.py in the mkdir function.

I think I will close this issue now since write_to_dataset() works for my needs 
right now and it supports the _partition_filename_cb_ which I find useful. I 
will just wait until I can safely upgrade to the latest version of adlfs where 
I know it will work fine.

 
{code:java}
ds.write_dataset(data=table, 
 base_dir="dev/test-write", 
 format="parquet",
 
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
 filesystem=fs)

---
RuntimeError  Traceback (most recent call last)
 in 
> 1 ds.write_dataset(data=table, 
  2  base_dir="dev/test-write",
  3  format="parquet",
  4  
partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", 
pyarrow.date32())])),
  5  filesystem=fs)

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data,

[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-20 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477
 ] 

Lance Dacey edited comment on ARROW-10517 at 11/20/20, 8:26 AM:


Yeah, I can open an issue there. https://github.com/dask/adlfs/issues/135

I think that this might be the major issue I am facing with v12 Azure Blob SDK. 
I cannot read a dataset because I get a list of files returned instead of a 
dictionary (but I am able to write a dataset).

I think I might have to open some fsspec issues as well because mkdir is 
creating those empty files instead of a directory which doesn't seem right. 
Also ran into an issue with read_table(use_legacy_dataset=True) where data was 
trying to be read from the wrong partition with a similar name "domain=tnt" and 
"domain=tntplus". So it looks like perhaps only the prefix was being used to 
list the files.

 

 

edit:
  
{code:java}
fs.info("dev/testing10/evaluations")                       
{'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'}   {code}
 


was (Author: ldacey):
Yeah, I can open an issue there.

I hopefully am not using an old version. I installed miniconda and then used 
the environment files to make sure that adlfs is the recent version. And I 
print the module versions in the script so everything should be aligned.

I think I might have to open some fsspec issues as well because mkdir is 
creating those empty files instead of a directory which doesn't seem right. 
Also ran into an issue with read_table(use_legacy_dataset=True) where data was 
trying to be read from the wrong partition with a similar name "domain=tnt" and 
"domain=tntplus". So it looks like perhaps only the prefix was being used to 
list the files.

 

 

edit:
 
{code:java}
fs.info("dev/testing10/evaluations")                       
{'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'}   {code}
 

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda

[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477
 ] 

Lance Dacey edited comment on ARROW-10517 at 11/19/20, 1:48 PM:


Yeah, I can open an issue there.

I hopefully am not using an old version. I installed miniconda and then used 
the environment files to make sure that adlfs is the recent version. And I 
print the module versions in the script so everything should be aligned.

I think I might have to open some fsspec issues as well because mkdir is 
creating those empty files instead of a directory which doesn't seem right. 
Also ran into an issue with read_table(use_legacy_dataset=True) where data was 
trying to be read from the wrong partition with a similar name "domain=tnt" and 
"domain=tntplus". So it looks like perhaps only the prefix was being used to 
list the files.

 

 

edit:
 
{code:java}
fs.info("dev/testing10/evaluations")                       
{'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'}   {code}
 


was (Author: ldacey):
Yeah, I can open an issue there.

I hopefully am not using an old version. I installed miniconda and then used 
the environment files to make sure that adlfs is the recent version. And I 
print the module versions in the script so everything should be aligned.

I think I might have to open some fsspec issues as well because mkdir is 
creating those empty files instead of a directory which doesn't seem right. 
Also ran into an issue with read_table(use_legacy_dataset=True) where data was 
trying to be read from the wrong partition with a similar name "domain=tnt" and 
"domain=tntplus". So it looks like perhaps only the prefix was being used to 
list the files.

 

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477
 ] 

Lance Dacey commented on ARROW-10517:
-

Yeah, I can open an issue there.

I hopefully am not using an old version. I installed miniconda and then used 
the environment files to make sure that adlfs is the recent version. And I 
print the module versions in the script so everything should be aligned.

I think I might have to open some fsspec issues as well because mkdir is 
creating those empty files instead of a directory which doesn't seem right. 
Also ran into an issue with read_table(use_legacy_dataset=True) where data was 
trying to be read from the wrong partition with a similar name "domain=tnt" and 
"domain=tntplus". So it looks like perhaps only the prefix was being used to 
list the files.

 

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235388#comment-17235388
 ] 

Lance Dacey commented on ARROW-10517:
-

Latest adlfs (0.5.5):

 

This really creates the test.parquet file as well, not just the directory:
{code:java}
fs.mkdir("dev/test999/2020/01/28/test.parquet", create_parents=True)
{code}
And if I try to run the same line again it it fails because the partition 
exists:
{code:python}
---
StorageErrorException: Operation returned an invalid status 'The specified blob 
already exists.'

During handling of the above exception, another exception occurred:

ResourceExistsError   Traceback (most recent call last)
/c/airflow/test.py in 
> 6 fs.mkdir("dev/test999/2020/01/28/test.parquet", 
create_parents=True)

~/miniconda3/envs/airflow/lib/python3.8/site-packages/adlfs/spec.py in 
mkdir(self, path, delimiter, exist_ok, **kwargs)
880 
881 def mkdir(self, path, delimiter="/", exist_ok=False, **kwargs):
--> 882 maybe_sync(self._mkdir, self, path, delimiter, exist_ok)
883 
884 async def _mkdir(self, path, delimiter="/", exist_ok=False, 
**kwargs):

~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in 
maybe_sync(func, self, *args, **kwargs)
 98 if inspect.iscoroutinefunction(func):
 99 # run the awaitable on the loop
--> 100 return sync(loop, func, *args, **kwargs)
101 else:
102 # just call the blocking function

~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in 
sync(loop, func, callback_timeout, *args, **kwargs)
 69 if error[0]:
 70 typ, exc, tb = error[0]
---> 71 raise exc.with_traceback(tb)
 72 else:
 73 return result[0]

~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in f()
 53 if callback_timeout is not None:
 54 future = asyncio.wait_for(future, callback_timeout)
---> 55 result[0] = await future
 56 except Exception:
 57 error[0] = sys.exc_info()

~/miniconda3/envs/airflow/lib/python3.8/site-packages/adlfs/spec.py in 
_mkdir(self, path, delimiter, exist_ok, **kwargs)
918 container=container_name
919 )
--> 920 await container_client.upload_blob(name=path, data="")
921 else:
922 ## everything else

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py
 in wrapper_use_tracer(*args, **kwargs)
 72 span_impl_type = settings.tracing_implementation()
 73 if span_impl_type is None:
---> 74 return await func(*args, **kwargs)
 75 
 76 # Merge span is parameter is set, but only if no explicit 
parent are passed

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_container_client_async.py
 in upload_blob(self, name, data, blob_type, length, metadata, **kwargs)
715 timeout = kwargs.pop('timeout', None)
716 encoding = kwargs.pop('encoding', 'UTF-8')
--> 717 await blob.upload_blob(
718 data,
719 blob_type=blob_type,

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py
 in wrapper_use_tracer(*args, **kwargs)
 72 span_impl_type = settings.tracing_implementation()
 73 if span_impl_type is None:
---> 74 return await func(*args, **kwargs)
 75 
 76 # Merge span is parameter is set, but only if no explicit 
parent are passed

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_blob_client_async.py
 in upload_blob(self, data, blob_type, length, metadata, **kwargs)
267 **kwargs)
268 if blob_type == BlobType.BlockBlob:
--> 269 return await upload_block_blob(**options)
270 if blob_type == BlobType.PageBlob:
271 return await upload_page_blob(**options)

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_upload_helpers.py
 in upload_block_blob(client, data, stream, length, overwrite, headers, 
validate_content, max_concurrency, blob_settings, encryption_options, **kwargs)
131 except StorageErrorException as error:
132 try:
--> 133 process_storage_error(error)
134 except ResourceModifiedError as mod_error:
135 if not overwrite:

~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/_shared/response_handlers.py
 in process_storage_error(storage_error)
145 error.error_code = error_code
146 error

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Attachment: ss2.PNG

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG, ss2.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesys

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235306#comment-17235306
 ] 

Lance Dacey commented on ARROW-10517:
-

 !ss.PNG! 


Added a screenshot of the results of the mkdir command. I am not sure why it 
created a file for the 28 partition, but it looks like that is what happened.

mkdir is failing on my production environment because I am stuck using old 
versions of adlfs and fsspec (bound to azure-blob-storage v2 SDK, unable to use 
v12 due to Airflow dependencies which is what runs all of my tasks using 
pyarrow in the first place).

What I don't understand is why I can use write_to_dataset (legacy version) 
without any issues, but the write_dataset method will fail? Is the filesystem 
implementation different? I suppose both would be using adlfs and fsspec in my 
case on Azure Blob - it seems weird that one method successfully creates the 
directories and partitions, but the other method will fail (which is why I 
raised this as a pyarrow issue).

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError 

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Attachment: ss.PNG

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
> Attachments: ss.PNG
>
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672 elif isinstance(source, (tuple, list)):
> 673 if all(_is_path_like(elem) for elem in source):
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> _filesystem_dataset(source, schema, filesystem, parti

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-19 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235277#comment-17235277
 ] 

Lance Dacey commented on ARROW-10517:
-

This works on my local conda environment (dependencies posted on my last edit, 
using the latest version of fsspec and adlfs). The "28" partition was a file 
instead of a folder in this case.

{code:python}
fs.mkdir("dev/test7/2020/01/28", create_parents=True)
{code}

If I run the same code on my production environment it fails. I am using this 
environment with read_table and write_to_dataset often though.


{code:python}
name: old
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - azure-storage-blob=2
  - pandas=1.1
  - pyarrow=2
  - pip=20.2
  - pip:
  - adlfs==0.2.5
  - fsspec==0.7.4

---
RuntimeError  Traceback (most recent call last)
 in 
> 1 fs.mkdir("dev/test8/2020/01/28", create_parents=True)

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test8/2020/01/28.

{code}

However, the dataset read function now works and it supports the row level 
filtering which is great (the dataset below is over 65 million rows and I am 
able to filter quickly for specific IDs across multiple files in under 2 
seconds):


{code:java}
dataset = ds.dataset(source=ds_path, 
 format="parquet", 
 partitioning="hive",
 exclude_invalid_files=False,
 filesystem=fs)

len(dataset.files)
1050

table = dataset.to_table(columns=None, filter=
 (ds.field("year") == "2020") & 
 (ds.field("month") == "11") & 
 (ds.field("day") > "10") &
 (ds.field("id") == "102648"))
{code}

But I cannot use write_dataset (along with the new partitioning features), 
unfortunately.


> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227  

[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-18 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235084#comment-17235084
 ] 

Lance Dacey edited comment on ARROW-10517 at 11/19/20, 7:44 AM:


Added an edit with the results of pure fsspec and adlfs find() commands against 
a dataset I created with pyarrow. For some reason, a list is being output 
although I am using the latest version of each library. 

I checked the versions by doing a conda list, and then inside of the notebook I 
ran:

{code:java}
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if 
getattr(m, '__version__', None)))
{code}


A separate attempt on my laptop locally using a fresh env file:

{code:java}
name: airflow
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - azure-storage-blob=12
  - pandas=1.1
  - pyarrow=2
  - adlfs=0.5


~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
434 selector_ignore_prefixes=selector_ignore_prefixes
435 )
--> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, 
options)
437 
438 return factory.finish(schema)

~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset.FileSystemDatasetFactory.__init__()

~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/error.pxi in 
pyarrow.lib.pyarrow_internal_check_status()

~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_get_file_info_selector()

~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/fs.py in 
get_file_info_selector(self, selector)
219 selector.base_dir, maxdepth=maxdepth, withdirs=True, 
detail=True
220 )
--> 221 for path, info in selected_files.items():
222 infos.append(self._create_file_info(path, info))
223 

AttributeError: 'list' object has no attribute 'items'
{code}




was (Author: ldacey):
Added an edit with the results of pure fsspec and adlfs find() commands against 
a dataset I created with pyarrow. For some reason, a list is being output 
although I am using the latest version of each library. 

I checked the versions by doing a conda list, and then inside of the notebook I 
ran:

{code:java}
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if 
getattr(m, '__version__', None)))
{code}


> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-18 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Description: 
 
{code:python}
# adal==1.2.5
# adlfs==0.2.5
# fsspec==0.7.4
# pandas==1.1.3
# pyarrow==2.0.0
# azure-storage-blob==2.1.0
# azure-storage-common==2.1.0

import pyarrow.dataset as ds
import fsspec
from pyarrow.dataset import DirectoryPartitioning

fs = fsspec.filesystem(protocol='abfs', 
   account_name=base.login, 
   account_key=base.password)


ds.write_dataset(data=table, 
 base_dir="dev/test7", 
 basename_template=None, 
 format="parquet",
 partitioning=DirectoryPartitioning(pa.schema([("year", 
pa.string()), ("month", pa.string()), ("day", pa.string())])), 
 schema=table.schema,
 filesystem=fs, 
)
{code}
 I think this is due to early versions of adlfs having mkdir(). Although I use 
write_to_dataset and write_table all of the time, so I am not sure why this 
would be an issue.
{code:python}
---
RuntimeError  Traceback (most recent call last)
 in 
 13 
 14 
---> 15 ds.write_dataset(data=table, 
 16  base_dir="dev/test7",
 17  basename_template=None,

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test7/2020/01/28.
{code}
 
Next, if I try to read a dataset (keep in mind that this works with read_table 
and ParquetDataset):

{code:python}
ds.dataset(source="dev/staging/evaluations", 
   format="parquet", 
   partitioning="hive",
   exclude_invalid_files=False,
   filesystem=fs
  )
{code}
 
This doesn't seem to respect the filesystem connected to Azure Blob.
{code:python}
---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 ds.dataset(source="dev/staging/evaluations", 
  2format="parquet",
  3partitioning="hive",
  4exclude_invalid_files=False,
  5filesystem=fs

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
426 fs, paths_or_selector = _ensure_multiple_sources(source, 
filesystem)
427 else:
--> 428 fs, paths_or_selector = _ensure_single_source(source, 
filesystem)
429 
430 options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_ensure_single_source(path, filesystem)
402 paths_or_selector = [path]
403 else:
--> 404 raise FileNotFoundError(path)
405 
406 return filesystem, paths_or_selector

FileNotFoundError: dev/staging/evaluations
{code}

This *does* work though when I list the blobs before passing them to ds.dataset:

{code:python}
blobs = wasb.list_blobs(container_name="dev", prefix="stag

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-18 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235084#comment-17235084
 ] 

Lance Dacey commented on ARROW-10517:
-

Added an edit with the results of pure fsspec and adlfs find() commands against 
a dataset I created with pyarrow. For some reason, a list is being output 
although I am using the latest version of each library. 

I checked the versions by doing a conda list, and then inside of the notebook I 
ran:

{code:java}
print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if 
getattr(m, '__version__', None)))
{code}


> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-18 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Description: 
 
{code:python}
# adal==1.2.5
# adlfs==0.2.5
# fsspec==0.7.4
# pandas==1.1.3
# pyarrow==2.0.0
# azure-storage-blob==2.1.0
# azure-storage-common==2.1.0

import pyarrow.dataset as ds
import fsspec
from pyarrow.dataset import DirectoryPartitioning

fs = fsspec.filesystem(protocol='abfs', 
   account_name=base.login, 
   account_key=base.password)


ds.write_dataset(data=table, 
 base_dir="dev/test7", 
 basename_template=None, 
 format="parquet",
 partitioning=DirectoryPartitioning(pa.schema([("year", 
pa.string()), ("month", pa.string()), ("day", pa.string())])), 
 schema=table.schema,
 filesystem=fs, 
)
{code}
 I think this is due to early versions of adlfs having mkdir(). Although I use 
write_to_dataset and write_table all of the time, so I am not sure why this 
would be an issue.
{code:python}
---
RuntimeError  Traceback (most recent call last)
 in 
 13 
 14 
---> 15 ds.write_dataset(data=table, 
 16  base_dir="dev/test7",
 17  basename_template=None,

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test7/2020/01/28.
{code}
 
Next, if I try to read a dataset (keep in mind that this works with read_table 
and ParquetDataset):

{code:python}
ds.dataset(source="dev/staging/evaluations", 
   format="parquet", 
   partitioning="hive",
   exclude_invalid_files=False,
   filesystem=fs
  )
{code}
 
This doesn't seem to respect the filesystem connected to Azure Blob.
{code:python}
---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 ds.dataset(source="dev/staging/evaluations", 
  2format="parquet",
  3partitioning="hive",
  4exclude_invalid_files=False,
  5filesystem=fs

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
426 fs, paths_or_selector = _ensure_multiple_sources(source, 
filesystem)
427 else:
--> 428 fs, paths_or_selector = _ensure_single_source(source, 
filesystem)
429 
430 options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_ensure_single_source(path, filesystem)
402 paths_or_selector = [path]
403 else:
--> 404 raise FileNotFoundError(path)
405 
406 return filesystem, paths_or_selector

FileNotFoundError: dev/staging/evaluations
{code}

This *does* work though when I list the blobs before passing them to ds.dataset:

{code:python}
blobs = wasb.list_blobs(container_name="dev", prefix="stag

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-13 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231903#comment-17231903
 ] 

Lance Dacey commented on ARROW-10517:
-

Hello - let me know if my edit covers it.

Previously I did have some tests for azure-blob v12 SDK, but I cannot use that 
in production anyways right now (apache-airflow requirements), so I am stuck 
with adlfs 0.2.5 I think.

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
> {code:python}
> # adal==1.2.5
> # adlfs==0.2.5
> # fsspec==0.7.4
> # pandas==1.1.3
> # pyarrow==2.0.0
> # azure-storage-blob==2.1.0
> # azure-storage-common==2.1.0
> import pyarrow.dataset as ds
> import fsspec
> from pyarrow.dataset import DirectoryPartitioning
> fs = fsspec.filesystem(protocol='abfs', 
>account_name=base.login, 
>account_key=base.password)
> ds.write_dataset(data=table, 
>  base_dir="dev/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", 
> pa.string()), ("month", pa.string()), ("day", pa.string())])), 
>  schema=table.schema,
>  filesystem=fs, 
> )
> {code}
>  I think this is due to early versions of adlfs having mkdir(). Although I 
> use write_to_dataset and write_table all of the time, so I am not sure why 
> this would be an issue.
> {code:python}
> ---
> RuntimeError  Traceback (most recent call last)
>  in 
>  13 
>  14 
> ---> 15 ds.write_dataset(data=table, 
>  16  base_dir="dev/test7",
>  17  basename_template=None,
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
> write_dataset(data, base_dir, basename_template, format, partitioning, 
> schema, filesystem, file_options, use_threads)
> 771 filesystem, _ = _ensure_fs(filesystem)
> 772 
> --> 773 _filesystemdataset_write(
> 774 data, base_dir, basename_template, schema,
> 775 filesystem, partitioning, file_options, use_threads,
> /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
> pyarrow._dataset._filesystemdataset_write()
> /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
> pyarrow._fs._cb_create_dir()
> /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, 
> path, recursive)
> 226 def create_dir(self, path, recursive):
> 227 # mkdir also raises FileNotFoundError when base directory is 
> not found
> --> 228 self.fs.mkdir(path, create_parents=recursive)
> 229 
> 230 def delete_dir(self, path):
> /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
> delimiter, exists_ok, **kwargs)
> 561 else:
> 562 ## everything else
> --> 563 raise RuntimeError(f"Cannot create 
> {container_name}{delimiter}{path}.")
> 564 else:
> 565 if container_name in self.ls("") and path:
> RuntimeError: Cannot create dev/test7/2020/01/28.
> {code}
>  
> Next, if I try to read a dataset (keep in mind that this works with 
> read_table and ParquetDataset):
> {code:python}
> ds.dataset(source="dev/staging/evaluations", 
>format="parquet", 
>partitioning="hive",
>exclude_invalid_files=False,
>filesystem=fs
>   )
> {code}
>  
> This doesn't seem to respect the filesystem connected to Azure Blob.
> {code:python}
> ---
> FileNotFoundError Traceback (most recent call last)
>  in 
> > 1 ds.dataset(source="dev/staging/evaluations", 
>   2format="parquet",
>   3partitioning="hive",
>   4exclude_invalid_files=False,
>   5filesystem=fs
> /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
> schema, format, filesystem, partitioning, partition_base_dir, 
> exclude_invalid_files, ignore_prefixes)
> 669 # TODO(kszucs): support InMemoryDataset for a table input
> 670 if _is_path_like(source):
> --> 671 return _filesystem_dataset(source, **kwargs)
> 672

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-13 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Description: 
 

 

If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade 
fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):

 
{code:python}
# adal==1.2.5
# adlfs==0.2.5
# fsspec==0.7.4
# pandas==1.1.3
# pyarrow==2.0.0
# azure-storage-blob==2.1.0
# azure-storage-common==2.1.0

import pyarrow.dataset as ds
import fsspec
from pyarrow.dataset import DirectoryPartitioning

fs = fsspec.filesystem(protocol='abfs', 
   account_name=base.login, 
   account_key=base.password)


ds.write_dataset(data=table, 
 base_dir="dev/test7", 
 basename_template=None, 
 format="parquet",
 partitioning=DirectoryPartitioning(pa.schema([("year", 
pa.string()), ("month", pa.string()), ("day", pa.string())])), 
 schema=table.schema,
 filesystem=fs, 
)
{code}
 I think this is due to early versions of adlfs having mkdir(). Although I use 
write_to_dataset and write_table all of the time, so I am not sure why this 
would be an issue.
{code:python}
---
RuntimeError  Traceback (most recent call last)
 in 
 13 
 14 
---> 15 ds.write_dataset(data=table, 
 16  base_dir="dev/test7",
 17  basename_template=None,

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test7/2020/01/28.
{code}
 
Next, if I try to read a dataset (keep in mind that this works with read_table 
and ParquetDataset):

{code:python}
ds.dataset(source="dev/staging/evaluations", 
   format="parquet", 
   partitioning="hive",
   exclude_invalid_files=False,
   filesystem=fs
  )
{code}
 
This doesn't seem to respect the filesystem connected to Azure Blob.
{code:python}
---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 ds.dataset(source="dev/staging/evaluations", 
  2format="parquet",
  3partitioning="hive",
  4exclude_invalid_files=False,
  5filesystem=fs

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
426 fs, paths_or_selector = _ensure_multiple_sources(source, 
filesystem)
427 else:
--> 428 fs, paths_or_selector = _ensure_single_source(source, 
filesystem)
429 
430 options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_ensure_single_source(path, filesystem)
402 paths_or_selector = [path]
403 else:
--> 404 raise FileNotFoundError(path)
405 
406 return filesystem, paths_or_selector

FileNotFoundError: dev/staging/evaluations
{cod

[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-13 Thread Lance Dacey (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lance Dacey updated ARROW-10517:

Description: 

 
{code:python}
# adal==1.2.5
# adlfs==0.2.5
# fsspec==0.7.4
# pandas==1.1.3
# pyarrow==2.0.0
# azure-storage-blob==2.1.0
# azure-storage-common==2.1.0

import pyarrow.dataset as ds
import fsspec
from pyarrow.dataset import DirectoryPartitioning

fs = fsspec.filesystem(protocol='abfs', 
   account_name=base.login, 
   account_key=base.password)


ds.write_dataset(data=table, 
 base_dir="dev/test7", 
 basename_template=None, 
 format="parquet",
 partitioning=DirectoryPartitioning(pa.schema([("year", 
pa.string()), ("month", pa.string()), ("day", pa.string())])), 
 schema=table.schema,
 filesystem=fs, 
)
{code}
 I think this is due to early versions of adlfs having mkdir(). Although I use 
write_to_dataset and write_table all of the time, so I am not sure why this 
would be an issue.
{code:python}
---
RuntimeError  Traceback (most recent call last)
 in 
 13 
 14 
---> 15 ds.write_dataset(data=table, 
 16  base_dir="dev/test7",
 17  basename_template=None,

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
write_dataset(data, base_dir, basename_template, format, partitioning, schema, 
filesystem, file_options, use_threads)
771 filesystem, _ = _ensure_fs(filesystem)
772 
--> 773 _filesystemdataset_write(
774 data, base_dir, basename_template, schema,
775 filesystem, partitioning, file_options, use_threads,

/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in 
pyarrow._dataset._filesystemdataset_write()

/opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in 
pyarrow._fs._cb_create_dir()

/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, 
recursive)
226 def create_dir(self, path, recursive):
227 # mkdir also raises FileNotFoundError when base directory is 
not found
--> 228 self.fs.mkdir(path, create_parents=recursive)
229 
230 def delete_dir(self, path):

/opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, 
delimiter, exists_ok, **kwargs)
561 else:
562 ## everything else
--> 563 raise RuntimeError(f"Cannot create 
{container_name}{delimiter}{path}.")
564 else:
565 if container_name in self.ls("") and path:

RuntimeError: Cannot create dev/test7/2020/01/28.
{code}
 
Next, if I try to read a dataset (keep in mind that this works with read_table 
and ParquetDataset):

{code:python}
ds.dataset(source="dev/staging/evaluations", 
   format="parquet", 
   partitioning="hive",
   exclude_invalid_files=False,
   filesystem=fs
  )
{code}
 
This doesn't seem to respect the filesystem connected to Azure Blob.
{code:python}
---
FileNotFoundError Traceback (most recent call last)
 in 
> 1 ds.dataset(source="dev/staging/evaluations", 
  2format="parquet",
  3partitioning="hive",
  4exclude_invalid_files=False,
  5filesystem=fs

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, 
schema, format, filesystem, partitioning, partition_base_dir, 
exclude_invalid_files, ignore_prefixes)
669 # TODO(kszucs): support InMemoryDataset for a table input
670 if _is_path_like(source):
--> 671 return _filesystem_dataset(source, **kwargs)
672 elif isinstance(source, (tuple, list)):
673 if all(_is_path_like(elem) for elem in source):

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_filesystem_dataset(source, schema, filesystem, partitioning, format, 
partition_base_dir, exclude_invalid_files, selector_ignore_prefixes)
426 fs, paths_or_selector = _ensure_multiple_sources(source, 
filesystem)
427 else:
--> 428 fs, paths_or_selector = _ensure_single_source(source, 
filesystem)
429 
430 options = FileSystemFactoryOptions(

/opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in 
_ensure_single_source(path, filesystem)
402 paths_or_selector = [path]
403 else:
--> 404 raise FileNotFoundError(path)
405 
406 return filesystem, paths_or_selector

FileNotFoundError: dev/staging/evaluations
{code}

This *does* work though when I list the blobs before passing them to ds.dataset:

{code:python}
blobs = wasb.list_blobs(container_name="dev", prefix="sta

[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob

2020-11-08 Thread Lance Dacey (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228257#comment-17228257
 ] 

Lance Dacey commented on ARROW-10517:
-

+ [~mdurant] and [~jorisvandenbossche]

You guys helped me with a similar issue before. There seems to be some 
incompatibility with fsspec and the new pyarrow.dataset feature. If I upgrade 
adlfs and the azure blob SDK, then it it looks like fs.find() is returning a 
list instead of a dictionary like pyarrow expects. If I downgrade adlfs to use 
SDK v2.1, then I get the correct dictionary that pyarrow expects, but there 
does not seem to be a method for mkdir (which is required). Is there a way for 
me to get this to work? I tried tweaking the installed versions of fsspec, 
adlfs, and azure-storage-blob but I could not find a combination that worked.

> [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
> 
>
> Key: ARROW-10517
> URL: https://issues.apache.org/jira/browse/ARROW-10517
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 2.0.0
> Environment: Ubuntu 18.04
>Reporter: Lance Dacey
>Priority: Major
>  Labels: azureblob, dataset, dataset-parquet-read, 
> dataset-parquet-write, fsspec
>
>  
>  
> If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade 
> fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it):
>  
> {code:java}
> pa.dataset.write_dataset(data=table, 
>  base_dir="test/test7", 
>  basename_template=None, 
>  format="parquet",
>  partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), 
> ("month", pa.int16()), ("day", pa.int16())])), 
>  schema=table.schema,
>  filesystem=blob_fs){code}
>  
> {code:java}
> 226 def create_dir(self, path, recursive):  
> 227 # mkdir also raises FileNotFoundError when base directory is not found 
> --> 228 self.fs.mkdir(path, create_parents=recursive){code}
>  
> It does not look like there is a mkdir option. However, the output of 
> fs.find() returns a dictionary as expected:
> {code:java}
> selected_files = blob_fs.find(
>  "test/test6", maxdepth=None, withdirs=True, detail=True
> ){code}
>  
> Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 
> (unfortunately, I cannot use this in production since Airflow requires 2.1, 
> so this is only for testing purposes):
> {code:java}
> Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code}
>  
> Now fs.find() returns a list, but I am able to use fs.mkdir().
> {code:java}
> ['test/test6/year=2020',
>  'test/test6/year=2020/month=11',
>  'test/test6/year=2020/month=11/day=1',
>  
> 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet',
>  
> 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code}
>  
> This causes issues later when I try to read a dataset (the code is expecting 
> a dictionary still):
> {code:java}
> dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code}
> {code:java}
> --> 
> 221 for path, info in selected_files.items():  
> 222 infos.append(self._create_file_info(path, info))  
> 223 AttributeError: 'list' object has no attribute 'items'{code}
>  
> I am still able to read individual files:
> {code:java}
> dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", 
> filesystem=blob_fs, format="parquet"){code}
>  And I can read the dataset if I pass in a list of blob names "manually":
>  
> {code:java}
> blobs = wasb.list_blobs(container_name="test", prefix="test4")
> dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], 
>  format="parquet", 
>  partitioning="hive",
>  filesystem=blob_fs)
> {code}
>  
> For all of my examples, blob_fs is defined by:
> {code:java}
> blob_fs = fsspec.filesystem(
>  protocol="abfs", account_name=base.login, account_key=base.password
>  ){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >