[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631279#comment-17631279 ] Lance Dacey commented on ARROW-15474: - Nice, I was able to test it out and seemed to get the correct results. I have been using polars and duckdb to handle de-duplication for a while now so I used that as a comparison. {code:java} %%time table = con.execute("select distinct on (forecast_group) * from scanner order by session_id, date").arrow() CPU times: user 735 ms, sys: 45.7 ms, total: 780 ms Wall time: 1.92 s {code} Your suggestion: {code:java} %%time table = scanner.to_table() t1 = table.append_column('i', pa.array(np.arange(len(table t2 = t1.group_by(['forecast_group']).aggregate([('i', 'min')]).column('i_min') table = pc.take(table, t2) CPU times: user 872 ms, sys: 60.9 ms, total: 933 ms Wall time: 4.6 s {code} A bit slower than duckdb somehow, but for me it is acceptable and gives me an option to drop duplicates without requiring additional libraries, including pandas. Thanks! > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631108#comment-17631108 ] Lance Dacey commented on ARROW-15716: - Yes, ultimate goal is to create a single expression which would filter all unique partitions that had data written into them. I added unique partitions there because it is possible for multiple file fragments to be written to the same partition (max_rows during write) - I never tested what happens if you run an expression that has duplicates though. Any idea if that would matter? For example, the filter expression for both of these fragments would be the same: 'path/to/data/section=a/part-0.parquet', 'path/to/data/section=a/part-1.parquet', The example [~westonpace] provided would work great. > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Assignee: Vibhatha Lakmal Abeykoon >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630175#comment-17630175 ] Lance Dacey commented on ARROW-15716: - Yes, if I could easily retrieve a list of the unique partitions which were written to that would be helpful. If I could then parse the list of partitions into a dataset expression (used for table(filter=expression)), that would be even better. Right now I can get a list of the fragments, parse them into expressions, and from there I can determine the partitions using ds._get_partition_keys() Full example below. I am essentially just looking for a potential shortcut, convenience method, or better approach. Say these are the fragments which were written during dataset write: {code:python} ['path/to/data/month_id=202105/v1-manual__2022-11-06T22:50:20.parquet', 'path/to/data/month_id=202106/v1-manual__2022-11-06T22:50:20.parquet', 'path/to/data/month_id=202107/v1-manual__2022-11-06T22:50:20..parquet'] {code} My ultimate goal is for a downstream task to filter the dataset for those three partitions (not just the fragments since other files might exist). {code:python} partitioning = dataset.partitioning #parse each fragment path to get a list of expressions expressions = [partitioning.parse(file) for file in paths] #get the partitions filters = [ds._get_partition_keys(expression) for expression in expressions] [{'month_id': 202105}, {'month_id': 202106}, {'month_id': 202107}] #Convert to an expression from pyarrow.parquet import filters_to_expression filters_to_expression(filters) {code} > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630046#comment-17630046 ] Lance Dacey commented on ARROW-15716: - I wanted to check if this is something which might be possible eventually. It would reduce a lot of ugly custom code that I use to achieve the result that I am looking for. Write dataset, collect the fragment paths: {code:python} collector = [] ds.write_dataset( table, base_dir="dev/staging", partitioning=["date"], partitioning_flavor="hive", file_visitor=lambda x: collector.append(x) ) {code} Next my hope would be parse those paths into a consolidate filter expression which I could use to query the original dataset. This ensures that I read in the entire partition since it is possible that other files already existed before the write step above. {code:python} paths = [file.path for file in collector] partitioning = ds.partitioning(flavor="hive") filter_expression = partitioning.parse(paths) #parse a list of paths, ideally using the "hive" shortcut dataset = ds.dataset(source="dev/staging", partitioning=partitioning) new_table = dataset.to_table(filter=filter_expression) ds.write_dataset(new_table, base_dir="dev/final", existing_data_behavior="delete_matching") {code} > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 > ], skip_nulls=false}) > {code} > My hope would be to do something like filt_exp = partitioning.parse(paths) > which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
[ https://issues.apache.org/jira/browse/ARROW-15716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-15716: Description: Is it possible for partitioning.parse() to be updated to parse a list of paths instead of just a single path? I am passing the .paths from file_visitor to downstream tasks to process data which was recently saved, but I can run into problems with this if I overwrite data with delete_matching in order to consolidate small files since the paths won't exist. Here is the output of my current approach to use filters instead of reading the paths directly: {code:python} # Fragments saved during write_dataset ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] # Run partitioning.parse() on each fragment [, , , ] # Format those expressions into a list of tuples [('date_id', 'in', [20210114, 20210813])] # Convert to an expression which is used as a filter in .to_table() is_in(date_id, {value_set=int64:[ 20210114, 20210813 ], skip_nulls=false}) {code} My hope would be to do something like filt_exp = partitioning.parse(paths) which would return a dataset expression. was: Is it possible for partitioning.parse() to be updated to parse a list of paths instead of just a single path? I am passing the .paths from file_visitor to downstream tasks to process data which was recently saved, but I can run into problems with this if I overwrite data with delete_matching in order to consolidate small files since the paths won't exist. Here is the output of my current approach to use filters instead of reading the paths directly: {code:java} # Fragments saved during write_dataset ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] # Run partitioning.parse() on each fragment [, , , ] # Format those expressions into a list of tuples [('date_id', 'in', [20210114, 20210813])] # Convert to an expression which is used as a filter in .to_table() is_in(date_id, {value_set=int64:[ 20210114, 20210813 ], skip_nulls=false}) {code} And here is how I am creating the filter from a list of .paths (perhaps there is a better way?): {code:python} partitioning = ds.HivePartitioning(partition_schema) expressions = [] for file in paths: expressions.append(partitioning.parse(file)) values = [] filters = [] for expression in expressions: partitions = ds._get_partition_keys(expression) if len(partitions.keys()) > 1: element = [(k, "==", v) for k, v in partitions.items()] if element not in filters: filters.append(element) else: for k, v in partitions.items(): if v not in values: values.append(v) filters = [(k, "in", sorted(values))] filt_exp = pa.parquet._filters_to_expression(filters) dataset.to_table(filter=filt_exp) {code} My hope would be to do something like filt_exp = partitioning.parse(paths) which would return a dataset expression. > [Dataset][Python] Parse a list of fragment paths to gather filters > -- > > Key: ARROW-15716 > URL: https://issues.apache.org/jira/browse/ARROW-15716 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 7.0.0 >Reporter: Lance Dacey >Priority: Minor > > Is it possible for partitioning.parse() to be updated to parse a list of > paths instead of just a single path? > I am passing the .paths from file_visitor to downstream tasks to process data > which was recently saved, but I can run into problems with this if I > overwrite data with delete_matching in order to consolidate small files since > the paths won't exist. > Here is the output of my current approach to use filters instead of reading > the paths directly: > {code:python} > # Fragments saved during write_dataset > ['dev/dataset/fragments/date_id=20210813/data-0.parquet', > 'dev/dataset/fragments/date_id=20210114/data-2.parquet', > 'dev/dataset/fragments/date_id=20210114/data-1.parquet', > 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] > # Run partitioning.parse() on each fragment > [, > , > , > ] > # Format those expressions into a list of tuples > [('date_id', 'in', [20210114, 20210813])] > # Convert to an expression which is used as a filter in .to_table() > is_in(date_id, {value_set=int64:[ > 20210114, > 20210813 >
[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623362#comment-17623362 ] Lance Dacey commented on ARROW-15474: - Nice - I will give that a shot, thanks. I have been using a library called `polars` to drop duplicates from a pyarrow table lately, but it would be nice to have a native-pyarrow way to do it. Can we sort the data before adding the `cumulative_sum`? My concern is that the order of the raw data might be messed up and we might select the wrong row to keep. > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526487#comment-17526487 ] Lance Dacey commented on ARROW-12358: - Nice, thanks. I can try to test with a nightly build this weekend. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 9.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524991#comment-17524991 ] Lance Dacey commented on ARROW-15474: - I'll keep this open since this is a major wish list item for me. If anyone has some sample functions they have implemented outside of core pyarrow to achieve this then I would be interested in seeing that as well. > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > Fix For: 9.0.0 > > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
[ https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478 ] Lance Dacey edited comment on ARROW-16077 at 4/5/22 2:26 PM: - I am not sure about any public datasets. Locally, I use [azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio] for testing which can be installed or run as a Docker container. Note that I only use Azure Blob and not Azure Data Lake, so there might be some differences I am not aware of. I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import pytest from adlfs import AzureBlobFileSystem from pandas.testing import assert_frame_equal URL = "http://127.0.0.1:1"; ACCOUNT_NAME = "devstoreaccount1" KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==" CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};" @pytest.fixture def example_data(): return { "date_id": [20210114, 20210811], "id": [1, 2], "created_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "updated_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "category": ["cow", "sheep"], "value": [0, 99], } def test_double_slashes(example_data): fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR) fs.mkdir("resource") path = "resource/path/to//parquet/files/part-001.parquet" table = pa.table(example_data) pq.write_table(table, where=path, filesystem=fs) # use pq.read_table() with filesystem new = pq.read_table(source=path, filesystem=fs) assert new == table # use adlfs filesystem.open() df = pd.read_parquet(fs.open(path, mode="rb")) dataframe_table = pa.Table.from_pandas(df) assert table == dataframe_table # use abfs path with storage options df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR}) assert_frame_equal(df, df2) {code} was (Author: ldacey): I am not sure about any public datasets. Locally, I use [azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio] for testing which can be installed or run as a Docker container. I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import pytest from adlfs import AzureBlobFileSystem from pandas.testing import assert_frame_equal URL = "http://127.0.0.1:1"; ACCOUNT_NAME = "devstoreaccount1" KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==" CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};" @pytest.fixture def example_data(): return { "date_id": [20210114, 20210811], "id": [1, 2], "created_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "updated_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "category": ["cow", "sheep"], "value": [0, 99], } def test_double_slashes(example_data): fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR) fs.mkdir("resource") path = "resource/path/to//parquet/files/part-001.parquet" table = pa.table(example_data) pq.write_table(table, where=path, filesystem=fs) # use pq.read_table() with filesystem new = pq.read_table(source=path, filesystem=fs) assert new == table # use adlfs filesystem.open() df = pd.read_parquet(fs.open(path, mode="rb")) dataframe_table = pa.Table.from_pandas(df) assert table == dataframe_table # use abfs path with storage options df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR}) assert_frame_equal(df, df2) {code} > [Python] ArrowInvalid error on reading partitioned parquet files with > fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in th
[jira] [Commented] (ARROW-16077) [Python] ArrowInvalid error on reading partitioned parquet files with fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path
[ https://issues.apache.org/jira/browse/ARROW-16077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517478#comment-17517478 ] Lance Dacey commented on ARROW-16077: - I am not sure about any public datasets. Locally, I use [azurite|https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azurite?tabs=visual-studio] for testing which can be installed or run as a Docker container. I use pyarrow ds.dataset() or pq.read_table() with a filesystem to read parquet data from Azure. I did a couple of tests with double slashes in the path. Perhaps I misunderstood what the original issue was, but it looks like I can read the data with pq.read_table and with pandas using fs.open() and storage_options. I pasted my quick tests below. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq import pytest from adlfs import AzureBlobFileSystem from pandas.testing import assert_frame_equal URL = "http://127.0.0.1:1"; ACCOUNT_NAME = "devstoreaccount1" KEY = "Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==" CONN_STR = f"DefaultEndpointsProtocol=http;AccountName={ACCOUNT_NAME};AccountKey={KEY};BlobEndpoint={URL}/{ACCOUNT_NAME};" @pytest.fixture def example_data(): return { "date_id": [20210114, 20210811], "id": [1, 2], "created_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "updated_at": [ "2021-01-14 16:45:18", "2021-08-11 15:10:00", ], "category": ["cow", "sheep"], "value": [0, 99], } def test_double_slashes(example_data): fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, connection_string=CONN_STR) fs.mkdir("resource") path = "resource/path/to//parquet/files/part-001.parquet" table = pa.table(example_data) pq.write_table(table, where=path, filesystem=fs) # use pq.read_table() with filesystem new = pq.read_table(source=path, filesystem=fs) assert new == table # use adlfs filesystem.open() df = pd.read_parquet(fs.open(path, mode="rb")) dataframe_table = pa.Table.from_pandas(df) assert table == dataframe_table # use abfs path with storage options df2 = pd.read_parquet(f"abfs://{path}", storage_options={"connection_string": CONN_STR}) assert_frame_equal(df, df2) {code} > [Python] ArrowInvalid error on reading partitioned parquet files with > fsspec.adlfs (pyarrow-7.0.0) due to removed '/' in the ls of path > --- > > Key: ARROW-16077 > URL: https://issues.apache.org/jira/browse/ARROW-16077 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 7.0.0 >Reporter: Jon Rosenberg >Priority: Major > > Reading a partitioned parquet from adlfs with pyarrow through pandas will > throw unnecessary exceptions on not matching forward slashes in the listed > files returned from adlfs, ie: > > {code:python} > import pandas as pd > pd.read_parquet("adl://resource/path/to/parquet/files"){code} > results in exception of the form > {code:bash} > pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path > 'path/to/parquet/files/part-0001.parquet', which is outside base dir > '/path/to/parquet/files/'{code} > > and testing with modifying the adlfs method to prepend slashes to all > returned files, we still end up with an error on file paths that would > otherwise be handled correctly where there is a double slash in a location > where there should be one, ie: > > {code:python} > import pandas as pd > pd.read_parquet("adl://resource/path/to//parquet/files") {code} > would throw > {code:bash} > pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path > '/path/to/parquet/files/part-0001.parquet', which is outside base dir > '/path/to//parquet/files/' {code} > In both cases the ls has returned correctly from adlfs, given it's discovered > the file part-0001.parquet but the pyarrow exception stops what could > otherwise be successful processing. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17517431#comment-17517431 ] Lance Dacey commented on ARROW-12358: - Is this on the radar to be fixed for the next release? > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 8.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17501328#comment-17501328 ] Lance Dacey commented on ARROW-12358: - Is this issue sufficient to track this? In the meantime, is there a more efficient way to create the partitions instead using "overwrite_or_ignore" and then "delete_matching" if the first attempt failed? > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Weston Pace >Priority: Major > Labels: dataset > Fix For: 8.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-12365. --- Fix Version/s: 6.0.0 Resolution: Resolved delete_matching option solves this issue > [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() > -- > > Key: ARROW-12365 > URL: https://issues.apache.org/jira/browse/ARROW-12365 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > Fix For: 6.0.0 > > > I need to use the legacy pq.write_to_dataset() in order to guarantee that a > file within a partition will have a specific name. > My use case is that I need to report on the final version of data and our > visualization tool connects directly to our parquet files on Azure Blob > (Power BI). > 1) Download data every hour based on an updated_at timestamp (this data is > partitioned by date) > 2) Transform the data which was just downloaded and save it into a "staging" > dataset (this has all versions of the data and there will be many files > within each partition. In this case, up to 24 files within a single date > partition since we download hourly) > 3) Filter the transformed data and read a subset of columns, sort it by the > updated_at timestamp and drop duplicates on the unique constraint, then > partition and save it with partition_filename_cb. In the example below, if I > partition by the "date_id" column, then my dataset structure will be > "/date_id=202104123/20210413.parquet" > {code:java} > use_legacy_dataset=True, partition_filename_cb=lambda x: > str(x[-1]) + ".parquet",{code} > Ultimately, I am sure that this final dataset has exactly one file per > partition and that I only have the latest version of each row based on the > maximum updated_at timestamp. My visualization tool can safely connect to and > incrementally refresh from this dataset. > > > An alternative solution would be to allow us to overwrite anything in an > existing partition. I do not care about the file names so much as I want to > ensure that I am fully replacing any data which might already exist in my > partition, and I want to limit the number of physical files. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15716) [Dataset][Python] Parse a list of fragment paths to gather filters
Lance Dacey created ARROW-15716: --- Summary: [Dataset][Python] Parse a list of fragment paths to gather filters Key: ARROW-15716 URL: https://issues.apache.org/jira/browse/ARROW-15716 Project: Apache Arrow Issue Type: Wish Affects Versions: 7.0.0 Reporter: Lance Dacey Is it possible for partitioning.parse() to be updated to parse a list of paths instead of just a single path? I am passing the .paths from file_visitor to downstream tasks to process data which was recently saved, but I can run into problems with this if I overwrite data with delete_matching in order to consolidate small files since the paths won't exist. Here is the output of my current approach to use filters instead of reading the paths directly: {code:java} # Fragments saved during write_dataset ['dev/dataset/fragments/date_id=20210813/data-0.parquet', 'dev/dataset/fragments/date_id=20210114/data-2.parquet', 'dev/dataset/fragments/date_id=20210114/data-1.parquet', 'dev/dataset/fragments/date_id=20210114/data-0.parquet'] # Run partitioning.parse() on each fragment [, , , ] # Format those expressions into a list of tuples [('date_id', 'in', [20210114, 20210813])] # Convert to an expression which is used as a filter in .to_table() is_in(date_id, {value_set=int64:[ 20210114, 20210813 ], skip_nulls=false}) {code} And here is how I am creating the filter from a list of .paths (perhaps there is a better way?): {code:python} partitioning = ds.HivePartitioning(partition_schema) expressions = [] for file in paths: expressions.append(partitioning.parse(file)) values = [] filters = [] for expression in expressions: partitions = ds._get_partition_keys(expression) if len(partitions.keys()) > 1: element = [(k, "==", v) for k, v in partitions.items()] if element not in filters: filters.append(element) else: for k, v in partitions.items(): if v not in values: values.append(v) filters = [(k, "in", sorted(values))] filt_exp = pa.parquet._filters_to_expression(filters) dataset.to_table(filter=filt_exp) {code} My hope would be to do something like filt_exp = partitioning.parse(paths) which would return a dataset expression. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17485722#comment-17485722 ] Lance Dacey commented on ARROW-12358: - Is this slated for a fix in 7.0.0? I am writing a dataset using "overwrite_or_ignore" and then "delete_matching" if my initial save fails (FileNotFoundError) using "delete_matching". > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 8.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483481#comment-17483481 ] Lance Dacey commented on ARROW-15474: - Ahh, that would be great. Random is a bit risky for my use case since I generally care about the latest version. I found [this repository|https://github.com/TomScheffers/pyarrow_ops/tree/main/pyarrow_ops] which has a method to drop duplicates that I might be able to adopt in the meantime. I would need to digest exactly what is happening down below a bit more, but I think there are some compute functions like `pc.sort_indices`, `pc.unique`, etc that could probably be used to replace some of the numpy code. {code:python} def drop_duplicates(table, on=[], keep='first'): # Gather columns to arr arr = columns_to_array(table, (on if on else table.column_names)) # Groupify dic, counts, sort_idxs, bgn_idxs = groupify_array(arr) # Gather idxs if keep == 'last': idxs = (np.array(bgn_idxs) - 1)[1:].tolist() + [len(sort_idxs) - 1] elif keep == 'first': idxs = bgn_idxs elif keep == 'drop': idxs = [i for i, c in zip(bgn_idxs, counts) if c == 1] return table.take(sort_idxs[idxs]) def groupify_array(arr): # Input: Pyarrow/Numpy array # Output: # - 1. Unique values # - 2. Sort index # - 3. Count per unique # - 4. Begin index per unique dic, counts = np.unique(arr, return_counts=True) sort_idx = np.argsort(arr) return dic, counts, sort_idx, [0] + np.cumsum(counts)[:-1].tolist() def combine_column(table, name): return table.column(name).combine_chunks() f = np.vectorize(hash) def columns_to_array(table, columns): columns = ([columns] if isinstance(columns, str) else list(set(columns))) if len(columns) == 1: return f(combine_column(table, columns[0]).to_numpy(zero_copy_only=False)) else: values = [c.to_numpy() for c in table.select(columns).itercolumns()] return np.array(list(map(hash, zip(*values {code} > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > Fix For: 8.0.0 > > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
[ https://issues.apache.org/jira/browse/ARROW-15474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483114#comment-17483114 ] Lance Dacey commented on ARROW-15474: - I would personally be okay with only having the first row retained since I could just sort the table before dropping duplicates to get the desired results. Is it possible to get the first or nth values from a table groupby? In pandas, we can do this which I think has the desired behavior even with multiple columns (as long as we sort the data first). If we can get the indices of which rows to keep, then we could use table.take() to return a new table with the latest values. {code:python} df = pd.DataFrame( { "id": [1, 1, 1, 2, 2, 2], "name": ["a", "a", "a", "b", "c", "c"], "updated_at": [ "2021-01-01 00:02:19", "2022-01-04 12:13:10", "2022-01-06 04:10:52", "2022-01-02 17:32:21", "2022-01-06 01:27:14", "2022-01-06 23:09:56", ], } ) df.sort_values(["id", "name", "updated_at"], ascending=[1, 1, 0]).groupby(["id", "name"]).nth(0).reset_index() {code} > [Python] Possibility of a table.drop_duplicates() function? > --- > > Key: ARROW-15474 > URL: https://issues.apache.org/jira/browse/ARROW-15474 > Project: Apache Arrow > Issue Type: Wish >Affects Versions: 6.0.1 >Reporter: Lance Dacey >Priority: Major > Fix For: 8.0.0 > > > I noticed that there is a group_by() and sort_by() function in the 7.0.0 > branch. Is it possible to include a drop_duplicates() function as well? > ||id||updated_at|| > |1|2022-01-01 04:23:57| > |2|2022-01-01 07:19:21| > |2|2022-01-10 22:14:01| > Something like this which would return a table without the second row in the > example above would be great. > I usually am reading an append-only dataset and then I need to report on > latest version of each row. To drop duplicates, I am temporarily converting > the append-only table to a pandas DataFrame, and then I convert it back to a > table and save a separate "latest-version" dataset. > {code:python} > table.sort_by(sorting=[("id", "ascending"), ("updated_at", > "ascending")]).drop_duplicates(subset=["id"] keep="last") > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15474) [Python] Possibility of a table.drop_duplicates() function?
Lance Dacey created ARROW-15474: --- Summary: [Python] Possibility of a table.drop_duplicates() function? Key: ARROW-15474 URL: https://issues.apache.org/jira/browse/ARROW-15474 Project: Apache Arrow Issue Type: Wish Affects Versions: 6.0.1 Reporter: Lance Dacey Fix For: 8.0.0 I noticed that there is a group_by() and sort_by() function in the 7.0.0 branch. Is it possible to include a drop_duplicates() function as well? ||id||updated_at|| |1|2022-01-01 04:23:57| |2|2022-01-01 07:19:21| |2|2022-01-10 22:14:01| Something like this which would return a table without the second row in the example above would be great. I usually am reading an append-only dataset and then I need to report on latest version of each row. To drop duplicates, I am temporarily converting the append-only table to a pandas DataFrame, and then I convert it back to a table and save a separate "latest-version" dataset. {code:python} table.sort_by(sorting=[("id", "ascending"), ("updated_at", "ascending")]).drop_duplicates(subset=["id"] keep="last") {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476120#comment-17476120 ] Lance Dacey commented on ARROW-12358: - Ah, so it must be related to the filesystem. I am using adlfs / fsspec to save datasets on Azure Blob: {code:python} import pyarrow as pa import pyarrow.dataset as ds print(type(fs)) tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 3] }) ds.write_dataset(data=tab, base_dir='/dev/newdataset', partitioning_flavor='hive', partitioning=['part'], existing_data_behavior='delete_matching', format='parquet', filesystem=fs) {code} Output: {code:python} [2022-01-14 12:45:44,076] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match [2022-01-14 12:45:44,090] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match [2022-01-14 12:45:44,093] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match [2022-01-14 12:45:44,109] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match [2022-01-14 12:45:44,121] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match [2022-01-14 12:45:44,124] {api.py:76} WARNING - Given content is empty, stopping the process very early, returning empty utf_8 str match --- FileNotFoundError Traceback (most recent call last) /tmp/ipykernel_47/3075266795.py in 4 print(type(fs)) 5 tab = pa.Table.from_pydict({ 'part': [0, 0, 1, 1], 'value': [0, 1, 2, 3] }) > 6 ds.write_dataset(data=tab, 7 base_dir='/dev/newdataset', 8 partitioning_flavor='hive', /opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, partitioning_flavor, schema, filesystem, file_options, use_threads, max_partitions, file_visitor, existing_data_behavior) 876 scanner = data 877 --> 878 _filesystemdataset_write( 879 scanner, base_dir, basename_template, filesystem, partitioning, 880 file_options, max_partitions, file_visitor, existing_data_behavior /opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_delete_dir_contents() /opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in delete_dir_contents(self, path) 357 raise ValueError( 358 "delete_dir_contents called on path '", path, "'") --> 359 self._delete_dir_contents(path) 360 361 def delete_root_dir_contents(self): /opt/conda/envs/airflow/lib/python3.9/site-packages/pyarrow/fs.py in _delete_dir_contents(self, path) 347 348 def _delete_dir_contents(self, path): --> 349 for subpath in self.fs.listdir(path, detail=False): 350 if self.fs.isdir(subpath): 351 self.fs.rm(subpath, recursive=True) /opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/spec.py in listdir(self, path, detail, **kwargs) 1221 def listdir(self, path, detail=True, **kwargs): 1222 """Alias of `AbstractFileSystem.ls`.""" -> 1223 return self.ls(path, detail=detail, **kwargs) 1224 1225 def cp(self, path1, path2, **kwargs): /opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in ls(self, path, detail, invalidate_cache, delimiter, return_glob, **kwargs) 753 ): 754 --> 755 files = sync( 756 self.loop, 757 self._ls, /opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in sync(loop, func, timeout, *args, **kwargs) 69 raise FSTimeoutError from return_result 70 elif isinstance(return_result, BaseException): ---> 71 raise return_result 72 else: 73 return return_result /opt/conda/envs/airflow/lib/python3.9/site-packages/fsspec/asyn.py in _runner(event, coro, result, timeout) 23 coro = asyncio.wait_for(coro, timeout=timeout) 24 try: ---> 25 result[0] = await coro 26 except Exception as ex: 27 result[0] = ex /opt/conda/envs/airflow/lib/python3.9/site-packages/adlfs/spec.py in _ls(self, path, invalidate_cache, delimiter, return_glob, **kwargs) 875 if not finalblobs: 876 if not await self._exists(target_path): --> 877
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17475363#comment-17475363 ] Lance Dacey commented on ARROW-12358: - [~westonpace] Just wanted to check if this issue with "delete_matching" not creating the partition directory is still on the radar. I am currently using "overwrite_or_ignore", and then writing the same table again with "delete_matching" which is a bit redundant. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 8.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450796#comment-17450796 ] Lance Dacey edited comment on ARROW-12358 at 12/3/21, 3:04 AM: --- I was not able to install 6.0.1 until the latest version of turbodbc supported it. Finally have it up and running and I see that the `existing_data_behavior` argument has been added. Is this the proper way to use the "delete_matching" feature? When I tried to set that as default, there was a FileNotFound error (because the base_dir did not exist at all). EDIT - using the try, except does not really work. I need to save the dataset as "overwrite_or_ignore" first, then save the dataset again as "delete_matching" {code:python} try: ds.write_dataset( data=table, existing_data_behavior="error", ) except pa.lib.ArrowInvalid: ds.write_dataset( data=table, ..., existing_data_behavior="delete_matching", ) {code} I created a dataset using my old method (`use_legacy_dataset` = True with a `partition_filename_cb` to overwrite partitions) and the output matched the new "delete_matching" dataset. I believe I can completely retire the use_legacy_dataset code now. Really amazing, thank you. was (Author: ldacey): I was not able to install 6.0.1 until the latest version of turbodbc supported it. Finally have it up and running and I see that the `existing_data_behavior` argument has been added. Is this the proper way to use the "delete_matching" feature? When I tried to set that as default, there was a FileNotFound error (because the base_dir did not exist at all). {code:python} try: ds.write_dataset( data=table, existing_data_behavior="error", ) except pa.lib.ArrowInvalid: ds.write_dataset( data=table, ..., existing_data_behavior="delete_matching", ) {code} I created a dataset using my old method (`use_legacy_dataset` = True with a `partition_filename_cb` to overwrite partitions) and the output matched the new "delete_matching" dataset. I believe I can completely retire the use_legacy_dataset code now. Really amazing, thank you. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 7.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452649#comment-17452649 ] Lance Dacey commented on ARROW-12358: - Any thoughts on "delete_matching" creating the partition if it does not exist already? > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 7.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451813#comment-17451813 ] Lance Dacey commented on ARROW-14938: - Sure - refer to this section: https://arrow.apache.org/docs/python/dataset.html#different-partitioning-schemes "hive" is a shortcut which will infer the data type of the partition column when it gets added back to the table, but you can specify the schema of your partitioned columns too using ds.partitioning(). > Partition column dissappear when reading dataset > > > Key: ARROW-14938 > URL: https://issues.apache.org/jira/browse/ARROW-14938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 > Environment: Debian bullseye, python 3.9 >Reporter: Martin Gran >Priority: Major > > Appending CSV to parquet dataset with partitioning on "code". > {code:python} > table = pa.Table.from_pandas(chunk) > pa.dataset.write_dataset( > table, > output_path, > basename_template=f"chunk_\{y}_\{{i}}", > format="parquet", > partitioning=["code"], > existing_data_behavior="overwrite_or_ignore", > ) > {code} > Loading the dataset again and expecting code to be in the dataframe. > {code:python} > import pyarrow.dataset as ds > dataset = ds.dataset("../data/interim/2020_elements_parquet/", > format="parquet",) > df = dataset.to_table().to_pandas() > >>>df["code"] > {code} > Trace > {code:python} > --- > KeyError Traceback (most recent call last) > ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in > get_loc(self, key, method, tolerance) 3360 try: -> 3361 return > self._engine.get_loc(casted_key) 3362 except KeyError as err: > ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in > pandas._libs.index.IndexEngine.get_loc() > ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in > pandas._libs.index.IndexEngine.get_loc() > pandas/_libs/hashtable_class_helper.pxi in > pandas._libs.hashtable.PyObjectHashTable.get_item() > pandas/_libs/hashtable_class_helper.pxi in > pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The > above exception was the direct cause of the following exception: KeyError > Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in > > 1 df["code"] > ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in > __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return > self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) > 3459 if is_integer(indexer): 3460 indexer = [indexer] > ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in > get_loc(self, key, method, tolerance) 3361 return > self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise > KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not > self.hasnans: KeyError: 'code' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14938) Partition column dissappear when reading dataset
[ https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17451752#comment-17451752 ] Lance Dacey commented on ARROW-14938: - If you add the partitioning argument to ds.dataset(source, format, partitioning) that should fix it. For example, partitioning="hive" or specify it with a partitioning object partitioning=ds.partitioning(pa.schema(["code", pa.string()]), flavor="hive"). I used hive in those examples but there is directory partitioning as well. > Partition column dissappear when reading dataset > > > Key: ARROW-14938 > URL: https://issues.apache.org/jira/browse/ARROW-14938 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 6.0.1 > Environment: Debian bullseye, python 3.9 >Reporter: Martin Gran >Priority: Major > > Appending CSV to parquet dataset with partitioning on "code". > {code:python} > table = pa.Table.from_pandas(chunk) > pa.dataset.write_dataset( > table, > output_path, > basename_template=f"chunk_\{y}_\{{i}}", > format="parquet", > partitioning=["code"], > existing_data_behavior="overwrite_or_ignore", > ) > {code} > Loading the dataset again and expecting code to be in the dataframe. > {code:python} > import pyarrow.dataset as ds > dataset = ds.dataset("../data/interim/2020_elements_parquet/", > format="parquet",) > df = dataset.to_table().to_pandas() > >>>df["code"] > {code} > Trace > {code:python} > --- > KeyError Traceback (most recent call last) > ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in > get_loc(self, key, method, tolerance) 3360 try: -> 3361 return > self._engine.get_loc(casted_key) 3362 except KeyError as err: > ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in > pandas._libs.index.IndexEngine.get_loc() > ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in > pandas._libs.index.IndexEngine.get_loc() > pandas/_libs/hashtable_class_helper.pxi in > pandas._libs.hashtable.PyObjectHashTable.get_item() > pandas/_libs/hashtable_class_helper.pxi in > pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The > above exception was the direct cause of the following exception: KeyError > Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in > > 1 df["code"] > ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in > __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return > self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) > 3459 if is_integer(indexer): 3460 indexer = [indexer] > ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in > get_loc(self, key, method, tolerance) 3361 return > self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise > KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not > self.hasnans: KeyError: 'code' > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450796#comment-17450796 ] Lance Dacey commented on ARROW-12358: - I was not able to install 6.0.1 until the latest version of turbodbc supported it. Finally have it up and running and I see that the `existing_data_behavior` argument has been added. Is this the proper way to use the "delete_matching" feature? When I tried to set that as default, there was a FileNotFound error (because the base_dir did not exist at all). {code:python} try: ds.write_dataset( data=table, existing_data_behavior="error", ) except pa.lib.ArrowInvalid: ds.write_dataset( data=table, ..., existing_data_behavior="delete_matching", ) {code} I created a dataset using my old method (`use_legacy_dataset` = True with a `partition_filename_cb` to overwrite partitions) and the output matched the new "delete_matching" dataset. I believe I can completely retire the use_legacy_dataset code now. Really amazing, thank you. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 7.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}}) > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-14608) [Python] Provide access to hash_aggregate functions through a group_by method
[ https://issues.apache.org/jira/browse/ARROW-14608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450770#comment-17450770 ] Lance Dacey commented on ARROW-14608: - If we can do group_by using the pyarrow table, then I should be able to drop_duplicates as well if it is combined with a filter right? Sorting and dropping duplicates is one of the big reasons I still need to convert some pyarrow tables into a pandas DataFrame temporarily. {code:java} df.sort_values(['id', 'updated_at'], ascending=True).drop_duplicates(subset=['id'], keep='last'){code} > [Python] Provide access to hash_aggregate functions through a group_by method > - > > Key: ARROW-14608 > URL: https://issues.apache.org/jira/browse/ARROW-14608 > Project: Apache Arrow > Issue Type: Sub-task > Components: Python >Affects Versions: 6.0.0 >Reporter: Alessandro Molina >Assignee: Alessandro Molina >Priority: Major > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 10h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403772#comment-17403772 ] Lance Dacey commented on ARROW-12358: - kDeleteMatchingPartitions - So this only deletes the individual partitions and not the entire dataset correct? So if I save a dataset made up of hundreds of partitions but only 4 of them are written to, then only those 4 partitions will have their existing files cleared? If so, then yes that should work for me. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 6.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403767#comment-17403767 ] Lance Dacey commented on ARROW-12365: - The metadata collector works great, but this issue is more related to https://issues.apache.org/jira/browse/ARROW-12358 I use the partition_filename_cb to guaranteed that I overwrite partitions which I do not think we can control with ds.write_dataset() because of the \{i} counter which may be different and accidentally write a new file into an existing partition (I need to be sure that there are no duplicates in the data, since our Power BI tool connects directly to the parquet dataset) > [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() > -- > > Key: ARROW-12365 > URL: https://issues.apache.org/jira/browse/ARROW-12365 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > > I need to use the legacy pq.write_to_dataset() in order to guarantee that a > file within a partition will have a specific name. > My use case is that I need to report on the final version of data and our > visualization tool connects directly to our parquet files on Azure Blob > (Power BI). > 1) Download data every hour based on an updated_at timestamp (this data is > partitioned by date) > 2) Transform the data which was just downloaded and save it into a "staging" > dataset (this has all versions of the data and there will be many files > within each partition. In this case, up to 24 files within a single date > partition since we download hourly) > 3) Filter the transformed data and read a subset of columns, sort it by the > updated_at timestamp and drop duplicates on the unique constraint, then > partition and save it with partition_filename_cb. In the example below, if I > partition by the "date_id" column, then my dataset structure will be > "/date_id=202104123/20210413.parquet" > {code:java} > use_legacy_dataset=True, partition_filename_cb=lambda x: > str(x[-1]) + ".parquet",{code} > Ultimately, I am sure that this final dataset has exactly one file per > partition and that I only have the latest version of each row based on the > maximum updated_at timestamp. My visualization tool can safely connect to and > incrementally refresh from this dataset. > > > An alternative solution would be to allow us to overwrite anything in an > existing partition. I do not care about the file names so much as I want to > ensure that I am fully replacing any data which might already exist in my > partition, and I want to limit the number of physical files. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398635#comment-17398635 ] Lance Dacey commented on ARROW-12358: - I do not clear my append dataset, but I need to add tasks to consolidate the small files someday. If I download a source every hour, I will have a minimum of 24 files in a single daily partition and some of them might be small. But yes, I am basically describing a materialized view. I cannot rely on an incremental refresh in many cases because I partition data based on the created_at date and not the updated_at date. Here is an example where the data was all updated today, but there were some rows originally created days or even months ago. {code:python} table = pa.table( { "date_id": [20210114, 20210811, 20210812, 20210813],#based on the created_at timestamp "created_at": ["2021-01-14 16:45:18", "2021-08-11 15:10:00", "2021-08-12 11:19:26", "2021-08-13 23:01:47"], "updated_at": ["2021-08-13 00:04:12", "2021-08-13 02:16:23", "2021-08-13 09:55:44", "2021-08-13 22:36:01"], "category": ["cow", "sheep", "dog", "cat"], "value": [0, 99, 17, 238], } ) {code} Partitioning this by date_id would save the following files in my "append" dataset. Note that this has one row which is from January, so I cannot do an incremental refresh from the minimum date because it would be too much data in a real world scenario. {code:python} written_paths = [ "dev/test/date_id=20210812/test-20210813114024-2.parquet", "dev/test/date_id=20210813/test-20210813114024-3.parquet", "dev/test/date_id=20210811/test-20210813114024-1.parquet", "dev/test/date_id=20210114/test-20210813114024-0.parquet", ] {code} During my next task, I create a new dataset from the written_paths above (so a dataset of only the new/changed data). Using .get_fragments() and partition expressions, I ultimately generate a filter expression: {code:python} fragments = ds.dataset(written_paths, fs).get_fragments() for frag in fragments: partitions = ds._get_partition_keys(frag.partition_expression) #... other stuff filter_expression = {code} Finally, I use that filter to query my "append" dataset which has all historical data. So I read all of the data in each partition {code:python} df = ds.dataset(source, fs).to_table(filters=filter_expression).to_pandas() {code} , convert the table to pandas, sort and drop duplicates, convert back to a table, and then save to my "final" dataset with partition_filename_cb to overwrite whatever was there. This means that if even a single row was updated within a partition, I will be read all of the data in that partition and recompute the final version of each row. This also requires me to use the "use_legacy_dataset" flag to support overwriting the existing partitions. I found a custom implementation of drop_duplicates (https://github.com/TomScheffers/pyarrow_ops/blob/main/pyarrow_ops/ops.py) using pyarrow Tables, but I am still just using pandas for now. I keep a close eye on the pyarrow.compute() docs and have been slowly replacing stuff I do with pandas directly in the pyarrow tables, which is great. You mentioning the temporary staging area got me to realize that I could replace my messy staging append dataset (many small files) with something temporary that I delete each schedule, and then read from it and create a consolidated historical append-only dataset similar to what I am doing in the example above (one file per partition instead of potentially hundreds) > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 6.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the t
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398007#comment-17398007 ] Lance Dacey commented on ARROW-12358: - What is the common workflow pattern for folks trying to imitate something similar to a view in a database? In many of my sources I have a dataset which is append only (using UUIDs in the basename template), normally partitioned by date. If this data is downloaded frequently or is generated from multiple sources (for example, several endpoints or servers), then each partition might have many files. Most likely there are also different versions of each row (one ID will have a row for each time it was updated, for example). I then write to a new dataset which is used for reporting and visualization. # Get the list of files which were saved to the append-only dataset during the most recent schedule # Create a dataset from the list of paths which were just saved and use .get_fragments() and ds._get_partition_keys(fragment.partition_expression) to generate a filter expression (this allows me to query for *all* of the data in each relevant partition which was recently modified - so if only a single row was modified in the 2021-08-05 partition, then I still need to read all of the other data in that partition in order to finalize it) # Create a dataframe, sort the data and drop duplicates on a primary key, convert back to a table (it would be nice to be able to do this purely in a pyarrow table so I could leave out pandas!) # Use pq.write_to_dataset() with partition_filename_cb=lambda x: str(x[-1]) + ".parquet" to write to a final dataset. This allows me to overwrite the relevant partitions because the filenames are the same. I can be certain that I only have the latest version of each row. This is my approach to come close to what I would achieve with a view in the database. It works fine, but the storage is essentially doubled since I am maintaining two datasets (append-only and final). Our visualization tool connects directly to these parquet files, so there is some benefit in having less files (one per partition instead of potentially hundreds) as well. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 6.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes
[ https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376448#comment-17376448 ] Lance Dacey commented on ARROW-13074: - Sure Joris, I posted it and then I read that you said to keep the discussion separate so I tried to be sneaky and delete it before you noticed > [Python] Start with deprecating ParquetDataset custom attributes > > > Key: ARROW-13074 > URL: https://issues.apache.org/jira/browse/ARROW-13074 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As a first step for ARROW-9720, we should start with deprecating > attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep > / are conflicting with the "dataset API". > I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} > class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). > In addition, some of the keywords are also exposed as properties (memory_map, > read_dictionary, buffer_size, fs), and could be deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes
[ https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-13074: Comment: was deleted (was: I have run into a few issues with basename_template: 1) If I run tasks in parallel (for example, Airflow downloads data from various SQL servers and writes to the same partitions), then there is a chance to overwrite existing data (part-0.parquet) 2) If I make the basename_template unique, then I can end up with duplicate data inside of my partitions because I am not overwriting what is already there. The way I have been organizing this so far is to have use two datasets: *Dataset A*: * UUID filenames, so everything is unique. This most likely has duplicate values, and most certainly will have old versions of rows (based on an updated_at timestamp) * This normally has a lot of files per partition since I download data every 30 minutes - 1 hour in many cases *Dataset B:* * Reads from Dataset A, sorts, drop duplicates, and then resave using a partition_filename_cb {code:java} use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code} * I normally partition by date_id, so each partition is something like {code:java} path/date_id=20210706/20210706.parquet{code} * This allows me to have a single file per partition which has the final version of the each row with no duplicates. Our visualization tool connects to these fragments directly (Power BI in this case) I think that I might be able to use basename_template if I was careful and made sure that I did not write data in parallel, so the part-0.parquet file would be overwritten each time. Or perhaps I could list the files in that partition and delete them before saving new data (risky if another process might be using those files at that time). ) > [Python] Start with deprecating ParquetDataset custom attributes > > > Key: ARROW-13074 > URL: https://issues.apache.org/jira/browse/ARROW-13074 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As a first step for ARROW-9720, we should start with deprecating > attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep > / are conflicting with the "dataset API". > I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} > class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). > In addition, some of the keywords are also exposed as properties (memory_map, > read_dictionary, buffer_size, fs), and could be deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes
[ https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375901#comment-17375901 ] Lance Dacey commented on ARROW-13074: - I have run into a few issues with basename_template: 1) If I run tasks in parallel (for example, Airflow downloads data from various SQL servers and writes to the same partitions), then there is a chance to overwrite existing data (part-0.parquet) 2) If I make the basename_template unique, then I can end up with duplicate data inside of my partitions because I am not overwriting what is already there. The way I have been organizing this so far is to have use two datasets: *Dataset A*: * UUID filenames, so everything is unique. This most likely has duplicate values, and most certainly will have old versions of rows (based on an updated_at timestamp) * This normally has a lot of files per partition since I download data every 30 minutes - 1 hour in many cases *Dataset B:* * Reads from Dataset A, sorts, drop duplicates, and then resave using a partition_filename_cb {code:java} use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code} * I normally partition by date_id, so each partition is something like {code:java} path/date_id=20210706/20210706.parquet{code} * This allows me to have a single file per partition which has the final version of the each row with no duplicates. Our visualization tool connects to these fragments directly (Power BI in this case) I think that I might be able to use basename_template if I was careful and made sure that I did not write data in parallel, so the part-0.parquet file would be overwritten each time. Or perhaps I could list the files in that partition and delete them before saving new data (risky if another process might be using those files at that time). > [Python] Start with deprecating ParquetDataset custom attributes > > > Key: ARROW-13074 > URL: https://issues.apache.org/jira/browse/ARROW-13074 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As a first step for ARROW-9720, we should start with deprecating > attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep > / are conflicting with the "dataset API". > I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} > class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). > In addition, some of the keywords are also exposed as properties (memory_map, > read_dictionary, buffer_size, fs), and could be deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-13074) [Python] Start with deprecating ParquetDataset custom attributes
[ https://issues.apache.org/jira/browse/ARROW-13074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17375518#comment-17375518 ] Lance Dacey commented on ARROW-13074: - Any idea if this includes the partition_filename_cb function? I am still using that pretty extensively to write my "final" datasets that Power BI connects to for visualization since it allows me to overwrite each partition. > [Python] Start with deprecating ParquetDataset custom attributes > > > Key: ARROW-13074 > URL: https://issues.apache.org/jira/browse/ARROW-13074 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 5.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > As a first step for ARROW-9720, we should start with deprecating > attributes/methods of {{pq.ParquetDataset}} that we would definitely not keep > / are conflicting with the "dataset API". > I am thinking of the {{pieces}} attribute (and the {{ParquetDatasetPiece}} > class), the {{partitions}} attribute (and the {{ParquetPartitions}} class). > In addition, some of the keywords are also exposed as properties (memory_map, > read_dictionary, buffer_size, fs), and could be deprecated. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-12364. --- Fix Version/s: 5.0.0 Resolution: Fixed > [Python] [Dataset] Add metadata_collector option to ds.write_dataset() > -- > > Key: ARROW-12364 > URL: https://issues.apache.org/jira/browse/ARROW-12364 > Project: Apache Arrow > Issue Type: Wish > Components: Parquet, Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > Fix For: 5.0.0 > > > The legacy pq.write_to_dataset() has an option to save metadata to a list > when writing partitioned data. > {code:python} > collector = [] > pq.write_to_dataset( > table=table, > root_path=output_path, > use_legacy_dataset=True, > metadata_collector=collector, > ) > fragments = [] > for piece in collector: > files.append(filesystem.sep.join([output_path, > piece.row_group(0).column(0).file_path])) > {code} > This allows me to save a list of the specific parquet files which were > created when writing the partitions to storage. I use this when scheduling > tasks with Airflow. > Task A downloads data and partitions it --> Task B reads the file fragments > which were just saved and transforms it --> Task C creates a list of dataset > filters from the file fragments I transformed, reads each filter to into a > table and then processes the data further (normally dropping duplicates or > selecting a subset of the columns) and saves it for visualization > {code:java} > fragments = > ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', > 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', > 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ] > {code} > I can use this list downstream to do two things: > 1) I can read the list of fragments directly as a new dataset and transform > the data > {code:java} > ds.dataset(fragments) > {code} > 2) I can generate filters from the fragment paths which were saved using > ds._get_partition_keys(). This allows me to query the dataset and retrieve > all fragments within the partition. For example, if I partition by date and I > process data every 30 minutes I might have 48 individual file fragments > within a single partition. I need to know to query the *entire* partition > instead of reading a single fragment. > {code:java} > def consolidate_filters(fragments): > """Retrieves the partition_expressions from a list of dataset fragments > to build a list of unique filters""" > filters = [] > for frag in fragments: > partitions = ds._get_partition_keys(frag.partition_expression) > filter = [(k, "==", v) for k, v in partitions.items()] > if filter not in filters: > filters.append(filter) > return filters > filter_expression = pq._filters_to_expression( > filters=consolidate_filters(fragments=fragments) > ) > {code} > My current problem is that when I use ds.write_dataset(), I do not have a > convenient method for generating a list of the file fragments I just saved. > My only choice is to use basename_template and fs.glob() to find a list of > the files based on the basename_template pattern. This is much slower and a > waste of listing files on blob storage. [Related stackoverflow question with > the basis of the approach I am using now > |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367552#comment-17367552 ] Lance Dacey commented on ARROW-12364: - I think this is taken care of by ARROW-10440 > [Python] [Dataset] Add metadata_collector option to ds.write_dataset() > -- > > Key: ARROW-12364 > URL: https://issues.apache.org/jira/browse/ARROW-12364 > Project: Apache Arrow > Issue Type: Wish > Components: Parquet, Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > > The legacy pq.write_to_dataset() has an option to save metadata to a list > when writing partitioned data. > {code:python} > collector = [] > pq.write_to_dataset( > table=table, > root_path=output_path, > use_legacy_dataset=True, > metadata_collector=collector, > ) > fragments = [] > for piece in collector: > files.append(filesystem.sep.join([output_path, > piece.row_group(0).column(0).file_path])) > {code} > This allows me to save a list of the specific parquet files which were > created when writing the partitions to storage. I use this when scheduling > tasks with Airflow. > Task A downloads data and partitions it --> Task B reads the file fragments > which were just saved and transforms it --> Task C creates a list of dataset > filters from the file fragments I transformed, reads each filter to into a > table and then processes the data further (normally dropping duplicates or > selecting a subset of the columns) and saves it for visualization > {code:java} > fragments = > ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', > 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', > 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ] > {code} > I can use this list downstream to do two things: > 1) I can read the list of fragments directly as a new dataset and transform > the data > {code:java} > ds.dataset(fragments) > {code} > 2) I can generate filters from the fragment paths which were saved using > ds._get_partition_keys(). This allows me to query the dataset and retrieve > all fragments within the partition. For example, if I partition by date and I > process data every 30 minutes I might have 48 individual file fragments > within a single partition. I need to know to query the *entire* partition > instead of reading a single fragment. > {code:java} > def consolidate_filters(fragments): > """Retrieves the partition_expressions from a list of dataset fragments > to build a list of unique filters""" > filters = [] > for frag in fragments: > partitions = ds._get_partition_keys(frag.partition_expression) > filter = [(k, "==", v) for k, v in partitions.items()] > if filter not in filters: > filters.append(filter) > return filters > filter_expression = pq._filters_to_expression( > filters=consolidate_filters(fragments=fragments) > ) > {code} > My current problem is that when I use ds.write_dataset(), I do not have a > convenient method for generating a list of the file fragments I just saved. > My only choice is to use basename_template and fs.glob() to find a list of > the files based on the basename_template pattern. This is much slower and a > waste of listing files on blob storage. [Related stackoverflow question with > the basis of the approach I am using now > |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367199#comment-17367199 ] Lance Dacey commented on ARROW-12364: - Hi @jorisvandenbossche, you asked me to create a separate issue for the metadata collector for ds.write_dataset. Just wanted to make sure that you had a chance to take a look. I had to switch back to the legacy dataset writer for most projects. Using fs.glob() can be very slow on very large datasets with many thousands of files, and my workflow often depends on knowing which files were written during a previous Airflow task. > [Python] [Dataset] Add metadata_collector option to ds.write_dataset() > -- > > Key: ARROW-12364 > URL: https://issues.apache.org/jira/browse/ARROW-12364 > Project: Apache Arrow > Issue Type: Wish > Components: Parquet, Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > > The legacy pq.write_to_dataset() has an option to save metadata to a list > when writing partitioned data. > {code:python} > collector = [] > pq.write_to_dataset( > table=table, > root_path=output_path, > use_legacy_dataset=True, > metadata_collector=collector, > ) > fragments = [] > for piece in collector: > files.append(filesystem.sep.join([output_path, > piece.row_group(0).column(0).file_path])) > {code} > This allows me to save a list of the specific parquet files which were > created when writing the partitions to storage. I use this when scheduling > tasks with Airflow. > Task A downloads data and partitions it --> Task B reads the file fragments > which were just saved and transforms it --> Task C creates a list of dataset > filters from the file fragments I transformed, reads each filter to into a > table and then processes the data further (normally dropping duplicates or > selecting a subset of the columns) and saves it for visualization > {code:java} > fragments = > ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', > 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', > 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ] > {code} > I can use this list downstream to do two things: > 1) I can read the list of fragments directly as a new dataset and transform > the data > {code:java} > ds.dataset(fragments) > {code} > 2) I can generate filters from the fragment paths which were saved using > ds._get_partition_keys(). This allows me to query the dataset and retrieve > all fragments within the partition. For example, if I partition by date and I > process data every 30 minutes I might have 48 individual file fragments > within a single partition. I need to know to query the *entire* partition > instead of reading a single fragment. > {code:java} > def consolidate_filters(fragments): > """Retrieves the partition_expressions from a list of dataset fragments > to build a list of unique filters""" > filters = [] > for frag in fragments: > partitions = ds._get_partition_keys(frag.partition_expression) > filter = [(k, "==", v) for k, v in partitions.items()] > if filter not in filters: > filters.append(filter) > return filters > filter_expression = pq._filters_to_expression( > filters=consolidate_filters(fragments=fragments) > ) > {code} > My current problem is that when I use ds.write_dataset(), I do not have a > convenient method for generating a list of the file fragments I just saved. > My only choice is to use basename_template and fs.glob() to find a list of > the files based on the basename_template pattern. This is much slower and a > waste of listing files on blob storage. [Related stackoverflow question with > the basis of the approach I am using now > |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346110#comment-17346110 ] Lance Dacey commented on ARROW-12358: - Being able to update and replace specific rows would be very powerful. For my use case, I am basically overwriting the entire partition in order to update a (sometimes tiny) subset of rows. That means that I need to read the existing data for that partition which was saved previously, and the new data with updated or new rows. Then I need to sort and drop duplicates (I use pandas because there is no simple .drop_duplicates() for a pyarrow table, but adding a step with pandas can add some complication sometimes with data types), then I need to overwrite the partition (I use the partition_filename_cb to guarantee that the final file for the partition is the same). > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 5.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-12365. --- Fix Version/s: 5.0.0 Resolution: Not A Problem > [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() > -- > > Key: ARROW-12365 > URL: https://issues.apache.org/jira/browse/ARROW-12365 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > Fix For: 5.0.0 > > > I need to use the legacy pq.write_to_dataset() in order to guarantee that a > file within a partition will have a specific name. > My use case is that I need to report on the final version of data and our > visualization tool connects directly to our parquet files on Azure Blob > (Power BI). > 1) Download data every hour based on an updated_at timestamp (this data is > partitioned by date) > 2) Transform the data which was just downloaded and save it into a "staging" > dataset (this has all versions of the data and there will be many files > within each partition. In this case, up to 24 files within a single date > partition since we download hourly) > 3) Filter the transformed data and read a subset of columns, sort it by the > updated_at timestamp and drop duplicates on the unique constraint, then > partition and save it with partition_filename_cb. In the example below, if I > partition by the "date_id" column, then my dataset structure will be > "/date_id=202104123/20210413.parquet" > {code:java} > use_legacy_dataset=True, partition_filename_cb=lambda x: > str(x[-1]) + ".parquet",{code} > Ultimately, I am sure that this final dataset has exactly one file per > partition and that I only have the latest version of each row based on the > maximum updated_at timestamp. My visualization tool can safely connect to and > incrementally refresh from this dataset. > > > An alternative solution would be to allow us to overwrite anything in an > existing partition. I do not care about the file names so much as I want to > ensure that I am fully replacing any data which might already exist in my > partition, and I want to limit the number of physical files. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
[ https://issues.apache.org/jira/browse/ARROW-12365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17335292#comment-17335292 ] Lance Dacey commented on ARROW-12365: - @jorisvandenbossche I will close this issue in favor of an overwrite option for partitions since that is the only reason I use the partition_filename_cb https://issues.apache.org/jira/browse/ARROW-12358 > [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() > -- > > Key: ARROW-12365 > URL: https://issues.apache.org/jira/browse/ARROW-12365 > Project: Apache Arrow > Issue Type: Wish > Components: Python >Affects Versions: 3.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: dataset, parquet, python > > I need to use the legacy pq.write_to_dataset() in order to guarantee that a > file within a partition will have a specific name. > My use case is that I need to report on the final version of data and our > visualization tool connects directly to our parquet files on Azure Blob > (Power BI). > 1) Download data every hour based on an updated_at timestamp (this data is > partitioned by date) > 2) Transform the data which was just downloaded and save it into a "staging" > dataset (this has all versions of the data and there will be many files > within each partition. In this case, up to 24 files within a single date > partition since we download hourly) > 3) Filter the transformed data and read a subset of columns, sort it by the > updated_at timestamp and drop duplicates on the unique constraint, then > partition and save it with partition_filename_cb. In the example below, if I > partition by the "date_id" column, then my dataset structure will be > "/date_id=202104123/20210413.parquet" > {code:java} > use_legacy_dataset=True, partition_filename_cb=lambda x: > str(x[-1]) + ".parquet",{code} > Ultimately, I am sure that this final dataset has exactly one file per > partition and that I only have the latest version of each row based on the > maximum updated_at timestamp. My visualization tool can safely connect to and > incrementally refresh from this dataset. > > > An alternative solution would be to allow us to overwrite anything in an > existing partition. I do not care about the file names so much as I want to > ensure that I am fully replacing any data which might already exist in my > partition, and I want to limit the number of physical files. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-11250. --- Fix Version/s: (was: 5.0.0) 3.0.0 Resolution: Fixed This was fixed with a new version of the adlfs library > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 3.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that that if I read a dataset > inside of my Jupyter notebook, > > {code:java} > %%time > dataset = ds.dataset("dev/test-split", > partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), > flavor="hive"), > filesystem=fs, > exclude_invalid_files=False) > CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code} > > Now, on the exact same server when I try to run the same code against the > same dataset in Airflow it takes over 3 minutes (comparing the timestamps in > my logs between right before I read the dataset, and immediately after the > dataset is available to filter): > {code:java} > [2021-01-14 03:52:04,011] INFO - Reading dev/test-split > [2021-01-14 03:55:17,360] INFO - Processing dat
[jira] [Closed] (ARROW-9682) [Python] Unable to specify the partition style with pq.write_to_dataset
[ https://issues.apache.org/jira/browse/ARROW-9682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-9682. -- Resolution: Not A Problem This works using ds.write_dataset() > [Python] Unable to specify the partition style with pq.write_to_dataset > --- > > Key: ARROW-9682 > URL: https://issues.apache.org/jira/browse/ARROW-9682 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 1.0.0 > Environment: Ubuntu 18.04 > Python 3.7 >Reporter: Lance Dacey >Priority: Major > Labels: dataset-parquet-write, parquet, parquetWriter > > I am able to import and test DirectoryPartitioning but I am not able to > figure out a way to write a dataset using this feature. It seems like > write_to_dataset defaults to the "hive" style. Is there a way to test this? > {code:java} > from pyarrow.dataset import DirectoryPartitioning > partitioning = DirectoryPartitioning(pa.schema([("year", pa.int16()), > ("month", pa.int8()), ("day", pa.int8())])) > print(partitioning.parse("/2009/11/3")) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-12358) [C++][Python][R][Dataset] Control overwriting vs appending when writing to existing dataset
[ https://issues.apache.org/jira/browse/ARROW-12358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320221#comment-17320221 ] Lance Dacey commented on ARROW-12358: - I think that having an "overwrite" option would satisfy my need for the partition_filename_cb https://issues.apache.org/jira/browse/ARROW-12365 if we can replace _all_ data inside the partition. This would be great for file compaction as well - we could read a dataset with a lot of tiny file fragments and then overwrite it. Overwriting a specific file is also useful. For example, my basename_template is usually my f"\{task-id}-\{schedule-timestamp}-\{file-count}-\{i}.parquet". I am able clear a task and overwrite a file which already exists. The only flaw here is that we cannot control the \{i} variable so I guess it is not guaranteed. I could live without this. For "append", is it possible for the counter to be per partition instead (potential race conditions if multiple tasks write to the same partition in parallel perhaps, and it seems to be a more demanding step for large datasets..)? Or could the \{i} variable optionally be a uuid instead of the fragment count? "error" makes sense. > [C++][Python][R][Dataset] Control overwriting vs appending when writing to > existing dataset > --- > > Key: ARROW-12358 > URL: https://issues.apache.org/jira/browse/ARROW-12358 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset > Fix For: 5.0.0 > > > Currently, the dataset writing (eg with {{pyarrow.dataset.write_dataset}} > uses a fixed filename template ({{"part\{i\}.ext"}}). This means that when > you are writing to an existing dataset, you de facto overwrite previous data > when using this default template. > There is some discussion in ARROW-10695 about how the user can avoid this by > ensuring the file names are unique (the user can specify the > {{basename_template}} to be something unique). There is also ARROW-7706 about > silently doubling data (so _not_ overwriting existing data) with the legacy > {{parquet.write_to_dataset}} implementation. > It could be good to have a "mode" when writing datasets that controls the > different possible behaviours. And erroring when there is pre-existing data > in the target directory is maybe the safest default, because both appending > vs overwriting silently can be surprising behaviour depending on your > expectations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
[ https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320192#comment-17320192 ] Lance Dacey commented on ARROW-10695: - [~jorisvandenbossche] partition_filename_cb: https://issues.apache.org/jira/browse/ARROW-12358 metadata_collector: https://issues.apache.org/jira/browse/ARROW-12365 > [C++][Dataset] Allow to use a UUID in the basename_template when writing a > dataset > -- > > Key: ARROW-10695 > URL: https://issues.apache.org/jira/browse/ARROW-10695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset, dataset-parquet-write > Fix For: 5.0.0 > > > Currently we allow the user to specify a {{basename_template}}, and this can > include a {{"\{i\}"}} part to replace it with an automatically incremented > integer (so each generated file written to a single partition is unique): > https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 > It _might_ be useful to also have the ability to use a UUID, to ensure the > file is unique in general (not only for a single write) and to mimic the > behaviour of the old {{write_to_dataset}} implementation. > For example, we could look for a {{"\{uuid\}"}} in the template string, and > if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12365) [Python] [Dataset] Add partition_filename_cb to ds.write_dataset()
Lance Dacey created ARROW-12365: --- Summary: [Python] [Dataset] Add partition_filename_cb to ds.write_dataset() Key: ARROW-12365 URL: https://issues.apache.org/jira/browse/ARROW-12365 Project: Apache Arrow Issue Type: Wish Components: Python Affects Versions: 3.0.0 Environment: Ubuntu 18.04 Reporter: Lance Dacey I need to use the legacy pq.write_to_dataset() in order to guarantee that a file within a partition will have a specific name. My use case is that I need to report on the final version of data and our visualization tool connects directly to our parquet files on Azure Blob (Power BI). 1) Download data every hour based on an updated_at timestamp (this data is partitioned by date) 2) Transform the data which was just downloaded and save it into a "staging" dataset (this has all versions of the data and there will be many files within each partition. In this case, up to 24 files within a single date partition since we download hourly) 3) Filter the transformed data and read a subset of columns, sort it by the updated_at timestamp and drop duplicates on the unique constraint, then partition and save it with partition_filename_cb. In the example below, if I partition by the "date_id" column, then my dataset structure will be "/date_id=202104123/20210413.parquet" {code:java} use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]) + ".parquet",{code} Ultimately, I am sure that this final dataset has exactly one file per partition and that I only have the latest version of each row based on the maximum updated_at timestamp. My visualization tool can safely connect to and incrementally refresh from this dataset. An alternative solution would be to allow us to overwrite anything in an existing partition. I do not care about the file names so much as I want to ensure that I am fully replacing any data which might already exist in my partition, and I want to limit the number of physical files. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-12364) [Python] [Dataset] Add metadata_collector option to ds.write_dataset()
Lance Dacey created ARROW-12364: --- Summary: [Python] [Dataset] Add metadata_collector option to ds.write_dataset() Key: ARROW-12364 URL: https://issues.apache.org/jira/browse/ARROW-12364 Project: Apache Arrow Issue Type: Wish Components: Parquet, Python Affects Versions: 3.0.0 Environment: Ubuntu 18.04 Reporter: Lance Dacey The legacy pq.write_to_dataset() has an option to save metadata to a list when writing partitioned data. {code:python} collector = [] pq.write_to_dataset( table=table, root_path=output_path, use_legacy_dataset=True, metadata_collector=collector, ) fragments = [] for piece in collector: files.append(filesystem.sep.join([output_path, piece.row_group(0).column(0).file_path])) {code} This allows me to save a list of the specific parquet files which were created when writing the partitions to storage. I use this when scheduling tasks with Airflow. Task A downloads data and partitions it --> Task B reads the file fragments which were just saved and transforms it --> Task C creates a list of dataset filters from the file fragments I transformed, reads each filter to into a table and then processes the data further (normally dropping duplicates or selecting a subset of the columns) and saves it for visualization {code:java} fragments = ['dev/date_id=20180111/transform-split-20210301013200-68.parquet', 'dev/date_id=20180114/transform-split-20210301013200-69.parquet', 'dev/date_id=20180128/transform-split-20210301013200-57.parquet', ] {code} I can use this list downstream to do two things: 1) I can read the list of fragments directly as a new dataset and transform the data {code:java} ds.dataset(fragments) {code} 2) I can generate filters from the fragment paths which were saved using ds._get_partition_keys(). This allows me to query the dataset and retrieve all fragments within the partition. For example, if I partition by date and I process data every 30 minutes I might have 48 individual file fragments within a single partition. I need to know to query the *entire* partition instead of reading a single fragment. {code:java} def consolidate_filters(fragments): """Retrieves the partition_expressions from a list of dataset fragments to build a list of unique filters""" filters = [] for frag in fragments: partitions = ds._get_partition_keys(frag.partition_expression) filter = [(k, "==", v) for k, v in partitions.items()] if filter not in filters: filters.append(filter) return filters filter_expression = pq._filters_to_expression( filters=consolidate_filters(fragments=fragments) ) {code} My current problem is that when I use ds.write_dataset(), I do not have a convenient method for generating a list of the file fragments I just saved. My only choice is to use basename_template and fs.glob() to find a list of the files based on the basename_template pattern. This is much slower and a waste of listing files on blob storage. [Related stackoverflow question with the basis of the approach I am using now |https://stackoverflow.com/questions/66252660/pyarrow-identify-the-fragments-written-or-filters-used-when-writing-a-parquet/66266585#66266585] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
[ https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320167#comment-17320167 ] Lance Dacey commented on ARROW-10695: - I have been creating my own basename_template with either a uuid or a name with the task+timestamp of when the data was processed and it has worked well. I like that approach better than the uuid filename actually. I think the remaining issue with the default part-{i} template is that it can also be a bit inconsistent when writing data in loops. Say I am processing a directory of files one by one in a loop and I partition the data on the "date" column. A lot of the files will just overwrite the part-0.parquet file, but you might also see part-11.parquet or another random filename. I suppose the surprising part is that write_dataset() does not always append new random files nor does it *always* overwrite what is there. This does not impact me now that I customize the basename_template though, but I think an "append" or "replace" flag would make a lot of sense I'll open another issue with my use case for metadata_collector and partition_filename_cb which I am using heavily > [C++][Dataset] Allow to use a UUID in the basename_template when writing a > dataset > -- > > Key: ARROW-10695 > URL: https://issues.apache.org/jira/browse/ARROW-10695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset, dataset-parquet-write > Fix For: 5.0.0 > > > Currently we allow the user to specify a {{basename_template}}, and this can > include a {{"\{i\}"}} part to replace it with an automatically incremented > integer (so each generated file written to a single partition is unique): > https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 > It _might_ be useful to also have the ability to use a UUID, to ensure the > file is unique in general (not only for a single write) and to mimic the > behaviour of the old {{write_to_dataset}} implementation. > For example, we could look for a {{"\{uuid\}"}} in the template string, and > if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
[ https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306996#comment-17306996 ] Lance Dacey commented on ARROW-10695: - Sorry, did not see a notification for this. Hm - I am not sure how to provide a minimal example easily. The issue is when multiple machines are writing to the same dataset at the same time into the same partition. For example, machine A downloads data from server 1 and saves it to the dataset at the same time as machine B downloading data and saving data from server 2. My workaround for now was to ensure that the basename_template is a unique value. Initially, I was using a UUID filename as the basename_template, but I need to be able to use fs.glob() to get a list of all of the fragments which were just written to process them in downstream tasks. Unfortunately, there is no metadata_collector for ds.write_dataset() yet. > [C++][Dataset] Allow to use a UUID in the basename_template when writing a > dataset > -- > > Key: ARROW-10695 > URL: https://issues.apache.org/jira/browse/ARROW-10695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset, dataset-parquet-write > Fix For: 5.0.0 > > > Currently we allow the user to specify a {{basename_template}}, and this can > include a {{"\{i\}"}} part to replace it with an automatically incremented > integer (so each generated file written to a single partition is unique): > https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 > It _might_ be useful to also have the ability to use a UUID, to ensure the > file is unique in general (not only for a single write) and to mimic the > behaviour of the old {{write_to_dataset}} implementation. > For example, we could look for a {{"\{uuid\}"}} in the template string, and > if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()
[ https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17299754#comment-17299754 ] Lance Dacey commented on ARROW-10440: - Can someone confirm if this issue would cover my use case or if I should add a separate feature request issue? My goal is to simply be able to retrieve the list of fragment paths which were saved using the ds.write_dataset() function. I believe it does since I am using the metadata_collector argument to gather this information with the legacy dataset, but let me know if this is different. thanks! > [C++][Dataset][Python] Add a callback to visit file writers just before > Finish() > > > Key: ARROW-10440 > URL: https://issues.apache.org/jira/browse/ARROW-10440 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 5.0.0 > > > This will fill the role of (for example) {{metadata_collector}} or allow > stats to be embedded in IPC file footer metadata. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-10694. --- Fix Version/s: 3.0.0 Resolution: Fixed https://github.com/dask/adlfs/pull/193 > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > Labels: dataset > Fix For: 3.0.0 > > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17298828#comment-17298828 ] Lance Dacey commented on ARROW-10694: - This is being worked on in the adlfs library so I will close this. There are working aldfs branches that I have tested, but they have unfortunately also included new problems. Hopefully there will be a final solution soon. > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > Labels: dataset > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10440) [C++][Dataset][Python] Add a callback to visit file writers just before Finish()
[ https://issues.apache.org/jira/browse/ARROW-10440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294650#comment-17294650 ] Lance Dacey commented on ARROW-10440: - Will this change allow us to get a list of the blob paths which were saved as file fragments? I am currently using fs.glob() to find a list of files which were just saved using a specific basename_template as a work around. {code:java} pattern = filesystem.sep.join([output_path, f"**{base_template}-*"]) files = filesystem.glob( pattern, details=False, invalidate_cache=True, ) {code} However, with the legacy write_to_dataset(), I am able to use the metadata_collector and then create a list of the file paths like this, which is more convenient (I do not have to worry about generating unique/predictable basename templates). {code:java} files = [] for piece in collector: files.append(filesystem.sep.join([output_path, piece.row_group(0).column(0).file_path])) {code} I require the lists of blobs to pass along to other Airflow tasks to either read as a dataset, or I generate a list of filters from the paths. > [C++][Dataset][Python] Add a callback to visit file writers just before > Finish() > > > Key: ARROW-10440 > URL: https://issues.apache.org/jira/browse/ARROW-10440 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 2.0.0 >Reporter: Ben Kietzman >Assignee: Ben Kietzman >Priority: Major > Fix For: 4.0.0 > > > This will fill the role of (for example) {{metadata_collector}} or allow > stats to be embedded in IPC file footer metadata. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
[ https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286036#comment-17286036 ] Lance Dacey commented on ARROW-10695: - Perhaps this has changed, but I was running into issues when writing to a dataset in parallel. For example, I use Airflow to extract data from 6 different servers in parallel (separate tasks are used to download data from each source "extract_cms_1", "extract_cms_2") using turbodbc which fetches the data in pyarrow tables --> this data is written to Azure Blob using ds.write_dataset() I noticed that the part-{i} names were clashing when this happened. part-0 would be replaced a few times for example, and it seemed random or hinted at race conditions. I have another Airflow DAG which is downloading from 74 different REST APIs as well (the downloads can happen simultaneously but the source and credentials used are different per account). Adding the guid() to the filenames solved that issue for me. Is there a separate issue open for the partition_filename_cb to be added to ds.write_dataset()? I have been using that feature to "repartition" Dataset A with many small files into Dataset B with one file per partition (larger physical file, less fragments). > [C++][Dataset] Allow to use a UUID in the basename_template when writing a > dataset > -- > > Key: ARROW-10695 > URL: https://issues.apache.org/jira/browse/ARROW-10695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Minor > Labels: dataset, dataset-parquet-write > Fix For: 4.0.0 > > > Currently we allow the user to specify a {{basename_template}}, and this can > include a {{"\{i\}"}} part to replace it with an automatically incremented > integer (so each generated file written to a single partition is unique): > https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 > It _might_ be useful to also have the ability to use a UUID, to ensure the > file is unique in general (not only for a single write) and to mimic the > behaviour of the old {{write_to_dataset}} implementation. > For example, we could look for a {{"\{uuid\}"}} in the template string, and > if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11453) [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0
Lance Dacey created ARROW-11453: --- Summary: [Python] [Dataset] Unable to use write_dataset() to Azure Blob with adlfs 0.6.0 Key: ARROW-11453 URL: https://issues.apache.org/jira/browse/ARROW-11453 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: This environment results in an error: adlfs v0.6.0 fsspec 0.8.5 azure.storage.blob 12.6.0 adal 1.2.6 pandas 1.2.1 pyarrow 3.0.0 Reporter: Lance Dacey https://github.com/dask/adlfs/issues/171 I am unable to save data to Azure Blob using ds.write_dataset() with pyarrow 3.0 and adlfs 0.6.0. Reverting to 0.5.9 fixes the issue, but I am not sure what the cause is - posting this here in case there were filesystem changes in pyarrow recently which are incompatible with changes made in adlfs. {code:java} File "pyarrow/_dataset.pyx", line 2343, in pyarrow._dataset._filesystemdataset_write File "pyarrow/_fs.pyx", line 1032, in pyarrow._fs._cb_create_dir File "/opt/conda/lib/python3.8/site-packages/pyarrow/fs.py", line 259, in create_dir self.fs.mkdir(path, create_parents=recursive) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 121, in wrapper return maybe_sync(func, self, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 100, in maybe_sync return sync(loop, func, *args, **kwargs) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 71, in sync raise exc.with_traceback(tb) File "/opt/conda/lib/python3.8/site-packages/fsspec/asyn.py", line 55, in f result[0] = await future File "/opt/conda/lib/python3.8/site-packages/adlfs/spec.py", line 1033, in _mkdir raise FileExistsError( FileExistsError: Cannot overwrite existing Azure container -- dev already exists. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc
[ https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-11390. --- Fix Version/s: 3.0.0 Resolution: Fixed I reorganized my Dockerfile to ensure that pyarrow 3.0 was installed before turbodbc (there was a base image which was installing 2.0), and I believe that conda-forge was updated for turbodbc as well > [Python] pyarrow 3.0 issues with turbodbc > - > > Key: ARROW-11390 > URL: https://issues.apache.org/jira/browse/ARROW-11390 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 > Environment: pyarrow 3.0.0 > fsspec 0.8.4 > adlfs v0.5.9 > pandas 1.2.1 > numpy 1.19.5 > turbodbc 4.1.1 >Reporter: Lance Dacey >Priority: Major > Labels: python, turbodbc > Fix For: 3.0.0 > > > This is more of a turbodbc issue I think, but perhaps someone here would have > some idea of what changed to cause potential issues. > {code:java} > cursor = connection.cursor() > cursor.execute("select top 10 * from dbo.tickets") > table = cursor.fetchallarrow(){code} > I am able to run table.num_rows and it will print out 10. > If I run table.to_pandas() or table.schema or try to write the table to a > dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 > and the same code works again. > [https://github.com/blue-yonder/turbodbc/issues/289] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc
[ https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273199#comment-17273199 ] Lance Dacey commented on ARROW-11390: - Everything seems to be all set now, thanks! pyarrow 3.0.0 fsspec 0.8.4 adlfs v0.5.9 pandas 1.2.1 numpy 1.19.5 turbodbc 4.1.1 > [Python] pyarrow 3.0 issues with turbodbc > - > > Key: ARROW-11390 > URL: https://issues.apache.org/jira/browse/ARROW-11390 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 > Environment: pyarrow 3.0.0 > fsspec 0.8.4 > adlfs v0.5.9 > pandas 1.2.1 > numpy 1.19.5 > turbodbc 4.1.1 >Reporter: Lance Dacey >Priority: Major > Labels: python, turbodbc > > This is more of a turbodbc issue I think, but perhaps someone here would have > some idea of what changed to cause potential issues. > {code:java} > cursor = connection.cursor() > cursor.execute("select top 10 * from dbo.tickets") > table = cursor.fetchallarrow(){code} > I am able to run table.num_rows and it will print out 10. > If I run table.to_pandas() or table.schema or try to write the table to a > dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 > and the same code works again. > [https://github.com/blue-yonder/turbodbc/issues/289] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc
[ https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272708#comment-17272708 ] Lance Dacey commented on ARROW-11390: - That makes sense. I checked further and the base image I was using is this: https://github.com/jupyter/docker-stacks/blob/master/pyspark-notebook/Dockerfile Which pins pyarrow at 2.0: {code:java} RUN conda install --quiet --yes --satisfied-skip-solve \ 'pyarrow=2.0.*' && \ {code} I'll try again now that 3.0 is on conda-forge > [Python] pyarrow 3.0 issues with turbodbc > - > > Key: ARROW-11390 > URL: https://issues.apache.org/jira/browse/ARROW-11390 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 > Environment: pyarrow 3.0.0 > fsspec 0.8.4 > adlfs v0.5.9 > pandas 1.2.1 > numpy 1.19.5 > turbodbc 4.1.1 >Reporter: Lance Dacey >Priority: Major > Labels: python, turbodbc > > This is more of a turbodbc issue I think, but perhaps someone here would have > some idea of what changed to cause potential issues. > {code:java} > cursor = connection.cursor() > cursor.execute("select top 10 * from dbo.tickets") > table = cursor.fetchallarrow(){code} > I am able to run table.num_rows and it will print out 10. > If I run table.to_pandas() or table.schema or try to write the table to a > dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 > and the same code works again. > [https://github.com/blue-yonder/turbodbc/issues/289] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc
[ https://issues.apache.org/jira/browse/ARROW-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272271#comment-17272271 ] Lance Dacey commented on ARROW-11390: - Actually, turbodbc would have been installed before pyarrow since version 3.0 was not on conda-forge so I moved it down to the pip section. Do I need to reverse this installation process? {code:java} && /opt/conda/bin/conda install -c conda-forge -yq \ pandas \ numpy \ pyodbc \ pybind11 \ turbodbc \ azure-storage-blob \ azure-storage-common \ xlrd \ openpyxl \ mysql-connector-python \ zeep \ xmltodict \ dask \ dask-labextension \ pymssql=2.1 \ sqlalchemy-redshift \ python-snappy \ seaborn \ python-gitlab \ pyxlsb \ humanfriendly \ jupyterlab \ notebook=6.1.4 \ pip \ && /opt/conda/bin/pip install --no-cache-dir --upgrade pip \ smartsheet-python-sdk \ duo-client \ adlfs \ pyarrow \ "apache-airflow[postgres,redis,celery,crypto,ssh,password]==$AIRFLOW_VERSION" \ {code} I have not been able to get turbodbc to work with pip which is why I am using conda right now. Actually I was just trying to get it to work again using a CFLAGS argument "-D_GLIBCXX_USE_CXX11_ABI=0", but had no luck. I will attempt some more and perhaps raise an issue on the turbodbc project though. Let me know if there is a proper way to install these libraries! (ideally with just plain pip, since my base image is from Airflow which does not use conda by default) > [Python] pyarrow 3.0 issues with turbodbc > - > > Key: ARROW-11390 > URL: https://issues.apache.org/jira/browse/ARROW-11390 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 3.0.0 > Environment: pyarrow 3.0.0 > fsspec 0.8.4 > adlfs v0.5.9 > pandas 1.2.1 > numpy 1.19.5 > turbodbc 4.1.1 >Reporter: Lance Dacey >Priority: Major > Labels: python, turbodbc > > This is more of a turbodbc issue I think, but perhaps someone here would have > some idea of what changed to cause potential issues. > {code:java} > cursor = connection.cursor() > cursor.execute("select top 10 * from dbo.tickets") > table = cursor.fetchallarrow(){code} > I am able to run table.num_rows and it will print out 10. > If I run table.to_pandas() or table.schema or try to write the table to a > dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 > and the same code works again. > [https://github.com/blue-yonder/turbodbc/issues/289] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-11390) [Python] pyarrow 3.0 issues with turbodbc
Lance Dacey created ARROW-11390: --- Summary: [Python] pyarrow 3.0 issues with turbodbc Key: ARROW-11390 URL: https://issues.apache.org/jira/browse/ARROW-11390 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 3.0.0 Environment: pyarrow 3.0.0 fsspec 0.8.4 adlfs v0.5.9 pandas 1.2.1 numpy 1.19.5 turbodbc 4.1.1 Reporter: Lance Dacey This is more of a turbodbc issue I think, but perhaps someone here would have some idea of what changed to cause potential issues. {code:java} cursor = connection.cursor() cursor.execute("select top 10 * from dbo.tickets") table = cursor.fetchallarrow(){code} I am able to run table.num_rows and it will print out 10. If I run table.to_pandas() or table.schema or try to write the table to a dataset, my kernel dies with no explanation. I reverted back to pyarrow 2.0 and the same code works again. [https://github.com/blue-yonder/turbodbc/issues/289] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266151#comment-17266151 ] Lance Dacey commented on ARROW-11250: - Good idea - I was a able to list all of the files and print the info quickly, one interesting thing is that the ds.dataset() failed right after though and the error message is a little different. My input path was "dev/case-history/" with the final slash. This shows that it took 8 seconds to get the len(fs.find()) which is about the same amount of time it takes to read ds.dataset() in Jupyter. This error message is different than usual though and it mentions something about a dircache: {code:java} [2021-01-15 15:51:47,158] INFO - Reading /dev/case-history/ [2021-01-15 15:51:55,607] INFO - 9682 [2021-01-15 15:51:55,892] INFO - {'name': '/dev/case-history', 'size': 0, 'type': 'directory'} [2021-01-15 15:51:55,893] {taskinstance.py:1150} ERROR - '/dev/case-history/' Traceback (most recent call last): ... File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, in dataset return _filesystem_dataset(source, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, in _filesystem_dataset fs, paths_or_selector = _ensure_single_source(source, filesystem) File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 395, in _ensure_single_source file_info = filesystem.get_file_info([path])[0] File "pyarrow/_fs.pyx", line 434, in pyarrow._fs.FileSystem.get_file_info File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/_fs.pyx", line 1012, in pyarrow._fs._cb_get_file_info_vector File "/opt/conda/lib/python3.7/site-packages/pyarrow/fs.py", line 195, in get_file_info info = self.fs.info(path) File "/opt/conda/lib/python3.7/site-packages/adlfs/spec.py", line 522, in info fetch_from_azure = (path and self._ls_from_cache(path) is None) or refresh File "/opt/conda/lib/python3.7/site-packages/fsspec/spec.py", line 336, in _ls_from_cache return self.dircache[path] File "/opt/conda/lib/python3.7/site-packages/fsspec/dircache.py", line 62, in __getitem__ return self._cache[item] # maybe raises KeyError KeyError: '/dev/case-history/' {code} I edited my DAG and changed the input path to be "dev/case-history" with no final slash and the error was different (note that fs.info() always either removes or adds the final slash to the name of the path): {code:java} [2021-01-15 15:36:25,603] INFO - {'name': '/dev/case-history/', 'size': 0, 'type': 'directory'} [2021-01-15 15:36:25,604] ERROR - /dev/case-history Traceback (most recent call last): File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 671, in dataset return _filesystem_dataset(source, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 428, in _filesystem_dataset fs, paths_or_selector = _ensure_single_source(source, filesystem) File "/opt/conda/lib/python3.7/site-packages/pyarrow/dataset.py", line 404, in _ensure_single_source raise FileNotFoundError(path) FileNotFoundError: /dev/case-history {code} Without any fs.info() or fs.find() it took 11 minutes to read the same dataset... from 17:45 to 17:56 {code:java} [2021-01-14 17:45:10,470] INFO - Reading /dev/case-history/ [2021-01-14 17:56:58,307] INFO - Processing dataset in batches {code} > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor
[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265928#comment-17265928 ] Lance Dacey edited comment on ARROW-10247 at 1/15/21, 11:08 AM: Nice - how would you generally go about finding the array of values? Would it be detected from the file paths, or would I need store it externally somewhere (sometimes new categories could be added into the field without me being aware so explicitly listing them in my code might be weird)? was (Author: ldacey): Nice - how would you general go about finding the array of values? Would it be detected from the file paths, or would I need store it externally somewhere (sometimes new categories could be added into the field without me being aware so explicitly listing them in my code might be weird)? > [C++][Dataset] Cannot write dataset with dictionary column as partition field > - > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as this column will typically be > repeated many times (and we also support reading it as such with dictionary > type, so a roundtrip is currently not possible in that case) > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265928#comment-17265928 ] Lance Dacey commented on ARROW-10247: - Nice - how would you general go about finding the array of values? Would it be detected from the file paths, or would I need store it externally somewhere (sometimes new categories could be added into the field without me being aware so explicitly listing them in my code might be weird)? > [C++][Dataset] Cannot write dataset with dictionary column as partition field > - > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as this column will typically be > repeated many times (and we also support reading it as such with dictionary > type, so a roundtrip is currently not possible in that case) > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265922#comment-17265922 ] Lance Dacey commented on ARROW-11250: - Do you have any idea at all what could also be causing my Airflow scheduler to take SO long to read the same dataset that I am able to read in under 10 seconds on Jupyter? Could it be an overlay network or something? I have ensured that my tasks calling ds.dataset() are running on the same node that my Jupyterhub is running on. All software between the environments seems to be identical as well (same requirements.txt). 11 minutes on the latest airflow run and 9 seconds if I run it in a notebook.. is there a way to narrow down my troubleshooting scope for this? {code:java} dataset = ds.dataset( source=input_path, format="parquet", partitioning=partitioning, filesystem=fs, ){code} > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 4.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that that if I read a dataset > inside of my Jupyter notebook, > > {code:java} > %%time > dataset = ds.dataset("
[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265909#comment-17265909 ] Lance Dacey commented on ARROW-11250: - Sure, I can raise an issue there. {code:java} fs_pa.get_file_info("dev/test-split") {code} I had to tweak the code you provided a bit to get it to run for the FileSelector: {code:java} fs_pa.get_file_info(FileSelector("dev/test-split", recursive=True)) [, , , , , , , , ... ]{code} FYI - if I add an ending slash to the path I get type=Directory instead of NotFound: {code:java} fs_pa.get_file_info("dev/test-split/") {code} > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 4.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that that if I read a dataset > inside of my Jupyter notebook, > > {code:java} > %%time > dataset = ds.dataset("dev/test-split", > partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), > flavor="hive"), > filesystem=fs, > exclude_invalid_files=False) > CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code} > > Now, on the exact same se
[jira] [Comment Edited] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265869#comment-17265869 ] Lance Dacey edited comment on ARROW-11250 at 1/15/21, 10:19 AM: {code:java} selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files1 == selected_files2 True{code} I am able to run the above cell over and over again. Now when I use fs.info() without a final slash: {code:java} fs.info("dev/test-split") {'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code} If I add a slash to the folder name, the slash is removed in the fs.info() return - will this impact anything? {code:java} fs.info("dev/test-split/") {'name': 'dev/test-split', 'size': 0, 'type': 'directory'} {code} {code:java} selected_files3 = fs.info("dev/test-split") selected_files4 = fs.info("dev/test-split/") selected_files3 == selected_files4 False{code} Edit - running fs.info() on the same path fails if I do it more than once without changing the name by adding a slash, or resetting my kernel. Even if I delete the fs variable and create a new filesystem, it does not work. was (Author: ldacey): {code:java} selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files1 == selected_files2 True{code} I am able to run the above cell over and over again. Now when I use fs.info() without a final slash: {code:java} fs.info("dev/test-split") {'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code} If I add a slash to the folder name, the slash is removed in the fs.info() return - will this impact anything? {code:java} fs.info("dev/test-split/") {'name': 'dev/test-split', 'size': 0, 'type': 'directory'} {code} {code:java} selected_files3 = fs.info("dev/test-split") selected_files4 = fs.info("dev/test-split/") selected_files3 == selected_files4 False{code} > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 4.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib
[jira] [Commented] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
[ https://issues.apache.org/jira/browse/ARROW-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17265869#comment-17265869 ] Lance Dacey commented on ARROW-11250: - {code:java} selected_files1 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files2 = fs.find("dev/test-split", maxdepth=None, withdirs=True, detail=True) selected_files1 == selected_files2 True{code} I am able to run the above cell over and over again. Now when I use fs.info() without a final slash: {code:java} fs.info("dev/test-split") {'name': 'dev/test-split/', 'size': 0, 'type': 'directory'}{code} If I add a slash to the folder name, the slash is removed in the fs.info() return - will this impact anything? {code:java} fs.info("dev/test-split/") {'name': 'dev/test-split', 'size': 0, 'type': 'directory'} {code} {code:java} selected_files3 = fs.info("dev/test-split") selected_files4 = fs.info("dev/test-split/") selected_files3 == selected_files4 False{code} > [Python] Inconsistent behavior calling ds.dataset() > --- > > Key: ARROW-11250 > URL: https://issues.apache.org/jira/browse/ARROW-11250 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > adal 1.2.5 pyh9f0ad1d_0conda-forge > adlfs 0.5.9 pyhd8ed1ab_0conda-forge > apache-airflow1.10.14 pypi_0pypi > azure-common 1.1.24 py_0conda-forge > azure-core1.9.0 pyhd3deb0d_0conda-forge > azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge > azure-identity1.5.0 pyhd8ed1ab_0conda-forge > azure-nspkg 3.0.2 py_0conda-forge > azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge > azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge > fsspec0.8.5 pyhd8ed1ab_0conda-forge > jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge > pandas1.2.0py37ha9443f7_0 > pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge >Reporter: Lance Dacey >Priority: Minor > Labels: azureblob, dataset,, python > Fix For: 4.0.0 > > > In a Jupyter notebook, I have noticed that sometimes I am not able to read a > dataset which certainly exists on Azure Blob. > > {code:java} > fs = fsspec.filesystem(protocol="abfs", account_name, account_key) > {code} > > One example of this is reading a dataset in one cell: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} > > Then in another cell I try to read the same dataset: > > {code:java} > ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, partitioning, format, > partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) > 426 fs, paths_or_selector = _ensure_multiple_sources(source, > filesystem) > 427 else: > --> 428 fs, paths_or_selector = _ensure_single_source(source, > filesystem) > 429 > 430 options = FileSystemFactoryOptions( > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _ensure_single_source(path, filesystem) > 402 paths_or_selector = [path] > 403 else: > --> 404 raise FileNotFoundError(path) > 405 > 406 return filesystem, paths_or_selector > FileNotFoundError: dev/test-split > {code} > > If I reset the kernel, it works again. It also works if I change the path > slightly, like adding a "/" at the end (so basically it just not work if I > read the same dataset twice): > > {code:java} > ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) > {code} > > > The other strange behavior I have noticed that th
[jira] [Created] (ARROW-11250) [Python] Inconsistent behavior calling ds.dataset()
Lance Dacey created ARROW-11250: --- Summary: [Python] Inconsistent behavior calling ds.dataset() Key: ARROW-11250 URL: https://issues.apache.org/jira/browse/ARROW-11250 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 2.0.0 Environment: Ubuntu 18.04 adal 1.2.5 pyh9f0ad1d_0conda-forge adlfs 0.5.9 pyhd8ed1ab_0conda-forge apache-airflow1.10.14 pypi_0pypi azure-common 1.1.24 py_0conda-forge azure-core1.9.0 pyhd3deb0d_0conda-forge azure-datalake-store 0.0.51 pyh9f0ad1d_0conda-forge azure-identity1.5.0 pyhd8ed1ab_0conda-forge azure-nspkg 3.0.2 py_0conda-forge azure-storage-blob12.6.0 pyhd3deb0d_0conda-forge azure-storage-common 2.1.0py37hc8dfbb8_3conda-forge fsspec0.8.5 pyhd8ed1ab_0conda-forge jupyterlab_pygments 0.1.2 pyh9f0ad1d_0conda-forge pandas1.2.0py37ha9443f7_0 pyarrow 2.0.0 py37h4935f41_6_cpuconda-forge Reporter: Lance Dacey Fix For: 3.0.0 In a Jupyter notebook, I have noticed that sometimes I am not able to read a dataset which certainly exists on Azure Blob. {code:java} fs = fsspec.filesystem(protocol="abfs", account_name, account_key) {code} One example of this is reading a dataset in one cell: {code:java} ds.dataset("dev/test-split", partitioning="hive", filesystem=fs){code} Then in another cell I try to read the same dataset: {code:java} ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) --- FileNotFoundError Traceback (most recent call last) in > 1 ds.dataset("dev/test-split", partitioning="hive", filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/test-split {code} If I reset the kernel, it works again. It also works if I change the path slightly, like adding a "/" at the end (so basically it just not work if I read the same dataset twice): {code:java} ds.dataset("dev/test-split/", partitioning="hive", filesystem=fs) {code} The other strange behavior I have noticed that that if I read a dataset inside of my Jupyter notebook, {code:java} %%time dataset = ds.dataset("dev/test-split", partitioning=ds.partitioning(pa.schema([("date", pa.date32())]), flavor="hive"), filesystem=fs, exclude_invalid_files=False) CPU times: user 1.98 s, sys: 0 ns, total: 1.98 s Wall time: 2.58 s{code} Now, on the exact same server when I try to run the same code against the same dataset in Airflow it takes over 3 minutes (comparing the timestamps in my logs between right before I read the dataset, and immediately after the dataset is available to filter): {code:java} [2021-01-14 03:52:04,011] INFO - Reading dev/test-split [2021-01-14 03:55:17,360] INFO - Processing dataset in batches {code} This is probably not a pyarrow issue, but what are some potential causes that I can look into? I have one example where it is 9 seconds to read the dataset in Jupyter, but then 11 *minutes* in Airflow. I don't know what to really investigate - as I mentioned, the Jupyter notebook and Airflow are on the same server and both are deployed using Docker. Airflow is using the CeleryExecutor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996 ] Lance Dacey edited comment on ARROW-10247 at 1/10/21, 3:27 AM: --- What is the best workaround for this issue right now? I was playing around with making a new partition schema if a dictionary type was found in my partition columns: {code:java} partitioning = None part_schema = t.select(["project", "date"]).schema fields = [] for part in part_schema: if pa.types.is_dictionary(part.type): fields.append(pa.field(part.name, part.type.value_type)) else: fields.append(pa.field(part.name, part.type)) new_schema = pa.schema(fields) partitioning = ds.partitioning(new_schema, flavor="hive") {code} This seems to work for me. My only issue is if I have multiple partition columns with different types. This would return an error when I read the dataset with ds.dataset(): {code:java} partitioning = ds.partitioning(pa.schema([('date', pa.date32()), ("project", pa.dictionary(index_type=pa.int32(), value_type=pa.string()))]), flavor="hive"){code} ArrowInvalid: No dictionary provided for dictionary field project: dictionary And this returns dictionaries for both partitions (instead of date being pa.date32()) which is not ideal: {code:java} partitioning=ds.HivePartitioning.discover(infer_dictionary=True){code} was (Author: ldacey): What is the best workaround for this issue right now? If a column in the partition columns is_dictionary(), then convert it to pa.string() to save the dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to read the dataset later? > [C++][Dataset] Cannot write dataset with dictionary column as partition field > - > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as thi
[jira] [Commented] (ARROW-10247) [C++][Dataset] Cannot write dataset with dictionary column as partition field
[ https://issues.apache.org/jira/browse/ARROW-10247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17261996#comment-17261996 ] Lance Dacey commented on ARROW-10247: - What is the best workaround for this issue right now? If a column in the partition columns is_dictionary(), then convert it to pa.string() to save the dataset and then use ds.HivePartitioning.discover(infer_dictionary=True) to read the dataset later? > [C++][Dataset] Cannot write dataset with dictionary column as partition field > - > > Key: ARROW-10247 > URL: https://issues.apache.org/jira/browse/ARROW-10247 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Assignee: Ben Kietzman >Priority: Major > Labels: dataset, pull-request-available > Fix For: 3.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When the column to use for partitioning is dictionary encoded, we get this > error: > {code} > In [9]: import pyarrow.dataset as ds > In [10]: part = ["xxx"] * 3 + ["yyy"] * 3 > ...: table = pa.table([ > ...: pa.array(range(len(part))), > ...: pa.array(part).dictionary_encode(), > ...: ], names=['col', 'part']) > In [11]: part = ds.partitioning(table.select(["part"]).schema) > In [12]: ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > --- > ArrowTypeErrorTraceback (most recent call last) > in > > 1 ds.write_dataset(table, "test_dataset_dict_part", format="parquet", > partitioning=part) > ~/scipy/repos/arrow/python/pyarrow/dataset.py in write_dataset(data, > base_dir, basename_template, format, partitioning, schema, filesystem, > file_options, use_threads) > 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > --> 775 filesystem, partitioning, file_options, use_threads, > 776 ) > ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > ~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > ArrowTypeError: scalar xxx (of type string) is invalid for part: > dictionary > In ../src/arrow/dataset/filter.cc, line 1082, code: > VisitConjunctionMembers(*and_.left_operand(), visitor) > In ../src/arrow/dataset/partition.cc, line 257, code: VisitKeys(expr, > [&](const std::string& name, const std::shared_ptr& value) { auto&& > _error_or_value28 = (FieldRef(name).FindOneOrNone(*schema_)); do { > ::arrow::Status __s = > ::arrow::internal::GenericToStatus((_error_or_value28).status()); do { if > ((__builtin_expect(!!(!__s.ok()), 0))) { ::arrow::Status _st = (__s); > _st.AddContextLine("../src/arrow/dataset/partition.cc", 257, > "(_error_or_value28).status()"); return _st; } } while (0); } while (false); > auto match = std::move(_error_or_value28).ValueUnsafe();;; if (match) { const > auto& field = schema_->field(match[0]); if > (!value->type->Equals(field->type())) { return Status::TypeError("scalar ", > value->ToString(), " (of type ", *value->type, ") is invalid for ", > field->ToString()); } values[match[0]] = value.get(); } return Status::OK(); > }) > In ../src/arrow/dataset/file_base.cc, line 321, code: > (_error_or_value24).status() > In ../src/arrow/dataset/file_base.cc, line 354, code: task_group->Finish() > {code} > While this seems a quit normal use case, as this column will typically be > repeated many times (and we also support reading it as such with dictionary > type, so a roundtrip is currently not possible in that case) > I tagged it for 2.0.0 for a moment in case it's possible today, but I didn't > yet look into how easy it would be to fix. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10523) [Python] Pandas timestamps are inferred to have only microsecond precision
[ https://issues.apache.org/jira/browse/ARROW-10523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17260878#comment-17260878 ] Lance Dacey commented on ARROW-10523: - I noticed that even explicitly using (unit="ns") would not work when using write_to_dataset() with the legacy dataset. I would print table.schema right before saving the dataset to Azure Blob (it would show "ns"), and when I read the dataset.schema afterwards the unit was the "us". In the end, I explicitly wrote the data using unit="us" and also added the coerce_timestamps="us" write option. > [Python] Pandas timestamps are inferred to have only microsecond precision > -- > > Key: ARROW-10523 > URL: https://issues.apache.org/jira/browse/ARROW-10523 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 2.0.0 >Reporter: David Li >Priority: Minor > > {code:java} > import pyarrow as pa > import pandas as pd > arr = pa.array([pd.Timestamp(year=2020, month=1, day=1, nanosecond=999)]) > print(arr) > print(arr.type) {code} > This gives: > {noformat} > [ > 2020-01-01 00:00:00.00 > ] > timestamp[us] > {noformat} > However, Pandas Timestamps have nanosecond precision, which would be nice to > preserve in inference. > The reason is that TypeInferrer [hardcodes > microseconds|https://github.com/apache/arrow/blob/apache-arrow-2.0.0/cpp/src/arrow/python/inference.cc#L466] > as it only knows about the standard library datetime, so I'm treating this > as a feature request and not quite a bug. Of course, this can be worked > around easily by specifying an explicit type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10695) [C++][Dataset] Allow to use a UUID in the basename_template when writing a dataset
[ https://issues.apache.org/jira/browse/ARROW-10695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244232#comment-17244232 ] Lance Dacey commented on ARROW-10695: - FYI, I think this might be necessary for some use cases. For example, I have Airflow extract data from dozens of APIs in parallel and write to the same target partitioned dataset (partitioned based on the Airflow scheduled date, so all files belong in the same batch folder) - this causes the part-0.parquet file to be overwritten each time which results in lost data instead of there being dozens of files. For the meantime, I added the code below. I need to keep the \{i} it seems or I get an error: {code:python} if self.create_uuid_filename: basename_template = guid() + "-{i}.parquet" else: basename_template = "part-{i}.parquet" {code} guid() is imported from pyarrow.utils > [C++][Dataset] Allow to use a UUID in the basename_template when writing a > dataset > -- > > Key: ARROW-10695 > URL: https://issues.apache.org/jira/browse/ARROW-10695 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Labels: dataset, dataset-parquet-write > Fix For: 3.0.0 > > > Currently we allow the user to specify a {{basename_template}}, and this can > include a {{"\{i\}"}} part to replace it with an automatically incremented > integer (so each generated file written to a single partition is unique): > https://github.com/apache/arrow/blob/master/python/pyarrow/dataset.py#L713-L717 > It _might_ be useful to also have the ability to use a UUID, to ensure the > file is unique in general (not only for a single write) and to mimic the > behaviour of the old {{write_to_dataset}} implementation. > For example, we could look for a {{"\{uuid\}"}} in the template string, and > if present replace it for each file with a new UUID. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243963#comment-17243963 ] Lance Dacey commented on ARROW-10517: - Yes, I think the uuid specifier would work fine for my purposes. Generally, I have had pyarrow create the resulting filenames with the partition_filename_cb function, but you are right - I could probably generate the filenames directly since I am dictating which filters to use in the first place (and each filter becomes a file). {code:python} d1 = { "id": [1, 2, 3, 4, 5], "created_at": [ datetime.date(2020, 5, 7), datetime.date(2020, 6, 19), datetime.date(2020, 9, 14), datetime.date(2020, 11, 22), datetime.date(2020, 12, 2), ], "updated_at": [ datetime.date(2020, 12, 2), datetime.date(2020, 12, 2), datetime.date(2020, 12, 2), datetime.date(2020, 12, 2), datetime.date(2020, 12, 2), ], } df = pd.DataFrame(data=d1) table = pa.Table.from_pandas(df) #historical dataset which has all history of each ID each time it gets updated #each created_at partition would have a sub-partition for updated_at since historical data can change - this can generate many small files depending on how often my schedule runs to download data #I use pa.string() as the partition data type here because I have had issues using pa.date32(), sometimes I will get an error that we cannot convert a string to date32() but using a date works perfectly fine ds.write_dataset( data=table, base_dir=output_path, format="parquet", partitioning=ds.partitioning(pa.schema([("created_at", pa.string()), ("updated_at", pa.string())]), flavor="hive"), schema=table.schema, filesystem=fs, ) #the next task would read the dataset and filter for the created_at partition (ignoring the updated_at partition) dataset = ds.dataset( source=output_path, format="parquet", partitioning="hive", filesystem=fs, ) #I save the unique filters (each created_at value) externally and build the dataset filter expression filter_expression = pq._filters_to_expression(filters=[[('created_at', '==', '2020-05-07')], [('created_at', '==', '2020-06-19')], [('created_at', '==', '2020-09-14')], [('created_at', '==', '2020-11-22')], [('created_at', '==', '2020-12-02')]]) table = dataset.to_table(filter=filter_expression) #Turn the table into a pandas dataframe to remove duplicates and retain the latest row for each ID df = table.to_pandas(self_destruct=True).sort_values(["id", "updated_at"], ascending=True).drop_duplicates(["id"], keep="last") table = pa.Table.from_pandas(df) #this writes the final dataset. #There would be one file per created_at partition. "container/created_at=2020-05-07/2020-05-07.parquet" #our visualization tool connects directly to these parquet files so we can report on the latest status of each ticket (not much attention is paid to the historical changes) pq.write_to_dataset( table=table, root_path=output_path, partition_cols=["created_at"], partition_filename_cb=lambda x: str(x[-1]) + '.parquet',, filesystem=fs, ) {code} ***Note regarding the filters I use. I am using code similar to something I found in the pyarrow.write_to_dataset function (pasted below) to generate these filters. I could probably generate filenames instead though and use write_table like you mentioned. {code:python} for keys, subgroup in data_df.groupby(partition_keys): if not isinstance(keys, tuple): keys = (keys,) subdir = '/'.join( ['{colname}={value}'.format(colname=name, value=val) for name, val in zip(partition_cols, keys)]) {code} > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Fix For: 2.0.0 > > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", >
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242425#comment-17242425 ] Lance Dacey commented on ARROW-10517: - FYI, it seems like the "part-\{i}" basename_template does not work well if schedules run in parallel. For example, I ran 30 schedules (in parallel) which read separate JSON files and output the data into the same partitioned parquet dataset. Only part-0.parquet was being overwritten each time. For now, I imported the guid() function from pyarrow.utils to ensure that all files are written. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Fix For: 2.0.0 > > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, parti
[jira] [Comment Edited] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242213#comment-17242213 ] Lance Dacey edited comment on ARROW-10694 at 12/2/20, 9:57 AM: --- I am simply listing and deleting blobs without ".parquet" as a workaround for now. I think this is still an issue that should be resolved since this can delete _common_metadata and _metadata files unless I specifically ignore them was (Author: ldacey): I am simply listing and deleting blobs with ".parquet" as a workaround for now. I think this is still an issue that should be resolved since this can delete _common_metadata and _metadata files unless I specifically ignore them > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242213#comment-17242213 ] Lance Dacey commented on ARROW-10694: - I am simply listing and deleting blobs with ".parquet" as a workaround for now. I think this is still an issue that should be resolved since this can delete _common_metadata and _metadata files unless I specifically ignore them > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241833#comment-17241833 ] Lance Dacey commented on ARROW-10517: - Thanks - since the \{i} increments each time a new file is written, I am not sure if this can work for my use case unless I am designing this incorrectly. I am using the partition_filename_cb similar to how I would create a materialized view in a database to ensure that there is only one row per unique ID based on the latest update timestamp. I can then connect this parquet dataset to our visualization tool, or I can export it to CSV format and email it to another team, etc. {code:java} #the historical dataset includes all rows, the number of files will depend on the frequency of scheduled downloads. it is possible to have multiple rows per unique ID historical_dataset = [ 'dev/test/report_date=2018-01-01/part-0.parquet', 'dev/test/report_date=2018-01-01/part-1.parquet', 'dev/test/report_date=2018-01-01/part-2.parquet', 'dev/test/report_date=2018-01-01/part-3.parquet', 'dev/test/report_date=2018-01-01/part-4.parquet', 'dev/test/report_date=2018-01-01/part-5.parquet', ] #read the historical dataset and filter for the partition. in this case, report_date = 2018-01-01, so all data from that date is read into a table #convert to pandas dataframe, sort based on "id" and "updated_at" fields #drop duplicates based on "id" field, retaining the latest version #write to a new dataset which is just the latest version of each "id". The 6 parts are now in a single file which will be continuously overwritten if any new data is added to the historical_dataset. Our visualization tool connects to these finalized files, and sometimes I send the data through email for reporting purposes latest_dataset = [ 'dev/test/report_date=2018-01-01/2018-01-01.parquet', ] {code} Perhaps there is a better way to go about this? With a database, I would just create view which selects distinct on the ID column based on the latest update timestamp. This seems to be a common use case, so I am not sure how people would go about it with Parquet. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Fix For: 2.0.0 > > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): >
[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237341#comment-17237341 ] Lance Dacey commented on ARROW-10694: - FYI, I tested HivePartitioning as well, but faced the same issue. {code:java} from pyarrow.dataset import HivePartitioning partition = HivePartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())])) FileNotFoundError: dev/test-dataset2/year=2018/month=1/day=1{code} > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237316#comment-17237316 ] Lance Dacey commented on ARROW-10694: - {code:java} print(fs.isfile("dev/test-dataset/2018/1/1")) print(fs.info("dev/test-dataset/2018/1/1", detail=True)){code} False {'name': 'dev/test-dataset/2018/1/1/', 'size': 0, 'type': 'directory'} > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10/1', > 'dev/test-dataset/2018/10/1/test-27.parquet', > 'dev/test-dataset/2018/3/1', > 'dev/test-dataset/2018/3/1/test-6.parquet', > 'dev/test-dataset/2020/1/1', > 'dev/test-dataset/2020/1/1/test-2.parquet', > 'dev/test-dataset/2020/10/1', > 'dev/test-dataset/2020/10/1/test-29.parquet', > 'dev/test-dataset/2020/11/1', > 'dev/test-dataset/2020/11/1/test-32.parquet', > 'dev/test-dataset/2020/2/1', > 'dev/test-dataset/2020/2/1/test-5.parquet', > 'dev/test-dataset/2020/7/1', > 'dev/test-dataset/2020/7/1/test-20.parquet', > 'dev/test-dataset/2020/8/1', > 'dev/test-dataset/2020/8/1/test-23.parquet', > 'dev/test-dataset/2020/9/1', > 'dev/test-dataset/2020/9/1/test-26.parquet' > ]{code} > As you can see, there is an empty file for each "day" partition. I was not > even able to read the dataset at all until I manually deleted the first empty > file in the dataset (2018/1/1). > I then get an error when I try to use the to_table() method: > {code:java} > OSError Traceback (most recent call last) > in > > 1 > dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx > in > pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in > pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi > in pyarrow.lib.check_status()OSError: Could not open parquet input source > 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes > {code} > If I manually delete the empty file, I can then use the to_table() function: > {code:java} > dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == > 10)).to_pandas() > {code} > Is this a bug with pyarrow, adlfs, or fsspec? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
[ https://issues.apache.org/jira/browse/ARROW-10694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17237299#comment-17237299 ] Lance Dacey commented on ARROW-10694: - Sure. https://github.com/dask/adlfs/issues/137 I tried the exclude_invalid_files argument but ran into an error: {code:java} dataset = ds.dataset(source="dev/test-dataset", format="parquet", partitioning=partition, exclude_invalid_files=True, filesystem=fs) --- FileNotFoundError Traceback (most recent call last) in > 1 dataset = ds.dataset(source="dev/test-dataset", 2 format="parquet", 3 partitioning=partition, 4 exclude_invalid_files=True, 5 filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 434 selector_ignore_prefixes=selector_ignore_prefixes 435 ) --> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options) 437 438 return factory.finish(schema) /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.FileSystemDatasetFactory.__init__() /opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_open_input_file() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in open_input_file(self, path) 274 275 if not self.fs.isfile(path): --> 276 raise FileNotFoundError(path) 277 278 return PythonFile(self.fs.open(path, mode="rb"), mode="r") FileNotFoundError: dev/test-dataset/2018/1/1 {code} That folder and the empty file exists though: {code:java} for file in fs.find("dev/test-dataset"): print(file) dev/test-dataset/2018/1/1 dev/test-dataset/2018/1/1/test-0.parquet dev/test-dataset/2018/10/1 dev/test-dataset/2018/10/1/test-27.parquet dev/test-dataset/2018/11/1 dev/test-dataset/2018/11/1/test-30.parquet dev/test-dataset/2018/12/1 dev/test-dataset/2018/12/1/test-33.parquet dev/test-dataset/2018/2/1 dev/test-dataset/2018/2/1/test-3.parquet {code} > [Python] ds.write_dataset() generates empty files for each final partition > -- > > Key: ARROW-10694 > URL: https://issues.apache.org/jira/browse/ARROW-10694 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 > Python 3.8.6 > adlfs master branch >Reporter: Lance Dacey >Priority: Major > > ds.write_dataset() is generating empty files for the final partition folder > which causes errors when reading the dataset or converting a dataset to a > table. > I believe this may be caused by fs.mkdir(). Without the final slash in the > path, an empty file is created in the "dev" container: > > {code:java} > fs = fsspec.filesystem(protocol='abfs', account_name=base.login, > account_key=base.password) > fs.mkdir("dev/test2") > {code} > > If the final slash is added, a proper folder is created: > {code:java} > fs.mkdir("dev/test2/"){code} > > Here is a full example of what happens with ds.write_dataset: > {code:java} > schema = pa.schema( > [ > ("year", pa.int16()), > ("month", pa.int8()), > ("day", pa.int8()), > ("report_date", pa.date32()), > ("employee_id", pa.string()), > ("designation", pa.dictionary(index_type=pa.int16(), > value_type=pa.string())), > ] > ) > part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", > pa.int8()), ("day", pa.int8())])) > ds.write_dataset(data=table, > base_dir="dev/test-dataset", > basename_template="test-{i}.parquet", > format="parquet", > partitioning=part, > schema=schema, > filesystem=fs) > dataset.files > #sample printed below, note the empty files > [ > 'dev/test-dataset/2018/1/1/test-0.parquet', > 'dev/test-dataset/2018/10
[jira] [Created] (ARROW-10694) [Python] ds.write_dataset() generates empty files for each final partition
Lance Dacey created ARROW-10694: --- Summary: [Python] ds.write_dataset() generates empty files for each final partition Key: ARROW-10694 URL: https://issues.apache.org/jira/browse/ARROW-10694 Project: Apache Arrow Issue Type: Bug Affects Versions: 2.0.0 Environment: Ubuntu 18.04 Python 3.8.6 adlfs master branch Reporter: Lance Dacey ds.write_dataset() is generating empty files for the final partition folder which causes errors when reading the dataset or converting a dataset to a table. I believe this may be caused by fs.mkdir(). Without the final slash in the path, an empty file is created in the "dev" container: {code:java} fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) fs.mkdir("dev/test2") {code} If the final slash is added, a proper folder is created: {code:java} fs.mkdir("dev/test2/"){code} Here is a full example of what happens with ds.write_dataset: {code:java} schema = pa.schema( [ ("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8()), ("report_date", pa.date32()), ("employee_id", pa.string()), ("designation", pa.dictionary(index_type=pa.int16(), value_type=pa.string())), ] ) part = DirectoryPartitioning(pa.schema([("year", pa.int16()), ("month", pa.int8()), ("day", pa.int8())])) ds.write_dataset(data=table, base_dir="dev/test-dataset", basename_template="test-{i}.parquet", format="parquet", partitioning=part, schema=schema, filesystem=fs) dataset.files #sample printed below, note the empty files [ 'dev/test-dataset/2018/1/1/test-0.parquet', 'dev/test-dataset/2018/10/1', 'dev/test-dataset/2018/10/1/test-27.parquet', 'dev/test-dataset/2018/3/1', 'dev/test-dataset/2018/3/1/test-6.parquet', 'dev/test-dataset/2020/1/1', 'dev/test-dataset/2020/1/1/test-2.parquet', 'dev/test-dataset/2020/10/1', 'dev/test-dataset/2020/10/1/test-29.parquet', 'dev/test-dataset/2020/11/1', 'dev/test-dataset/2020/11/1/test-32.parquet', 'dev/test-dataset/2020/2/1', 'dev/test-dataset/2020/2/1/test-5.parquet', 'dev/test-dataset/2020/7/1', 'dev/test-dataset/2020/7/1/test-20.parquet', 'dev/test-dataset/2020/8/1', 'dev/test-dataset/2020/8/1/test-23.parquet', 'dev/test-dataset/2020/9/1', 'dev/test-dataset/2020/9/1/test-26.parquet' ]{code} As you can see, there is an empty file for each "day" partition. I was not even able to read the dataset at all until I manually deleted the first empty file in the dataset (2018/1/1). I then get an error when I try to use the to_table() method: {code:java} OSError Traceback (most recent call last) in > 1 dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.to_table()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()/opt/conda/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()OSError: Could not open parquet input source 'dev/test-dataset/2018/10/1': Invalid: Parquet file size is 0 bytes {code} If I manually delete the empty file, I can then use the to_table() function: {code:java} dataset.to_table(filter=(ds.field("year") == 2020) & (ds.field("month") == 10)).to_pandas() {code} Is this a bug with pyarrow, adlfs, or fsspec? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236702#comment-17236702 ] Lance Dacey commented on ARROW-10517: - Regarding partition_filename_cb, some common ones I am using are to create a full date name from the partition folders. {code:java} year=2020/month=8/day=4 partition_filename_cb=lambda x: "-".join(str(y).zfill(2) for y in x) + ".parquet" 2020-08-04.parquet{code} I am doing this to address a "many small files" situation in a few scenarios. Perhaps there is a better way to go about it though where this would not be necessary. Scenario 1): * I use turbodbc to query 6 different SQL servers every 30 minutes (48 schedules per date * 6) directly into pyarrow tables which I then write to a partitioned dataset. * This creates a lot of small files which I then filter for and write to a separate dataset with the partition_filename_cb to consolidate the data into a single daily file Scenario 2): * I query for data every hour from some REST APIs (Zendesk and ServiceNow) for any tickets which have changed since my last query (based on the latest updated_at timestamp) * I partition this data based on the created_at date. So we have a lot of small files due to the frequency of downloads, and a single download might have tickets which were created_at in the past. At least 24 files * the amount of unique dates which were updated. * So again, I filter for any created_at partition which was changed in the last hour and then rewrite a "final" consolidated version of the data in a separate dataset using the partition_filename_cb which is then used for downstream tasks and transformation. * Ultimately, I need to ensure that our visualizations/reports only display the latest version of each ticket even if it was updated a dozen times, so this step generally includes me sorting the data and dropping duplicates on some unique constraints Both scenarios have tiny files each download interval or based on how I partition the data, but are pretty large overall (scenario 1 is over 500 million rows, and scenario 2 is over 70 million rows from March of this year). Maybe it is not required to use the partition_filename_cb though, it just seemed faster and more organized (under 300ms to read a single file compared to over 1 minute to filter for a day with 96 UUID filenames) Any best practices here to avoid the need to use the partition_filename_cb function? > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Fix For: 2.0.0 > > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/pytho
[jira] [Closed] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey closed ARROW-10517. --- Fix Version/s: 2.0.0 Resolution: Later My issue is caused by another library (adlfs). Once this is fixed, this issue will not be relevant. https://github.com/dask/adlfs/issues/135 > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Fix For: 2.0.0 > > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236357#comment-17236357 ] Lance Dacey commented on ARROW-10517: - Thanks for your help. By adding **kwargs to the adlfs find() return, I was able to get ds.dataset features to work (read and write) with the latest version of adlfs. I am sure the library will be updated soon. Since I am stuck with azure-storage-blob SDK v2 in production, I have been using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I am able to use write_to_dataset() with the legacy system. This error leads back to adlfs core.py in the mkdir function. I think I will close this issue now since write_to_dataset() works for my needs right now and it supports the _partition_filename_cb_ which I find useful. I will just wait until I can safely upgrade to the latest version of adlfs where I know it will work fine. {code:java} ds.write_dataset(data=table, base_dir="dev/test-write", format="parquet", partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), filesystem=fs) --- RuntimeError Traceback (most recent call last) in > 1 ds.write_dataset(data=table, 2 base_dir="dev/test-write", 3 format="parquet", 4 partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), 5 filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test-write/2018-03-01. {code} > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir=
[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17236357#comment-17236357 ] Lance Dacey edited comment on ARROW-10517 at 11/20/20, 6:24 PM: Thanks for your help. By adding **kwargs to the adlfs find() return, I was able to get ds.dataset features to work (read and write) with the latest version of adlfs. I am sure the library will be updated soon: https://github.com/dask/adlfs/issues/135. Since I am stuck with azure-storage-blob SDK v2 in production, I have been using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I am able to use write_to_dataset() with the legacy system. This error leads back to adlfs core.py in the mkdir function. I think I will close this issue now since write_to_dataset() works for my needs right now and it supports the _partition_filename_cb_ which I find useful. I will just wait until I can safely upgrade to the latest version of adlfs where I know it will work fine. {code:java} ds.write_dataset(data=table, base_dir="dev/test-write", format="parquet", partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), filesystem=fs) --- RuntimeError Traceback (most recent call last) in > 1 ds.write_dataset(data=table, 2 base_dir="dev/test-write", 3 format="parquet", 4 partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), 5 filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test-write/2018-03-01. {code} was (Author: ldacey): Thanks for your help. By adding **kwargs to the adlfs find() return, I was able to get ds.dataset features to work (read and write) with the latest version of adlfs. I am sure the library will be updated soon. Since I am stuck with azure-storage-blob SDK v2 in production, I have been using an old version of adlfs (0.2.5). I am unable to use write_dataset, but I am able to use write_to_dataset() with the legacy system. This error leads back to adlfs core.py in the mkdir function. I think I will close this issue now since write_to_dataset() works for my needs right now and it supports the _partition_filename_cb_ which I find useful. I will just wait until I can safely upgrade to the latest version of adlfs where I know it will work fine. {code:java} ds.write_dataset(data=table, base_dir="dev/test-write", format="parquet", partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), filesystem=fs) --- RuntimeError Traceback (most recent call last) in > 1 ds.write_dataset(data=table, 2 base_dir="dev/test-write", 3 format="parquet", 4 partitioning=ds.DirectoryPartitioning(pyarrow.schema([("report_date", pyarrow.date32())])), 5 filesystem=fs) /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data,
[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477 ] Lance Dacey edited comment on ARROW-10517 at 11/20/20, 8:26 AM: Yeah, I can open an issue there. https://github.com/dask/adlfs/issues/135 I think that this might be the major issue I am facing with v12 Azure Blob SDK. I cannot read a dataset because I get a list of files returned instead of a dictionary (but I am able to write a dataset). I think I might have to open some fsspec issues as well because mkdir is creating those empty files instead of a directory which doesn't seem right. Also ran into an issue with read_table(use_legacy_dataset=True) where data was trying to be read from the wrong partition with a similar name "domain=tnt" and "domain=tntplus". So it looks like perhaps only the prefix was being used to list the files. edit: {code:java} fs.info("dev/testing10/evaluations") {'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'} {code} was (Author: ldacey): Yeah, I can open an issue there. I hopefully am not using an old version. I installed miniconda and then used the environment files to make sure that adlfs is the recent version. And I print the module versions in the script so everything should be aligned. I think I might have to open some fsspec issues as well because mkdir is creating those empty files instead of a directory which doesn't seem right. Also ran into an issue with read_table(use_legacy_dataset=True) where data was trying to be read from the wrong partition with a similar name "domain=tnt" and "domain=tntplus". So it looks like perhaps only the prefix was being used to list the files. edit: {code:java} fs.info("dev/testing10/evaluations") {'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'} {code} > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda
[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477 ] Lance Dacey edited comment on ARROW-10517 at 11/19/20, 1:48 PM: Yeah, I can open an issue there. I hopefully am not using an old version. I installed miniconda and then used the environment files to make sure that adlfs is the recent version. And I print the module versions in the script so everything should be aligned. I think I might have to open some fsspec issues as well because mkdir is creating those empty files instead of a directory which doesn't seem right. Also ran into an issue with read_table(use_legacy_dataset=True) where data was trying to be read from the wrong partition with a similar name "domain=tnt" and "domain=tntplus". So it looks like perhaps only the prefix was being used to list the files. edit: {code:java} fs.info("dev/testing10/evaluations") {'name': 'dev/testing10/evaluations/', 'size': 0, 'type': 'directory'} {code} was (Author: ldacey): Yeah, I can open an issue there. I hopefully am not using an old version. I installed miniconda and then used the environment files to make sure that adlfs is the recent version. And I print the module versions in the script so everything should be aligned. I think I might have to open some fsspec issues as well because mkdir is creating those empty files instead of a directory which doesn't seem right. Also ran into an issue with read_table(use_legacy_dataset=True) where data was trying to be read from the wrong partition with a similar name "domain=tnt" and "domain=tntplus". So it looks like perhaps only the prefix was being used to list the files. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235477#comment-17235477 ] Lance Dacey commented on ARROW-10517: - Yeah, I can open an issue there. I hopefully am not using an old version. I installed miniconda and then used the environment files to make sure that adlfs is the recent version. And I print the module versions in the script so everything should be aligned. I think I might have to open some fsspec issues as well because mkdir is creating those empty files instead of a directory which doesn't seem right. Also ran into an issue with read_table(use_legacy_dataset=True) where data was trying to be read from the wrong partition with a similar name "domain=tnt" and "domain=tntplus". So it looks like perhaps only the prefix was being used to list the files. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235388#comment-17235388 ] Lance Dacey commented on ARROW-10517: - Latest adlfs (0.5.5): This really creates the test.parquet file as well, not just the directory: {code:java} fs.mkdir("dev/test999/2020/01/28/test.parquet", create_parents=True) {code} And if I try to run the same line again it it fails because the partition exists: {code:python} --- StorageErrorException: Operation returned an invalid status 'The specified blob already exists.' During handling of the above exception, another exception occurred: ResourceExistsError Traceback (most recent call last) /c/airflow/test.py in > 6 fs.mkdir("dev/test999/2020/01/28/test.parquet", create_parents=True) ~/miniconda3/envs/airflow/lib/python3.8/site-packages/adlfs/spec.py in mkdir(self, path, delimiter, exist_ok, **kwargs) 880 881 def mkdir(self, path, delimiter="/", exist_ok=False, **kwargs): --> 882 maybe_sync(self._mkdir, self, path, delimiter, exist_ok) 883 884 async def _mkdir(self, path, delimiter="/", exist_ok=False, **kwargs): ~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in maybe_sync(func, self, *args, **kwargs) 98 if inspect.iscoroutinefunction(func): 99 # run the awaitable on the loop --> 100 return sync(loop, func, *args, **kwargs) 101 else: 102 # just call the blocking function ~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in sync(loop, func, callback_timeout, *args, **kwargs) 69 if error[0]: 70 typ, exc, tb = error[0] ---> 71 raise exc.with_traceback(tb) 72 else: 73 return result[0] ~/miniconda3/envs/airflow/lib/python3.8/site-packages/fsspec/asyn.py in f() 53 if callback_timeout is not None: 54 future = asyncio.wait_for(future, callback_timeout) ---> 55 result[0] = await future 56 except Exception: 57 error[0] = sys.exc_info() ~/miniconda3/envs/airflow/lib/python3.8/site-packages/adlfs/spec.py in _mkdir(self, path, delimiter, exist_ok, **kwargs) 918 container=container_name 919 ) --> 920 await container_client.upload_blob(name=path, data="") 921 else: 922 ## everything else ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py in wrapper_use_tracer(*args, **kwargs) 72 span_impl_type = settings.tracing_implementation() 73 if span_impl_type is None: ---> 74 return await func(*args, **kwargs) 75 76 # Merge span is parameter is set, but only if no explicit parent are passed ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_container_client_async.py in upload_blob(self, name, data, blob_type, length, metadata, **kwargs) 715 timeout = kwargs.pop('timeout', None) 716 encoding = kwargs.pop('encoding', 'UTF-8') --> 717 await blob.upload_blob( 718 data, 719 blob_type=blob_type, ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/core/tracing/decorator_async.py in wrapper_use_tracer(*args, **kwargs) 72 span_impl_type = settings.tracing_implementation() 73 if span_impl_type is None: ---> 74 return await func(*args, **kwargs) 75 76 # Merge span is parameter is set, but only if no explicit parent are passed ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_blob_client_async.py in upload_blob(self, data, blob_type, length, metadata, **kwargs) 267 **kwargs) 268 if blob_type == BlobType.BlockBlob: --> 269 return await upload_block_blob(**options) 270 if blob_type == BlobType.PageBlob: 271 return await upload_page_blob(**options) ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/aio/_upload_helpers.py in upload_block_blob(client, data, stream, length, overwrite, headers, validate_content, max_concurrency, blob_settings, encryption_options, **kwargs) 131 except StorageErrorException as error: 132 try: --> 133 process_storage_error(error) 134 except ResourceModifiedError as mod_error: 135 if not overwrite: ~/miniconda3/envs/airflow/lib/python3.8/site-packages/azure/storage/blob/_shared/response_handlers.py in process_storage_error(storage_error) 145 error.error_code = error_code 146 error
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Attachment: ss2.PNG > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG, ss2.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesys
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235306#comment-17235306 ] Lance Dacey commented on ARROW-10517: - !ss.PNG! Added a screenshot of the results of the mkdir command. I am not sure why it created a file for the 28 partition, but it looks like that is what happened. mkdir is failing on my production environment because I am stuck using old versions of adlfs and fsspec (bound to azure-blob-storage v2 SDK, unable to use v12 due to Airflow dependencies which is what runs all of my tasks using pyarrow in the first place). What I don't understand is why I can use write_to_dataset (legacy version) without any issues, but the write_dataset method will fail? Is the filesystem implementation different? I suppose both would be using adlfs and fsspec in my case on Azure Blob - it seems weird that one method successfully creates the directories and partitions, but the other method will fail (which is why I raised this as a pyarrow issue). > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Attachment: ss.PNG > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > Attachments: ss.PNG > > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672 elif isinstance(source, (tuple, list)): > 673 if all(_is_path_like(elem) for elem in source): > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > _filesystem_dataset(source, schema, filesystem, parti
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235277#comment-17235277 ] Lance Dacey commented on ARROW-10517: - This works on my local conda environment (dependencies posted on my last edit, using the latest version of fsspec and adlfs). The "28" partition was a file instead of a folder in this case. {code:python} fs.mkdir("dev/test7/2020/01/28", create_parents=True) {code} If I run the same code on my production environment it fails. I am using this environment with read_table and write_to_dataset often though. {code:python} name: old channels: - conda-forge - defaults dependencies: - python=3.8 - azure-storage-blob=2 - pandas=1.1 - pyarrow=2 - pip=20.2 - pip: - adlfs==0.2.5 - fsspec==0.7.4 --- RuntimeError Traceback (most recent call last) in > 1 fs.mkdir("dev/test8/2020/01/28", create_parents=True) /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test8/2020/01/28. {code} However, the dataset read function now works and it supports the row level filtering which is great (the dataset below is over 65 million rows and I am able to filter quickly for specific IDs across multiple files in under 2 seconds): {code:java} dataset = ds.dataset(source=ds_path, format="parquet", partitioning="hive", exclude_invalid_files=False, filesystem=fs) len(dataset.files) 1050 table = dataset.to_table(columns=None, filter= (ds.field("year") == "2020") & (ds.field("month") == "11") & (ds.field("day") > "10") & (ds.field("id") == "102648")) {code} But I cannot use write_dataset (along with the new partitioning features), unfortunately. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227
[jira] [Comment Edited] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235084#comment-17235084 ] Lance Dacey edited comment on ARROW-10517 at 11/19/20, 7:44 AM: Added an edit with the results of pure fsspec and adlfs find() commands against a dataset I created with pyarrow. For some reason, a list is being output although I am using the latest version of each library. I checked the versions by doing a conda list, and then inside of the notebook I ran: {code:java} print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None))) {code} A separate attempt on my laptop locally using a fresh env file: {code:java} name: airflow channels: - conda-forge - defaults dependencies: - python=3.8 - azure-storage-blob=12 - pandas=1.1 - pyarrow=2 - adlfs=0.5 ~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 434 selector_ignore_prefixes=selector_ignore_prefixes 435 ) --> 436 factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options) 437 438 return factory.finish(schema) ~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.FileSystemDatasetFactory.__init__() ~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_get_file_info_selector() ~/miniconda3/envs/airflow/lib/python3.8/site-packages/pyarrow/fs.py in get_file_info_selector(self, selector) 219 selector.base_dir, maxdepth=maxdepth, withdirs=True, detail=True 220 ) --> 221 for path, info in selected_files.items(): 222 infos.append(self._create_file_info(path, info)) 223 AttributeError: 'list' object has no attribute 'items' {code} was (Author: ldacey): Added an edit with the results of pure fsspec and adlfs find() commands against a dataset I created with pyarrow. For some reason, a list is being output although I am using the latest version of each library. I checked the versions by doing a conda list, and then inside of the notebook I ran: {code:java} print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None))) {code} > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads,
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Description: {code:python} # adal==1.2.5 # adlfs==0.2.5 # fsspec==0.7.4 # pandas==1.1.3 # pyarrow==2.0.0 # azure-storage-blob==2.1.0 # azure-storage-common==2.1.0 import pyarrow.dataset as ds import fsspec from pyarrow.dataset import DirectoryPartitioning fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) ds.write_dataset(data=table, base_dir="dev/test7", basename_template=None, format="parquet", partitioning=DirectoryPartitioning(pa.schema([("year", pa.string()), ("month", pa.string()), ("day", pa.string())])), schema=table.schema, filesystem=fs, ) {code} I think this is due to early versions of adlfs having mkdir(). Although I use write_to_dataset and write_table all of the time, so I am not sure why this would be an issue. {code:python} --- RuntimeError Traceback (most recent call last) in 13 14 ---> 15 ds.write_dataset(data=table, 16 base_dir="dev/test7", 17 basename_template=None, /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test7/2020/01/28. {code} Next, if I try to read a dataset (keep in mind that this works with read_table and ParquetDataset): {code:python} ds.dataset(source="dev/staging/evaluations", format="parquet", partitioning="hive", exclude_invalid_files=False, filesystem=fs ) {code} This doesn't seem to respect the filesystem connected to Azure Blob. {code:python} --- FileNotFoundError Traceback (most recent call last) in > 1 ds.dataset(source="dev/staging/evaluations", 2format="parquet", 3partitioning="hive", 4exclude_invalid_files=False, 5filesystem=fs /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/staging/evaluations {code} This *does* work though when I list the blobs before passing them to ds.dataset: {code:python} blobs = wasb.list_blobs(container_name="dev", prefix="stag
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17235084#comment-17235084 ] Lance Dacey commented on ARROW-10517: - Added an edit with the results of pure fsspec and adlfs find() commands against a dataset I created with pyarrow. For some reason, a list is being output although I am using the latest version of each library. I checked the versions by doing a conda list, and then inside of the notebook I ran: {code:java} print('\n'.join(f'{m.__name__} {m.__version__}' for m in globals().values() if getattr(m, '__version__', None))) {code} > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Description: {code:python} # adal==1.2.5 # adlfs==0.2.5 # fsspec==0.7.4 # pandas==1.1.3 # pyarrow==2.0.0 # azure-storage-blob==2.1.0 # azure-storage-common==2.1.0 import pyarrow.dataset as ds import fsspec from pyarrow.dataset import DirectoryPartitioning fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) ds.write_dataset(data=table, base_dir="dev/test7", basename_template=None, format="parquet", partitioning=DirectoryPartitioning(pa.schema([("year", pa.string()), ("month", pa.string()), ("day", pa.string())])), schema=table.schema, filesystem=fs, ) {code} I think this is due to early versions of adlfs having mkdir(). Although I use write_to_dataset and write_table all of the time, so I am not sure why this would be an issue. {code:python} --- RuntimeError Traceback (most recent call last) in 13 14 ---> 15 ds.write_dataset(data=table, 16 base_dir="dev/test7", 17 basename_template=None, /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test7/2020/01/28. {code} Next, if I try to read a dataset (keep in mind that this works with read_table and ParquetDataset): {code:python} ds.dataset(source="dev/staging/evaluations", format="parquet", partitioning="hive", exclude_invalid_files=False, filesystem=fs ) {code} This doesn't seem to respect the filesystem connected to Azure Blob. {code:python} --- FileNotFoundError Traceback (most recent call last) in > 1 ds.dataset(source="dev/staging/evaluations", 2format="parquet", 3partitioning="hive", 4exclude_invalid_files=False, 5filesystem=fs /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/staging/evaluations {code} This *does* work though when I list the blobs before passing them to ds.dataset: {code:python} blobs = wasb.list_blobs(container_name="dev", prefix="stag
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231903#comment-17231903 ] Lance Dacey commented on ARROW-10517: - Hello - let me know if my edit covers it. Previously I did have some tests for azure-blob v12 SDK, but I cannot use that in production anyways right now (apache-airflow requirements), so I am stuck with adlfs 0.2.5 I think. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > {code:python} > # adal==1.2.5 > # adlfs==0.2.5 > # fsspec==0.7.4 > # pandas==1.1.3 > # pyarrow==2.0.0 > # azure-storage-blob==2.1.0 > # azure-storage-common==2.1.0 > import pyarrow.dataset as ds > import fsspec > from pyarrow.dataset import DirectoryPartitioning > fs = fsspec.filesystem(protocol='abfs', >account_name=base.login, >account_key=base.password) > ds.write_dataset(data=table, > base_dir="dev/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", > pa.string()), ("month", pa.string()), ("day", pa.string())])), > schema=table.schema, > filesystem=fs, > ) > {code} > I think this is due to early versions of adlfs having mkdir(). Although I > use write_to_dataset and write_table all of the time, so I am not sure why > this would be an issue. > {code:python} > --- > RuntimeError Traceback (most recent call last) > in > 13 > 14 > ---> 15 ds.write_dataset(data=table, > 16 base_dir="dev/test7", > 17 basename_template=None, > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in > write_dataset(data, base_dir, basename_template, format, partitioning, > schema, filesystem, file_options, use_threads) > 771 filesystem, _ = _ensure_fs(filesystem) > 772 > --> 773 _filesystemdataset_write( > 774 data, base_dir, basename_template, schema, > 775 filesystem, partitioning, file_options, use_threads, > /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in > pyarrow._dataset._filesystemdataset_write() > /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in > pyarrow._fs._cb_create_dir() > /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, > path, recursive) > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is > not found > --> 228 self.fs.mkdir(path, create_parents=recursive) > 229 > 230 def delete_dir(self, path): > /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, > delimiter, exists_ok, **kwargs) > 561 else: > 562 ## everything else > --> 563 raise RuntimeError(f"Cannot create > {container_name}{delimiter}{path}.") > 564 else: > 565 if container_name in self.ls("") and path: > RuntimeError: Cannot create dev/test7/2020/01/28. > {code} > > Next, if I try to read a dataset (keep in mind that this works with > read_table and ParquetDataset): > {code:python} > ds.dataset(source="dev/staging/evaluations", >format="parquet", >partitioning="hive", >exclude_invalid_files=False, >filesystem=fs > ) > {code} > > This doesn't seem to respect the filesystem connected to Azure Blob. > {code:python} > --- > FileNotFoundError Traceback (most recent call last) > in > > 1 ds.dataset(source="dev/staging/evaluations", > 2format="parquet", > 3partitioning="hive", > 4exclude_invalid_files=False, > 5filesystem=fs > /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, > schema, format, filesystem, partitioning, partition_base_dir, > exclude_invalid_files, ignore_prefixes) > 669 # TODO(kszucs): support InMemoryDataset for a table input > 670 if _is_path_like(source): > --> 671 return _filesystem_dataset(source, **kwargs) > 672
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Description: If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it): {code:python} # adal==1.2.5 # adlfs==0.2.5 # fsspec==0.7.4 # pandas==1.1.3 # pyarrow==2.0.0 # azure-storage-blob==2.1.0 # azure-storage-common==2.1.0 import pyarrow.dataset as ds import fsspec from pyarrow.dataset import DirectoryPartitioning fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) ds.write_dataset(data=table, base_dir="dev/test7", basename_template=None, format="parquet", partitioning=DirectoryPartitioning(pa.schema([("year", pa.string()), ("month", pa.string()), ("day", pa.string())])), schema=table.schema, filesystem=fs, ) {code} I think this is due to early versions of adlfs having mkdir(). Although I use write_to_dataset and write_table all of the time, so I am not sure why this would be an issue. {code:python} --- RuntimeError Traceback (most recent call last) in 13 14 ---> 15 ds.write_dataset(data=table, 16 base_dir="dev/test7", 17 basename_template=None, /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test7/2020/01/28. {code} Next, if I try to read a dataset (keep in mind that this works with read_table and ParquetDataset): {code:python} ds.dataset(source="dev/staging/evaluations", format="parquet", partitioning="hive", exclude_invalid_files=False, filesystem=fs ) {code} This doesn't seem to respect the filesystem connected to Azure Blob. {code:python} --- FileNotFoundError Traceback (most recent call last) in > 1 ds.dataset(source="dev/staging/evaluations", 2format="parquet", 3partitioning="hive", 4exclude_invalid_files=False, 5filesystem=fs /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/staging/evaluations {cod
[jira] [Updated] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lance Dacey updated ARROW-10517: Description: {code:python} # adal==1.2.5 # adlfs==0.2.5 # fsspec==0.7.4 # pandas==1.1.3 # pyarrow==2.0.0 # azure-storage-blob==2.1.0 # azure-storage-common==2.1.0 import pyarrow.dataset as ds import fsspec from pyarrow.dataset import DirectoryPartitioning fs = fsspec.filesystem(protocol='abfs', account_name=base.login, account_key=base.password) ds.write_dataset(data=table, base_dir="dev/test7", basename_template=None, format="parquet", partitioning=DirectoryPartitioning(pa.schema([("year", pa.string()), ("month", pa.string()), ("day", pa.string())])), schema=table.schema, filesystem=fs, ) {code} I think this is due to early versions of adlfs having mkdir(). Although I use write_to_dataset and write_table all of the time, so I am not sure why this would be an issue. {code:python} --- RuntimeError Traceback (most recent call last) in 13 14 ---> 15 ds.write_dataset(data=table, 16 base_dir="dev/test7", 17 basename_template=None, /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in write_dataset(data, base_dir, basename_template, format, partitioning, schema, filesystem, file_options, use_threads) 771 filesystem, _ = _ensure_fs(filesystem) 772 --> 773 _filesystemdataset_write( 774 data, base_dir, basename_template, schema, 775 filesystem, partitioning, file_options, use_threads, /opt/conda/lib/python3.8/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._filesystemdataset_write() /opt/conda/lib/python3.8/site-packages/pyarrow/_fs.pyx in pyarrow._fs._cb_create_dir() /opt/conda/lib/python3.8/site-packages/pyarrow/fs.py in create_dir(self, path, recursive) 226 def create_dir(self, path, recursive): 227 # mkdir also raises FileNotFoundError when base directory is not found --> 228 self.fs.mkdir(path, create_parents=recursive) 229 230 def delete_dir(self, path): /opt/conda/lib/python3.8/site-packages/adlfs/core.py in mkdir(self, path, delimiter, exists_ok, **kwargs) 561 else: 562 ## everything else --> 563 raise RuntimeError(f"Cannot create {container_name}{delimiter}{path}.") 564 else: 565 if container_name in self.ls("") and path: RuntimeError: Cannot create dev/test7/2020/01/28. {code} Next, if I try to read a dataset (keep in mind that this works with read_table and ParquetDataset): {code:python} ds.dataset(source="dev/staging/evaluations", format="parquet", partitioning="hive", exclude_invalid_files=False, filesystem=fs ) {code} This doesn't seem to respect the filesystem connected to Azure Blob. {code:python} --- FileNotFoundError Traceback (most recent call last) in > 1 ds.dataset(source="dev/staging/evaluations", 2format="parquet", 3partitioning="hive", 4exclude_invalid_files=False, 5filesystem=fs /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in dataset(source, schema, format, filesystem, partitioning, partition_base_dir, exclude_invalid_files, ignore_prefixes) 669 # TODO(kszucs): support InMemoryDataset for a table input 670 if _is_path_like(source): --> 671 return _filesystem_dataset(source, **kwargs) 672 elif isinstance(source, (tuple, list)): 673 if all(_is_path_like(elem) for elem in source): /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _filesystem_dataset(source, schema, filesystem, partitioning, format, partition_base_dir, exclude_invalid_files, selector_ignore_prefixes) 426 fs, paths_or_selector = _ensure_multiple_sources(source, filesystem) 427 else: --> 428 fs, paths_or_selector = _ensure_single_source(source, filesystem) 429 430 options = FileSystemFactoryOptions( /opt/conda/lib/python3.8/site-packages/pyarrow/dataset.py in _ensure_single_source(path, filesystem) 402 paths_or_selector = [path] 403 else: --> 404 raise FileNotFoundError(path) 405 406 return filesystem, paths_or_selector FileNotFoundError: dev/staging/evaluations {code} This *does* work though when I list the blobs before passing them to ds.dataset: {code:python} blobs = wasb.list_blobs(container_name="dev", prefix="sta
[jira] [Commented] (ARROW-10517) [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob
[ https://issues.apache.org/jira/browse/ARROW-10517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17228257#comment-17228257 ] Lance Dacey commented on ARROW-10517: - + [~mdurant] and [~jorisvandenbossche] You guys helped me with a similar issue before. There seems to be some incompatibility with fsspec and the new pyarrow.dataset feature. If I upgrade adlfs and the azure blob SDK, then it it looks like fs.find() is returning a list instead of a dictionary like pyarrow expects. If I downgrade adlfs to use SDK v2.1, then I get the correct dictionary that pyarrow expects, but there does not seem to be a method for mkdir (which is required). Is there a way for me to get this to work? I tried tweaking the installed versions of fsspec, adlfs, and azure-storage-blob but I could not find a combination that worked. > [Python] Unable to read/write Parquet datasets with fsspec on Azure Blob > > > Key: ARROW-10517 > URL: https://issues.apache.org/jira/browse/ARROW-10517 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 2.0.0 > Environment: Ubuntu 18.04 >Reporter: Lance Dacey >Priority: Major > Labels: azureblob, dataset, dataset-parquet-read, > dataset-parquet-write, fsspec > > > > If I downgrade adlfs to 0.2.5 and azure-blob-storage to 2.1, and then upgrade > fsspec (0.6.2 has errors with a detail kwarg, so I need to upgrade it): > > {code:java} > pa.dataset.write_dataset(data=table, > base_dir="test/test7", > basename_template=None, > format="parquet", > partitioning=DirectoryPartitioning(pa.schema([("year", pa.int64()), > ("month", pa.int16()), ("day", pa.int16())])), > schema=table.schema, > filesystem=blob_fs){code} > > {code:java} > 226 def create_dir(self, path, recursive): > 227 # mkdir also raises FileNotFoundError when base directory is not found > --> 228 self.fs.mkdir(path, create_parents=recursive){code} > > It does not look like there is a mkdir option. However, the output of > fs.find() returns a dictionary as expected: > {code:java} > selected_files = blob_fs.find( > "test/test6", maxdepth=None, withdirs=True, detail=True > ){code} > > Now if I install the latest version of adlfs it upgrades my blob SDK to 12.5 > (unfortunately, I cannot use this in production since Airflow requires 2.1, > so this is only for testing purposes): > {code:java} > Successfully installed adlfs-0.5.5 azure-storage-blob-12.5.0{code} > > Now fs.find() returns a list, but I am able to use fs.mkdir(). > {code:java} > ['test/test6/year=2020', > 'test/test6/year=2020/month=11', > 'test/test6/year=2020/month=11/day=1', > > 'test/test6/year=2020/month=11/day=1/8ee6c66320ca47908c37f112f0cffd6c.parquet', > > 'test/test6/year=2020/month=11/day=1/ef753f016efc44b7b0f0800c35d084fc.parquet',]{code} > > This causes issues later when I try to read a dataset (the code is expecting > a dictionary still): > {code:java} > dataset = ds.dataset("test/test5", filesystem=blob_fs, format="parquet"){code} > {code:java} > --> > 221 for path, info in selected_files.items(): > 222 infos.append(self._create_file_info(path, info)) > 223 AttributeError: 'list' object has no attribute 'items'{code} > > I am still able to read individual files: > {code:java} > dataset = ds.dataset("test/test4/year=2020/month=11/2020-11.parquet", > filesystem=blob_fs, format="parquet"){code} > And I can read the dataset if I pass in a list of blob names "manually": > > {code:java} > blobs = wasb.list_blobs(container_name="test", prefix="test4") > dataset = ds.dataset(source=["test/" + blob.name for blob in blobs], > format="parquet", > partitioning="hive", > filesystem=blob_fs) > {code} > > For all of my examples, blob_fs is defined by: > {code:java} > blob_fs = fsspec.filesystem( > protocol="abfs", account_name=base.login, account_key=base.password > ){code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)