[jira] [Commented] (ARROW-2358) API for Writing to Multiple Feather Files
[ https://issues.apache.org/jira/browse/ARROW-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511574#comment-16511574 ] Dhruv Madeka commented on ARROW-2358: - Got it! I think I can handle that > API for Writing to Multiple Feather Files > - > > Key: ARROW-2358 > URL: https://issues.apache.org/jira/browse/ARROW-2358 > Project: Apache Arrow > Issue Type: New Feature > Components: C, C++, Python >Affects Versions: 0.9.0 >Reporter: Dhruv Madeka >Priority: Minor > > It would be really great to have an API which can write a Table to a > `FeatherDataset`. Essentially, taking a name for a file - it would split the > table into N-equal parts (which could be determined by the user or the code) > and then write the data to N files with a suffix (which is `_part` by default > but could be user specificed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2667) [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray
[ https://issues.apache.org/jira/browse/ARROW-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511292#comment-16511292 ] Dhruv Madeka commented on ARROW-2667: - Would this be the right issue for adding a `slice` function to `Table`? It seems to be useful in writing multiple files > [C++/Python] Add pandas-like take method to Array/Column/ChunkedArray > - > > Key: ARROW-2667 > URL: https://issues.apache.org/jira/browse/ARROW-2667 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Uwe L. Korn >Priority: Major > > We should add a {{take}} method to {{Array/ChunkedArray/Column}} that takes a > list of indices and returns a reordered array. > For reference, see Pandas' interface: > https://github.com/pandas-dev/pandas/blob/2cbdd9a2cd19501c98582490e35c5402ae6de941/pandas/core/arrays/base.py#L466 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2358) API for Writing to Multiple Feather Files
[ https://issues.apache.org/jira/browse/ARROW-2358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16510611#comment-16510611 ] Dhruv Madeka commented on ARROW-2358: - [~wesmckinn] So Im good to submit a PR. Its just not obvious how to do this without a `slice` function for a table. Would you advise I implement that first and then the FeatherDataset writer? > API for Writing to Multiple Feather Files > - > > Key: ARROW-2358 > URL: https://issues.apache.org/jira/browse/ARROW-2358 > Project: Apache Arrow > Issue Type: New Feature > Components: C, C++, Python >Affects Versions: 0.9.0 >Reporter: Dhruv Madeka >Priority: Minor > > It would be really great to have an API which can write a Table to a > `FeatherDataset`. Essentially, taking a name for a file - it would split the > table into N-equal parts (which could be determined by the user or the code) > and then write the data to N files with a suffix (which is `_part` by default > but could be user specificed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2358) API for Writing to Multiple Feather Files
Dhruv Madeka created ARROW-2358: --- Summary: API for Writing to Multiple Feather Files Key: ARROW-2358 URL: https://issues.apache.org/jira/browse/ARROW-2358 Project: Apache Arrow Issue Type: New Feature Components: C, C++, Python Affects Versions: 0.9.0 Reporter: Dhruv Madeka Fix For: 0.10.0 It would be really great to have an API which can write a Table to a `FeatherDataset`. Essentially, taking a name for a file - it would split the table into N-equal parts (which could be determined by the user or the code) and then write the data to N files with a suffix (which is `_part` by default but could be user specificed). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2332) [Python] Provide API for reading multiple Feather files
[ https://issues.apache.org/jira/browse/ARROW-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413444#comment-16413444 ] Dhruv Madeka commented on ARROW-2332: - [~wesmckinn] - let me know if that sounds like a good plan, I'll try to make a PR > [Python] Provide API for reading multiple Feather files > --- > > Key: ARROW-2332 > URL: https://issues.apache.org/jira/browse/ARROW-2332 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See discussion in > https://github.com/wesm/feather/issues/273#issuecomment-374093374 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2332) [Python] Provide API for reading multiple Feather files
[ https://issues.apache.org/jira/browse/ARROW-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408384#comment-16408384 ] Dhruv Madeka edited comment on ARROW-2332 at 3/21/18 6:32 PM: -- As mentioned in the Github issue - IMHO this requires a few steps. * In `FeatherReader`, separate the extraction of the table from the call to convert it `to_pandas` * Create a `FeatherDataset` class which takes a list of feather files and creates a table for each one * Validate that the schemas for each of the files match, the `validate_schema` for the `ParquetDataset` seems to work. Maybe we can create an abstract dataset class and inherit from there * call `concat_tables` on the extract tables and return them as a pandas dataframe was (Author: madeka): As mentioned in the Github issue - IMHO this requires a few steps. * In `FeatherReader`, separate the extraction of the table from the call to convert it `to_pandas` * Create a `FeatherDataset` class which takes a list of featherfiles and creates a table for each one * Validate that the schemas for each of the files match, the `validate_schema` for the `ParquetDataset` seems to work. Maybe we can create an abstract dataset class and inherit from there > [Python] Provide API for reading multiple Feather files > --- > > Key: ARROW-2332 > URL: https://issues.apache.org/jira/browse/ARROW-2332 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See discussion in > https://github.com/wesm/feather/issues/273#issuecomment-374093374 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2332) [Python] Provide API for reading multiple Feather files
[ https://issues.apache.org/jira/browse/ARROW-2332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16408384#comment-16408384 ] Dhruv Madeka commented on ARROW-2332: - As mentioned in the Github issue - IMHO this requires a few steps. * In `FeatherReader`, separate the extraction of the table from the call to convert it `to_pandas` * Create a `FeatherDataset` class which takes a list of featherfiles and creates a table for each one * Validate that the schemas for each of the files match, the `validate_schema` for the `ParquetDataset` seems to work. Maybe we can create an abstract dataset class and inherit from there > [Python] Provide API for reading multiple Feather files > --- > > Key: ARROW-2332 > URL: https://issues.apache.org/jira/browse/ARROW-2332 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Fix For: 0.10.0 > > > See discussion in > https://github.com/wesm/feather/issues/273#issuecomment-374093374 -- This message was sent by Atlassian JIRA (v7.6.3#76005)