[jira] [Commented] (ARROW-18371) [C++] Expose *FromJSON helpers
[ https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636754#comment-17636754 ] Li Jin commented on ARROW-18371: > Definitely not. These are functions generating ad hoc data tailored for > specific tests, with little consistency. To clarify, do you know the \{Array,{{Exec,Record}Batch}FromJSON or BatchesWithSchema/MakeBasicBatches > [C++] Expose *FromJSON helpers > -- > > Key: ARROW-18371 > URL: https://issues.apache.org/jira/browse/ARROW-18371 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: testing > > {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when > testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches > could be considered as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18371) [C++] Expose *FromJSON helpers
[ https://issues.apache.org/jira/browse/ARROW-18371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17636754#comment-17636754 ] Li Jin edited comment on ARROW-18371 at 11/21/22 4:08 PM: -- > Definitely not. These are functions generating ad hoc data tailored for > specific tests, with little consistency. To clarify, do you mean the \{Array,{{Exec,Record}Batch}FromJSON or BatchesWithSchema/MakeBasicBatches was (Author: icexelloss): > Definitely not. These are functions generating ad hoc data tailored for > specific tests, with little consistency. To clarify, do you know the \{Array,{{Exec,Record}Batch}FromJSON or BatchesWithSchema/MakeBasicBatches > [C++] Expose *FromJSON helpers > -- > > Key: ARROW-18371 > URL: https://issues.apache.org/jira/browse/ARROW-18371 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Rok Mihevc >Priority: Major > Labels: testing > > {Array,{{Exec,Record}Batch}FromJSON helper functions would be useful when > testing in projects that use Arrow. BatchesWithSchema and MakeBasicBatches > could be considered as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}
[ https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618950#comment-17618950 ] Li Jin edited comment on ARROW-18063 at 10/17/22 3:11 PM: -- {quote}It might be slightly nicer to throw an error when setting the default named table provider if it has already been set. There are more complex alternatives such as a named table provider registry or a chain of named table providers but I'm not sure they are needed in this case. {quote} I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once. {quote}Another alternative, which might be a more long term solution, is to create a new Substrait extension which defines a new {{read_type}} (e.g. {{{}ExtensionTable{}}}) which contains the needed information (e.g. URL). We would then need to make it possible to construct custom sources from {{ExtensionTable}} though which probably puts us in roughly the same boat :). We would need an {{ExtensionTableProvider}} and we would probably want the default to be configurable. {quote} I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider. was (Author: icexelloss): >It might be slightly nicer to throw an error when setting the default named >table provider if it has already been set. There are more complex alternatives >such as a named table provider registry or a chain of named table providers >but I'm not sure they are needed in this case. I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once. >Another alternative, which might be a more long term solution, is to create a >new Substrait extension which defines a new {{read_type}} (e.g. >{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL). >We would then need to make it possible to construct custom sources from >{{ExtensionTable}} though which probably puts us in roughly the same boat :). >We would need an {{ExtensionTableProvider}} and we would probably want the >default to be configurable. I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider. > [C++][Python] Custom streaming data providers in {{run_query}} > -- > > Key: ARROW-18063 > URL: https://issues.apache.org/jira/browse/ARROW-18063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Priority: Major > > [Mailing list > thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so] > The goal is to: > - generate a substrait plan in Python using Ibis > - ... wherein tables are specified using custom URLs > - use the python API {{run_query}} to execute the plan > - ... against source data which is *streamed* from those URLs rather than > pulled fully into local memory > The obstacles include: > - The API for constructing a data stream from the custom URLs is only > available in c++ > - The python {{run_query}} function requires tables as input and cannot > accept a RecordBatchReader even if one could be constructed from a custom URL > - Writing custom cython is not preferred > Some potential solutions: > - Use ExecuteSerializedPlan() directly usable from c++ so that construction > of data sources need not be handled in python. Passing a buffer from > python/ibis down to C++ is much simpler and can be navigated without writing > cython > - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} > into a registry so that data source factories can be added from c++ then > referenced by name from python > - Extend {{run_query}} to support non-Table sources and require the user to > write a python mapping from URLs to {{pa.RecordBatchReader}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}
[ https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618950#comment-17618950 ] Li Jin edited comment on ARROW-18063 at 10/17/22 3:10 PM: -- >It might be slightly nicer to throw an error when setting the default named >table provider if it has already been set. There are more complex alternatives >such as a named table provider registry or a chain of named table providers >but I'm not sure they are needed in this case. I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once. >Another alternative, which might be a more long term solution, is to create a >new Substrait extension which defines a new {{read_type}} (e.g. >{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL). >We would then need to make it possible to construct custom sources from >{{ExtensionTable}} though which probably puts us in roughly the same boat :). >We would need an {{ExtensionTableProvider}} and we would probably want the >default to be configurable. I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider. was (Author: icexelloss): >It might be slightly nicer to throw an error when setting the default named >table provider if it has already been set. There are more complex alternatives >such as a named table provider registry or a chain of named table providers >but I'm not sure they are needed in this case. I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once. >Another alternative, which might be a more long term solution, is to create a >new Substrait extension which defines a new {{read_type}} (e.g. >{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL). We would then need to make it possible to construct custom sources from {{ExtensionTable}} though which probably puts us in roughly the same boat :). We would need an {{ExtensionTableProvider}} and we would probably want the default to be configurable. I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider. > [C++][Python] Custom streaming data providers in {{run_query}} > -- > > Key: ARROW-18063 > URL: https://issues.apache.org/jira/browse/ARROW-18063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Priority: Major > > [Mailing list > thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so] > The goal is to: > - generate a substrait plan in Python using Ibis > - ... wherein tables are specified using custom URLs > - use the python API {{run_query}} to execute the plan > - ... against source data which is *streamed* from those URLs rather than > pulled fully into local memory > The obstacles include: > - The API for constructing a data stream from the custom URLs is only > available in c++ > - The python {{run_query}} function requires tables as input and cannot > accept a RecordBatchReader even if one could be constructed from a custom URL > - Writing custom cython is not preferred > Some potential solutions: > - Use ExecuteSerializedPlan() directly usable from c++ so that construction > of data sources need not be handled in python. Passing a buffer from > python/ibis down to C++ is much simpler and can be navigated without writing > cython > - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} > into a registry so that data source factories can be added from c++ then > referenced by name from python > - Extend {{run_query}} to support non-Table sources and require the user to > write a python mapping from URLs to {{pa.RecordBatchReader}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-18063) [C++][Python] Custom streaming data providers in {{run_query}}
[ https://issues.apache.org/jira/browse/ARROW-18063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17618950#comment-17618950 ] Li Jin commented on ARROW-18063: >It might be slightly nicer to throw an error when setting the default named >table provider if it has already been set. There are more complex alternatives >such as a named table provider registry or a chain of named table providers >but I'm not sure they are needed in this case. I think either override or raise error is fine. In practice I don't see our application would need to invoke the initialization of custom registration more than once. >Another alternative, which might be a more long term solution, is to create a >new Substrait extension which defines a new {{read_type}} (e.g. >{{{}ExtensionTable{}}}) which contains the needed information (e.g. URL). We would then need to make it possible to construct custom sources from {{ExtensionTable}} though which probably puts us in roughly the same boat :). We would need an {{ExtensionTableProvider}} and we would probably want the default to be configurable. I have the same thinking as well. Long term we should allow user to register custom ExtensionTableProvider as well and ideally with the similar way of how to extend ExecFactoryRegistry and NamedTableProvider. > [C++][Python] Custom streaming data providers in {{run_query}} > -- > > Key: ARROW-18063 > URL: https://issues.apache.org/jira/browse/ARROW-18063 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Ben Kietzman >Priority: Major > > [Mailing list > thread|https://lists.apache.org/thread/r484sqrd6xjdd058prbrcwh3t5vg91so] > The goal is to: > - generate a substrait plan in Python using Ibis > - ... wherein tables are specified using custom URLs > - use the python API {{run_query}} to execute the plan > - ... against source data which is *streamed* from those URLs rather than > pulled fully into local memory > The obstacles include: > - The API for constructing a data stream from the custom URLs is only > available in c++ > - The python {{run_query}} function requires tables as input and cannot > accept a RecordBatchReader even if one could be constructed from a custom URL > - Writing custom cython is not preferred > Some potential solutions: > - Use ExecuteSerializedPlan() directly usable from c++ so that construction > of data sources need not be handled in python. Passing a buffer from > python/ibis down to C++ is much simpler and can be navigated without writing > cython > - Refactor NamedTableProvider from a lambda mapping {{names -> data source}} > into a registry so that data source factories can be added from c++ then > referenced by name from python > - Extend {{run_query}} to support non-Table sources and require the user to > write a python mapping from URLs to {{pa.RecordBatchReader}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17253) [Python] pyarrow.array() crashes the interpreter when given a generator that raises while iterating
[ https://issues.apache.org/jira/browse/ARROW-17253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17574271#comment-17574271 ] Li Jin commented on ARROW-17253: Thanks [~apitrou] ! > [Python] pyarrow.array() crashes the interpreter when given a generator that > raises while iterating > --- > > Key: ARROW-17253 > URL: https://issues.apache.org/jira/browse/ARROW-17253 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 8.0.0 >Reporter: Li Jin >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 10.0.0, 9.0.1 > > Time Spent: 40m > Remaining Estimate: 0h > > {code:java} > pa.array((1 // 0 for x in range(10)), size=10){code} > This would crash the python interpreter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (ARROW-17253) Pyarrow array crashes the interpreter when encounter 0 division error
[ https://issues.apache.org/jira/browse/ARROW-17253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573021#comment-17573021 ] Li Jin edited comment on ARROW-17253 at 7/29/22 3:02 PM: - I think in general, any exception raised by the generator would crash the python interpreter when passed to pa.array was (Author: icexelloss): I think in general, any exception raised by the generator would crash the python interpreter when passing to pa.array > Pyarrow array crashes the interpreter when encounter 0 division error > --- > > Key: ARROW-17253 > URL: https://issues.apache.org/jira/browse/ARROW-17253 > Project: Apache Arrow > Issue Type: Bug >Reporter: Li Jin >Priority: Major > > {code:java} > pa.array((1 // 0 for x in range(10)), size=10){code} > This would crash the python interpreter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-17253) Pyarrow array crashes the interpreter when encounter 0 division error
[ https://issues.apache.org/jira/browse/ARROW-17253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated ARROW-17253: --- Description: {code:java} pa.array((1 // 0 for x in range(10)), size=10){code} This would crash the python interpreter was: {code:java} pa.array(1 // 0 for x in range(10), size=10){code} This would crash the python interpreter > Pyarrow array crashes the interpreter when encounter 0 division error > --- > > Key: ARROW-17253 > URL: https://issues.apache.org/jira/browse/ARROW-17253 > Project: Apache Arrow > Issue Type: Bug >Reporter: Li Jin >Priority: Major > > {code:java} > pa.array((1 // 0 for x in range(10)), size=10){code} > This would crash the python interpreter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (ARROW-17253) Pyarrow array crashes the interpreter when encounter 0 division error
[ https://issues.apache.org/jira/browse/ARROW-17253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573021#comment-17573021 ] Li Jin commented on ARROW-17253: I think in general, any exception raised by the generator would crash the python interpreter when passing to pa.array > Pyarrow array crashes the interpreter when encounter 0 division error > --- > > Key: ARROW-17253 > URL: https://issues.apache.org/jira/browse/ARROW-17253 > Project: Apache Arrow > Issue Type: Bug >Reporter: Li Jin >Priority: Major > > {code:java} > pa.array(1 // 0 for x in range(10), size=10){code} > This would crash the python interpreter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (ARROW-17253) Pyarrow array crashes the interpreter when encounter 0 division error
Li Jin created ARROW-17253: -- Summary: Pyarrow array crashes the interpreter when encounter 0 division error Key: ARROW-17253 URL: https://issues.apache.org/jira/browse/ARROW-17253 Project: Apache Arrow Issue Type: Bug Reporter: Li Jin {code:java} pa.array(1 // 0 for x in range(10), size=10){code} This would crash the python interpreter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated ARROW-16716: --- Component/s: Benchmarking > [Benchmarks] Create Projection benchmark for Acero > -- > > Key: ARROW-16716 > URL: https://issues.apache.org/jira/browse/ARROW-16716 > Project: Apache Arrow > Issue Type: Improvement > Components: Benchmarking >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero
[ https://issues.apache.org/jira/browse/ARROW-16716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17545087#comment-17545087 ] Li Jin commented on ARROW-16716: [~ichauster] is an intern that is going to work on Acero benchmarks. I talked to @Weston Pace offline and seems Projection is a good place to start. Pasting [~westonpace] 's comments on this " * Presumably, for a complex expression, with a large enough batch size, the majority of time will be spent in the kernel functions. * How far can you shrink the exec batch before the overhead of the node impacts runtime? * How complex does the expression need to be? * I expect the results to be very similar to ExecuteScalarExpressionOverhead in expression_benchmark.cc. Is it? If not, what is the difference? * What is the data rate of the project node (in bytes/second) for all of the above? * For all of the above run with both 1 thread and 1 thread per core The ExecuteScalarExpressionOverhead benchmarks would be a good existing example that should be pretty similar to how we benchmark project node. !https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif! " > [Benchmarks] Create Projection benchmark for Acero > -- > > Key: ARROW-16716 > URL: https://issues.apache.org/jira/browse/ARROW-16716 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (ARROW-16716) [Benchmarks] Create Projection benchmark for Acero
Li Jin created ARROW-16716: -- Summary: [Benchmarks] Create Projection benchmark for Acero Key: ARROW-16716 URL: https://issues.apache.org/jira/browse/ARROW-16716 Project: Apache Arrow Issue Type: Improvement Reporter: Li Jin -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-15901) [C++] Support Substrait projection with custom output field names
[ https://issues.apache.org/jira/browse/ARROW-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17532270#comment-17532270 ] Li Jin commented on ARROW-15901: [~rtpsw] Can you please explain a bit of the issue here? > [C++] Support Substrait projection with custom output field names > - > > Key: ARROW-15901 > URL: https://issues.apache.org/jira/browse/ARROW-15901 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Yaron Gvili >Assignee: Yaron Gvili >Priority: Major > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > Currently, Arrow Substrait does not support a plan with custom output field > names. The proposal is to add support for projection only at this time. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (ARROW-16083) [C++] Implement AsofJoin execution node
[ https://issues.apache.org/jira/browse/ARROW-16083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17515366#comment-17515366 ] Li Jin commented on ARROW-16083: Created this Jira because I started to work on implementing Asof Join execution node. For background, asof join is a particular useful join operation in time series data analysis, it assumes that data arrives in sorted time order and performs join with inexact timestamp matches. See [https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.merge_asof.html] for details. > [C++] Implement AsofJoin execution node > --- > > Key: ARROW-16083 > URL: https://issues.apache.org/jira/browse/ARROW-16083 > Project: Apache Arrow > Issue Type: New Feature >Reporter: Li Jin >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-16083) [C++] Implement AsofJoin execution node
Li Jin created ARROW-16083: -- Summary: [C++] Implement AsofJoin execution node Key: ARROW-16083 URL: https://issues.apache.org/jira/browse/ARROW-16083 Project: Apache Arrow Issue Type: New Feature Reporter: Li Jin -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (ARROW-15532) [C++] Fix unused warning for StringClassifyDoc
[ https://issues.apache.org/jira/browse/ARROW-15532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin resolved ARROW-15532. Fix Version/s: 7.0.0 Resolution: Fixed Issue resolved by pull request 12321 [https://github.com/apache/arrow/pull/12321] > [C++] Fix unused warning for StringClassifyDoc > -- > > Key: ARROW-15532 > URL: https://issues.apache.org/jira/browse/ARROW-15532 > Project: Apache Arrow > Issue Type: Task > Components: C++ >Reporter: Li Jin >Assignee: Li Jin >Priority: Minor > Labels: pull-request-available > Fix For: 7.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (ARROW-15532) [C++] Fix unused warning for StringClassifyDoc
[ https://issues.apache.org/jira/browse/ARROW-15532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated ARROW-15532: --- Summary: [C++] Fix unused warning for StringClassifyDoc (was: [C++] Remove unused warning for StringClassifyDoc) > [C++] Fix unused warning for StringClassifyDoc > -- > > Key: ARROW-15532 > URL: https://issues.apache.org/jira/browse/ARROW-15532 > Project: Apache Arrow > Issue Type: Task >Reporter: Li Jin >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (ARROW-15532) [C++] Remove unused warning for StringClassifyDoc
Li Jin created ARROW-15532: -- Summary: [C++] Remove unused warning for StringClassifyDoc Key: ARROW-15532 URL: https://issues.apache.org/jira/browse/ARROW-15532 Project: Apache Arrow Issue Type: Task Reporter: Li Jin -- This message was sent by Atlassian Jira (v8.20.1#820001)