[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755808#comment-16755808 ] Antoine Pitrou commented on ARROW-4313: --- "Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX cache levels, core counts, or any other attribute ever different for the same CPU model string?" The overclocked frequency may vary (which we could also call "actual frequency"), the rest should be the same. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755807#comment-16755807 ] Antoine Pitrou commented on ARROW-4313: --- For the record, IBM POWER CPUs support little-endian mode on Linux: https://www.ibm.com/developerworks/library/l-power-little-endian-faq-trs/index.html So big-endian support in Arrow would probably not be a roadblock. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755804#comment-16755804 ] Zhijun Fu commented on ARROW-4418: -- In addition to above benefits that Robert mentioned, asio also provides opportunities for performance improvements, by providing io service, thread pool .etc. In our internal testing, which uses 10+ actors on a single machine, I found 50% of plasma store CPU are spent on receiving messages from plasma clients, using UNIX domain socket. I'm thinking that one way to improve perf is like this: * Use a pool of threads to receive messages from clients. To ensure correct behavior, we can bind a boost::strand to a single client, so that all the messages from a given client arrives in order. As this part is CPU consumings, using multiple threads is going to help. * After this, the messages are posted into io service of main thread, which calls ProcessMessages for each of them in order. * After this, post the replies to a pool of threads, again use boost::strand for each plasma client to ensure correct order. I'm thinking this would probably help on cases where there are multiple workers using plasma store on the same machine, which should be very common. And it seems implementing this would be hard without asio functionalities. Thoughts? > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4418: -- Labels: pull-request-available (was: ) > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570 ] Tanya Schlusser commented on ARROW-4313: I think part of this was to allow anybody to contribute benchmarks from their own machine. And while dedicated benchmarking machines like the ones you will set up will have all parameters set for optimal benchmarking, benchmarks run on other machines may give different results. Collecting details about the machine that might explain those differences (in case someone cares to explore the dataset) is part of the goal of the data model. One concern, of course, is that people get wildly different results than a benchmark says, and may say "Oh boo–the representative person from the company made fake results that I can't replicate on my machine" ... and with details about a system, performance differences can maybe be traced back to differences in setup, because they were recorded. Not all fields need to be filled out all the time. My priorities are: # Identifying which fields flat-out wrong # Differentiating between necessary columns and extraneous ones that can be left null To me, it is not a big deal to have an extra column dangling around that almost nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what I'm interested in getting out of the discussion here.) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4422) [Plasma] Enforce memory limit in plasma, rather than relying on dlmalloc_set_footprint_limit
[ https://issues.apache.org/jira/browse/ARROW-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755526#comment-16755526 ] Anurag Khandelwal commented on ARROW-4422: -- cc [~pcmoritz] > [Plasma] Enforce memory limit in plasma, rather than relying on > dlmalloc_set_footprint_limit > > > Key: ARROW-4422 > URL: https://issues.apache.org/jira/browse/ARROW-4422 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Plasma (C++) >Affects Versions: 0.12.0 >Reporter: Anurag Khandelwal >Assignee: Anurag Khandelwal >Priority: Minor > Fix For: 0.13.0 > > > Currently, Plasma relies on dlmalloc_set_footprint_limit to limit the memory > utilization for Plasma Store. This is restrictive because: > * It restricts Plasma to dlmalloc, which supports limiting memory footprint, > as opposed to other, potentially more performant malloc implementations > (e.g., jemalloc) > * dlmalloc_set_footprint_limit does not guarantee that the limit set by it > the amount of _usable_ memory. As such, we might trigger evictions much > earlier than hitting this limit, e.g., due to fragmentation or metadata > overheads. > To overcome this, we can impose the memory limit at Plasma by tracking the > number of bytes allocated and freed using malloc and free calls. Whenever the > allocation reaches the set limit, we fail any subsequent allocations (i.e., > return NULL from malloc). This allows Plasma to not be tied to dlmalloc, and > also provides more accurate tracking of memory allocation/capacity. > Caveat: We will need to make sure that the mmaped files are living on a file > system that is a bit larger (depending on malloc implementation) than the > Plasma memory limit to account for the extra memory required due to > fragmentation/metadata overheads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4422) [Plasma] Enforce memory limit in plasma, rather than relying on dlmalloc_set_footprint_limit
Anurag Khandelwal created ARROW-4422: Summary: [Plasma] Enforce memory limit in plasma, rather than relying on dlmalloc_set_footprint_limit Key: ARROW-4422 URL: https://issues.apache.org/jira/browse/ARROW-4422 Project: Apache Arrow Issue Type: Improvement Components: C++, Plasma (C++) Affects Versions: 0.12.0 Reporter: Anurag Khandelwal Assignee: Anurag Khandelwal Fix For: 0.13.0 Currently, Plasma relies on dlmalloc_set_footprint_limit to limit the memory utilization for Plasma Store. This is restrictive because: * It restricts Plasma to dlmalloc, which supports limiting memory footprint, as opposed to other, potentially more performant malloc implementations (e.g., jemalloc) * dlmalloc_set_footprint_limit does not guarantee that the limit set by it the amount of _usable_ memory. As such, we might trigger evictions much earlier than hitting this limit, e.g., due to fragmentation or metadata overheads. To overcome this, we can impose the memory limit at Plasma by tracking the number of bytes allocated and freed using malloc and free calls. Whenever the allocation reaches the set limit, we fail any subsequent allocations (i.e., return NULL from malloc). This allows Plasma to not be tied to dlmalloc, and also provides more accurate tracking of memory allocation/capacity. Caveat: We will need to make sure that the mmaped files are living on a file system that is a bit larger (depending on malloc implementation) than the Plasma memory limit to account for the extra memory required due to fragmentation/metadata overheads. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755506#comment-16755506 ] Areg Melik-Adamyan commented on ARROW-4313: --- [~tanya] why we need 'overclock_freq_HZ'? What is the practical usage model for this field? > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755513#comment-16755513 ] Areg Melik-Adamyan commented on ARROW-4313: --- Got it. I think that mostly those numbers are never used because you run benchmarks on a fixed freq always to get consistent results in time. So they can be easily determined from the model name or cpuid, just for informational purposes, but will never be used in a serial benchmarking. In a serial benchmarking everything should be fixed, nailed and unchanged, except the variable you are measuring, and it is the arrow code measured through the benchmark code. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755509#comment-16755509 ] Tanya Schlusser commented on ARROW-4313: [~aregm] I do not know. I am depending on the other people commenting here to make sure the hardware tables make sense because honestly I don't ever pay attention to hardware because my use cases never stress my system. At one point Wes suggested it. I am glad there is a debate. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.erdplus > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.png) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504 ] Tanya Schlusser commented on ARROW-4313: Thank you very much for everyone's detailed feedback. I absolutely need guidance with the Machine / CPU / GPU specs. I have updated the [^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added all of the recommended columns. *Summary of changes:* * All the dimension tables have been renamed to exclude the `_dim`. (It was to distinguish dimension vs. fact tables.) * `cpu` ** Added a `cpu_thread_count`. ** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and `frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to the `machine` table to allow for overclocking like Wes mentioned in the beginning. * `os` ** Added both `os.architecture_name` and `os.architecture_bits`, the latter forced to be in \{32, 64}, and pulled from the architecture name (maybe it will become just a computed column in the joined view...). I think it's a good idea. * `project` ** Added a `project.project_name` (oversight before) * `benchmark_language` ** Split out `language` to `language_name` and `language_version` because maybe people will want to compare between them (e.g. Python 2.7, 3.5+) * `environment` ** Removed foreign key for `machine_id` — that should be in the benchmark report separately. Many machines will have the same environment. * `benchmark` ** Added foreign key for `benchmark_language_id`—a benchmark with the same name may exist for different languages. ** Added foreign key for `project_id`—moved it from table `benchmark_result` * `benchmark_result` ** Added foreign key for `machine_id` (was removed from `environment`) ** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated above) *Questions* * `cpu` and `gpu` dimension ** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX cache levels, core counts, or any other attribute ever different for the same CPU model string? ** The same for GPU. ** I have commented the columns to say that `cpu_thread_count` corresponds to `sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n hw.physicalcpu`; corrections gratefully accepted. ** Would it be less confusing to make the column names the exact same strings as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to `cpu.cpu_brand_string` to correspond to the output of `sysctl -n machdep.cpu.brand_string`? ** On that note is CPU RAM the same thing as `sysctl -n machdep.cpu.cache.size`? * `environment` ** I'm worried I'm doing something inelegant with the dependency list. It will hold everything – Conda / virtualenv; versions of Numpy; all permutations of the various dependencies in what in ASV is the dependency matrix. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: (was: benchmark-data-model.erdplus) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tanya Schlusser updated ARROW-4313: --- Attachment: benchmark-data-model.png > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4421) [Flight][C++] Handle large Flight data messages
Wes McKinney created ARROW-4421: --- Summary: [Flight][C++] Handle large Flight data messages Key: ARROW-4421 URL: https://issues.apache.org/jira/browse/ARROW-4421 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Wes McKinney Fix For: 0.13.0 I believe the message payloads are currently limited to 4MB by default, see one developer's discussion here: https://nanxiao.me/en/message-length-setting-in-grpc/ While it is a good idea to break large messages into smaller ones, we will need to address how to gracefully send larger payloads that may be provided by a user's server implementation. Either we can increase the limit or break up the record batches into smaller chunks in the Flight server base (or both, of course) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-3328) [Flight] Allow for optional unique flight identifier to be sent with FlightGetInfo
[ https://issues.apache.org/jira/browse/ARROW-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney closed ARROW-3328. --- Resolution: Won't Fix As discussed let's leave this to be handled using serialized tickets > [Flight] Allow for optional unique flight identifier to be sent with > FlightGetInfo > -- > > Key: ARROW-3328 > URL: https://issues.apache.org/jira/browse/ARROW-3328 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > > There could either be > * A global identifier for the entire flight > * Endpoint-specific identifiers > A client could use these unique identifier to perform other kinds of actions. > An example would be retrieving logs or statistics about a get -- you could > see time spent writing the dataset to gRPC or time spent constructing the > dataset before handing off to the gRPC write layer -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository
[ https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4345. - Resolution: Fixed Assignee: Wes McKinney Done in https://github.com/apache/parquet-testing/commit/8991d0b58d5a59925c87dd2a0bdb59a5a4a16bd4 > [C++] Add Apache 2.0 license file to the Parquet-testing repository > --- > > Key: ARROW-4345 > URL: https://issues.apache.org/jira/browse/ARROW-4345 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, cpp >Affects Versions: 0.12.0 >Reporter: Rylan Dmello >Assignee: Wes McKinney >Priority: Minor > > The parquet-testing repository is used as a git submodule in the Apache Arrow > repository, but doesn't currently have a license file: > [https://github.com/apache/arrow/tree/master/cpp/submodules] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository
[ https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4345: Fix Version/s: 0.13.0 > [C++] Add Apache 2.0 license file to the Parquet-testing repository > --- > > Key: ARROW-4345 > URL: https://issues.apache.org/jira/browse/ARROW-4345 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, cpp >Affects Versions: 0.12.0 >Reporter: Rylan Dmello >Assignee: Wes McKinney >Priority: Minor > Fix For: 0.13.0 > > > The parquet-testing repository is used as a git submodule in the Apache Arrow > repository, but doesn't currently have a license file: > [https://github.com/apache/arrow/tree/master/cpp/submodules] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-3289) [C++] Implement DoPut command for Flight on client and server side
[ https://issues.apache.org/jira/browse/ARROW-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-3289: --- Assignee: David Li > [C++] Implement DoPut command for Flight on client and server side > > > Key: ARROW-3289 > URL: https://issues.apache.org/jira/browse/ARROW-3289 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: David Li >Priority: Major > Labels: flight, pull-request-available > Fix For: 0.13.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This was omitted from ARROW-3146 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755433#comment-16755433 ] Wes McKinney commented on ARROW-4313: - I recall an e-mail thread some time back about IBM POWER support -- some of us (myself, [~kou]) were given access to Power Z -based CI infrastructure for testing but we have yet to try it. I doubt that the project works on big endian right now (Arrow is current little-endian, even running on big-endian hardware) > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository
[ https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755426#comment-16755426 ] Rylan Dmello commented on ARROW-4345: - Hi Uwe, I don't think I can get to this until the next week or so. If you have the bandwidth for this, please feel free to take ownership of this story. > [C++] Add Apache 2.0 license file to the Parquet-testing repository > --- > > Key: ARROW-4345 > URL: https://issues.apache.org/jira/browse/ARROW-4345 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, cpp >Affects Versions: 0.12.0 >Reporter: Rylan Dmello >Priority: Minor > > The parquet-testing repository is used as a git submodule in the Apache Arrow > repository, but doesn't currently have a license file: > [https://github.com/apache/arrow/tree/master/cpp/submodules] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4420) [INTEGRATION] Pin spark's version to the recently released arrow 0.12 patch
Krisztian Szucs created ARROW-4420: -- Summary: [INTEGRATION] Pin spark's version to the recently released arrow 0.12 patch Key: ARROW-4420 URL: https://issues.apache.org/jira/browse/ARROW-4420 Project: Apache Arrow Issue Type: Bug Components: Integration Reporter: Krisztian Szucs Assignee: Krisztian Szucs As discussed in https://github.com/apache/arrow/pull/3300#discussion_r252026108 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4296) [Plasma] Starting Plasma store with use_one_memory_mapped_file enabled crashes due to improper memory alignment
[ https://issues.apache.org/jira/browse/ARROW-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Nishihara resolved ARROW-4296. - Resolution: Fixed Issue resolved by pull request 3490 [https://github.com/apache/arrow/pull/3490] > [Plasma] Starting Plasma store with use_one_memory_mapped_file enabled > crashes due to improper memory alignment > --- > > Key: ARROW-4296 > URL: https://issues.apache.org/jira/browse/ARROW-4296 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Plasma (C++) >Affects Versions: 0.11.1 >Reporter: Anurag Khandelwal >Priority: Minor > Labels: pull-request-available > Fix For: 0.13.0 > > Time Spent: 5h > Remaining Estimate: 0h > > Starting Plasma with use_one_memory_mapped_file (-f flag) causes a crash, > most likely due to improper memory alignment. This can be resolved by > changing the dlmemalign call during initialization to use slightly smaller > memory (by ~8KB). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755383#comment-16755383 ] Antoine Pitrou commented on ARROW-4313: --- Multiple CPUs would go under the core_count IMO. As for mainframes, no, but AFAIK there are regular Linux-based (or AIX-based) POWER servers. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755381#comment-16755381 ] Robert Nishihara commented on ARROW-4418: - If preferable, there is also a non-boost version of asio [https://think-async.com/Asio/AsioAndBoostAsio.html|https://think-async.com/Asio/AsioAndBoostAsio.html,] I also remember thinking that asio is moving into the C++ standard library, though I can't seem to find a reference for that at the moment. The benefits of using asio are pretty big (Windows support for the Plasma store as well as using a more standard C++ approach than what we are currently doing). In terms of alternatives, I know that Philipp has looked into gRPC. Maybe he could elaborate on the pros/cons there? > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755360#comment-16755360 ] Areg Melik-Adamyan commented on ARROW-4313: --- Ok, if we want to add them, then it should be named 'smt_thread_count' or 'threads_per_core'. And there is also there is a case for multiple CPUs. Do you anticipate using Arrow on mainframes? I would say that most likely FPGA usage will preceed Power usage. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755360#comment-16755360 ] Areg Melik-Adamyan edited comment on ARROW-4313 at 1/29/19 8:23 PM: Ok, if we want to add them, then it should be named 'smt_thread_count' or 'threads_per_core'. And there is a case for multiple CPUs also. Do you anticipate using Arrow on mainframes? I would say that most likely FPGA usage will preceed Power usage. was (Author: aregm): Ok, if we want to add them, then it should be named 'smt_thread_count' or 'threads_per_core'. And there is also there is a case for multiple CPUs. Do you anticipate using Arrow on mainframes? I would say that most likely FPGA usage will preceed Power usage. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-3289) [C++] Implement DoPut command for Flight on client and server side
[ https://issues.apache.org/jira/browse/ARROW-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-3289: -- Labels: flight pull-request-available (was: flight) > [C++] Implement DoPut command for Flight on client and server side > > > Key: ARROW-3289 > URL: https://issues.apache.org/jira/browse/ARROW-3289 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: flight, pull-request-available > Fix For: 0.13.0 > > > This was omitted from ARROW-3146 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755350#comment-16755350 ] Antoine Pitrou commented on ARROW-4313: --- > there is a 'core_count', for IA it is better to have HT flag, for others > threads=cores Not really, for example IBM POWER CPUs can have 2, 4 or 8 threads per core. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755291#comment-16755291 ] Areg Melik-Adamyan commented on ARROW-4313: --- * in `cpu_dim`, perhaps add a `cpu_thread_count` (the CPU's number of hardware threads, which can be a multiple of the number of distinct cores) ** there is a 'core_count', for IA it is better to have HT flag, for others threads=cores * either in `machine_dim` or `os_dim`, store the bitness? (usually 64-bit I suppose, though perhaps some people will want to benchmark on 32-bit). Or, more generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or "AArch64"). ** short uname -i should be enough. * not sure why tables are suffixed with `_dim`? ** I guess those are conditional names and not necessarily the resulting. > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4418: -- Priority: Major (was: Minor) > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Zhijun Fu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755252#comment-16755252 ] Antoine Pitrou commented on ARROW-4418: --- I think we should be careful to evaluate the cost of the boost::asio dependency. Handling boost dependencies is always delicate, especially when they come with a compiled library (i.e. the library isn't header-only). > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4418: -- Component/s: Plasma (C++) > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4418: -- Labels: (was: pull-request-available) > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement > Components: Plasma (C++) >Reporter: Zhijun Fu >Priority: Major > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4418: -- Description: Original text: It would be nice to move plasma store from current event loop to boost::asio to modernize the code, and more importantly to benefit from the functionalities provided by asio, which I think also provides opportunities for performance improvement. was:https://issues.apache.org/jira/browse/ARROW-4418 > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Zhijun Fu >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Original text: > It would be nice to move plasma store from current event loop to boost::asio > to modernize the code, and more importantly to benefit from the > functionalities provided by asio, which I think also provides opportunities > for performance improvement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou updated ARROW-4418: -- Description: https://issues.apache.org/jira/browse/ARROW-4418 > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Zhijun Fu >Priority: Minor > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > https://issues.apache.org/jira/browse/ARROW-4418 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4407) [C++] ExternalProject_Add does not capture CC/CXX correctly
[ https://issues.apache.org/jira/browse/ARROW-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755200#comment-16755200 ] Wes McKinney commented on ARROW-4407: - While it's a good idea to fix this particular issue, I think we should be careful about spending very much time on having the CMake build preserve the environment of where it was first invoked. Instead, we should encourage developers to keep their environment variables consistent (through the use of files that are sourced on each shell initialization, here is mine: https://github.com/wesm/dev-toolchain/blob/master/toolchain/arrow-toolchain.sh) while they are developing > [C++] ExternalProject_Add does not capture CC/CXX correctly > --- > > Key: ARROW-4407 > URL: https://issues.apache.org/jira/browse/ARROW-4407 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Affects Versions: 0.12.0 >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Minor > Labels: pull-request-available > Fix For: 0.13.0 > > Time Spent: 40m > Remaining Estimate: 0h > > The issue is that CC/CXX environment variables are captured on the first > invocation of the builder (e.g make or ninja) instead of when CMake is > invoked into to build directory. This can lead to compilation errors (notably > when compiling with clang in the top directory due to the addition of the > `-Qunused-arguments` option). > This leads to an issue where I have a script that prepare the build directory > and export CXX within the script. When I jump in the build folder, there's a > mismatch between the external gbenchmark (and all deps if conda is not used) > compiler and the build. > To reproduce: > # Create a new build directory with clang as compiler, don't build yet > # In a new shell (without the compiler environment variable), go into > directory invoke make/ninja -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4213) [Flight] C++ and Java implementations are incompatible
[ https://issues.apache.org/jira/browse/ARROW-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-4213. - Resolution: Fixed Issue resolved by pull request 3477 [https://github.com/apache/arrow/pull/3477] > [Flight] C++ and Java implementations are incompatible > -- > > Key: ARROW-4213 > URL: https://issues.apache.org/jira/browse/ARROW-4213 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC >Reporter: David Li >Priority: Major > Labels: flight, pull-request-available > Fix For: 0.13.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > A C++ client cannot request streams from a Java service, nor can it decode > the schema from GetFlightInfo. > Schema: in Java, GetFlightInfo encodes the schema directly via flatbuffers. > C++ expects it to be encoded as an IPC message. This isn't a problem in Java > as a method exists to decode such schemas, but in C++ the API for reading > such a schema isn't really exposed. I'm willing to submit a patch for this, > but it's not clear to me which scheme is preferred. > Streams: in Java, DoGet starts with an ArrowMessage containing a schema. C++ > does not expect this and segfaults when it tries to decode the message as a > record batch. Based on the presentations I've seen, I think C++ is in the > wrong here; I have a patch to fix this that I could clean up and submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
[ https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4418: -- Labels: pull-request-available (was: ) > [Plasma] replace event loop with boost::asio for plasma store > - > > Key: ARROW-4418 > URL: https://issues.apache.org/jira/browse/ARROW-4418 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Zhijun Fu >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4213) [Flight] C++ and Java implementations are incompatible
[ https://issues.apache.org/jira/browse/ARROW-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney reassigned ARROW-4213: --- Assignee: David Li > [Flight] C++ and Java implementations are incompatible > -- > > Key: ARROW-4213 > URL: https://issues.apache.org/jira/browse/ARROW-4213 > Project: Apache Arrow > Issue Type: Bug > Components: FlightRPC >Reporter: David Li >Assignee: David Li >Priority: Major > Labels: flight, pull-request-available > Fix For: 0.13.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > A C++ client cannot request streams from a Java service, nor can it decode > the schema from GetFlightInfo. > Schema: in Java, GetFlightInfo encodes the schema directly via flatbuffers. > C++ expects it to be encoded as an IPC message. This isn't a problem in Java > as a method exists to decode such schemas, but in C++ the API for reading > such a schema isn't really exposed. I'm willing to submit a patch for this, > but it's not clear to me which scheme is preferred. > Streams: in Java, DoGet starts with an ArrowMessage containing a schema. C++ > does not expect this and segfaults when it tries to decode the message as a > record batch. Based on the presentations I've seen, I think C++ is in the > wrong here; I have a patch to fix this that I could clean up and submit. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4419) [Flight] Deal with body buffers in FlightData
[ https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755194#comment-16755194 ] Wes McKinney commented on ARROW-4419: - It might be useful to write an ultra-minimal pure Python Flight server and client (using the generated Python grpc bindings) so that we can more easily test this kind of thing e.g. see https://github.com/apache/arrow/blob/master/java/flight/README.md#python-example-usage > [Flight] Deal with body buffers in FlightData > - > > Key: ARROW-4419 > URL: https://issues.apache.org/jira/browse/ARROW-4419 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Minor > Labels: flight > > The Java implementation will fail to decode a schema message if the message > also contains (empty) body buffers (see ArrowMessage.asSchema's precondition > checks). However, clients using default Protobuf serialization will likely > write an empty body buffer by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4419) [Flight] Deal with body buffers in FlightData
[ https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4419: Summary: [Flight] Deal with body buffers in FlightData (was: Deal with body buffers in FlightData) > [Flight] Deal with body buffers in FlightData > - > > Key: ARROW-4419 > URL: https://issues.apache.org/jira/browse/ARROW-4419 > Project: Apache Arrow > Issue Type: Improvement > Components: FlightRPC >Reporter: David Li >Priority: Minor > Labels: flight > > The Java implementation will fail to decode a schema message if the message > also contains (empty) body buffers (see ArrowMessage.asSchema's precondition > checks). However, clients using default Protobuf serialization will likely > write an empty body buffer by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4419) Deal with body buffers in FlightData
David Li created ARROW-4419: --- Summary: Deal with body buffers in FlightData Key: ARROW-4419 URL: https://issues.apache.org/jira/browse/ARROW-4419 Project: Apache Arrow Issue Type: Improvement Components: FlightRPC Reporter: David Li The Java implementation will fail to decode a schema message if the message also contains (empty) body buffers (see ArrowMessage.asSchema's precondition checks). However, clients using default Protobuf serialization will likely write an empty body buffer by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4395) ts-node throws type error running `bin/arrow2csv.js`
[ https://issues.apache.org/jira/browse/ARROW-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs resolved ARROW-4395. Resolution: Fixed Fix Version/s: (was: 0.4.0) 0.13.0 Issue resolved by pull request 3504 [https://github.com/apache/arrow/pull/3504] > ts-node throws type error running `bin/arrow2csv.js` > > > Key: ARROW-4395 > URL: https://issues.apache.org/jira/browse/ARROW-4395 > Project: Apache Arrow > Issue Type: Bug > Components: JavaScript >Affects Versions: 0.4.0 >Reporter: Paul Taylor >Assignee: Paul Taylor >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Time Spent: 40m > Remaining Estimate: 0h > > ts-node is being too strict, throws this (inaccurate) error JIT'ing the TS > source: > {code:none} > $ cat test/data/cpp/stream/simple.arrow | ./bin/arrow2csv.js > /home/ptaylor/dev/arrow/js/node_modules/ts-node/src/index.ts:228 > return new TSError(diagnosticText, diagnosticCodes) >^ > TSError: ⨯ Unable to compile TypeScript: > src/vector/map.ts(25,57): error TS2345: Argument of type 'Field number | symbol]>[]' is not assignable to parameter of type 'Field T]>[]'. > Type 'Field' is not assignable to type > 'Field'. > Type 'T[string] | T[number] | T[symbol]' is not assignable to type > 'T[keyof T]'. > Type 'T[symbol]' is not assignable to type 'T[keyof T]'. > Type 'DataType' is not assignable to type 'T[keyof T]'. > Type 'symbol' is not assignable to type 'keyof T'. > Type 'symbol' is not assignable to type 'string | number'. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store
Zhijun Fu created ARROW-4418: Summary: [Plasma] replace event loop with boost::asio for plasma store Key: ARROW-4418 URL: https://issues.apache.org/jira/browse/ARROW-4418 Project: Apache Arrow Issue Type: Improvement Reporter: Zhijun Fu -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing
[ https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-4413: Fix Version/s: 0.13.0 > [Python] pyarrow.hdfs.connect() failing > --- > > Key: ARROW-4413 > URL: https://issues.apache.org/jira/browse/ARROW-4413 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 > Environment: Python 2.7 > Hadoop distribution: Amazon 2.7.3 > Hive 2.1.1 > Spark 2.1.1 > Tez 0.8.4 > Linux 4.4.35-33.55.amzn1.x86_64 >Reporter: Bradley Grantham >Priority: Major > Fix For: 0.13.0 > > > Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}. > This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used > the same environment when testing that it still worked on {{v0.11.1}}) > > {code:java} > In [1]: import pyarrow as pa > In [2]: fs = pa.hdfs.connect() > --- > TypeError Traceback (most recent call last) > in () > > 1 fs = pa.hdfs.connect() > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, > port, user, kerb_ticket, driver, extra_conf) > 205 fs = HadoopFileSystem(host=host, port=port, user=user, > 206 kerb_ticket=kerb_ticket, driver=driver, > --> 207 extra_conf=extra_conf) > 208 return fs > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, > host, port, user, kerb_ticket, driver, extra_conf) > 36 _maybe_set_hadoop_classpath() > 37 > ---> 38 self._connect(host, port, user, kerb_ticket, driver, > extra_conf) > 39 > 40 def __reduce__(self): > /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in > pyarrow.lib.HadoopFileSystem._connect() > 72 if host is not None: > 73 conf.host = tobytes(host) > ---> 74 self.host = host > 75 > 76 conf.port = port > TypeError: Expected unicode, got str > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4412) [DOCUMENTATION] Add explicit version numbers to the arrow specification documents.
[ https://issues.apache.org/jira/browse/ARROW-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755108#comment-16755108 ] Wes McKinney commented on ARROW-4412: - The version number is in the top left of http://arrow.apache.org/docs/format/README.html (it is a dev version number right now... we should fix that) > [DOCUMENTATION] Add explicit version numbers to the arrow specification > documents. > -- > > Key: ARROW-4412 > URL: https://issues.apache.org/jira/browse/ARROW-4412 > Project: Apache Arrow > Issue Type: Improvement > Components: Documentation >Reporter: Micah Kornfield >Priority: Minor > > Based on conversation on the mailing list it might pay to include > version/revision numbers on the specification document. One way is to > include the "release" version, another might be to only update versioning on > changes to the document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4313) Define general benchmark database schema
[ https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754973#comment-16754973 ] Antoine Pitrou commented on ARROW-4313: --- Some thoughts: * in `cpu_dim`, perhaps add a `cpu_thread_count` (the CPU's number of hardware threads, which can be a multiple of the number of distinct cores) * either in `machine_dim` or `os_dim`, store the bitness? (usually 64-bit I suppose, though perhaps some people will want to benchmark on 32-bit). Or, more generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or "AArch64"). * not sure why tables are suffixed with `_dim`? > Define general benchmark database schema > > > Key: ARROW-4313 > URL: https://issues.apache.org/jira/browse/ARROW-4313 > Project: Apache Arrow > Issue Type: New Feature > Components: Benchmarking >Reporter: Wes McKinney >Priority: Major > Fix For: 0.13.0 > > Attachments: benchmark-data-model.erdplus, benchmark-data-model.png > > > Some possible attributes that the benchmark database should track, to permit > heterogeneity of hardware and programming languages > * Timestamp of benchmark run > * Git commit hash of codebase > * Machine unique name (sort of the "user id") > * CPU identification for machine, and clock frequency (in case of > overclocking) > * CPU cache sizes (L1/L2/L3) > * Whether or not CPU throttling is enabled (if it can be easily determined) > * RAM size > * GPU identification (if any) > * Benchmark unique name > * Programming language(s) associated with benchmark (e.g. a benchmark > may involve both C++ and Python) > * Benchmark time, plus mean and standard deviation if available, else NULL > see discussion on mailing list > https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-4417) [C++] Doc build broken
[ https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-4417. --- Resolution: Fixed Fix Version/s: 0.13.0 Issue resolved by pull request 3521 [https://github.com/apache/arrow/pull/3521] > [C++] Doc build broken > -- > > Key: ARROW-4417 > URL: https://issues.apache.org/jira/browse/ARROW-4417 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, Documentation >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > Fix For: 0.13.0 > > Time Spent: 20m > Remaining Estimate: 0h > > See https://travis-ci.org/apache/arrow/jobs/485716603#L4746 > {code} > /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: > The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext > *ctx, const Datum &input, Datum *out)=0 are not documented: > parameter 'ctx' > parameter 'input' (warning treated as error, aborting now) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros
[ https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Szucs reassigned ARROW-4414: -- Assignee: Krisztian Szucs > [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds > for older distros > -- > > Key: ARROW-4414 > URL: https://issues.apache.org/jira/browse/ARROW-4414 > Project: Apache Arrow > Issue Type: Bug >Reporter: Krisztian Szucs >Assignee: Krisztian Szucs >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial > and Debian stretch. It's available since CMake 3.8: > https://cmake.org/cmake/help/v3.8/command/add_custom_command.html > We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt > Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake > 3.5) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros
[ https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4414: -- Labels: pull-request-available (was: ) > [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds > for older distros > -- > > Key: ARROW-4414 > URL: https://issues.apache.org/jira/browse/ARROW-4414 > Project: Apache Arrow > Issue Type: Bug >Reporter: Krisztian Szucs >Priority: Major > Labels: pull-request-available > > COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial > and Debian stretch. It's available since CMake 3.8: > https://cmake.org/cmake/help/v3.8/command/add_custom_command.html > We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt > Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake > 3.5) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4417) [C++] Doc build broken
[ https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4417: -- Labels: pull-request-available (was: ) > [C++] Doc build broken > -- > > Key: ARROW-4417 > URL: https://issues.apache.org/jira/browse/ARROW-4417 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, Documentation >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > Labels: pull-request-available > > See https://travis-ci.org/apache/arrow/jobs/485716603#L4746 > {code} > /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: > The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext > *ctx, const Datum &input, Datum *out)=0 are not documented: > parameter 'ctx' > parameter 'input' (warning treated as error, aborting now) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4416) [CI] Build gandiva in cpp docker image
Krisztian Szucs created ARROW-4416: -- Summary: [CI] Build gandiva in cpp docker image Key: ARROW-4416 URL: https://issues.apache.org/jira/browse/ARROW-4416 Project: Apache Arrow Issue Type: Bug Reporter: Krisztian Szucs Currently Gandiva is not built, for the sake of completeness enable it by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4417) [C++] Doc build broken
Antoine Pitrou created ARROW-4417: - Summary: [C++] Doc build broken Key: ARROW-4417 URL: https://issues.apache.org/jira/browse/ARROW-4417 Project: Apache Arrow Issue Type: Bug Components: C++, Continuous Integration, Documentation Reporter: Antoine Pitrou See https://travis-ci.org/apache/arrow/jobs/485716603#L4746 {code} /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext *ctx, const Datum &input, Datum *out)=0 are not documented: parameter 'ctx' parameter 'input' (warning treated as error, aborting now) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-4417) [C++] Doc build broken
[ https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-4417: - Assignee: Antoine Pitrou > [C++] Doc build broken > -- > > Key: ARROW-4417 > URL: https://issues.apache.org/jira/browse/ARROW-4417 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration, Documentation >Reporter: Antoine Pitrou >Assignee: Antoine Pitrou >Priority: Major > > See https://travis-ci.org/apache/arrow/jobs/485716603#L4746 > {code} > /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: > The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext > *ctx, const Datum &input, Datum *out)=0 are not documented: > parameter 'ctx' > parameter 'input' (warning treated as error, aborting now) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros
[ https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754864#comment-16754864 ] Uwe L. Korn commented on ARROW-4414: Args. I don't know how to pass the parameters otherwise to the Gandiva command in the manylinux1 build. Would be good of someone else could have a look at it. > [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds > for older distros > -- > > Key: ARROW-4414 > URL: https://issues.apache.org/jira/browse/ARROW-4414 > Project: Apache Arrow > Issue Type: Bug >Reporter: Krisztian Szucs >Priority: Major > > COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial > and Debian stretch. It's available since CMake 3.8: > https://cmake.org/cmake/help/v3.8/command/add_custom_command.html > We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt > Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake > 3.5) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4415) [Doc] Port run_site docker to the new compose setup
Krisztian Szucs created ARROW-4415: -- Summary: [Doc] Port run_site docker to the new compose setup Key: ARROW-4415 URL: https://issues.apache.org/jira/browse/ARROW-4415 Project: Apache Arrow Issue Type: Bug Components: Documentation Reporter: Krisztian Szucs Assignee: Krisztian Szucs Eventually all docker related code under https://github.com/apache/arrow/tree/master/dev should be moved to the new docker-compose setup defined in the top-level docker-compose.yml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros
Krisztian Szucs created ARROW-4414: -- Summary: [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros Key: ARROW-4414 URL: https://issues.apache.org/jira/browse/ARROW-4414 Project: Apache Arrow Issue Type: Bug Reporter: Krisztian Szucs COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial and Debian stretch. It's available since CMake 3.8: https://cmake.org/cmake/help/v3.8/command/add_custom_command.html We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake 3.5) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3954) [Rust] Add Slice to Array and ArrayData
[ https://issues.apache.org/jira/browse/ARROW-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754831#comment-16754831 ] Neville Dipale commented on ARROW-3954: --- Thanks Chao, I might be out of my depth with this one, I mainly struggled with ArrayData's buffers when I gave up. > [Rust] Add Slice to Array and ArrayData > --- > > Key: ARROW-3954 > URL: https://issues.apache.org/jira/browse/ARROW-3954 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 0.13.0 > > > Similar to C++, we should be able to construct zero-copy slice from {{Array}} > and {{ArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing
Bradley Grantham created ARROW-4413: --- Summary: [Python] pyarrow.hdfs.connect() failing Key: ARROW-4413 URL: https://issues.apache.org/jira/browse/ARROW-4413 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.0 Environment: Python 2.7 Hadoop distribution: Amazon 2.7.3 Hive 2.1.1 Spark 2.1.1 Tez 0.8.4 Linux 4.4.35-33.55.amzn1.x86_64 Reporter: Bradley Grantham Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}. This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used the same environment when testing that it still worked on {{v0.11.1}}) {code:java} In [1]: import pyarrow as pa In [2]: fs = pa.hdfs.connect() --- TypeError Traceback (most recent call last) in () > 1 fs = pa.hdfs.connect() /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, port, user, kerb_ticket, driver, extra_conf) 205 fs = HadoopFileSystem(host=host, port=port, user=user, 206 kerb_ticket=kerb_ticket, driver=driver, --> 207 extra_conf=extra_conf) 208 return fs /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, host, port, user, kerb_ticket, driver, extra_conf) 36 _maybe_set_hadoop_classpath() 37 ---> 38 self._connect(host, port, user, kerb_ticket, driver, extra_conf) 39 40 def __reduce__(self): /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect() 72 if host is not None: 73 conf.host = tobytes(host) ---> 74 self.host = host 75 76 conf.port = port TypeError: Expected unicode, got str {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing
[ https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754796#comment-16754796 ] Antoine Pitrou edited comment on ARROW-4413 at 1/29/19 9:45 AM: The following patch would probably work, but I don't know how to test it: https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80 was (Author: pitrou): The following patch would probably work, but I don't know how test it: https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80 > [Python] pyarrow.hdfs.connect() failing > --- > > Key: ARROW-4413 > URL: https://issues.apache.org/jira/browse/ARROW-4413 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 > Environment: Python 2.7 > Hadoop distribution: Amazon 2.7.3 > Hive 2.1.1 > Spark 2.1.1 > Tez 0.8.4 > Linux 4.4.35-33.55.amzn1.x86_64 >Reporter: Bradley Grantham >Priority: Major > > Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}. > This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used > the same environment when testing that it still worked on {{v0.11.1}}) > > {code:java} > In [1]: import pyarrow as pa > In [2]: fs = pa.hdfs.connect() > --- > TypeError Traceback (most recent call last) > in () > > 1 fs = pa.hdfs.connect() > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, > port, user, kerb_ticket, driver, extra_conf) > 205 fs = HadoopFileSystem(host=host, port=port, user=user, > 206 kerb_ticket=kerb_ticket, driver=driver, > --> 207 extra_conf=extra_conf) > 208 return fs > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, > host, port, user, kerb_ticket, driver, extra_conf) > 36 _maybe_set_hadoop_classpath() > 37 > ---> 38 self._connect(host, port, user, kerb_ticket, driver, > extra_conf) > 39 > 40 def __reduce__(self): > /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in > pyarrow.lib.HadoopFileSystem._connect() > 72 if host is not None: > 73 conf.host = tobytes(host) > ---> 74 self.host = host > 75 > 76 conf.port = port > TypeError: Expected unicode, got str > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing
[ https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754796#comment-16754796 ] Antoine Pitrou commented on ARROW-4413: --- The following patch would probably work, but I don't know how test it: https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80 > [Python] pyarrow.hdfs.connect() failing > --- > > Key: ARROW-4413 > URL: https://issues.apache.org/jira/browse/ARROW-4413 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 > Environment: Python 2.7 > Hadoop distribution: Amazon 2.7.3 > Hive 2.1.1 > Spark 2.1.1 > Tez 0.8.4 > Linux 4.4.35-33.55.amzn1.x86_64 >Reporter: Bradley Grantham >Priority: Major > > Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}. > This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used > the same environment when testing that it still worked on {{v0.11.1}}) > > {code:java} > In [1]: import pyarrow as pa > In [2]: fs = pa.hdfs.connect() > --- > TypeError Traceback (most recent call last) > in () > > 1 fs = pa.hdfs.connect() > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, > port, user, kerb_ticket, driver, extra_conf) > 205 fs = HadoopFileSystem(host=host, port=port, user=user, > 206 kerb_ticket=kerb_ticket, driver=driver, > --> 207 extra_conf=extra_conf) > 208 return fs > /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, > host, port, user, kerb_ticket, driver, extra_conf) > 36 _maybe_set_hadoop_classpath() > 37 > ---> 38 self._connect(host, port, user, kerb_ticket, driver, > extra_conf) > 39 > 40 def __reduce__(self): > /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in > pyarrow.lib.HadoopFileSystem._connect() > 72 if host is not None: > 73 conf.host = tobytes(host) > ---> 74 self.host = host > 75 > 76 conf.port = port > TypeError: Expected unicode, got str > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)