[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755808#comment-16755808
 ] 

Antoine Pitrou commented on ARROW-4313:
---

"Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX cache 
levels, core counts, or any other attribute ever different for the same CPU 
model string?"

The overclocked frequency may vary (which we could also call "actual 
frequency"), the rest should be the same.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755807#comment-16755807
 ] 

Antoine Pitrou commented on ARROW-4313:
---

For the record, IBM POWER CPUs support little-endian mode on Linux:
https://www.ibm.com/developerworks/library/l-power-little-endian-faq-trs/index.html

So big-endian support in Arrow would probably not be a roadblock.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Zhijun Fu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755804#comment-16755804
 ] 

Zhijun Fu commented on ARROW-4418:
--

In addition to above benefits that Robert mentioned, asio also provides 
opportunities for performance improvements, by providing io service, thread 
pool .etc.

In our internal testing, which uses 10+ actors on a single machine, I found 50% 
of plasma store CPU are spent on receiving messages from plasma clients, using 
UNIX domain socket.

I'm thinking that one way to improve perf is like this: 
 * Use a pool of threads to receive messages from clients. To ensure correct 
behavior, we can bind a boost::strand to a single client, so that all the 
messages from a given client arrives in order. As this part is CPU consumings, 
using multiple threads is going to help.
 * After this, the messages are posted into io service of main thread, which 
calls ProcessMessages for each of them in order.
 * After this, post the replies to a pool of threads, again use boost::strand 
for each plasma client to ensure correct order. 

I'm thinking this would probably help on cases where there are multiple workers 
using plasma store on the same machine, which should be very common. And it 
seems implementing this would be hard without asio functionalities.

Thoughts?

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4418:
--
Labels: pull-request-available  (was: )

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755570#comment-16755570
 ] 

Tanya Schlusser commented on ARROW-4313:


I think part of this was to allow anybody to contribute benchmarks from their 
own machine. And while dedicated benchmarking machines like the ones you will 
set up will have all parameters set for optimal benchmarking, benchmarks run on 
other machines may give different results. Collecting details about the machine 
that might explain those differences (in case someone cares to explore the 
dataset) is part of the goal of the data model.

One concern, of course, is that people get wildly different results than a 
benchmark says, and may say "Oh boo–the representative person from the company 
made fake results that I can't replicate on my machine" ... and with details 
about a system, performance differences can maybe be traced back to differences 
in setup, because they were recorded.

Not all fields need to be filled out all the time. My priorities are:
 # Identifying which fields flat-out wrong
 # Differentiating between necessary columns and extraneous ones that can be 
left null


To me, it is not a big deal to have an extra column dangling around that almost 
nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what 
I'm interested in getting out of the discussion here.)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4422) [Plasma] Enforce memory limit in plasma, rather than relying on dlmalloc_set_footprint_limit

2019-01-29 Thread Anurag Khandelwal (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755526#comment-16755526
 ] 

Anurag Khandelwal commented on ARROW-4422:
--

cc [~pcmoritz]

> [Plasma] Enforce memory limit in plasma, rather than relying on 
> dlmalloc_set_footprint_limit
> 
>
> Key: ARROW-4422
> URL: https://issues.apache.org/jira/browse/ARROW-4422
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Plasma (C++)
>Affects Versions: 0.12.0
>Reporter: Anurag Khandelwal
>Assignee: Anurag Khandelwal
>Priority: Minor
> Fix For: 0.13.0
>
>
> Currently, Plasma relies on dlmalloc_set_footprint_limit to limit the memory 
> utilization for Plasma Store. This is restrictive because:
>  * It restricts Plasma to dlmalloc, which supports limiting memory footprint, 
> as opposed to other, potentially more performant malloc implementations 
> (e.g., jemalloc)
>  * dlmalloc_set_footprint_limit does not guarantee that the limit set by it 
> the amount of _usable_ memory. As such, we might trigger evictions much 
> earlier than hitting this limit, e.g., due to fragmentation or metadata 
> overheads.
> To overcome this, we can impose the memory limit at Plasma by tracking the 
> number of bytes allocated and freed using malloc and free calls. Whenever the 
> allocation reaches the set limit, we fail any subsequent allocations (i.e., 
> return NULL from malloc). This allows Plasma to not be tied to dlmalloc, and 
> also provides more accurate tracking of memory allocation/capacity. 
> Caveat: We will need to make sure that the mmaped files are living on a file 
> system that is a bit larger (depending on malloc implementation) than the 
> Plasma memory limit to account for the extra memory required due to 
> fragmentation/metadata overheads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4422) [Plasma] Enforce memory limit in plasma, rather than relying on dlmalloc_set_footprint_limit

2019-01-29 Thread Anurag Khandelwal (JIRA)
Anurag Khandelwal created ARROW-4422:


 Summary: [Plasma] Enforce memory limit in plasma, rather than 
relying on dlmalloc_set_footprint_limit
 Key: ARROW-4422
 URL: https://issues.apache.org/jira/browse/ARROW-4422
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Plasma (C++)
Affects Versions: 0.12.0
Reporter: Anurag Khandelwal
Assignee: Anurag Khandelwal
 Fix For: 0.13.0


Currently, Plasma relies on dlmalloc_set_footprint_limit to limit the memory 
utilization for Plasma Store. This is restrictive because:
 * It restricts Plasma to dlmalloc, which supports limiting memory footprint, 
as opposed to other, potentially more performant malloc implementations (e.g., 
jemalloc)
 * dlmalloc_set_footprint_limit does not guarantee that the limit set by it the 
amount of _usable_ memory. As such, we might trigger evictions much earlier 
than hitting this limit, e.g., due to fragmentation or metadata overheads.

To overcome this, we can impose the memory limit at Plasma by tracking the 
number of bytes allocated and freed using malloc and free calls. Whenever the 
allocation reaches the set limit, we fail any subsequent allocations (i.e., 
return NULL from malloc). This allows Plasma to not be tied to dlmalloc, and 
also provides more accurate tracking of memory allocation/capacity. 

Caveat: We will need to make sure that the mmaped files are living on a file 
system that is a bit larger (depending on malloc implementation) than the 
Plasma memory limit to account for the extra memory required due to 
fragmentation/metadata overheads.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Areg Melik-Adamyan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755506#comment-16755506
 ] 

Areg Melik-Adamyan commented on ARROW-4313:
---

[~tanya] why we need 'overclock_freq_HZ'? What is the practical usage model for 
this field?

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Areg Melik-Adamyan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755513#comment-16755513
 ] 

Areg Melik-Adamyan commented on ARROW-4313:
---

Got it. I think that mostly those numbers are never used because you run 
benchmarks on a fixed freq always to get consistent results in time. So they 
can be easily determined from the model name or cpuid, just for informational 
purposes, but will never be used in a serial benchmarking. In a serial 
benchmarking everything should be fixed, nailed and unchanged, except the 
variable you are measuring, and it is the arrow code measured through the 
benchmark code. 

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755509#comment-16755509
 ] 

Tanya Schlusser commented on ARROW-4313:


[~aregm] I do not know. I am depending on the other people commenting here to 
make sure the hardware tables make sense because honestly I don't ever pay 
attention to hardware because my use cases never stress my system. At one point 
Wes suggested it. I am glad there is a debate.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.erdplus

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.png)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755504#comment-16755504
 ] 

Tanya Schlusser commented on ARROW-4313:


Thank you very much for everyone's detailed feedback. I absolutely need 
guidance with the Machine / CPU / GPU specs. I have updated the 
[^benchmark-data-model.png] and the [^benchmark-data-model.erdplus], and added 
all of the recommended columns.

 

*Summary of changes:*
 * All the dimension tables have been renamed to exclude the `_dim`. (It was to 
distinguish dimension vs. fact tables.)

 * `cpu`
 ** Added a `cpu_thread_count`. 
 ** Changed `cpu.speed_Hz` to two columns: `frequency_max_Hz` and 
`frequency_min_Hz` and also added a column `machine.overclock_frequency_Hz` to 
the `machine` table to allow for overclocking like Wes mentioned in the 
beginning.

 * `os`
 ** Added both `os.architecture_name` and `os.architecture_bits`, the latter 
forced to be in \{32, 64}, and pulled from the architecture name (maybe it will 
become just a computed column in the joined view...). I think it's a good idea.

 * `project`
 ** Added a `project.project_name` (oversight before)

 * `benchmark_language`
 ** Split out `language` to `language_name` and `language_version` because 
maybe people will want to compare between them (e.g. Python 2.7, 3.5+)

 * `environment`
 ** Removed foreign key for `machine_id` — that should be in the benchmark 
report separately. Many machines will have the same environment.

 * `benchmark`
 ** Added foreign key for `benchmark_language_id`—a benchmark with the same 
name may exist for different languages.
 ** Added foreign key for `project_id`—moved it from table `benchmark_result`

 * `benchmark_result`
 ** Added foreign key for `machine_id` (was removed from `environment`)
 ** Deleted foreign key for `project_id`, placing it in `benchmark` (as stated 
above)

*Questions*
 * `cpu` and `gpu` dimension
 ** Is it a mistake to make `cpu.cpu_model_name` unique? I mean, are the LX 
cache levels, core counts, or any other attribute ever different for the same 
CPU model string?
 ** The same for GPU.
 ** I have commented the columns to say that  `cpu_thread_count` corresponds to 
`sysctl -n hw.logicalcpu` and `cpu_core_count` corresponds to `sysctl -n 
hw.physicalcpu`; corrections gratefully accepted.
 ** Would it be less confusing to make the column names the exact same strings 
as correspond to their value from `sysctl`, e.g. change `cpu.cpu_model_name` to 
`cpu.cpu_brand_string` to correspond to the output of `sysctl -n 
machdep.cpu.brand_string`?
 ** On that note is CPU RAM the same thing as `sysctl -n 
machdep.cpu.cache.size`?
 * `environment`
 ** I'm worried I'm doing something inelegant with the dependency list. It will 
hold everything – Conda / virtualenv; versions of Numpy; all permutations of 
the various dependencies in what in ASV is the dependency matrix.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: (was: benchmark-data-model.erdplus)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Tanya Schlusser (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanya Schlusser updated ARROW-4313:
---
Attachment: benchmark-data-model.png

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4421) [Flight][C++] Handle large Flight data messages

2019-01-29 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4421:
---

 Summary: [Flight][C++] Handle large Flight data messages
 Key: ARROW-4421
 URL: https://issues.apache.org/jira/browse/ARROW-4421
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.13.0


I believe the message payloads are currently limited to 4MB by default, see one 
developer's discussion here:

https://nanxiao.me/en/message-length-setting-in-grpc/

While it is a good idea to break large messages into smaller ones, we will need 
to address how to gracefully send larger payloads that may be provided by a 
user's server implementation. Either we can increase the limit or break up the 
record batches into smaller chunks in the Flight server base (or both, of 
course)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-3328) [Flight] Allow for optional unique flight identifier to be sent with FlightGetInfo

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney closed ARROW-3328.
---
Resolution: Won't Fix

As discussed let's leave this to be handled using serialized tickets

> [Flight] Allow for optional unique flight identifier to be sent with 
> FlightGetInfo
> --
>
> Key: ARROW-3328
> URL: https://issues.apache.org/jira/browse/ARROW-3328
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
>
> There could either be
> * A global identifier for the entire flight
> * Endpoint-specific identifiers
> A client could use these unique identifier to perform other kinds of actions. 
> An example would be retrieving logs or statistics about a get -- you could 
> see time spent writing the dataset to gRPC or time spent constructing the 
> dataset before handing off to the gRPC write layer



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4345.
-
Resolution: Fixed
  Assignee: Wes McKinney

Done in 
https://github.com/apache/parquet-testing/commit/8991d0b58d5a59925c87dd2a0bdb59a5a4a16bd4

> [C++] Add Apache 2.0 license file to the Parquet-testing repository
> ---
>
> Key: ARROW-4345
> URL: https://issues.apache.org/jira/browse/ARROW-4345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, cpp
>Affects Versions: 0.12.0
>Reporter: Rylan Dmello
>Assignee: Wes McKinney
>Priority: Minor
>
> The parquet-testing repository is used as a git submodule in the Apache Arrow 
> repository, but doesn't currently have a license file:
>     [https://github.com/apache/arrow/tree/master/cpp/submodules]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4345:

Fix Version/s: 0.13.0

> [C++] Add Apache 2.0 license file to the Parquet-testing repository
> ---
>
> Key: ARROW-4345
> URL: https://issues.apache.org/jira/browse/ARROW-4345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, cpp
>Affects Versions: 0.12.0
>Reporter: Rylan Dmello
>Assignee: Wes McKinney
>Priority: Minor
> Fix For: 0.13.0
>
>
> The parquet-testing repository is used as a git submodule in the Apache Arrow 
> repository, but doesn't currently have a license file:
>     [https://github.com/apache/arrow/tree/master/cpp/submodules]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-3289) [C++] Implement DoPut command for Flight on client and server side

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-3289:
---

Assignee: David Li

> [C++] Implement DoPut command for Flight on client and server side  
> 
>
> Key: ARROW-3289
> URL: https://issues.apache.org/jira/browse/ARROW-3289
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: David Li
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was omitted from ARROW-3146



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755433#comment-16755433
 ] 

Wes McKinney commented on ARROW-4313:
-

I recall an e-mail thread some time back about IBM POWER support -- some of us 
(myself, [~kou]) were given access to Power Z -based CI infrastructure for 
testing but we have yet to try it. I doubt that the project works on big endian 
right now (Arrow is current little-endian, even running on big-endian hardware)

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4345) [C++] Add Apache 2.0 license file to the Parquet-testing repository

2019-01-29 Thread Rylan Dmello (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755426#comment-16755426
 ] 

Rylan Dmello commented on ARROW-4345:
-

Hi Uwe, I don't think I can get to this until the next week or so. If you have 
the bandwidth for this, please feel free to take ownership of this story.

> [C++] Add Apache 2.0 license file to the Parquet-testing repository
> ---
>
> Key: ARROW-4345
> URL: https://issues.apache.org/jira/browse/ARROW-4345
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, cpp
>Affects Versions: 0.12.0
>Reporter: Rylan Dmello
>Priority: Minor
>
> The parquet-testing repository is used as a git submodule in the Apache Arrow 
> repository, but doesn't currently have a license file:
>     [https://github.com/apache/arrow/tree/master/cpp/submodules]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4420) [INTEGRATION] Pin spark's version to the recently released arrow 0.12 patch

2019-01-29 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4420:
--

 Summary: [INTEGRATION] Pin spark's version to the recently 
released arrow 0.12 patch
 Key: ARROW-4420
 URL: https://issues.apache.org/jira/browse/ARROW-4420
 Project: Apache Arrow
  Issue Type: Bug
  Components: Integration
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


As discussed in https://github.com/apache/arrow/pull/3300#discussion_r252026108



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4296) [Plasma] Starting Plasma store with use_one_memory_mapped_file enabled crashes due to improper memory alignment

2019-01-29 Thread Robert Nishihara (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Nishihara resolved ARROW-4296.
-
Resolution: Fixed

Issue resolved by pull request 3490
[https://github.com/apache/arrow/pull/3490]

> [Plasma] Starting Plasma store with use_one_memory_mapped_file enabled 
> crashes due to improper memory alignment
> ---
>
> Key: ARROW-4296
> URL: https://issues.apache.org/jira/browse/ARROW-4296
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Plasma (C++)
>Affects Versions: 0.11.1
>Reporter: Anurag Khandelwal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> Starting Plasma with use_one_memory_mapped_file (-f flag) causes a crash, 
> most likely due to improper memory alignment. This can be resolved by 
> changing the dlmemalign call during initialization to use slightly smaller 
> memory (by ~8KB).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755383#comment-16755383
 ] 

Antoine Pitrou commented on ARROW-4313:
---

Multiple CPUs would go under the core_count IMO.

As for mainframes, no, but AFAIK there are regular Linux-based (or AIX-based) 
POWER servers.

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Robert Nishihara (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755381#comment-16755381
 ] 

Robert Nishihara commented on ARROW-4418:
-

If preferable, there is also a non-boost version of asio 
[https://think-async.com/Asio/AsioAndBoostAsio.html|https://think-async.com/Asio/AsioAndBoostAsio.html,]

 

I also remember thinking that asio is moving into the C++ standard library, 
though I can't seem to find a reference for that at the moment.

 

The benefits of using asio are pretty big (Windows support for the Plasma store 
as well as using a more standard C++ approach than what we are currently doing).

 

In terms of alternatives, I know that Philipp has looked into gRPC. Maybe he 
could elaborate on the pros/cons there?

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Areg Melik-Adamyan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755360#comment-16755360
 ] 

Areg Melik-Adamyan commented on ARROW-4313:
---

Ok, if we want to add them, then it should be named 'smt_thread_count' or 
'threads_per_core'. And there is also there is a case for multiple CPUs. Do you 
anticipate using Arrow on mainframes? I would say that most likely FPGA usage 
will preceed Power usage. 

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Areg Melik-Adamyan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755360#comment-16755360
 ] 

Areg Melik-Adamyan edited comment on ARROW-4313 at 1/29/19 8:23 PM:


Ok, if we want to add them, then it should be named 'smt_thread_count' or 
'threads_per_core'. And there is a case for multiple CPUs also. Do you 
anticipate using Arrow on mainframes? I would say that most likely FPGA usage 
will preceed Power usage. 


was (Author: aregm):
Ok, if we want to add them, then it should be named 'smt_thread_count' or 
'threads_per_core'. And there is also there is a case for multiple CPUs. Do you 
anticipate using Arrow on mainframes? I would say that most likely FPGA usage 
will preceed Power usage. 

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-3289) [C++] Implement DoPut command for Flight on client and server side

2019-01-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-3289:
--
Labels: flight pull-request-available  (was: flight)

> [C++] Implement DoPut command for Flight on client and server side  
> 
>
> Key: ARROW-3289
> URL: https://issues.apache.org/jira/browse/ARROW-3289
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 0.13.0
>
>
> This was omitted from ARROW-3146



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755350#comment-16755350
 ] 

Antoine Pitrou commented on ARROW-4313:
---

> there is a 'core_count', for IA it is better to have HT flag, for others 
> threads=cores

Not really, for example IBM POWER CPUs can have 2, 4 or 8 threads per core.


> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Areg Melik-Adamyan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755291#comment-16755291
 ] 

Areg Melik-Adamyan commented on ARROW-4313:
---

* in `cpu_dim`, perhaps add a `cpu_thread_count` (the CPU's number of hardware 
threads, which can be a multiple of the number of distinct cores)
 ** there is a 'core_count', for IA it is better to have HT flag, for others 
threads=cores
 * either in `machine_dim` or `os_dim`, store the bitness? (usually 64-bit I 
suppose, though perhaps some people will want to benchmark on 32-bit). Or, more 
generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or 
"AArch64").
 ** short uname -i should be enough. 
 * not sure why tables are suffixed with `_dim`?
 ** I guess those are conditional names and not necessarily the resulting. 

> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4418:
--
Priority: Major  (was: Minor)

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755252#comment-16755252
 ] 

Antoine Pitrou commented on ARROW-4418:
---

I think we should be careful to evaluate the cost of the boost::asio 
dependency. Handling boost dependencies is always delicate, especially when 
they come with a compiled library (i.e. the library isn't header-only).

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4418:
--
Component/s: Plasma (C++)

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4418:
--
Labels:   (was: pull-request-available)

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Plasma (C++)
>Reporter: Zhijun Fu
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4418:
--
Description: 
Original text:

It would be nice to move plasma store from current event loop to boost::asio to 
modernize the code, and more importantly to benefit from the functionalities 
provided by asio, which I think also provides opportunities for performance 
improvement.

  was:https://issues.apache.org/jira/browse/ARROW-4418


> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Zhijun Fu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Original text:
> It would be nice to move plasma store from current event loop to boost::asio 
> to modernize the code, and more importantly to benefit from the 
> functionalities provided by asio, which I think also provides opportunities 
> for performance improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou updated ARROW-4418:
--
Description: https://issues.apache.org/jira/browse/ARROW-4418

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Zhijun Fu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> https://issues.apache.org/jira/browse/ARROW-4418



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4407) [C++] ExternalProject_Add does not capture CC/CXX correctly

2019-01-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755200#comment-16755200
 ] 

Wes McKinney commented on ARROW-4407:
-

While it's a good idea to fix this particular issue, I think we should be 
careful about spending very much time on having the CMake build preserve the 
environment of where it was first invoked. Instead, we should encourage 
developers to keep their environment variables consistent (through the use of 
files that are sourced on each shell initialization, here is mine: 
https://github.com/wesm/dev-toolchain/blob/master/toolchain/arrow-toolchain.sh) 
while they are developing

> [C++] ExternalProject_Add does not capture CC/CXX correctly
> ---
>
> Key: ARROW-4407
> URL: https://issues.apache.org/jira/browse/ARROW-4407
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Affects Versions: 0.12.0
>Reporter: Francois Saint-Jacques
>Assignee: Francois Saint-Jacques
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> The issue is that CC/CXX environment variables are captured on the first 
> invocation of the builder (e.g make or ninja) instead of when CMake is 
> invoked into to build directory. This can lead to compilation errors (notably 
> when compiling with clang in the top directory due to the addition of the 
> `-Qunused-arguments` option).
> This leads to an issue where I have a script that prepare the build directory 
> and export CXX within the script. When I jump in the build folder, there's a 
> mismatch between the external gbenchmark (and all deps if conda is not used) 
> compiler and the build.
> To reproduce:
> # Create a new build directory with clang as compiler, don't build yet
> # In a new shell (without the compiler environment variable), go into 
> directory invoke make/ninja



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4213) [Flight] C++ and Java implementations are incompatible

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-4213.
-
Resolution: Fixed

Issue resolved by pull request 3477
[https://github.com/apache/arrow/pull/3477]

> [Flight] C++ and Java implementations are incompatible
> --
>
> Key: ARROW-4213
> URL: https://issues.apache.org/jira/browse/ARROW-4213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC
>Reporter: David Li
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> A C++ client cannot request streams from a Java service, nor can it decode 
> the schema from GetFlightInfo.
> Schema: in Java, GetFlightInfo encodes the schema directly via flatbuffers. 
> C++ expects it to be encoded as an IPC message. This isn't a problem in Java 
> as a method exists to decode such schemas, but in C++ the API for reading 
> such a schema isn't really exposed. I'm willing to submit a patch for this, 
> but it's not clear to me which scheme is preferred.
> Streams: in Java, DoGet starts with an ArrowMessage containing a schema. C++ 
> does not expect this and segfaults when it tries to decode the message as a 
> record batch. Based on the presentations I've seen, I think C++ is in the 
> wrong here; I have a patch to fix this that I could clean up and submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4418:
--
Labels: pull-request-available  (was: )

> [Plasma] replace event loop with boost::asio for plasma store
> -
>
> Key: ARROW-4418
> URL: https://issues.apache.org/jira/browse/ARROW-4418
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Zhijun Fu
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4213) [Flight] C++ and Java implementations are incompatible

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney reassigned ARROW-4213:
---

Assignee: David Li

> [Flight] C++ and Java implementations are incompatible
> --
>
> Key: ARROW-4213
> URL: https://issues.apache.org/jira/browse/ARROW-4213
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: FlightRPC
>Reporter: David Li
>Assignee: David Li
>Priority: Major
>  Labels: flight, pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> A C++ client cannot request streams from a Java service, nor can it decode 
> the schema from GetFlightInfo.
> Schema: in Java, GetFlightInfo encodes the schema directly via flatbuffers. 
> C++ expects it to be encoded as an IPC message. This isn't a problem in Java 
> as a method exists to decode such schemas, but in C++ the API for reading 
> such a schema isn't really exposed. I'm willing to submit a patch for this, 
> but it's not clear to me which scheme is preferred.
> Streams: in Java, DoGet starts with an ArrowMessage containing a schema. C++ 
> does not expect this and segfaults when it tries to decode the message as a 
> record batch. Based on the presentations I've seen, I think C++ is in the 
> wrong here; I have a patch to fix this that I could clean up and submit.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4419) [Flight] Deal with body buffers in FlightData

2019-01-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755194#comment-16755194
 ] 

Wes McKinney commented on ARROW-4419:
-

It might be useful to write an ultra-minimal pure Python Flight server and 
client (using the generated Python grpc bindings) so that we can more easily 
test this kind of thing

e.g. see

https://github.com/apache/arrow/blob/master/java/flight/README.md#python-example-usage

> [Flight] Deal with body buffers in FlightData
> -
>
> Key: ARROW-4419
> URL: https://issues.apache.org/jira/browse/ARROW-4419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Minor
>  Labels: flight
>
> The Java implementation will fail to decode a schema message if the message 
> also contains (empty) body buffers (see ArrowMessage.asSchema's precondition 
> checks). However, clients using default Protobuf serialization will likely 
> write an empty body buffer by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4419) [Flight] Deal with body buffers in FlightData

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4419:

Summary: [Flight] Deal with body buffers in FlightData  (was: Deal with 
body buffers in FlightData)

> [Flight] Deal with body buffers in FlightData
> -
>
> Key: ARROW-4419
> URL: https://issues.apache.org/jira/browse/ARROW-4419
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: FlightRPC
>Reporter: David Li
>Priority: Minor
>  Labels: flight
>
> The Java implementation will fail to decode a schema message if the message 
> also contains (empty) body buffers (see ArrowMessage.asSchema's precondition 
> checks). However, clients using default Protobuf serialization will likely 
> write an empty body buffer by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4419) Deal with body buffers in FlightData

2019-01-29 Thread David Li (JIRA)
David Li created ARROW-4419:
---

 Summary: Deal with body buffers in FlightData
 Key: ARROW-4419
 URL: https://issues.apache.org/jira/browse/ARROW-4419
 Project: Apache Arrow
  Issue Type: Improvement
  Components: FlightRPC
Reporter: David Li


The Java implementation will fail to decode a schema message if the message 
also contains (empty) body buffers (see ArrowMessage.asSchema's precondition 
checks). However, clients using default Protobuf serialization will likely 
write an empty body buffer by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4395) ts-node throws type error running `bin/arrow2csv.js`

2019-01-29 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs resolved ARROW-4395.

   Resolution: Fixed
Fix Version/s: (was: 0.4.0)
   0.13.0

Issue resolved by pull request 3504
[https://github.com/apache/arrow/pull/3504]

> ts-node throws type error running `bin/arrow2csv.js`
> 
>
> Key: ARROW-4395
> URL: https://issues.apache.org/jira/browse/ARROW-4395
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: JavaScript
>Affects Versions: 0.4.0
>Reporter: Paul Taylor
>Assignee: Paul Taylor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> ts-node is being too strict, throws this (inaccurate) error JIT'ing the TS 
> source:
> {code:none}
> $ cat test/data/cpp/stream/simple.arrow | ./bin/arrow2csv.js 
> /home/ptaylor/dev/arrow/js/node_modules/ts-node/src/index.ts:228
> return new TSError(diagnosticText, diagnosticCodes)
>^
> TSError: ⨯ Unable to compile TypeScript:
> src/vector/map.ts(25,57): error TS2345: Argument of type 'Field number | symbol]>[]' is not assignable to parameter of type 'Field T]>[]'.
>   Type 'Field' is not assignable to type 
> 'Field'.
> Type 'T[string] | T[number] | T[symbol]' is not assignable to type 
> 'T[keyof T]'.
>   Type 'T[symbol]' is not assignable to type 'T[keyof T]'.
> Type 'DataType' is not assignable to type 'T[keyof T]'.
>   Type 'symbol' is not assignable to type 'keyof T'.
> Type 'symbol' is not assignable to type 'string | number'.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4418) [Plasma] replace event loop with boost::asio for plasma store

2019-01-29 Thread Zhijun Fu (JIRA)
Zhijun Fu created ARROW-4418:


 Summary: [Plasma] replace event loop with boost::asio for plasma 
store
 Key: ARROW-4418
 URL: https://issues.apache.org/jira/browse/ARROW-4418
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhijun Fu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing

2019-01-29 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-4413:

Fix Version/s: 0.13.0

> [Python] pyarrow.hdfs.connect() failing
> ---
>
> Key: ARROW-4413
> URL: https://issues.apache.org/jira/browse/ARROW-4413
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
> Environment: Python 2.7
> Hadoop distribution: Amazon 2.7.3
> Hive 2.1.1 
> Spark 2.1.1
> Tez 0.8.4
> Linux 4.4.35-33.55.amzn1.x86_64
>Reporter: Bradley Grantham
>Priority: Major
> Fix For: 0.13.0
>
>
> Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}.
> This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used 
> the same environment when testing that it still worked on {{v0.11.1}})
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: fs = pa.hdfs.connect()
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 fs = pa.hdfs.connect()
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, 
> port, user, kerb_ticket, driver, extra_conf)
> 205 fs = HadoopFileSystem(host=host, port=port, user=user,
> 206   kerb_ticket=kerb_ticket, driver=driver,
> --> 207   extra_conf=extra_conf)
> 208 return fs
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, 
> host, port, user, kerb_ticket, driver, extra_conf)
>  36 _maybe_set_hadoop_classpath()
>  37 
> ---> 38 self._connect(host, port, user, kerb_ticket, driver, 
> extra_conf)
>  39 
>  40 def __reduce__(self):
> /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in 
> pyarrow.lib.HadoopFileSystem._connect()
>  72 if host is not None:
>  73 conf.host = tobytes(host)
> ---> 74 self.host = host
>  75 
>  76 conf.port = port
> TypeError: Expected unicode, got str
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4412) [DOCUMENTATION] Add explicit version numbers to the arrow specification documents.

2019-01-29 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16755108#comment-16755108
 ] 

Wes McKinney commented on ARROW-4412:
-

The version number is in the top left of 

http://arrow.apache.org/docs/format/README.html

(it is a dev version number right now... we should fix that)

> [DOCUMENTATION] Add explicit version numbers to the arrow specification 
> documents.
> --
>
> Key: ARROW-4412
> URL: https://issues.apache.org/jira/browse/ARROW-4412
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Micah Kornfield
>Priority: Minor
>
> Based on conversation on the mailing list it might pay to include 
> version/revision numbers on the specification document.  One way is to 
> include the "release" version, another might be to only update versioning on 
> changes to the document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4313) Define general benchmark database schema

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754973#comment-16754973
 ] 

Antoine Pitrou commented on ARROW-4313:
---

Some thoughts:

* in `cpu_dim`, perhaps add a `cpu_thread_count` (the CPU's number of hardware 
threads, which can be a multiple of the number of distinct cores)
* either in `machine_dim` or `os_dim`, store the bitness? (usually 64-bit I 
suppose, though perhaps some people will want to benchmark on 32-bit). Or, more 
generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or 
"AArch64").
* not sure why tables are suffixed with `_dim`?


> Define general benchmark database schema
> 
>
> Key: ARROW-4313
> URL: https://issues.apache.org/jira/browse/ARROW-4313
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Benchmarking
>Reporter: Wes McKinney
>Priority: Major
> Fix For: 0.13.0
>
> Attachments: benchmark-data-model.erdplus, benchmark-data-model.png
>
>
> Some possible attributes that the benchmark database should track, to permit 
> heterogeneity of hardware and programming languages
> * Timestamp of benchmark run
> * Git commit hash of codebase
> * Machine unique name (sort of the "user id")
> * CPU identification for machine, and clock frequency (in case of 
> overclocking)
> * CPU cache sizes (L1/L2/L3)
> * Whether or not CPU throttling is enabled (if it can be easily determined)
> * RAM size
> * GPU identification (if any)
> * Benchmark unique name
> * Programming language(s) associated with benchmark (e.g. a benchmark
> may involve both C++ and Python)
> * Benchmark time, plus mean and standard deviation if available, else NULL
> see discussion on mailing list 
> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-4417) [C++] Doc build broken

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-4417.
---
   Resolution: Fixed
Fix Version/s: 0.13.0

Issue resolved by pull request 3521
[https://github.com/apache/arrow/pull/3521]

> [C++] Doc build broken
> --
>
> Key: ARROW-4417
> URL: https://issues.apache.org/jira/browse/ARROW-4417
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> See https://travis-ci.org/apache/arrow/jobs/485716603#L4746
> {code}
> /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: 
> The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext 
> *ctx, const Datum &input, Datum *out)=0 are not documented:
>   parameter 'ctx'
>   parameter 'input' (warning treated as error, aborting now)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros

2019-01-29 Thread Krisztian Szucs (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Szucs reassigned ARROW-4414:
--

Assignee: Krisztian Szucs

> [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds 
> for older distros
> --
>
> Key: ARROW-4414
> URL: https://issues.apache.org/jira/browse/ARROW-4414
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Krisztian Szucs
>Assignee: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial 
> and Debian stretch. It's available since CMake 3.8: 
> https://cmake.org/cmake/help/v3.8/command/add_custom_command.html
> We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt
> Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake 
> 3.5)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros

2019-01-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4414:
--
Labels: pull-request-available  (was: )

> [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds 
> for older distros
> --
>
> Key: ARROW-4414
> URL: https://issues.apache.org/jira/browse/ARROW-4414
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Krisztian Szucs
>Priority: Major
>  Labels: pull-request-available
>
> COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial 
> and Debian stretch. It's available since CMake 3.8: 
> https://cmake.org/cmake/help/v3.8/command/add_custom_command.html
> We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt
> Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake 
> 3.5)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4417) [C++] Doc build broken

2019-01-29 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4417:
--
Labels: pull-request-available  (was: )

> [C++] Doc build broken
> --
>
> Key: ARROW-4417
> URL: https://issues.apache.org/jira/browse/ARROW-4417
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>  Labels: pull-request-available
>
> See https://travis-ci.org/apache/arrow/jobs/485716603#L4746
> {code}
> /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: 
> The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext 
> *ctx, const Datum &input, Datum *out)=0 are not documented:
>   parameter 'ctx'
>   parameter 'input' (warning treated as error, aborting now)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4416) [CI] Build gandiva in cpp docker image

2019-01-29 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4416:
--

 Summary: [CI] Build gandiva in cpp docker image
 Key: ARROW-4416
 URL: https://issues.apache.org/jira/browse/ARROW-4416
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Krisztian Szucs


Currently Gandiva is not built, for the sake of completeness enable it by 
default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4417) [C++] Doc build broken

2019-01-29 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-4417:
-

 Summary: [C++] Doc build broken
 Key: ARROW-4417
 URL: https://issues.apache.org/jira/browse/ARROW-4417
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Continuous Integration, Documentation
Reporter: Antoine Pitrou


See https://travis-ci.org/apache/arrow/jobs/485716603#L4746

{code}
/home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: The 
following parameters of arrow::compute::UnaryKernel::Call(FunctionContext *ctx, 
const Datum &input, Datum *out)=0 are not documented:
  parameter 'ctx'
  parameter 'input' (warning treated as error, aborting now)
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-4417) [C++] Doc build broken

2019-01-29 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-4417:
-

Assignee: Antoine Pitrou

> [C++] Doc build broken
> --
>
> Key: ARROW-4417
> URL: https://issues.apache.org/jira/browse/ARROW-4417
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration, Documentation
>Reporter: Antoine Pitrou
>Assignee: Antoine Pitrou
>Priority: Major
>
> See https://travis-ci.org/apache/arrow/jobs/485716603#L4746
> {code}
> /home/travis/build/apache/arrow/cpp/src/arrow/compute/kernel.h:170: error: 
> The following parameters of arrow::compute::UnaryKernel::Call(FunctionContext 
> *ctx, const Datum &input, Datum *out)=0 are not documented:
>   parameter 'ctx'
>   parameter 'input' (warning treated as error, aborting now)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros

2019-01-29 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754864#comment-16754864
 ] 

Uwe L. Korn commented on ARROW-4414:


Args. I don't know how to pass the parameters otherwise to the Gandiva command 
in the manylinux1 build. Would be good of someone else could have a look at it.

> [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds 
> for older distros
> --
>
> Key: ARROW-4414
> URL: https://issues.apache.org/jira/browse/ARROW-4414
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Krisztian Szucs
>Priority: Major
>
> COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial 
> and Debian stretch. It's available since CMake 3.8: 
> https://cmake.org/cmake/help/v3.8/command/add_custom_command.html
> We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt
> Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake 
> 3.5)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4415) [Doc] Port run_site docker to the new compose setup

2019-01-29 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4415:
--

 Summary: [Doc] Port run_site docker to the new compose setup
 Key: ARROW-4415
 URL: https://issues.apache.org/jira/browse/ARROW-4415
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Krisztian Szucs
Assignee: Krisztian Szucs


Eventually all docker related code under 
https://github.com/apache/arrow/tree/master/dev should be moved to the new 
docker-compose setup defined in the top-level docker-compose.yml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4414) [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros

2019-01-29 Thread Krisztian Szucs (JIRA)
Krisztian Szucs created ARROW-4414:
--

 Summary: [C++] Stop using cmake COMMAND_EXPAND_LISTS because it 
breaks package builds for older distros
 Key: ARROW-4414
 URL: https://issues.apache.org/jira/browse/ARROW-4414
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Krisztian Szucs


COMMAND_EXPAND_LISTS option of add_custom_command is too new on Ubuntu Xenial 
and Debian stretch. It's available since CMake 3.8: 
https://cmake.org/cmake/help/v3.8/command/add_custom_command.html
We need to stop using it in cpp/src/gandiva/precompiled/CMakeLists.txt

Also We should pin cmake to version 3.5 in travis builds (xenial ships cmake 
3.5)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3954) [Rust] Add Slice to Array and ArrayData

2019-01-29 Thread Neville Dipale (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754831#comment-16754831
 ] 

Neville Dipale commented on ARROW-3954:
---

Thanks Chao, I might be out of my depth with this one, I mainly struggled with 
ArrayData's buffers when I gave up.

> [Rust] Add Slice to Array and ArrayData
> ---
>
> Key: ARROW-3954
> URL: https://issues.apache.org/jira/browse/ARROW-3954
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Rust
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 0.13.0
>
>
> Similar to C++, we should be able to construct zero-copy slice from {{Array}} 
> and {{ArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing

2019-01-29 Thread Bradley Grantham (JIRA)
Bradley Grantham created ARROW-4413:
---

 Summary: [Python] pyarrow.hdfs.connect() failing
 Key: ARROW-4413
 URL: https://issues.apache.org/jira/browse/ARROW-4413
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.12.0
 Environment: Python 2.7
Hadoop distribution: Amazon 2.7.3
Hive 2.1.1 
Spark 2.1.1
Tez 0.8.4
Linux 4.4.35-33.55.amzn1.x86_64
Reporter: Bradley Grantham


Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}.
This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used 
the same environment when testing that it still worked on {{v0.11.1}})

 
{code:java}
In [1]: import pyarrow as pa

In [2]: fs = pa.hdfs.connect()

---
TypeError Traceback (most recent call last)
 in ()
> 1 fs = pa.hdfs.connect()

/usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, 
port, user, kerb_ticket, driver, extra_conf)
205 fs = HadoopFileSystem(host=host, port=port, user=user,
206   kerb_ticket=kerb_ticket, driver=driver,
--> 207   extra_conf=extra_conf)
208 return fs

/usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, 
host, port, user, kerb_ticket, driver, extra_conf)
 36 _maybe_set_hadoop_classpath()
 37 
---> 38 self._connect(host, port, user, kerb_ticket, driver, extra_conf)
 39 
 40 def __reduce__(self):

/usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in 
pyarrow.lib.HadoopFileSystem._connect()
 72 if host is not None:
 73 conf.host = tobytes(host)
---> 74 self.host = host
 75 
 76 conf.port = port

TypeError: Expected unicode, got str
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754796#comment-16754796
 ] 

Antoine Pitrou edited comment on ARROW-4413 at 1/29/19 9:45 AM:


The following patch would probably work, but I don't know how to test it:
https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80



was (Author: pitrou):
The following patch would probably work, but I don't know how test it:
https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80


> [Python] pyarrow.hdfs.connect() failing
> ---
>
> Key: ARROW-4413
> URL: https://issues.apache.org/jira/browse/ARROW-4413
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
> Environment: Python 2.7
> Hadoop distribution: Amazon 2.7.3
> Hive 2.1.1 
> Spark 2.1.1
> Tez 0.8.4
> Linux 4.4.35-33.55.amzn1.x86_64
>Reporter: Bradley Grantham
>Priority: Major
>
> Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}.
> This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used 
> the same environment when testing that it still worked on {{v0.11.1}})
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: fs = pa.hdfs.connect()
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 fs = pa.hdfs.connect()
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, 
> port, user, kerb_ticket, driver, extra_conf)
> 205 fs = HadoopFileSystem(host=host, port=port, user=user,
> 206   kerb_ticket=kerb_ticket, driver=driver,
> --> 207   extra_conf=extra_conf)
> 208 return fs
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, 
> host, port, user, kerb_ticket, driver, extra_conf)
>  36 _maybe_set_hadoop_classpath()
>  37 
> ---> 38 self._connect(host, port, user, kerb_ticket, driver, 
> extra_conf)
>  39 
>  40 def __reduce__(self):
> /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in 
> pyarrow.lib.HadoopFileSystem._connect()
>  72 if host is not None:
>  73 conf.host = tobytes(host)
> ---> 74 self.host = host
>  75 
>  76 conf.port = port
> TypeError: Expected unicode, got str
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4413) [Python] pyarrow.hdfs.connect() failing

2019-01-29 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754796#comment-16754796
 ] 

Antoine Pitrou commented on ARROW-4413:
---

The following patch would probably work, but I don't know how test it:
https://gist.github.com/pitrou/1ee2e1b04543cddead11a146938d9e80


> [Python] pyarrow.hdfs.connect() failing
> ---
>
> Key: ARROW-4413
> URL: https://issues.apache.org/jira/browse/ARROW-4413
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
> Environment: Python 2.7
> Hadoop distribution: Amazon 2.7.3
> Hive 2.1.1 
> Spark 2.1.1
> Tez 0.8.4
> Linux 4.4.35-33.55.amzn1.x86_64
>Reporter: Bradley Grantham
>Priority: Major
>
> Trying to connect to hdfs using the below snippet. Using {{hadoop-libhdfs}}.
> This error appears in {{v0.12.0}}. It doesn't appear in {{v0.11.1}}. (I used 
> the same environment when testing that it still worked on {{v0.11.1}})
>  
> {code:java}
> In [1]: import pyarrow as pa
> In [2]: fs = pa.hdfs.connect()
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 fs = pa.hdfs.connect()
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in connect(host, 
> port, user, kerb_ticket, driver, extra_conf)
> 205 fs = HadoopFileSystem(host=host, port=port, user=user,
> 206   kerb_ticket=kerb_ticket, driver=driver,
> --> 207   extra_conf=extra_conf)
> 208 return fs
> /usr/local/lib64/python2.7/site-packages/pyarrow/hdfs.pyc in __init__(self, 
> host, port, user, kerb_ticket, driver, extra_conf)
>  36 _maybe_set_hadoop_classpath()
>  37 
> ---> 38 self._connect(host, port, user, kerb_ticket, driver, 
> extra_conf)
>  39 
>  40 def __reduce__(self):
> /usr/local/lib64/python2.7/site-packages/pyarrow/io-hdfs.pxi in 
> pyarrow.lib.HadoopFileSystem._connect()
>  72 if host is not None:
>  73 conf.host = tobytes(host)
> ---> 74 self.host = host
>  75 
>  76 conf.port = port
> TypeError: Expected unicode, got str
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)