Re: Access Gandiva filter result by array index

2018-12-14 Thread Suvayu Ali
Hi Ravindra,

On Fri, Dec 14, 2018 at 01:11:02PM +0530, Ravindra Pindikura wrote:
> > 
> >   But I can't access the elements of the selection vector!  Since it is 
> > declared
> >   as std::shared_ptr, the Value(..) method isn't found.  I had
> >   filled it with SelectionVector::MakeInt16(..), so I tried downcasting to
> >   arrow::NumericArray, but that fails!
> 
> This should work:
> 
>   auto array = 
> std::dynamic_pointer_cast>(selected->ToArray());
>   printf("%d %d\n", array->Value(0), array->Value(1));

Silly of me to not try the unsigned type in the first place!  Thanks a lot :)

Cheers,

-- 
Suvayu

Open source is the future. It sets us free.


Re: C++ documentation overhaul

2018-12-14 Thread Antonio Cavallo
Hi Antoine,
I'm trying to learn about arrow, would it possible for me to help with the
documentation?

Do you have a repository I can contribute to?
Thanks

On Wed, 12 Dec 2018 at 09:13, Antoine Pitrou  wrote:

>
> Hello,
>
> We are doing a refactor of the C++ documentation which will appear in
> 0.12.0.
>
> Currently, the main entry point of the C++ documentation is a
> Doxygen-generated API documentation in the traditional format, together
> with a couple MarkDown pages covering some example use cases.
>
> The rewrite integrates the C++ API documentation in a larger Sphinx
> documentation also holding the format specification and Python docs.
> This allows us to add cross-references very easily and make the whole
> documentation more cohesive.
>
> To accompany this transformation, I have started writing some prose
> documentation about fundamental concepts in the C++ API.  I have
> uploaded a snapshot build of this work-in-progress here:
> https://pitrou.net/arrowdevdoc/cpp/index.html
>
> Comments and suggestions are welcome.
>
> Regards
>
> Antoine.
>


Re: C++ documentation overhaul

2018-12-14 Thread Antoine Pitrou


Hi Antonio,

Everything is done in the main Arrow repository in a regular fashion
(e.g. you can open Pull Requests there).  Help on the documentation is
welcome, as many aspects are missing currently.

Feel free to ask any questions!

Regards

Antoine.


Le 14/12/2018 à 16:09, Antonio Cavallo a écrit :
> Hi Antoine,
> I'm trying to learn about arrow, would it possible for me to help with the
> documentation?
> 
> Do you have a repository I can contribute to?
> Thanks
> 
> On Wed, 12 Dec 2018 at 09:13, Antoine Pitrou  wrote:
> 
>>
>> Hello,
>>
>> We are doing a refactor of the C++ documentation which will appear in
>> 0.12.0.
>>
>> Currently, the main entry point of the C++ documentation is a
>> Doxygen-generated API documentation in the traditional format, together
>> with a couple MarkDown pages covering some example use cases.
>>
>> The rewrite integrates the C++ API documentation in a larger Sphinx
>> documentation also holding the format specification and Python docs.
>> This allows us to add cross-references very easily and make the whole
>> documentation more cohesive.
>>
>> To accompany this transformation, I have started writing some prose
>> documentation about fundamental concepts in the C++ API.  I have
>> uploaded a snapshot build of this work-in-progress here:
>> https://pitrou.net/arrowdevdoc/cpp/index.html
>>
>> Comments and suggestions are welcome.
>>
>> Regards
>>
>> Antoine.
>>
> 


[jira] [Created] (ARROW-4029) [C++] Define and document naming convention for internal / private header files not to be installed

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4029:
---

 Summary: [C++] Define and document naming convention for internal 
/ private header files not to be installed
 Key: ARROW-4029
 URL: https://issues.apache.org/jira/browse/ARROW-4029
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


The purpose of this is so that a common {{ARROW_INSTALL_PUBLIC_HEADERS}} can 
recognize and exclude any file that is non-public from installation.

see discussion on https://github.com/apache/arrow/pull/3172



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4030) [CI] Use travis_terminate to halt builds when a step fails

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4030:
---

 Summary: [CI] Use travis_terminate to halt builds when a step fails
 Key: ARROW-4030
 URL: https://issues.apache.org/jira/browse/ARROW-4030
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Continuous Integration
Reporter: Wes McKinney
 Fix For: 0.12.0


I noticed that Travis CI will soldier onward if a step in its {{script:}} block 
fails. This wastes build time when there is an error somewhere early on in the 
testing process

For example, in the main C++ build, if {{travis_script_cpp.sh}} fails, then the 
subsequent steps will continue.

It seems the way to deal with this is to add {{|| travis_terminate 1}} to lines 
that can fail

see

https://medium.com/@manjula.cse/how-to-stop-the-execution-of-travis-pipeline-if-script-exits-with-an-error-f0e5a43206bf

I also found this discussion

https://github.com/travis-ci/travis-ci/issues/1066



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow JS 0.4.0 Release

2018-12-14 Thread Wes McKinney
hi Paul,

On Thu, Dec 13, 2018 at 8:59 PM Paul Taylor  wrote:
>
> Another update: all the existing features and unit tests are working
> again except for the Table/RecordBatch streaming toString()
> implementations (and the `arrow2csv` utility), which I'll update later
> tonight.
>
> On JS release cadence, I think Brian's right that the current setup is
> working counter to our original intent. I am used to (and prefer) a
> faster-paced release cycle, essentially releasing early and as often as
> bugs are fixed or features are added. Indeed, Graphistry maintains a
> repo  with the
> latest version of the library that we can build against, which I update
> when I fix any bugs or add features.
>

It is common for software vendors to have "downstream" releases, so
this is reasonable, so long as this work is not promoted as Apache
releases

> The JS project is young, and sometimes has to move at a rapid pace. I've
> felt the turnaround time involved in the vote/prepare/verify/publish
> release process is slower than would be helpful to me. I'm used to
> publishing patch release to npm as soon as possible, possibly multiple
> times a day.

Well, surely the recent security problems with NPM demonstrate that
there is value in giving the community opportunity to vet a package
before it is published for the world to use, and that GPG-signing
packages is an important security measure to ensure that production
code is coming from a network of trust. It is different if you are
publishing packages for your own personal or corporate use.

>
> None of the PMCs contribute to or use the JS version (if that's wrong,
> hit me up!) so there's been no release pressure from there. None of the
> JS contributors are PMCs so even if we want to do releases, we have to
> wait for the a PMC. My take is that everyone on the project (especially
> PMCs) are probably ungodly busy people, and since not releasing to npm
> hasn't been blocking me, I opt not to bother folks.

I am happy to help release the JS package as often as you like, up to
multiple times per month. I stated this early on in the process, but
there has not seemed to be much desire to release. Brian's recent
request to release caught me at a bad time at the end of the year, but
there are other active PMCs who should be able to help. If you do
decide you want to release in the next week or two, please let me know
and I will make the time to help.

The lack of PMCs with an interest in JavaScript is a bit of
self-perpetuating issue. One of the responsibilities of PMC members
(and what will enable a committer to become a PMC) is to promote the
growth and development of a healthy community. This includes making
sure that the project releases. The JS developer community hasn't
grown much, though. My approach to such a problem is to act as a
"community of one" until it changes -- drive a project forward and
ensure a steady cadence of releases.

- Wes

>
>
> On 12/13/18 11:52 AM, Wes McKinney wrote:
> > +1 for synchronizing to the main releases when possible. In the 0.12
> > thread we have discussed moving to time-based releases (e.g. every 2
> > months). Time-based releases are helpful to create urgency around
> > getting work completed, and making sure that the project is always
> > ready to release.
> > On Thu, Dec 13, 2018 at 10:39 AM Brian Hulette  wrote:
> >> Sounds great Paul! Really excited that this refactor is wrapping up. My
> >> only concern with including this in 0.4.0 is that I'm not going to have the
> >> time to thoroughly review it for a few weeks, so gating on that would
> >> really delay it. But I can just manually test with some use-cases I care
> >> about in lieu of a thorough review in the interest of time.
> >>
> >> I think in the future (after 0.12?) it may behoove us to tie back in to the
> >> main Arrow release cycle. The idea with the separate JS release was to
> >> allow us to release faster, but in practice it has done the opposite. Since
> >> the fall of 2017 we've cut two major JS releases (0.2, 0.3) while there
> >> were four major main releases (0.8 - 0.11). Not to mention the disjoint
> >> version numbers can be confusing to users - perhaps not as much of a
> >> concern now that the format is pretty stable, but it can still be a
> >> friction point. And finally selfishly - if we had been on the main release
> >> cycle, the contributions I made in the summer would have been released in
> >> either 0.10 or 0.11 by now.
> >>
> >> Brian
> >>
> >> On Thu, Dec 13, 2018 at 3:29 AM Paul Taylor  wrote:
> >>
> >>> The ongoing JS refactor/upgrade branch
> >>>  is just
> >>> about done. It's passing all the integration tests, as well as a hundred
> >>> or so new unit tests. I have to update existing tests where the APIs
> >>> changed, battle with closure-compiler a bit, then it'll be ready to
> >>> merge in and ship out. I think I'll be able to wrap

[jira] [Created] (ARROW-4031) [C++] Refactor ArrayBuilder bitmap logic into TypedBufferBuilder

2018-12-14 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-4031:


 Summary: [C++] Refactor ArrayBuilder bitmap logic into 
TypedBufferBuilder
 Key: ARROW-4031
 URL: https://issues.apache.org/jira/browse/ARROW-4031
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Benjamin Kietzman


It would be useful to have a specialization of TypedBufferBuilder to simplify 
building buffers of bits. This could then be utilized by ArrayBuilder (for the 
null bitmap) and BooleanBuilder (for values)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4032) [Python] New pyarrow.Table.from_pydict() function

2018-12-14 Thread David Lee (JIRA)
David Lee created ARROW-4032:


 Summary: [Python] New pyarrow.Table.from_pydict() function
 Key: ARROW-4032
 URL: https://issues.apache.org/jira/browse/ARROW-4032
 Project: Apache Arrow
  Issue Type: Task
  Components: Python
Reporter: David Lee


Here's a proposal to create a pyarrow.Table.from_pydict() function.

Right now only pyarrow.Table.from_pandas() exist and there are inherit problems 
using Pandas with NULL support for Int(s) and Boolean(s)

[http://pandas.pydata.org/pandas-docs/version/0.23.4/gotchas.html]

{{NaN}}, Integer {{NA}} values and {{NA}} type promotions:

Sample python code on how this would work.

 
{code:java}
import pyarrow as pa
from datetime import datetime

pylist = [
{"name": "Tom", "age": 10},
{"name": "Mark", "age": 5, "city": "San Francisco"},
{"name": "Pam", "age": 7, "birthday": datetime.now()}
]

def from_pydict(pylist, columns):
arrow_columns = list()
for column in columns:
arrow_columns.append(pa.array([v[column] if column in v else None for v 
in pylist]))
arrow_table = pa.Table.from_arrays(arrow_columns, columns)
return arrow_table

test = from_pydict(pylist, ['name' , 'age', 'city', 'birthday', 'dummy'])

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Arrow JS 0.4.0 Release

2018-12-14 Thread Paul Taylor

Wes,

I didn't mean to sound like I was criticizing you, the project, or the 
release process. You're doing an outstanding job as a project lead, and 
it's a fine release process that helps ensure quality and security. Nor 
was I passive-aggressively expressing desire to be a PMC -- I'm 
overworked as it is and don't have the bandwidth to take on that 
responsibility. If I was and did, I'd be much more explicit about taking 
on those responsibilities regardless of PMC status :-)


I was only attempting to describe some of the reasons I (and perhaps 
others) haven't/don't push to release the JS package more often, and 
compare reality with the original intent behind having JS on a separate 
release track.


I also don't mean to criticize when I say I think a reason we don't 
release often might be because none of the JS users or maintainers are 
PMCs -- only trying to acknowledge the maintenance and release cycle is 
an attention-driven process. Since most of us contribute in conjunction 
with our other professional responsibilities, it's totally reasonable 
that if JS isn't part of a PMC's day-to-day, it'd be left to us to drive 
it forward.


I have been curious if there isn't a middle ground between the full 
RC/GM release process, and releasing what are essentially nightlies. npm 
has a feature to publish tagged releases that aren't considered mainline 
releases yet are still accessible to CI/auditing services. As long as 
the list of npm users authorized to publish the packages are Arrow 
contributors (and we force npm 2FA), we could have a lane for rapid 
iteration and release while we work out the kinks.


And lastly, an update on the refactor branch: all the features are 
working again, now just fixing the last few issues in the build scripts. 
I'm especially pleased that `cat ./some-gigantic-table.arrow | npx 
arrow2csv | less` doesn't stream the entire table to less and terminate 
with a broken-pipe error anymore :-)


Paul


On 12/14/18 10:31 AM, Wes McKinney wrote:

hi Paul,

On Thu, Dec 13, 2018 at 8:59 PM Paul Taylor  wrote:

Another update: all the existing features and unit tests are working
again except for the Table/RecordBatch streaming toString()
implementations (and the `arrow2csv` utility), which I'll update later
tonight.

On JS release cadence, I think Brian's right that the current setup is
working counter to our original intent. I am used to (and prefer) a
faster-paced release cycle, essentially releasing early and as often as
bugs are fixed or features are added. Indeed, Graphistry maintains a
repo  with the
latest version of the library that we can build against, which I update
when I fix any bugs or add features.


It is common for software vendors to have "downstream" releases, so
this is reasonable, so long as this work is not promoted as Apache
releases


The JS project is young, and sometimes has to move at a rapid pace. I've
felt the turnaround time involved in the vote/prepare/verify/publish
release process is slower than would be helpful to me. I'm used to
publishing patch release to npm as soon as possible, possibly multiple
times a day.

Well, surely the recent security problems with NPM demonstrate that
there is value in giving the community opportunity to vet a package
before it is published for the world to use, and that GPG-signing
packages is an important security measure to ensure that production
code is coming from a network of trust. It is different if you are
publishing packages for your own personal or corporate use.


None of the PMCs contribute to or use the JS version (if that's wrong,
hit me up!) so there's been no release pressure from there. None of the
JS contributors are PMCs so even if we want to do releases, we have to
wait for the a PMC. My take is that everyone on the project (especially
PMCs) are probably ungodly busy people, and since not releasing to npm
hasn't been blocking me, I opt not to bother folks.

I am happy to help release the JS package as often as you like, up to
multiple times per month. I stated this early on in the process, but
there has not seemed to be much desire to release. Brian's recent
request to release caught me at a bad time at the end of the year, but
there are other active PMCs who should be able to help. If you do
decide you want to release in the next week or two, please let me know
and I will make the time to help.

The lack of PMCs with an interest in JavaScript is a bit of
self-perpetuating issue. One of the responsibilities of PMC members
(and what will enable a committer to become a PMC) is to promote the
growth and development of a healthy community. This includes making
sure that the project releases. The JS developer community hasn't
grown much, though. My approach to such a problem is to act as a
"community of one" until it changes -- drive a project forward and
ensure a steady cadence of releases.

- Wes



On 12/13/18 11:52 AM, Wes McKinney wrote:

+1 fo

npmjs.com account to release Apache Arrow JavaScript

2018-12-14 Thread Kouhei Sutou
Hi Brian,

I read this change:

  
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=87298036&selectedPageVersions=46&selectedPageVersions=45

Can you add me to collaborators of apache-arrow? I may
release apache-arrow npm package as a PMC member.

Here is my account on npmjs.com:

  https://www.npmjs.com/~kou


Thanks,
--
kou


[jira] [Created] (ARROW-4033) [C++] thirdparty/download_dependencies.sh uses tools or options not available in older Linuxes

2018-12-14 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-4033:
---

 Summary: [C++] thirdparty/download_dependencies.sh uses tools or 
options not available in older Linuxes
 Key: ARROW-4033
 URL: https://issues.apache.org/jira/browse/ARROW-4033
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.12.0


I found I had to install the {{realpath}} apt package on Ubuntu 14.04. Also 
{{wget 1.15}} does not have the {{--show-progress}} option



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4034) red-arrow interface for FileOutputStream doesn't respect append=True

2018-12-14 Thread Ian Murray (JIRA)
Ian Murray created ARROW-4034:
-

 Summary: red-arrow interface for FileOutputStream doesn't respect 
append=True
 Key: ARROW-4034
 URL: https://issues.apache.org/jira/browse/ARROW-4034
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
 Environment: macOS High Sierra version 10.13.4; ruby 2.4.1; gtk-doc, 
gobject-introspection, boost, Arrow C++ & Parquet C++, Arrow GLib all installed 
via homebrew
Reporter: Ian Murray


It seems that the PR (#1978) that resolved Issue #2018 has not cascaded down 
through the existing ruby interface.

I've been experimenting with variations of the `write-file.rb` examples, but 
passing in the append flag as true 
(`Arrow::FileOutputStream.open("/tmp/file.arrow", true)`) still results in 
overwriting the file, and trying the newer interface using truncate and append 
flags throws `ArgumentError: wrong number of arguments (3 for 2)`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: npmjs.com account to release Apache Arrow JavaScript

2018-12-14 Thread Paul Taylor

Hi Kouhei,

I've added you as a maintainer of the apache-arrow top level package, as 
well as an owner on the @apache-arrow organization on npm.


Paul

On 12/14/18 1:59 PM, Kouhei Sutou wrote:

Hi Brian,

I read this change:

   
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=87298036&selectedPageVersions=46&selectedPageVersions=45

Can you add me to collaborators of apache-arrow? I may
release apache-arrow npm package as a PMC member.

Here is my account on npmjs.com:

   https://www.npmjs.com/~kou


Thanks,
--
kou