Re: Julia implementation and integration with main apache arrow repository

2020-09-14 Thread Kenta Murata
Hi Jacob,

I'm very excited to see Julia's implementation of Arrow is restarted.

Pkg.jl seems now support treating packages in subdirectories.
I guess the feature is added by
https://github.com/JuliaLang/Pkg.jl/pull/1766 and
https://github.com/JuliaRegistries/RegistryTools.jl/pull/31.
As these pull request, you can tell the location of Julia package
directory to Pkg.jl
by `subdir` parameter in Project.toml.

2020年9月14日(月) 4:33 Jacob Quinn :
>
> Hello all,
>
> Hopefully this email works (I'm not super familiar with using mailing lists
> like this).
>
> Over the past few weeks, I've been working on a pure Julia implementation
> to support serializing/deserializing the arrow format for Julia. The code
> in its current state can be found here:
> https://github.com/JuliaData/Arrow.jl.
>
> I believe the code has reached an initial beta-level quality and just
> finished writing the arrow <-> json integration testing code that archery
> expects. I haven't worked on actual archery integration yet, but it should
> just be a matter of adding a tester_julia.py file that knows how to invoke
> the test/integrationtest.jl file with similar arguments as the tester_go.py
> file.
>
> This email has a couple purposes:
> * Signal that the julia code is somewhat ready to be used/integrated in the
> main repo
> * Ask for advice/direction on actually integrating with the apache arrow
> github repository
>
> For the latter, in particular, I imagine keeping an initial PR as minimal
> as possible is desirable. I need to follow up with the core pkg devs for
> Julia, but I've been told it's possible/not hard to have a Julia package
> "live" inside a monorepo, but I just haven't figured out the details of
> what that means on the Julia General package registry side of things. But
> I'm happy to figure that out and shouldn't really affect the merging of
> Julia code into the apache arrow github.
>
> So my plan is roughly:
> * Fork/make a branch of the apache arrow repo
> * Add in the Julia code from the link I mentioned above
> * Add necessary files/integration in archery to run Julia integration tests
> alongside other languages
> * Do initial merge into apache arrow?
>
> If there are other initial requirements core devs would expect, just let me
> know, but I imagine that updating the implementation matrix, for example,
> can be done afterwards as follow up.
>
> Excited to have Julia more officially integrated here!
>
> Cheers,
>
> -Jacob
> https://github.com/quinnj
> https://twitter.com/quinn_jacobd



-- 
Regards,
Kenta Murata


Re: [DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-10 Thread Kenta Murata
Hi Wes,

Thank you very much giving us the detail explanation of your thoughts.

I need the knowledge of the SOTA of query engine you pointed out if I’ll
contribute to C++ Query Engine or just write the binding of it.  I’m
studying the article and the codes.

Regards,
Kenta Murata

On Thu, Aug 6, 2020 at 4:17 Wes McKinney  wrote:

> I see there's a bunch of additional aggregation code in Dremio that
> might serve as inspiration (some of which is related to distributed
> aggregation, so may not be relevant)
>
>
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate
>
> Maybe Andy or one of the other active Rust DataFusion developers can
> comment on the approach taken for hash aggs there
>
> On Wed, Aug 5, 2020 at 1:52 PM Wes McKinney  wrote:
> >
> > hi Kenta,
> >
> > Yes, I think it only makes sense to implement this in the context of
> > the query engine project. Here's a list of assorted thoughts about it:
> >
> > * I have been mentally planning to follow the Vectorwise-type query
> > engine architecture that's discussed in [1] [2] and many other
> > academic papers. I believe this is how some other current generation
> > open source columnar query engines work, such as Dremio [3] and DuckDB
> > [4][5].
> > * Hash (aka "group") aggregations need to be able to process arbitrary
> > expressions, not only a plain input column. So it's not enough to be
> > able to compute "sum(x) group by y" where "x" and "y" are fields in a
> > RecordBatch, we need to be able to compute "$AGG_FUNC($EXPR) GROUP BY
> > $GROUP_EXPR_1, $GROUP_EXPR_2, ..." where $EXPR / $GROUP_EXPR_1 / ...
> > are any column expressions computing from the input relations (keep in
> > mind that an aggregation could apply to stream of record batches
> > produced by a join). In any case, expression evaluation is a
> > closely-related task and should be implemented ASAP.
> > * Hash aggregation functions themselves should probably be introduced
> > as a new Function type in arrow::compute. I don't think it would be
> > appropriate to use the existing "SCALAR_AGGREGATE" functions, instead
> > we should introduce a new HASH_AGGREGATE function type that accepts
> > input data to be aggregated along with an array of pre-computed bucket
> > ids (which are computed by probing the HT). So rather than
> > Update(state, args) like we have for scalar aggregate, the primary
> > interface for group aggregation is Update(state, bucket_ids, args)
> > * The HashAggregation operator should be able to process an arbitrary
> > iterator of record batches
> > * We will probably want to adapt an existing or implement a new
> > concurrent hash table so that aggregations can be performed in
> > parallel without requiring a post-aggregation merge step
> > * There's some general support machinery for hashing multiple fields
> > and then doing efficient vectorized hash table probes (to assign
> > aggregation bucket id's to each row position)
> >
> > I think it is worth investing the effort to build something that is
> > reasonably consistent with the "state of the art" in database systems
> > (at least according to what we are able to build with our current
> > resources) rather than building something more crude that has to be
> > replaced with new implementation later.
> >
> > I'd like to help personally with this work (particularly since the
> > natural next step with my recent work in arrow/compute is to implement
> > expression evaluation) but I won't have significant bandwidth for it
> > until later this month or early September. If someone feels that they
> > sufficiently understand the state of the art for this type of workload
> > and wants to help with laying down the abstract C++ APIs for
> > Volcano-style query execution and an implementation of hash
> > aggregation, that sounds great.
> >
> > Thanks,
> > Wes
> >
> > [1]: https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf
> > [2]: https://github.com/TimoKersten/db-engine-paradigms
> > [3]:
> https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/sabot/op/aggregate/hash
> > [4]:
> https://github.com/cwida/duckdb/blob/master/src/include/duckdb/execution/aggregate_hashtable.hpp
> > [5]:
> https://github.com/cwida/duckdb/blob/master/src/execution/aggregate_hashtable.cpp
> >
> > On Wed, Aug 5, 2020 at 10:23 AM Kenta Murata  wrote:
> > >
> > > Hi folks,
> > >
> > > Red Arrow, the Ruby binding of Arrow GLib, implements group

[DISCUSS][C++] Group by operation for RecordBatch and Table

2020-08-05 Thread Kenta Murata
Hi folks,

Red Arrow, the Ruby binding of Arrow GLib, implements grouped aggregation
features for RecordBatch and Table.  Because these features are written in
Ruby, they are too slow for large size data.  We need to make them much
faster.

To improve their calculation speed, they should be written in C++, and
should be put in Arrow C++ instead of Red Arrow.

Is anyone working on implementing group-by operation for RecordBatch and
Table in Arrow C++?  If no one has worked on it, I would like to try it.

By the way, I found that the grouped aggregation feature is mentioned in
the design document of Arrow C++ Query Engine.  Is Query Engine, not Arrow
C++ Core, a suitable location to implement group-by operation?


Re: [DISCUSS][C++] MakeBuilder with a DictionaryType ignores the bit-width of the index type

2020-08-04 Thread Kenta Murata
Agreed.

I made ARROW-9642 and its pull-request.
https://github.com/apache/arrow/pull/7898

2020年8月4日(火) 6:32 Wes McKinney :

>
> It seems useful to use the index type to set the starting bit width of
> the builder. I guess we can preserve the behavior of expanding to the
> next bit width when overflowing the smaller integer types.
>
> On Sun, Aug 2, 2020 at 9:32 PM Kenta Murata  wrote:
> >
> > Hi folks,
> >
> > arrow::MakeBuilder function with a dictionary type creates a
> > dictionary builder with AdaptiveIntBuilder by ignoring the bit-width
> > of DictionaryType's index type.
> > I want to know whether this behavior is intentional or not.
> >
> > I think this feature is useful when I want to use a dictionary builder
> > with AdaptiveIntBuilder.
> > But the result by following code is a little bit surprising.
> >
> > ```cpp
> > #include 
> > #include 
> > #include 
> >
> > int
> > main(int argc, char **argv)
> > {
> >   auto dict_type = arrow::dictionary(arrow::int32(), arrow::utf8());
> >   std::unique_ptr out;
> >   ARROW_CHECK_OK(arrow::MakeBuilder(arrow::default_memory_pool(),
> > dict_type, ));
> >   std::cout << "type: " << out->type()->ToString() << std::endl;
> >   return 0;
> > }
> > ```
> >
> > You can see the message below when executing this code.
> >
> > type: dictionary
> >
> > I got `indices=int8` from a dictionary type with int32 index type.
> > I guess most people expect they get `indices=int32` here.
> >
> > --
> > Kenta Murata



--
Regards,
Kenta Murata


[DISCUSS][C++] MakeBuilder with a DictionaryType ignores the bit-width of the index type

2020-08-02 Thread Kenta Murata
Hi folks,

arrow::MakeBuilder function with a dictionary type creates a
dictionary builder with AdaptiveIntBuilder by ignoring the bit-width
of DictionaryType's index type.
I want to know whether this behavior is intentional or not.

I think this feature is useful when I want to use a dictionary builder
with AdaptiveIntBuilder.
But the result by following code is a little bit surprising.

```cpp
#include 
#include 
#include 

int
main(int argc, char **argv)
{
  auto dict_type = arrow::dictionary(arrow::int32(), arrow::utf8());
  std::unique_ptr out;
  ARROW_CHECK_OK(arrow::MakeBuilder(arrow::default_memory_pool(),
dict_type, ));
  std::cout << "type: " << out->type()->ToString() << std::endl;
  return 0;
}
```

You can see the message below when executing this code.

type: dictionary

I got `indices=int8` from a dictionary type with int32 index type.
I guess most people expect they get `indices=int32` here.

-- 
Kenta Murata


Re: 0.17 release blog post: help needed

2020-04-19 Thread Kenta Murata
I've edited Ruby and C GLib parts.
Kou and Shiro will check them later.

2020年4月20日(月) 11:09 Wes McKinney :
>
> I made a pass through the changelog and added a bunch of TODOs related
> to C++. In general, as a reminder, in these blog posts since the
> releases are growing large we should try to present as compact a high
> level summary as possible to convey some of the highlights of our
> labors (so likely not needed to write out any JIRA numbers, people can
> look at the changelog for that). I'll spend some more time on the blog
> post after others have had a chance to take a pass through
>
> On Sat, Apr 18, 2020 at 12:13 PM Neal Richardson
>  wrote:
> >
> > Hi all,
> > Since it looks like we're close to releasing 0.17, we need to fill in the
> > details for our blog post announcement. I've started a document here:
> > https://docs.google.com/document/d/16UKZtvL49o8nCDN8JU3Ut6y76Y9d8-4qXv5vFv7aNvs/edit#heading=h.kqqacbm2lpv8
> >
> > Please fill in the details for the parts of the project you're close to.
> > I'll handle wrapping this up in the usual boilerplate when we're done.
> >
> > Thanks,
> > Neal



-- 
Regards,
Kenta Murata


[jira] [Created] (ARROW-8343) [GLib] Add GArrowRecordBatchIterator

2020-04-06 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8343:
---

 Summary: [GLib] Add GArrowRecordBatchIterator
 Key: ARROW-8343
 URL: https://issues.apache.org/jira/browse/ARROW-8343
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8073) [GLib] Add binding of arrow::fs::PathForest

2020-03-11 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-8073:
---

 Summary: [GLib] Add binding of arrow::fs::PathForest
 Key: ARROW-8073
 URL: https://issues.apache.org/jira/browse/ARROW-8073
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7739) [GLib] Use placement new to initialize shared_ptr object in private structs

2020-02-01 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7739:
---

 Summary: [GLib] Use placement new to initialize shared_ptr object 
in private structs
 Key: ARROW-7739
 URL: https://issues.apache.org/jira/browse/ARROW-7739
 Project: Apache Arrow
  Issue Type: Task
  Components: GLib
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7730) [GLib] Add Duration type support

2020-01-30 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7730:
---

 Summary: [GLib] Add Duration type support
 Key: ARROW-7730
 URL: https://issues.apache.org/jira/browse/ARROW-7730
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7698) [Format][C++] Add tensor and sparse tensor supports in File metadata

2020-01-27 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7698:
---

 Summary: [Format][C++] Add tensor and sparse tensor supports in 
File metadata
 Key: ARROW-7698
 URL: https://issues.apache.org/jira/browse/ARROW-7698
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Format
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7515) [C++] Rename nonexistent and non_existent to not_found

2020-01-08 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7515:
---

 Summary: [C++] Rename nonexistent and non_existent to not_found
 Key: ARROW-7515
 URL: https://issues.apache.org/jira/browse/ARROW-7515
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7504) [GLib] Introduce value-returning garrow::check

2020-01-06 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7504:
---

 Summary: [GLib] Introduce value-returning garrow::check
 Key: ARROW-7504
 URL: https://issues.apache.org/jira/browse/ARROW-7504
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata


Follow this discussion 
https://github.com/apache/arrow/pull/6066/files#r363367450



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7445) [GLib] Add HadoopFileSystem support

2019-12-19 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7445:
---

 Summary: [GLib] Add HadoopFileSystem support
 Key: ARROW-7445
 URL: https://issues.apache.org/jira/browse/ARROW-7445
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: GLib
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7444) [GLib] Add LocalFileSystem support

2019-12-19 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7444:
---

 Summary: [GLib] Add LocalFileSystem support
 Key: ARROW-7444
 URL: https://issues.apache.org/jira/browse/ARROW-7444
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: GLib
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7443) [GLib] Add binding of arrow::fs

2019-12-19 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7443:
---

 Summary: [GLib] Add binding of arrow::fs
 Key: ARROW-7443
 URL: https://issues.apache.org/jira/browse/ARROW-7443
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7421) [C++] Support creating SparseCSRMatrix and SparseCSCMatrix from 0d and 1d Tensors

2019-12-17 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7421:
---

 Summary: [C++] Support creating SparseCSRMatrix and 
SparseCSCMatrix from 0d and 1d Tensors
 Key: ARROW-7421
 URL: https://issues.apache.org/jira/browse/ARROW-7421
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7420) [C++] Migrate internal functions of SparseTensor to Result-returning version

2019-12-17 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7420:
---

 Summary: [C++] Migrate internal functions of SparseTensor to 
Result-returning version
 Key: ARROW-7420
 URL: https://issues.apache.org/jira/browse/ARROW-7420
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7419) [Python] Support SparseCSCMatrix

2019-12-17 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7419:
---

 Summary: [Python] Support SparseCSCMatrix
 Key: ARROW-7419
 URL: https://issues.apache.org/jira/browse/ARROW-7419
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7371) [GLib] Add Datasets binding

2019-12-11 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7371:
---

 Summary: [GLib] Add Datasets binding
 Key: ARROW-7371
 URL: https://issues.apache.org/jira/browse/ARROW-7371
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7369) [GLib] Add garrow_table_combine_chunks

2019-12-10 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7369:
---

 Summary: [GLib] Add garrow_table_combine_chunks
 Key: ARROW-7369
 URL: https://issues.apache.org/jira/browse/ARROW-7369
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7306) [C++] Add Result-returning version of FileSystemFromUri

2019-12-03 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7306:
---

 Summary: [C++] Add Result-returning version of FileSystemFromUri
 Key: ARROW-7306
 URL: https://issues.apache.org/jira/browse/ARROW-7306
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7297) [C++] Add value accessor in sparse tensor class

2019-12-02 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7297:
---

 Summary: [C++] Add value accessor in sparse tensor class
 Key: ARROW-7297
 URL: https://issues.apache.org/jira/browse/ARROW-7297
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata


{{SparseTensor}} can have value accessor like {{Tensor::Value}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7291) [Dev]Fix FORMAT_DIR in update-flatbuffers.sh

2019-12-01 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7291:
---

 Summary: [Dev]Fix FORMAT_DIR in update-flatbuffers.sh
 Key: ARROW-7291
 URL: https://issues.apache.org/jira/browse/ARROW-7291
 Project: Apache Arrow
  Issue Type: Bug
  Components: Developer Tools
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7037) [C++ ] Compile error on the combination of protobuf >= 3.9 and clang

2019-10-30 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7037:
---

 Summary: [C++ ] Compile error on the combination of protobuf >= 
3.9 and clang
 Key: ARROW-7037
 URL: https://issues.apache.org/jira/browse/ARROW-7037
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I encountered the following compile error on the combination of protobuf 3.10.0 
and clang (Xcode 11).

{noformat}
[13/26] Building CXX object 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o
FAILED: c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o
/Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
   -Ic++/include 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include
 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src
 -Ic++/src -isystem c++/libs/thirdparty/zlib_ep-install/include -isystem 
c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments 
-fcolor-diagnostics -ggdb -O0 -g -fPIC  -Wno-zero-as-null-pointer-constant 
-Wno-inconsistent-missing-destructor-override -Wno-error=undef -std=c++11 
-Weverything -Wno-c++98-compat -Wno-missing-prototypes 
-Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default 
-Wno-missing-noreturn -Wno-unknown-pragmas 
-Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror 
-std=c++11 -Weverything -Wno-c++98-compat -Wno-missing-prototypes 
-Wno-c++98-compat-pedantic -Wno-padded -Wno-covered-switch-default 
-Wno-missing-noreturn -Wno-unknown-pragmas 
-Wno-gnu-zero-variadic-macro-arguments -Wconversion -Wno-c++2a-compat -Werror 
-O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -MF 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o.d -o 
c++/src/CMakeFiles/orc.dir/wrap/orc-proto-wrapper.cc.o -c 
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc
In file included from 
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/wrap/orc-proto-wrapper.cc:44:
c++/src/orc_proto.pb.cc:959:145: error: possible misuse of comma operator here 
[-Werror,-Wcomma]
static bool dynamic_init_dummy_orc_5fproto_2eproto = (  
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
 true);

^
c++/src/orc_proto.pb.cc:959:57: note: cast expression to void to silence warning
static bool dynamic_init_dummy_orc_5fproto_2eproto = (  
::PROTOBUF_NAMESPACE_ID::internal::AddDescriptors(_table_orc_5fproto_2eproto),
 true);

^~~~
static_cast(  
)
1 error generated.
{noformat}

This may be due to a bug of protobuf filed as 
https://github.com/protocolbuffers/protobuf/issues/6619.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7036) [C++] Version up ORC to avoid compile errors

2019-10-30 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-7036:
---

 Summary: [C++] Version up ORC to avoid compile errors
 Key: ARROW-7036
 URL: https://issues.apache.org/jira/browse/ARROW-7036
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I encountered the compile errors due to {{-Wshadow-field}} like below:

{noformat}
[1/4] Building CXX object c++/src/CMakeFiles/orc.dir/Vector.cc.o
FAILED: c++/src/CMakeFiles/orc.dir/Vector.cc.o
/Applications/Xcode_11.1.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/c++
   -Ic++/include 
-I/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include
 -I/Users/mrkn/src/github.com/apa
che/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src -Ic++/src -isystem 
c++/libs/thirdparty/zlib_ep-install/include -isystem 
c++/libs/thirdparty/lz4_ep-install/include -Qunused-arguments 
-fcolor-diagnostics -ggdb -O0 -g -fPIC  -Wno-z
ero-as-null-pointer-constant -Wno-inconsistent-missing-destructor-override 
-Wno-error=undef -std=c++11 -Weverything -Wno-c++98-compat 
-Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded 
-Wno-covered-switch-default -Wno-missing-n
oreturn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments 
-Wconversion -Werror -std=c++11 -Weverything -Wno-c++98-compat 
-Wno-missing-prototypes -Wno-c++98-compat-pedantic -Wno-padded 
-Wno-covered-switch-default -Wno-missing-nore
turn -Wno-unknown-pragmas -Wno-gnu-zero-variadic-macro-arguments -Wconversion 
-Werror -O0 -g -MD -MT c++/src/CMakeFiles/orc.dir/Vector.cc.o -MF 
c++/src/CMakeFiles/orc.dir/Vector.cc.o.d -o 
c++/src/CMakeFiles/orc.dir/Vector.cc.o -c /Users/mr
kn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:59:45:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  LongVectorBatch::LongVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:87:49:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  DoubleVectorBatch::DoubleVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:115:49:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  StringVectorBatch::StringVectorBatch(uint64_t capacity, MemoryPool& pool
^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/src/Vector.cc:407:55:
 error: parameter 'capacity' shadows member inherited from type 
'ColumnVectorBatch' [-Werror,-Wshadow-field]
  TimestampVectorBatch::TimestampVectorBatch(uint64_t capacity,
  ^
/Users/mrkn/src/github.com/apache/arrow/cpp/build.debug/orc_ep-prefix/src/orc_ep/c++/include/orc/Vector.hh:46:14:
 note: declared here
uint64_t capacity;
 ^
4 errors generated.
{noformat}

Upgrading ORC to 1.5.7 will fix this errors.

I used Xcode 11.1 on macOS Mojave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6814) [C++] Resolve compiler warnings occurred on release build

2019-10-07 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6814:
---

 Summary: [C++] Resolve compiler warnings occurred on release build
 Key: ARROW-6814
 URL: https://issues.apache.org/jira/browse/ARROW-6814
 Project: Apache Arrow
  Issue Type: Task
  Components: C++, C++ - Gandiva
Reporter: Kenta Murata
Assignee: Kenta Murata


I encountered some compiler warnings on release build when I used gcc version 
7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1).

[https://gist.github.com/mrkn/f7739edb301988a24e9d6066410b0625]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6508) [C++] Add Tensor and SparseTensor factory function with validations

2019-09-10 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6508:
---

 Summary: [C++] Add Tensor and SparseTensor factory function with 
validations
 Key: ARROW-6508
 URL: https://issues.apache.org/jira/browse/ARROW-6508
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata


Now Tensor and SparseTensor only have their constructors, but not factory 
functions that validate the parameters.
We need such factory functions for creating Tensor and SparseTensor from 
parameters given from the external source.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6505) [Website] Add new committers

2019-09-10 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6505:
---

 Summary: [Website] Add new committers
 Key: ARROW-6505
 URL: https://issues.apache.org/jira/browse/ARROW-6505
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Website
Reporter: Kenta Murata
Assignee: Kenta Murata


I'd like to add new committers on the committer list.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6503) [C++] Add an argument of memory pool object to SparseTensorConverter

2019-09-09 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6503:
---

 Summary: [C++] Add an argument of memory pool object to 
SparseTensorConverter
 Key: ARROW-6503
 URL: https://issues.apache.org/jira/browse/ARROW-6503
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


According to the comment 
https://github.com/apache/arrow/pull/5290#discussion_r322244745, we need to 
have variants of some functions for supplying a memory pool object to 
SparseTensorConverter function.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6501) [Format][C++] Remove non_zero_length field from SparseIndex

2019-09-09 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6501:
---

 Summary: [Format][C++] Remove non_zero_length field from 
SparseIndex
 Key: ARROW-6501
 URL: https://issues.apache.org/jira/browse/ARROW-6501
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Format
Reporter: Kenta Murata
Assignee: Kenta Murata


We can remove non_zero_length field from SparseIndex because it can be supplied 
from the shape of the indices tensor.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6489) [Developer][Documentation]Fix merge script and readme

2019-09-08 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6489:
---

 Summary: [Developer][Documentation]Fix merge script and readme
 Key: ARROW-6489
 URL: https://issues.apache.org/jira/browse/ARROW-6489
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Kenta Murata
Assignee: Kenta Murata


The following things should be fixed.

- merge_arrow_pr.py shouldn't be affected by git's merge.ff value.
- README should describe the information of APACHE_JIRA_USERNAME and 
APACHE_JIRA_PASSWORD
- README should describe the users needs to install requests and jira libraries 
before running merge_arrow_pr.py



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: [DISCUSS][FORMAT] Concerning about character encoding of binary string data

2019-09-06 Thread Kenta Murata
Thanks for responding.

I understand ExtensionType is suitable for handling character encoding.
I'll try to make and propose draft specification and implementation of
an extension type.

Regards,
Kenta Murata

2019年9月5日(木) 7:56 Wes McKinney :

>
> I opened https://issues.apache.org/jira/browse/ARROW-6455. It might
> make sense to define a common ExtensionType metadata in case multiple
> implementations decide they need this
>
> On Tue, Sep 3, 2019 at 10:35 PM Micah Kornfield  wrote:
> >
> > This might be bike-shedding but I agree we should attempt to use extension
> > types for this use-cases.  I would expect something like:
> > ARROW:extension:name=NonUtf8String
> > ARROW:extension:metadata = "{\"iso-charset\":  "ISO-8859-10"}"
> >
> > The latter's value being a json encoded string, which captures the
> > character set.
> >
> > Thanks,
> > Micah
> >
> >
> > On Tue, Sep 3, 2019 at 6:59 PM Sutou Kouhei  wrote:
> >
> > > Hi,
> > >
> > > > If people can constrain to use UTF-8 for all the string data,
> > > > StringArray is enough for them. But if they cannot unify the character
> > > > encoding of string data in UTF-8, should Apache Arrow provides the
> > > > standard way of the character encoding management?
> > >
> > > I think that Apache Arrow users should convert their string
> > > data to UTF-8 in their application. If Apache Arrow only
> > > supports UTF-8 string, Apache Arrow users can process string
> > > data without converting encoding between multiple systems. I
> > > think no conversion (zero-copy) use is Apache Arrow way.
> > >
> > > > My opinion is that Apache Arrow must have the standard way in both its
> > > > format and its API.  The reason is below:
> > > >
> > > > (1) Currently, when we use MySQL or PostgreSQL as the data source of
> > > > record batch streams, we will lose the information of character
> > > > encodings the original data have
> > >
> > > Both MySQL and PostgreSQL provide encoding conversion
> > > feature. So we can convert the original data to UTF-8.
> > >
> > > MySQL:
> > >
> > >   CONVERT function
> > >
> > > https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert
> > >
> > > PostgreSQL:
> > >
> > >   convert_to function
> > >
> > > https://www.postgresql.org/docs/11/functions-string.html#id-1.5.8.9.7.2.2.8.1.1
> > >
> > >
> > >
> > > If we need to support non UTF-8 encodings, I like
> > > NonUTF8String or something extension type and metadata
> > > approach. I prefer "ARROW:encoding" rather than
> > > "ARROW:charset" for metadata key too.
> > >
> > >
> > > Thanks,
> > > --
> > > kou
> > >
> > > In 
> > >   "[DISCUSS][FORMAT] Concerning about character encoding of binary string
> > > data" on Mon, 2 Sep 2019 17:39:22 +0900,
> > >   Kenta Murata  wrote:
> > >
> > > > [Abstract]
> > > > When we have a string data encoded in a character encoding other than
> > > > UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
> > > > doesn’t provide the way to specify what a character encoding used in a
> > > > BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
> > > > provides the way to manage a character encoding in a BinaryArray.
> > > >
> > > > I’d appreciate any comments or suggestions.
> > > >
> > > > [Long description]
> > > > Apache Arrow has the specialized type for UTF-8 encoded string but
> > > > doesn’t have types for other character encodings, such as ISO-8859-x
> > > > and Shift_JIS. We need to manage what a character encoding is used in
> > > > a binary string array, in the outside of the arrays such as metadata.
> > > >
> > > > In Datasets project, one of the goals is to support database
> > > > protocols.  Some databases support a lot of character encodings in
> > > > each manner.  For example, PostgreSQL supports to specify what a
> > > > character encoding is used for each database, and MySQL allows us to
> > > > specify character encodings separately for each level: database,
> > > > table, and column.
> > > >
> > > > I have a concern about how does Apache Arrow provide the way t

Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and Neal Richardson

2019-09-06 Thread Kenta Murata
Thank you very much everyone!
I'm very happy to join this community.

2019年9月6日(金) 12:39 Micah Kornfield :

>
> Congrats everyone.
>
> On Thu, Sep 5, 2019 at 7:06 PM Ji Liu  wrote:
>
> > Congratulations!
> >
> > Thanks,
> > Ji Liu
> >
> >
> > --
> > From:Fan Liya 
> > Send Time:2019年9月6日(星期五) 09:28
> > To:dev 
> > Subject:Re: [ANNOUNCE] New committers: Ben Kietzman, Kenta Murata, and
> > Neal Richardson
> >
> > Big congratulations to Ben, Kenta and Neal!
> >
> > Best,
> > Liya Fan
> >
> > On Fri, Sep 6, 2019 at 5:33 AM Wes McKinney  wrote:
> >
> > > hi all,
> > >
> > > on behalf of the Arrow PMC, I'm pleased to announce that Ben, Kenta,
> > > and Neal have accepted invitations to become Arrow committers. Welcome
> > > and thank you for all your contributions!
> > >
> >



--
Kenta Murata
OpenPGP FP = 1D69 ADDE 081C 9CC2 2E54  98C1 CEFE 8AFB 6081 B062

本を書きました!!
『Ruby 逆引きレシピ』 http://www.amazon.co.jp/dp/4798119881/mrkn-22

E-mail: m...@mrkn.jp
twitter: http://twitter.com/mrkn/
blog: http://d.hatena.ne.jp/mrkn/


[DISCUSS][FORMAT] Concerning about character encoding of binary string data

2019-09-02 Thread Kenta Murata
[Abstract]
When we have a string data encoded in a character encoding other than
UTF-8, we must use a BinaryArray for the data.  But Apache Arrow
doesn’t provide the way to specify what a character encoding used in a
BinaryArray.  In this mail, I’d like to discuss how Apache Arrow
provides the way to manage a character encoding in a BinaryArray.

I’d appreciate any comments or suggestions.

[Long description]
Apache Arrow has the specialized type for UTF-8 encoded string but
doesn’t have types for other character encodings, such as ISO-8859-x
and Shift_JIS. We need to manage what a character encoding is used in
a binary string array, in the outside of the arrays such as metadata.

In Datasets project, one of the goals is to support database
protocols.  Some databases support a lot of character encodings in
each manner.  For example, PostgreSQL supports to specify what a
character encoding is used for each database, and MySQL allows us to
specify character encodings separately for each level: database,
table, and column.

I have a concern about how does Apache Arrow provide the way to
specify character encodings for values in arrays.

If people can constrain to use UTF-8 for all the string data,
StringArray is enough for them. But if they cannot unify the character
encoding of string data in UTF-8, should Apache Arrow provides the
standard way of the character encoding management?

The example use of Apache Arrow in such case is an application to the
internal data of OR mapper library, such as ActiveRecord of Ruby on
Rails.

My opinion is that Apache Arrow must have the standard way in both its
format and its API.  The reason is below:

(1) Currently, when we use MySQL or PostgreSQL as the data source of
record batch streams, we will lose the information of character
encodings the original data have

(2) We need to struggle to support character encoding treatment on
each combination of systems if we don’t have a standard way of
character encoding management, though this is not fit to Apache
Arrow’s philosophy

(3) We cannot support character encoding treatment in the level of
language-binding if Apache Arrow doesn’t provide the standard APIs of
character encoding management

There are two options to manage a character encoding in a BinaryArray.
The first way is introducing an optional character_encoding field in
BinaryType.  The second way is using custom_metadata field to supply
the character encoding name.

If we use custom_metadata, we should decide the key for this
information.  I guess “charset” is good candidates for the key because
it is widely used for specifying what a character encoding is used.
The value must be the name of a character encoding, such as “UTF-8”
and “Windows-31J”.  It is better if we can decide canonical encoding
names, but I guess it is hard work because many systems use the same
name for the different encodings.  For example, “Shift_JIS” means
either IANA’s Shift_JIS or Windows-31J, they use the same coding rule
but the corresponding character sets are slightly different.  See the
spreadsheet [1] for the correspondence of character encoding names
between MySQL, PostgreSQL, Ruby, Python, IANA [3], and Encoding
standard of WhatWG [4].

If we introduce the new optional field for the information of a
character encoding in BinaryType, I recommend let this new field be a
string to keep the name of a character encoding.  But it is possible
to make the field integer and let it keep the enum value.  I don’t
know there is a good standard for the enum value of character
encodings.  IANA manages MIBenum [2], though the registered character
encodings [3] are not enough for our requirement, I think.

I prefer the second way because the first way can supply the
information of character encoding only to a Field, not a BinaryArray.

[1] 
https://docs.google.com/spreadsheets/d/1D0xlI5r2wJUV45aTY1q2TwqD__v7acmd8FOfr8xSOVQ/edit?usp=sharing
[2] https://tools.ietf.org/html/rfc3808
[3] https://www.iana.org/assignments/character-sets/character-sets.xhtml
[4] https://encoding.spec.whatwg.org/

-- 
Kenta Murata


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-09-02 Thread Kenta Murata
2019年8月28日(水) 8:57 Rok Mihevc :
>
> On Wed, Aug 28, 2019 at 1:18 AM Wes McKinney  wrote:
>
> > null/NA. But, as far as I'm aware, this component of pandas is
> > relatively unique and was never intended as an alternatives to sparse
> > matrix libraries.
> >
>
> Another example is
> https://sparse.pydata.org/en/latest/generated/sparse.SparseArray.html?highlight=fill%20value#sparse.SparseArray.fill_value,
> but it might have been influenced by Pandas.

pydata/sparse's COO tensor also has fill_value property,
and it raises a ValueError in to_scipy_sparse method when the tensor
has a non-zero fill value.

So we should support fill value someday, I think.

> I'm ok with dropping this for now.

Yes, we can advance without it, and support it later.
And, I think supporting fill value is not difficult.

-- 
Kenta Murata


Re: [DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-09-02 Thread Kenta Murata
2019年8月28日(水) 6:05 Wes McKinney :
> I'm also OK with these changes. Since we have not established a
> versioning or compatibility policy with regards to "Other" data
> structures like Tensor and SparseTensor, I don't know that a vote is
> needed, just a pull request.

I didn't understand that Tensor and SparseTensor isn't restricted by a
versioning and compatibility policy.

OK, I'll send some pull-requests.

-- 
Kenta Murata


[jira] [Created] (ARROW-6393) [C++]Add EqualOptions support in SparseTensor::Equals

2019-08-29 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6393:
---

 Summary: [C++]Add EqualOptions support in SparseTensor::Equals
 Key: ARROW-6393
 URL: https://issues.apache.org/jira/browse/ARROW-6393
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


SparseTensor::Equals should take EqualOptions argument as Tensor::Equals does.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[DISCUSSION] Automatically adding a the URL of the corresponding JIRA ticket as a comment in GitHub pull-request

2019-08-23 Thread Kenta Murata
I frequently do the following little bit bothersome steps for opening
JIRA tickets when I watch a GitHub pull-request:

1. Select the "ARROW-" text in the title and copy it
2. Open JIRA if I haven't open it
3. Select a ticket to open it
4. Alter the URL by pasting text that copied at the step-1
5. Hit the enter key

I think it is better if these steps become easier.

We already have a mechanism to inject a GitHub pull-request URL into
the corresponding JIRA ticket. How about making the similar mechanism
for the reverse link?  I guess it is possible to automate making a
comment of JIRA ticket URL to the pull-request when the "ARROW-"
text is injected in the title field by using GitHub Actions.

I consulted this idea to Kou, he said ursabot may be appropriate to
implement such the feature.  And he promote me to ask to Krisztian
about this.

Krisztian, what do you think this automation?

Regards,
Kenta Murata


[jira] [Created] (ARROW-6319) [C++] Extract the core of NumericTensor::Value as Tensor::Value

2019-08-21 Thread Kenta Murata (Jira)
Kenta Murata created ARROW-6319:
---

 Summary: [C++] Extract the core of NumericTensor::Value as 
Tensor::Value
 Key: ARROW-6319
 URL: https://issues.apache.org/jira/browse/ARROW-6319
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I'd like to enable element-wise access in Tensor class.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[DISCUSS][Format][C++] Improvement of sparse tensor format and implementation

2019-08-19 Thread Kenta Murata
Hi,

I’d like to propose the following improvement of the sparse tensor
format and implementation.

(1) To make variable bit-width indices available.

The main purpose of the first part of the proposal is making 32-bit
indices available.  It allows us to serialize scipy.sparse.csr_matrix
objects etc. with 32-bit indices without converting the index arrays
to 64-bit values.  As Jed said in the previous discussion [1] in this
ML, since 32-bit indices have advantages of the small memory
footprints, I strongly consider this change is necessary for the
sparse tensor support for Apache Arrow.  Adding both the type field in
each sparse index format and the stride field in SparseCOOIndex format
is necessary to do this.

(2) Adding the new COO format with separated row and column indices

Scipy.sparse.coo_matrix manages the indices of row and column in
separated numpy arrays.  It is enough for representing a sparse
matrix.  On the other hand, for supporting sparse tensors with
arbitrary ranks, Arrow's SparseCOOIndex manages COO indices as one
matrix. Hence we need to make a copy of indices to convert
scipy.sparse.coo_matrix to Arrow’s SparseTensor.  Introducing the new
COO format with separated row and column indices can resolve this
issue.

(3) Adding SparseCSCIndex

The CSC format of sparse matrices has the advantage of faster scanning
in columnar direction while the CSR format is faster in a row-wise
scan. Because The aptitude of CSC is different from the one of CSR, I
want to support CSC before releasing Arrow 1.0.

There are work-in-progress branch [2] of (1) above.  I’d appreciate
any comments or suggestions.

[1] 
http://mail-archives.apache.org/mod_mbox/arrow-dev/201903.mbox/%3c87pnqz70rg@jedbrown.org%3e

[2] https://github.com/mrkn/arrow/tree/sparse_tensor_index_value_type

Regards,
Kenta Murata


[jira] [Created] (ARROW-5830) [C++] Stop using memcmp in TensorEquals

2019-07-02 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5830:
---

 Summary: [C++] Stop using memcmp in TensorEquals
 Key: ARROW-5830
 URL: https://issues.apache.org/jira/browse/ARROW-5830
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata


Because memcmp problematic for comparing floating-point values, such as NaNs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [VOTE] Release Apache Arrow 0.14.0 - RC0

2019-07-02 Thread Kenta Murata
I tried on Ubuntu Bionic, and got the build errors on grpc_ep (version
1.20.0).  The error log will be shown the last of this mail.
The errors are (1) absence of php_generator.h header file, and (2) the
absence of has_ruby_package function in google::protobuf::FileOptions
class.
PHP support was introduced at the version 3.3.0.  The has_ruby_package
function was added at the version 3.6.0.
So the minimum version of protobuf should be 3.6.0 for building grpc
version 1.20.0.

The error log I got is below:

  
/tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/php_generator.cc:21:10:
fatal error: google/protobuf/compiler/php/php_generator.h: No such
file or directory
   #include 
^~
  compilation terminated.
  make[5]: *** 
[CMakeFiles/grpc_plugin_support.dir/src/compiler/php_generator.cc.o]
Error 1
  make[5]: *** Waiting for unfinished jobs
  
/tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:
In function ‘grpc::string grpc_ruby_generator::GetServices(const
FileDescriptor*)’:
  
/tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:165:25:
error: ‘const class google::protobuf::FileOptions’ has no member named
‘has_ruby_package’; did you mean ‘has_java_package’?
   if (file->options().has_ruby_package()) {
   ^~~~
   has_java_package
  
/tmp/arrow-0.14.0.dJDu3/apache-arrow-0.14.0/cpp/build/grpc_ep-prefix/src/grpc_ep/src/compiler/ruby_generator.cc:166:38:
error: ‘const class google::protobuf::FileOptions’ has no member named
‘ruby_package’; did you mean ‘java_package’?
 package_name = file->options().ruby_package();
^~~~
java_package
  make[5]: *** 
[CMakeFiles/grpc_plugin_support.dir/src/compiler/ruby_generator.cc.o]
Error 1
  make[4]: *** [CMakeFiles/grpc_plugin_support.dir/all] Error 2
  make[4]: *** Waiting for unfinished jobs
  make[3]: *** [all] Error 2


Regards,
Kenta Murata

2019年7月3日(水) 3:00 Yosuke Shiro :
>
> Ran dev/release/verify-release-candidate.sh source 0.14.0 0 on macOS Mojave.
> I got the following error, but it may be specific to my environment.
>
> """
> [ERROR] Failed to execute goal 
> pl.project13.maven:git-commit-id-plugin:2.2.2:revision (for-jars) on project 
> arrow-java-root: Could not complete Mojo execution...: Error: Could not get 
> HEAD Ref, are you sure you have set the dotGitDirectory property of this 
> plugin to a valid path? -> [Help 1]
> [ERROR]
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR]
> [ERROR] For more information about the errors and possible solutions, please 
> read the following articles:
> [ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
> + cleanup
> + '[' no = yes ']'
> + echo 'Failed to verify release candidate. See 
> /var/folders/8t/lw8gghw13hscdt7rr9j8kqnmgn/T/arrow-0.14.0.X.1bxYZ8hL 
> for details.'
> Failed to verify release candidate. See 
> /var/folders/8t/lw8gghw13hscdt7rr9j8kqnmgn/T/arrow-0.14.0.X.1bxYZ8hL 
> for details.
> “""
>
> """
> $ java --version
> openjdk 11.0.2 2019-01-15
> OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
> OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
> “""
>
> """
> $ mvn --version
> Apache Maven 3.6.0 (97c98ec64a1fdfee7767ce5ffb20918da4f719f3; 
> 2018-10-25T03:41:47+09:00)
> Maven home: /usr/local/Cellar/maven/3.6.0/libexec
> Java version: 11.0.2, vendor: Oracle Corporation, runtime: 
> /Library/Java/JavaVirtualMachines/jdk-11.0.2.jdk/Contents/Home
> Default locale: en_JP, platform encoding: UTF-8
> OS name: "mac os x", version: "10.14.4", arch: "x86_64", family: "mac"
> “""
>
>  C# verification ran fine to me.
>
>
> > On Jul 2, 2019, at 18:57, Antoine Pitrou  wrote:
> >
> >
> > +1 for this RC0 anyway (binding).
> >
> >
> > Le 02/07/2019 à 11:36, Antoine Pitrou a écrit :
> >>
> >> I tried again (Ubuntu 18.04):
> >>
> >> * binaries verification succeeded
> >>
> >> * source verification failed in gRPC configure step:
> >>
> >> CMake Error at cmake/cares.cmake:38 (find_package):
> >>  Could not find a package configuration file provided by "c-ares" with any
> >>  of the following names:
> >>
> >>c-aresConfig.cmak

[jira] [Created] (ARROW-5813) [C++] Support checking the equality of the different contiguous tensors

2019-06-30 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5813:
---

 Summary: [C++] Support checking the equality of the different 
contiguous tensors
 Key: ARROW-5813
 URL: https://issues.apache.org/jira/browse/ARROW-5813
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


The current TensorEquals function couldn't check the equality of the different 
contiguous tensors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5754) [C++]Missing override for ~GrpcStreamWriter?

2019-06-26 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5754:
---

 Summary: [C++]Missing override for ~GrpcStreamWriter?
 Key: ARROW-5754
 URL: https://issues.apache.org/jira/browse/ARROW-5754
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata


I encountered the following compile error:

{{../src/arrow/flight/client.cc:244:3: error: '~GrpcStreamWriter' overrides a 
destructor but is not marked 'override' 
[-Werror,-Winconsistent-missing-destructor-override]
  ~GrpcStreamWriter() = default;
  ^
../src/arrow/flight/client.h:86:27: note: overridden virtual function is here
class ARROW_FLIGHT_EXPORT FlightStreamWriter : public ipc::RecordBatchWriter {
  ^}}

Putting override modifier can resolve this problem.
I'll make a pull-request for the change.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5736) [Format] Support small bit-width indices of sparse tensor

2019-06-25 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5736:
---

 Summary: [Format] Support small bit-width indices of sparse tensor
 Key: ARROW-5736
 URL: https://issues.apache.org/jira/browse/ARROW-5736
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Format
Reporter: Kenta Murata
Assignee: Kenta Murata


Adding 32bit sparse index support is necessary to support non-copy data sharing 
with the existing systems such as SciPy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5704) [C++] Stop using ARROW_TEMPLATE_EXPORT for SparseTensorImpl class

2019-06-24 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5704:
---

 Summary: [C++] Stop using ARROW_TEMPLATE_EXPORT for 
SparseTensorImpl class
 Key: ARROW-5704
 URL: https://issues.apache.org/jira/browse/ARROW-5704
 Project: Apache Arrow
  Issue Type: Task
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I'd like to stop using ARROW_TEMPLATE_EXPORT for SparseTensorImpl class so that 
it can be wrapped in Arrow GLib library on the mingw platform.

 

This relates to ARROW-4399.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5486) [GLib] Add binding of gandiva::FunctionRegistry and related things

2019-06-03 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5486:
---

 Summary: [GLib] Add binding of gandiva::FunctionRegistry and 
related things
 Key: ARROW-5486
 URL: https://issues.apache.org/jira/browse/ARROW-5486
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata


I'd like to add a support of gandiva::FunctionRegistry and the related things 
in gandiva-glib.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5320) [C++] Undefined symbol errors are occurred when linking parquet executables

2019-05-14 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5320:
---

 Summary: [C++] Undefined symbol errors are occurred when linking 
parquet executables
 Key: ARROW-5320
 URL: https://issues.apache.org/jira/browse/ARROW-5320
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
 Environment: Xcode 10.2 on macOS Mojave 10.14.4
Reporter: Kenta Murata


Undefined symbol errors occurred when linking debug/parquet-reader, 
debug/parquet-file-deserialize-test, and debug/parquet-scan.  The unresolvable 
symbol is of boost regex referred in libparquet.a.

I tried to build the commit 608e846a9f825a30a0faa651bc0a3eebba20e7db with Xcode 
10.2 on macOS Mojave.

I specified -DARROW_BOOST_VENDORED=ON to avoid the problem related to the 
latest boost in Homebrew (See [https://github.com/boostorg/process/issues/55]).

The complete build log is available here:
[https://gist.github.com/mrkn/e5489140c9a782ca13a1b4bb8dd33111]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5155) [GLib][Ruby] MakeDense and MakeSparse in UnionArray should accept a vector of Field

2019-04-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5155:
---

 Summary: [GLib][Ruby] MakeDense and MakeSparse in UnionArray 
should accept a vector of Field
 Key: ARROW-5155
 URL: https://issues.apache.org/jira/browse/ARROW-5155
 Project: Apache Arrow
  Issue Type: New Feature
  Components: GLib, Ruby
Reporter: Kenta Murata
Assignee: Kenta Murata


This is a derivative issue of https://issues.apache.org/jira/browse/ARROW-4622



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5150) [Ruby] Add Arrow::Table#raw_records

2019-04-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5150:
---

 Summary: [Ruby] Add Arrow::Table#raw_records
 Key: ARROW-5150
 URL: https://issues.apache.org/jira/browse/ARROW-5150
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5050) [C++] cares_ep should build before grpc_ep

2019-03-27 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5050:
---

 Summary: [C++] cares_ep should build before grpc_ep
 Key: ARROW-5050
 URL: https://issues.apache.org/jira/browse/ARROW-5050
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Kenta Murata
Assignee: Kenta Murata


I found that grpc_ep can fail to find cares_ep because grpc_ep may be built 
before cares_ep.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5032) [C++] Headers in vendored/datetime directory aren't installed

2019-03-27 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-5032:
---

 Summary: [C++] Headers in vendored/datetime directory aren't 
installed
 Key: ARROW-5032
 URL: https://issues.apache.org/jira/browse/ARROW-5032
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Kenta Murata


I found that header files in vendored/datetime directory are not installed even 
though vendored/datetime.h is installed.

vendored/datetime.h is depends on the files in vendored/datetime directory, so 
they should be installed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4942) [Ruby] Remove needless omits

2019-03-18 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4942:
---

 Summary: [Ruby] Remove needless omits
 Key: ARROW-4942
 URL: https://issues.apache.org/jira/browse/ARROW-4942
 Project: Apache Arrow
  Issue Type: Test
  Components: Ruby
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4938) [Glib] Undefined symbols error occurred when GIR file is being generated.

2019-03-17 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4938:
---

 Summary: [Glib] Undefined symbols error occurred when GIR file is 
being generated.
 Key: ARROW-4938
 URL: https://issues.apache.org/jira/browse/ARROW-4938
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Kenta Murata


When there are the old arrow-glib.*dylib files in the installation directory, 
and these libraries doesn't have enough symbols, the "undefined symbols" error 
is occurred during GIR file is generated.

When I encountered this error, removing the old libraries resolves the problem.

I extracted the build log related to this problem in this gist:
https://gist.github.com/mrkn/6c14d5cae2bebca4609ed9c3ef8e5bbf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4932) [GLib] Use G_DECLARE_DERIVABLE_TYPE macro

2019-03-17 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4932:
---

 Summary: [GLib] Use G_DECLARE_DERIVABLE_TYPE macro
 Key: ARROW-4932
 URL: https://issues.apache.org/jira/browse/ARROW-4932
 Project: Apache Arrow
  Issue Type: Task
  Components: GLib
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4906) [Format] Fix document to describe that SparseMatrixIndexCSR assumes indptr is sorted for each row

2019-03-15 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4906:
---

 Summary: [Format] Fix document to describe that 
SparseMatrixIndexCSR assumes indptr is sorted for each row
 Key: ARROW-4906
 URL: https://issues.apache.org/jira/browse/ARROW-4906
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Sparse matrix formats

2019-03-13 Thread Kenta Murata
Hi Jed,

I'd like to describe the current status of the implementation of SparseTensor.
I hope the following explanation will help you.

First of all, I designed the current SparseTensor format as the first
interim implementation.
At this time I used scipy.sparse as a reference.

The reason why I started with the two formats, COO and CSR, is that I
thought they were commonly used and appropriate as the first
implementation of SparseTensor.

For 1:
The current SparseTensor uses int64_t for its indices value type.
If the "long" you mentioned is one in SparseTensor.fbs, that "long" is
of Flatbuffers' type, which is 64-bit signed integer.

I hope we can improve the current SparseTensor so that users can
specify whether to use int32_t or int64_t as the index data type.
Then we can use int64_t to handle larger matrices, as well as to
improve performance with 32-bit indices.
Moreover, we can share not-too-large canonicalized scipy's sparse
matrices without conversion.

For 2:
There are two reasons why the current COO format is limited to be sorted:

a) I wanted to start with a simple implementation
b) I thought that Arrow is often used to exchange sparse matrices that
are canonicalized for computation

I think it would be better for Arrow to support unsorted indexes, and
I think it would be even better to be able to handle sparse matrices
with duplicated values.  Since scipy.sparse can handle such data, I
think that supporting non-canonicalized formats are necessary for
adding scipy integration.

I forgot to denote the CSR index suppose to be sorted.  I'll issue PR
for fixing it.

If SparseTensor supports unsorted indexes or duplicated values, it
would be nice to add the ability to canonicalize them as well.
I am busy with preparations for RubyKaigi in mid-April now, but after
RubyKaigi, I will return to the improvement of sparse matrix
implementation again.
Before that, I would like to support as much as I can if someone is
working on it.

For 3:
Since I have only used sparse matrices to create and process bug of
words matrices in natural language processing, I have no experience of
partitioning sparse matrices for distributed computing.
I think it's great that Arrow can handle such a use case, so I would
like to cooperate if there is something I can do for the realization.

Thanks,
Kenta Murata

2019年3月12日(火) 2:17 Jed Brown :

>
> Thanks.  I'm new to the Arrow community so was hoping to get feedback if
> any of these are controversial or subject to constraints that I'm likely
> not familiar with.  Point 2 is likely simplest and I can start with
> that.
>
> Point 3 isn't coherent as a PR concept, but is a potential audience
> whose relation to Arrow I don't understand (could be an explicit
> non-goal for all I know).
>
> Wes McKinney  writes:
>
> > hi Jed,
> >
> > Would you like to submit a pull request to propose the changes or
> > additions you are escribing?
> >
> > Thanks
> > Wes
> >
> > On Sat, Mar 9, 2019 at 11:32 PM Jed Brown  wrote:
> >>
> >> Wes asked me to bring this discussion here.  I'm a developer of PETSc
> >> and, with Arrow is getting into the sparse representation space, would
> >> like for it to interoperate as well as possible.
> >>
> >> 1. Please guarantee support for 64-bit offsets and indices.  The current
> >> spec uses "long", which is 32-bit on some common platforms (notably
> >> Windows).  Specifying int64_t would bring that size support to LLP64
> >> architectures.
> >>
> >> Meanwhile, there is a significant performance benefit to using 32-bit
> >> indices when sizes allow it.  If using 4-byte floats, a CSR format costs
> >> 12 bytes per nonzero with int64, versus 8 bytes per nonzero with int32.
> >> Sparse matrix operations are typically dominated by bandwidth, so this
> >> is a substantial performance impact.
> >>
> >> 2. This relates to sorting of indices for CSR.  Many algorithms need to
> >> operate on upper vs lower-triangular parts of matrices, which is much
> >> easier if the CSR column indices are sorted within rows.  Typically, one
> >> finds the diagonal (storing its location in each row, if it exists).
> >> Given that the current spec says COO entries are sorted, it would be
> >> simple to specify this also for CSR.
> >>
> >>   https://github.com/apache/arrow/blob/master/format/SparseTensor.fbs
> >>
> >>
> >> 3. This is more nebulous and relates to partitioned representations of
> >> matrices, such as arise when using distributed memory or when optimizing
> >> for locality when using threads.  Some packages store "global" indices
> >> in the CSR representation (in which case yo

[jira] [Created] (ARROW-4775) [Website] Site navbar cannot be expanded

2019-03-05 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4775:
---

 Summary: [Website] Site navbar cannot be expanded
 Key: ARROW-4775
 URL: https://issues.apache.org/jira/browse/ARROW-4775
 Project: Apache Arrow
  Issue Type: Bug
  Components: Website
Reporter: Kenta Murata
Assignee: Kenta Murata


I found that the navbar at the top of the page cannot be expanded when the page 
is narrow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4671) [C++] MakBuilder must care Type::DICTIONARY

2019-02-24 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4671:
---

 Summary: [C++] MakBuilder must care Type::DICTIONARY
 Key: ARROW-4671
 URL: https://issues.apache.org/jira/browse/ARROW-4671
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Kenta Murata


Now, we cannot create a builder for DictionaryArray by using MakeBuilder.

When we pass DictionaryType to MakeBuilder, it says like below:
{quote}MakeBuilder: cannot construct builder for type dictionary
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4662) [Python] Add type_codes property in UnionType

2019-02-22 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4662:
---

 Summary: [Python] Add type_codes property in UnionType
 Key: ARROW-4662
 URL: https://issues.apache.org/jira/browse/ARROW-4662
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Python
Reporter: Kenta Murata
Assignee: Kenta Murata






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4632) [Ruby] Add BigDecimal#to_arrow

2019-02-19 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4632:
---

 Summary: [Ruby] Add BigDecimal#to_arrow
 Key: ARROW-4632
 URL: https://issues.apache.org/jira/browse/ARROW-4632
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Kenta Murata
Assignee: Kenta Murata


It may be better that BigDecimal has to_arrow instance method to convert itself 
to Arrow::Decimal128.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4622) MakeDense and MakeSparse in UnionArray should accept a vector of Field

2019-02-19 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4622:
---

 Summary: MakeDense and MakeSparse in UnionArray should accept a 
vector of Field
 Key: ARROW-4622
 URL: https://issues.apache.org/jira/browse/ARROW-4622
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, GLib, Python
Reporter: Kenta Murata
Assignee: Kenta Murata


Currently MakeDense and MakeUnion of UnionArray couldn't create a UnionArray 
with user-specified field names.  This is bugs of these functions.

To fix them, optional arguments of std::vector should be added.

GLib and Python bindings should be fixed, together.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4600) [Ruby] Arrow::DictionaryArray#[] should returns the item in the indices array

2019-02-17 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4600:
---

 Summary: [Ruby] Arrow::DictionaryArray#[] should returns the item 
in the indices array
 Key: ARROW-4600
 URL: https://issues.apache.org/jira/browse/ARROW-4600
 Project: Apache Arrow
  Issue Type: Bug
  Components: Ruby
Reporter: Kenta Murata


Arrow::DictionaryArray#[] should returns the item in the indices array.  
However, the current behavior is error like below:

{{Traceback (most recent call last):}}
 {{        5: from test.rb:4:in `'}}
 {{        4: from test.rb:4:in `new'}}
 {{        3: from 
/Users/mrkn/src/github.com/apache/arrow/ruby/red-arrow/lib/arrow/dictionary-data-type.rb:103:in
 `initialize'}}
 {{        2: from 
/Users/mrkn/.rbenv/versions/2.6.0/lib/ruby/gems/2.6.0/gems/gobject-introspection-3.3.1/lib/gobject-introspection/loader.rb:328:in
 `block in load_constructor_infos'}}
 {{        1: from 
/Users/mrkn/.rbenv/versions/2.6.0/lib/ruby/gems/2.6.0/gems/gobject-introspection-3.3.1/lib/gobject-introspection/loader.rb:317:in
 `block (2 levels) in load_constructor_infos'}}
 
{{/Users/mrkn/.rbenv/versions/2.6.0/lib/ruby/gems/2.6.0/gems/gobject-introspection-3.3.1/lib/gobject-introspection/loader.rb:317:in
 `invoke': *invalid argument Array (expect #) 
(+ArgumentError+)*}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4537) [CI] Suppress shell warning on travis-ci

2019-02-12 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4537:
---

 Summary: [CI] Suppress shell warning on travis-ci
 Key: ARROW-4537
 URL: https://issues.apache.org/jira/browse/ARROW-4537
 Project: Apache Arrow
  Issue Type: Task
  Components: Continuous Integration
Reporter: Kenta Murata


Suppress shell warnings like:

{{+'[' == 1 ']'}}
{{/home/travis/build/apache/arrow/ci/travis_before_script_cpp.sh: line 81: [: 
==: unary operator expected}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4536) Add data_type argument in garrow_list_array_new

2019-02-11 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4536:
---

 Summary: Add data_type argument in garrow_list_array_new
 Key: ARROW-4536
 URL: https://issues.apache.org/jira/browse/ARROW-4536
 Project: Apache Arrow
  Issue Type: Bug
  Components: GLib
Reporter: Kenta Murata


This issue is corresponding to GitHub's 
https://github.com/apache/arrow/pull/3621



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4535) [C++] Fix MakeBuilder to preserve ListType's field name

2019-02-11 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4535:
---

 Summary: [C++] Fix MakeBuilder to preserve ListType's field name
 Key: ARROW-4535
 URL: https://issues.apache.org/jira/browse/ARROW-4535
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Kenta Murata


MakeBuilder doesn't preserve the field name in the given ListType.
I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4506) [Ruby] Add Arrow::RecordBatch#raw_records

2019-02-07 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4506:
---

 Summary: [Ruby] Add Arrow::RecordBatch#raw_records
 Key: ARROW-4506
 URL: https://issues.apache.org/jira/browse/ARROW-4506
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Ruby
Reporter: Kenta Murata
Assignee: Kenta Murata


I want to add Arrow::RecordBatch#raw_records method to convert a record batch 
object to a nested array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4397) [C++] dim_names in Tensor and SparseTensor

2019-01-27 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4397:
---

 Summary: [C++] dim_names in Tensor and SparseTensor
 Key: ARROW-4397
 URL: https://issues.apache.org/jira/browse/ARROW-4397
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Kenta Murata


Along with ARROW-4388, it would be useful to introduce dim_names in Tensor and 
SparseTensor of C++ library.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4320) [C++] Add tests for non-contiguous tensors

2019-01-22 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4320:
---

 Summary: [C++] Add tests for non-contiguous tensors
 Key: ARROW-4320
 URL: https://issues.apache.org/jira/browse/ARROW-4320
 Project: Apache Arrow
  Issue Type: Test
Reporter: Kenta Murata
Assignee: Kenta Murata


I would like to add some test cases for tensors with non-contiguous strides.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4318) [C++] Add Tensor::CountNonZero

2019-01-21 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4318:
---

 Summary: [C++] Add Tensor::CountNonZero
 Key: ARROW-4318
 URL: https://issues.apache.org/jira/browse/ARROW-4318
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Kenta Murata
Assignee: Kenta Murata


I would like to move CountNonZero defined in SparseTensorConverter into Tensor 
class, and add tests for this function.

The pull-request is [https://github.com/apache/arrow/pull/3452].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4226) [C++] Add CSF sparse tensor support

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4226:
---

 Summary: [C++] Add CSF sparse tensor support
 Key: ARROW-4226
 URL: https://issues.apache.org/jira/browse/ARROW-4226
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Kenta Murata


[https://github.com/apache/arrow/pull/2546#pullrequestreview-156064172]
{quote}Perhaps in the future, if zero-copy and future-proof-ness is really what 
we want, we might want to add the CSF (compressed sparse fiber) format, a 
generalisation of CSR/CSC. I'm currently working on adding it to PyData/Sparse, 
and I plan to make it the preferred format (COO will still be around though).
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4225) [C++] Add CSC sparse matrix support

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4225:
---

 Summary: [C++] Add CSC sparse matrix support
 Key: ARROW-4225
 URL: https://issues.apache.org/jira/browse/ARROW-4225
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Kenta Murata


CSC sparse matrix is necessary for integration with existing sparse matrix 
libraries (umfpack, superlu). 
https://github.com/apache/arrow/pull/2546#issuecomment-422135645



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4224) [Python] Support integration with pydata/sparse library

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4224:
---

 Summary: [Python] Support integration with pydata/sparse library
 Key: ARROW-4224
 URL: https://issues.apache.org/jira/browse/ARROW-4224
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Kenta Murata


It would be great to support integration with pydata/sparse library.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4223) [Python] Support scipy.sparse integration

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4223:
---

 Summary: [Python] Support scipy.sparse integration
 Key: ARROW-4223
 URL: https://issues.apache.org/jira/browse/ARROW-4223
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Kenta Murata


It would be great to support integration with scipy.sparse.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4221) [Format] Add canonical flag in COO sparse index

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4221:
---

 Summary: [Format] Add canonical flag in COO sparse index
 Key: ARROW-4221
 URL: https://issues.apache.org/jira/browse/ARROW-4221
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Kenta Murata


To support the integration with scipy.sparse.coo_matrix, it is necessary to add 
a flag in SparseCOOIndex.  This flag denotes whether elements in COO sparse 
tensor is sorted lexicographically or not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4222) [C++] Support equality comparison between COO and CSR sparse tensors in SparseTensorEquals

2019-01-09 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-4222:
---

 Summary: [C++] Support equality comparison between COO and CSR 
sparse tensors in SparseTensorEquals
 Key: ARROW-4222
 URL: https://issues.apache.org/jira/browse/ARROW-4222
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Kenta Murata


Currently SparseTensorEquals always returns false when it gets COO and CSR 
sparse tensors.

It should support comparing the items in the sparse tensors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3518) Detect HOMEBREW_PREFIX automatically

2018-10-15 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-3518:
---

 Summary: Detect HOMEBREW_PREFIX automatically
 Key: ARROW-3518
 URL: https://issues.apache.org/jira/browse/ARROW-3518
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Kenta Murata


It can be detected by executing {{brew --prefix}} if available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3515) Introduce NumericTensor class

2018-10-15 Thread Kenta Murata (JIRA)
Kenta Murata created ARROW-3515:
---

 Summary: Introduce NumericTensor class
 Key: ARROW-3515
 URL: https://issues.apache.org/jira/browse/ARROW-3515
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Kenta Murata


[https://github.com/apache/arrow/pull/2759]

This commit defines the new NumericTensor class as a subclass of Tensor class. 
NumericTensor extends Tensor class by adding a member function to access 
element values in a tensor.

I want to use this new feature for writing tests of SparseTensor in 
[#2546|https://github.com/apache/arrow/pull/2546].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [DISCUSS] Concerns about the Arrow Slack channel

2018-06-21 Thread Kenta Murata
Hi everyone,

I heard from Kou that you’re discussing to stop using Slack.
So I want to propose another way to use Discourse.

On 2018/06/21 18:46:54, Dhruv Madeka  wrote:
> The issue with discourse is that you either have to host it or pay for them
> to host it

Discourse provides free hosting plan for community friendly opensource projects.
See this article for the details:
<https://blog.discourse.org/2016/03/free-discourse-forum-hosting-for-community-friendly-github-projects/>

> but still +1 for discourse, its a really nice format (I actually +1'ed the
> PyTorch forum on this thread too)

I’m also +1 for discourse because I’m managing https://discourse.ruby-data.org/ 
by this plan.


Regards,
Kenta Murata