[GitHub] [arrow-nanoarrow] pitrou opened a new issue, #27: Tune down issue notifications

2022-08-19 Thread GitBox


pitrou opened a new issue, #27:
URL: https://github.com/apache/arrow-nanoarrow/issues/27

   Currently, every comment on every PR and issue is forwarded to the Arrow 
issues ML. It would be to tune that down to perhaps just sending notifications 
of issue and PR creations.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] pitrou opened a new issue, #28: Allow namespacing

2022-08-19 Thread GitBox


pitrou opened a new issue, #28:
URL: https://github.com/apache/arrow-nanoarrow/issues/28

   Since one of the selling points of nanoarrow is easier embedding and 
vendoring, we should probably make it possible to avoid conflicts between 
different nanoarrow versions loaded in the same process.
   
   See for example a similar configuration option offered by xxhash:
   
https://github.com/Cyan4973/xxHash/blob/c4359b17db082888fdc18371eba918b957a6baaa/xxhash.h#L210-L225
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #27: Tune down issue notifications

2022-08-19 Thread GitBox


paleolimbot commented on issue #27:
URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1220958811

   I have no idea why that happens or how to stop it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #28: Allow namespacing

2022-08-19 Thread GitBox


paleolimbot commented on issue #28:
URL: https://github.com/apache/arrow-nanoarrow/issues/28#issuecomment-1220963652

   Definitely!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] lidavidm commented on issue #28: Allow namespacing

2022-08-19 Thread GitBox


lidavidm commented on issue #28:
URL: https://github.com/apache/arrow-nanoarrow/issues/28#issuecomment-1220966008

   Duplicate of #21?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] lidavidm opened a new issue, #71: [C] Research ConnectorX/pgeon for optimizing libpq driver

2022-08-19 Thread GitBox


lidavidm opened a new issue, #71:
URL: https://github.com/apache/arrow-adbc/issues/71

   Pgeon: https://github.com/0x0L/pgeon
   ConnectorX: https://sfu-db.github.io/connector-x/intro.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] lidavidm opened a new issue, #72: [C] Research Turbodbc/Arrowdantic for developing ODBC-wrapping driver

2022-08-19 Thread GitBox


lidavidm opened a new issue, #72:
URL: https://github.com/apache/arrow-adbc/issues/72

   Arrowdantic: https://github.com/jorgecarleitao/arrowdantic/
   Turbodbc: https://github.com/blue-yonder/turbodbc/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] paleolimbot commented on issue #70: [Python] Try using PyCapsule for handles to C structs

2022-08-19 Thread GitBox


paleolimbot commented on issue #70:
URL: https://github.com/apache/arrow-adbc/issues/70#issuecomment-1220998984

   Nothing! reticulate doesn't handle them. In this case the R package would 
implement `py_to_r.some.qualified.python.type.schema()` and return an external 
pointer classed as `nanoarrow_schema` (for example). My point was that the 
semantics should be exactly the same as if the transformation was automatic (at 
least in this case).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot commented on pull request #26: Implement getters

2022-08-19 Thread GitBox


paleolimbot commented on PR #26:
URL: https://github.com/apache/arrow-nanoarrow/pull/26#issuecomment-1221041550

   (I don't think there's much of a point with "safe" variants of these unless 
there's any objection)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] wesm commented on issue #27: Tune down issue notifications

2022-08-22 Thread GitBox


wesm commented on issue #27:
URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1222122597

   Here's Arrow's .asf.yaml
   
   https://github.com/apache/arrow/blob/master/.asf.yaml
   
   and this repo's
   
   https://github.com/apache/arrow-nanoarrow/blob/main/.asf.yaml
   
   I suggest copying over the the e-mail settings from the main Arrow repository


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] wesm commented on issue #27: Tune down issue notifications

2022-08-22 Thread GitBox


wesm commented on issue #27:
URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1222122949

   ```
   notifications:
 commits:  comm...@arrow.apache.org
 issues:   git...@arrow.apache.org
 pullrequests: git...@arrow.apache.org
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot opened a new pull request, #29: Tune down notifications

2022-08-22 Thread GitBox


paleolimbot opened a new pull request, #29:
URL: https://github.com/apache/arrow-nanoarrow/pull/29

   Fixes #27.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot merged pull request #29: Tune down notifications

2022-08-22 Thread GitBox


paleolimbot merged PR #29:
URL: https://github.com/apache/arrow-nanoarrow/pull/29


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-nanoarrow] paleolimbot closed issue #27: Tune down issue notifications

2022-08-22 Thread GitBox


paleolimbot closed issue #27: Tune down issue notifications
URL: https://github.com/apache/arrow-nanoarrow/issues/27


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] lidavidm merged pull request #73: MINOR: Make issue notifications less noisy

2022-08-22 Thread GitBox


lidavidm merged PR #73:
URL: https://github.com/apache/arrow-adbc/pull/73


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] ericphanson opened a new issue, #332: Difficulties trying to serialize to Union types

2022-08-24 Thread GitBox


ericphanson opened a new issue, #332:
URL: https://github.com/apache/arrow-julia/issues/332

   Setup:
   ```julia
   using Arrow
   
   struct A0 end
   
   struct A1
   x::Int
   end
   
   struct A2
   x::Int
   y::Float64
   end
   
   ArrowTypes.arrowname(::Type{A0}) = :A0
   ArrowTypes.JuliaType(::Val{:A0}) = A0
   
   ArrowTypes.arrowname(::Type{A1}) = :A1
   ArrowTypes.JuliaType(::Val{:A1}) = A1
   
   ArrowTypes.arrowname(::Type{A2}) = :A2
   ArrowTypes.JuliaType(::Val{:A2}) = A2
   
   struct MyUnion{T <:Tuple}
   elts::T
   end
   
   ArrowTypes.arrowname(::Type{<:MyUnion}) = :MyUnion
   ArrowTypes.JuliaType(::Val{:MyUnion}) = MyUnion
   ArrowTypes.ArrowType(::Type{<:MyUnion}) = ArrowTypes.UnionKind()
   ArrowTypes.toarrow(u::MyUnion{T}) where {T} = 
collect(Union{T.parameters...}, u.elts)
   ArrowTypes.fromarrow(::Type{<:MyUnion}, args...) = MyUnion(args)
   ```
   
   Then:
   ```julia
   julia> u = MyUnion((A0(), A1(1), A2(1, 2.0)))
   MyUnion{Tuple{A0, A1, A2}}((A0(), A1(1), A2(1, 2.0)))
   
   julia> ArrowTypes.toarrow(u)
   3-element Vector{Union{A0, A1, A2}}:
A0()
A1(1)
A2(1, 2.0)
   
   julia> tbl = (; col = [u]);
   
   julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1]
   ERROR: MethodError: no method matching isstringtype(::ArrowTypes.StructKind)
   Closest candidates are:
 isstringtype(::ArrowTypes.ListKind{stringtype}) where stringtype at 
~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:196
 isstringtype(::Type{ArrowTypes.ListKind{stringtype}}) where stringtype at 
~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:197
   Stacktrace:
[1] getindex(l::Arrow.List{MyUnion, Int32, Arrow.DenseUnion{Union{Missing, 
A0, A1, A2}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, 
Tuple{Union{Missing, A0, A1}, A2}}, Tuple{Arrow.DenseUnion{Union{Missing, A0, 
A1}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, 
A0}, A1}}, Tuple{Arrow.Struct{Union{Missing, A0}, Tuple{}}, Arrow.Struct{A1, 
Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Struct{A2, 
Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Primitive{Float64, 
Vector{Float64}}}, i::Int64)
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/arraytypes/list.jl:52
[2] top-level scope
  @ REPL[25]:1
   ```
   
   I took a guess and added
   ```julia
   ArrowTypes.isstringtype(::ArrowTypes.StructKind) = false
   ```
   
   which seems to fix it
   ```julia
   julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1]
   MyUnion{Tuple{SubArray{Union{Missing, A0, A1, A2}, 1, 
Arrow.DenseUnion{Union{Missing, A0, A1, A2}, 
Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0, 
A1}, A2}}, Tuple{Arrow.DenseUnion{Union{Missing, A0, A1}, 
Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0}, 
A1}}, Tuple{Arrow.Struct{Union{Missing, A0}, Tuple{}}, Arrow.Struct{A1, 
Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Struct{A2, 
Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Primitive{Float64, 
Vector{Float64}}, Tuple{UnitRange{Int64}}, true}}}((Union{Missing, A0, A1, 
A2}[A0(), A1(1), A2(1, 2.0)],))
   ```
   
   However, in my real code, I was using this macro to define the methods:
   ```julia
   macro arrow_record(T1)
  T = esc(T1)
  name = :(Symbol("JuliaLang.", string(parentmodule($T)), '.', 
string(nameof($T
  return quote
  ArrowTypes.arrowname(::Type{$T}) = $name
  ArrowTypes.ArrowType(::Type{$T}) = fieldtypes($T)
  ArrowTypes.toarrow(obj::$T) = ntuple(i -> getfield(obj, i), 
fieldcount($T))
  ArrowTypes.JuliaType(::Val{$name}, ::Any) = $T
  ArrowTypes.fromarrow(::Type{$T}, args) = $T(args...)
  ArrowTypes.fromarrow(::Type{$T}, arg::$T) = arg
  end
  end
   ```
   and had a second `A2`-style struct,
   ```julia
   struct A22
  x::Int
  y::Float64
   end
   ```
   If I do
   ```julia
   @arrow_record A2
   @arrow_record A22
   ```
   and define
   ```
   julia> u1 = MyUnion((A2(1, 2.0), A22(2, 3.0)))
   MyUnion{Tuple{A2, A22}}((A2(1, 2.0), A22(2, 3.0)))
   
   julia> u2 = MyUnion((A22(1, 2.0), A2(2, 3.0)))
   MyUnion{Tuple{A22, A2}}((A22(1, 2.0), A2(2, 3.0)))
   
   julia> tbl = (; col = [u1, u2]);
   ```
   
   Then I get
   ```julia
   julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1]
   ERROR: TypeError: in Union, expected Type, got a value of type 
Tuple{DataType, DataType}
   Stacktrace:
 [1] ArrowType(#unused#::Type{Union{Missing, A2}})
   @ ArrowTypes ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:71
 [2] ArrowTypes.ToArrow(x::Vector{Union{Missing, A2}})
   @ ArrowTypes ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:338
 [3] arrowvector(x::Vector{Union{Missing, A2}}, i::Int64, nl::Int64, 
fi::Int64, de::Dict{Int64, Any}, ded::Vector{Arrow.DictEncoding}, 
meta::Nothing; dictencoding::Bool, dictencode::

[GitHub] [arrow-julia] jariji opened a new issue, #333: Can't round-trip integer CategoricalArrays

2022-08-31 Thread GitBox


jariji opened a new issue, #333:
URL: https://github.com/apache/arrow-julia/issues/333

   ```jl
   julia> Arrow.write("/tmp/my.arrow",DataFrame(x=CategoricalArray([1,2,3])))
   "/tmp/my.arrow"
   
   julia> DataFrame(Arrow.Table("/tmp/my.arrow")).x|>eltype
   Int64
   
   [69666777] Arrow v2.3.0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl

2022-09-04 Thread GitBox


bilelomrani1 opened a new issue, #334:
URL: https://github.com/apache/arrow-julia/issues/334

   I have an `.arrow` file generated with `pyarrow` whose schema is the 
following:
   ```
   input: struct[512], high: 
fixed_size_list[512], low: fixed_size_list[512], 
close: fixed_size_list[512]> not null
 child 0, open: fixed_size_list[512]
 child 0, item: float
 child 1, high: fixed_size_list[512]
 child 0, item: float
 child 2, low: fixed_size_list[512]
 child 0, item: float
 child 3, close: fixed_size_list[512]
 child 0, item: float
   ```
   
   With `pyarrow`, I load and iterate over records with the following:
   ```python
   with pa.memory_map('arraydata.arrow', 'r') as source:
   loaded_arrays = pa.ipc.open_file(source).read_all()
   
   a = 0
   for batch in loaded_arrays.to_batches():
   for input_candles in batch["input"]:
   a += 1
   ```
   Iterating over my example file (~10,000 lines) takes 210 ms.
   
   In julia, I load and iterate over the same file with the following:
   
   ```julia
   stream = Arrow.Stream("./arraydata.arrow")
   
   function bench_iteration(stream)
   a = 0
   for batch in stream
   for sample in batch.input
   a += 1
   end
   end
   end
   
   @btime bench_iteration($stream)
   ```
   
   ```
   3.169 s (25272097 allocations: 1.70 GiB)
   ```
   
   Iterating over records takes 15 more time with `Arrow.jl`. Am I doing 
something wrong?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] svilupp opened a new issue, #335: Inconsistent handling of eltype Decimals.Decimal (with silent errors?)

2022-09-05 Thread GitBox


svilupp opened a new issue, #335:
URL: https://github.com/apache/arrow-julia/issues/335

   First of all, thank you for the amazing package! I have noticed unexpected 
behaviour that I wanted to point out.
   
   **Expected behaviour:** rational numbers like 1.0 and 0.1 will be 
represented as Float; they can be saved and loaded again.
   
   **Actual behaviour:** 
   When writing column with eltype Decimals.Decimal, `Arrow.write(filename,df)` 
will give a method error (see below) and 
`Arrow.write(filename,df;compress=:lz4)` will complete without an error, but 
the resulting table is wrong when re-read (see MWE below).
   
   I've had a quick look at the code base and I cannot see any type checks - 
are those left to the user / MethodErrors?
   
   MWE:
   ```
   using Decimals
   using DataFrames, Arrow
   
   df=DataFrame(:a=>[Decimal(2.0)])
   
   # this will fail with error that Decimal cannot be saved
   Arrow.write("test.feather", df)
   # nested task error: MethodError: no method matching write(::IOBuffer, 
::Decimals.Decimal)
   
   # this will succeed
   Arrow.write("test.feather", df;compress=:lz4)
   
   # but the loaded dataframe will be rubbish
   df2=Arrow.Table("test.feather")|>DataFrame
   # 1×1 DataFrame
   #  Row │ a
   #  │ Float64
   # ─┼─
   #1 │ 2.1509e-314
   
   ```
   
   Error stack trace from Arrow.write() without a keyword argument:
   > ERROR: TaskFailedException
   Stacktrace:
[1] wait
  @ ./task.jl:345 [inlined]
[2] close(writer::Arrow.Writer{IOStream})
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:230
[3] open(::Arrow.var"#120#121"{DataFrame}, ::Type, ::Vararg{Any}; 
kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:file,), 
Tuple{Bool}}})
  @ Base ./io.jl:386
[4] #write#119
  @ ~/.julia/packages/Arrow/ZlMFU/src/write.jl:57 [inlined]
[5] write(file_path::String, tbl::DataFrame)
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:56
[6] top-level scope
  @ REPL[14]:1
   > nested task error: MethodError: no method matching write(::IOBuffer, 
::Decimals.Decimal)
   Closest candidates are:
 write(::IO, ::Any) at io.jl:672
 write(::IO, ::Any, ::Any...) at io.jl:673
 write(::Base.GenericIOBuffer, ::UInt8) at iobuffer.jl:442
 ...
   Stacktrace:
[1] write(io::IOBuffer, x::Decimals.Decimal)
  @ Base ./io.jl:672
[2] writearray(io::IOStream, #unused#::Type{Decimals.Decimal}, 
col::Vector{Union{Missing, Decimals.Decimal}})
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/utils.jl:50
[3] writebuffer(io::IOStream, col::Arrow.Primitive{Union{Missing, 
Decimals.Decimal}, Vector{Union{Missing, Decimals.Decimal}}}, alignment::Int64)
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/arraytypes/primitive.jl:102
[4] write(io::IOStream, msg::Arrow.Message, 
blocks::Tuple{Vector{Arrow.Block}, Vector{Arrow.Block}}, 
sch::Base.RefValue{Tables.Schema}, alignment::Int64)
  @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:365
[5] macro expansion
  @ ~/.julia/packages/Arrow/ZlMFU/src/write.jl:149 [inlined]
[6] (::Arrow.var"#122#124"{IOStream, Int64, Tuple{Vector{Arrow.Block}, 
Vector{Arrow.Block}}, Base.RefValue{Tables.Schema}, 
Arrow.OrderedChannel{Arrow.Message}})()
  @ Arrow ./threadingconstructs.jl:258
   
   
   **Package version**
 [69666777] Arrow v2.3.0
 [a93c6f00] DataFrames v1.3.4
 [194296ae] LibPQ v1.14.0
   
   **versioninfo()** (but it was the same on 1.7)
   Julia Version 1.8.0
   Commit 5544a0fab76 (2022-08-17 13:38 UTC)
   Platform Info:
   OS: macOS (arm64-apple-darwin21.3.0)
   CPU: 8 × Apple M1 Pro
   WORD_SIZE: 64
   LIBM: libopenlibm
   LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
   Threads: 6 on 6 virtual cores


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] TanookiToad opened a new issue, #336: Invalid argument error

2022-09-07 Thread GitBox


TanookiToad opened a new issue, #336:
URL: https://github.com/apache/arrow-julia/issues/336

   If you try to save a loaded table into the same file, it will lead to an 
invalid argument error.
   
   Seems like it's caused by mmap on windows. See JuliaData/CSV.jl#70.
   
   ```jl
   using Arrow
   using DataFrames
   
   df = DataFrame(rand(100, 100), :auto)
   Arrow.write("test.arrow", df)
   
   df = Arrow.Table("test.arrow")
   Arrow.write("test.arrow", df)
   ```
   
   The last line will raise an error. 
   
   ```jl
   ERROR: SystemError: opening file "test.arrow": Invalid argument
   Stacktrace:
 [1] systemerror(p::String, errno::Int32; extrainfo::Nothing)
   @ Base .\error.jl:174
 [2] #systemerror#68
   @ .\error.jl:173 [inlined]
 [3] systemerror
   @ .\error.jl:173 [inlined]
 [4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, 
create::Nothing, truncate::Bool, append::Nothing)
   @ Base .\iostream.jl:293
 [5] open(fname::String, mode::String; lock::Bool)
   @ Base .\iostream.jl:355
 [6] open(fname::String, mode::String)
   @ Base .\iostream.jl:355
 [7] open(::Arrow.var"#116#117"{Nothing, Nothing, Bool, Nothing, Bool, 
Bool, Bool, Int64, Int64, Float64, Bool, Arrow.Table}, ::String, 
::Vararg{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), 
Tuple{}}})
   @ Base .\io.jl:328
 [8] open(::Function, ::String, ::String)
   @ Base .\io.jl:328
 [9] #write#115
   @ C:\Users\R9000K\.julia\packages\Arrow\SFb8h\src\write.jl:57 [inlined]
[10] write(file_path::String, tbl::Arrow.Table)
   @ Arrow C:\Users\R9000K\.julia\packages\Arrow\SFb8h\src\write.jl:57
[11] top-level scope
   @ Untitled-1:8
   ```
   
   However, it works when saved to a different file name other than the 
original one.
   
   ```jl
   Arrow.write("test1.arrow", df)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-testing] westonpace closed pull request #74: ARROW-15425: [Integration] Add delta dictionaries in file format to integration tests

2022-09-16 Thread GitBox


westonpace closed pull request #74: ARROW-15425: [Integration] Add delta 
dictionaries in file format to integration tests
URL: https://github.com/apache/arrow-testing/pull/74


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bkamins opened a new issue, #337: Support DataAPI.jl metadata API

2022-09-20 Thread GitBox


bkamins opened a new issue, #337:
URL: https://github.com/apache/arrow-julia/issues/337

   Hi @quinnj - could you please add to the release plan of Arrow.jl support 
for https://github.com/JuliaData/DataAPI.jl/pull/48 for the created Arrow 
tables.
   Only read methods need to be implemented for Arrow tables:
   * `DataAPI.metadata`
   * `DataAPI.metadatakeys`
   * `DataAPI.colmetadata`
   * `DataAPI.colmetadatakeys`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] Moelf opened a new issue, #340: Feather file with compression and larger than RAM

2022-10-06 Thread GitBox


Moelf opened a new issue, #340:
URL: https://github.com/apache/arrow-julia/issues/340

   Last time I checked, `mmap` breaks down for files with compression. This is 
understandable because the compressed buffers clearly can't be re-interpreted 
without inflation.
   
   But the larger the file is the more likely it's compressed, can we 
decompressed only a single "row group" (and only the relevant columns of 
course) on the fly yet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj closed issue #295: Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved

2022-10-06 Thread GitBox


quinnj closed issue #295: Order of record batches from "arrow file" format 
files (i.e. `Arrow.Table`) not preserved
URL: https://github.com/apache/arrow-julia/issues/295


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj closed issue #324: filtering DataFrame loaded from feather file triggers `deleteat!` error

2022-10-07 Thread GitBox


quinnj closed issue #324: filtering DataFrame loaded from feather file triggers 
`deleteat!` error
URL: https://github.com/apache/arrow-julia/issues/324


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj opened a new issue, #342: Need to improve code review/release process and reduce developer productivity friction

2022-10-07 Thread GitBox


quinnj opened a new issue, #342:
URL: https://github.com/apache/arrow-julia/issues/342

   In https://github.com/apache/arrow-julia/issues/284, I originally raised 
some concerns about the health and long-term maintainability of the package 
under the apache organization.
   
   Having let that sit for a while, I'm again raising concerns around how the 
package is managed. In particular, I have 3 main complaints:
   
   1. Inability for meaningful contributors to approve pull requests (only 
Arrow PMC members are able to approve PRs to be merged)
   2. Inability for meaningful contributors to approve new releases (same as 
above)
   3. Slowness of getting fixes merged and new releases made (combination of 
requiring Arrow PMC approvals from above 2 and current 72 hour release window)
   
   On point 1, it's unfortunate because only Arrow PMC members (only @kou so 
far) can approve PRs/releases in a meaningful way, yet these members, no 
disrespect intended, don't have the skills/context/code abilities to actually 
evaluate code changes. It would be much more helpful if @jrevels, @omus, 
@ericphanson, @nickrobinson251, @bkamins, and @baumgold had the necessary 
permissions to approve pull requests and new releases.
   
   On point 3, the current 72-hour window is really long. Especially when it's 
idiomatic in Julia packages to merge a single pull request with small fix, and 
immediately issue a patch release. I think ideally we'd be able to have at 
least 12 or 24 hour release windows that would make things much more manageable.
   
   Thoughts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] Moelf opened a new issue, #344: error earlier when number of entries don't match across all fields

2022-10-11 Thread GitBox


Moelf opened a new issue, #344:
URL: https://github.com/apache/arrow-julia/issues/344

   right now it throws confusing messages:
   ```
   julia> Arrow.write(tempname(), df)
   ERROR: UndefRefError: access to undefined reference
   Stacktrace:
 [1] getindex
   @ ./array.jl:924 [inlined]
 [2] iterate
   @ ~/.julia/packages/Arrow/SFb8h/src/arraytypes/list.jl:171 [inlined]
 [3] Arrow.ToList(input::Arrow.ToList{Vector{Bool}, false, 
Vector{Vector{Bool}}, Int32}; largelists::Bool)
   @ Arrow ~/.julia/packages/Arrow/SFb8h/src/arraytypes/list.jl:103
 [4] arrowvector(::ArrowTypes.ListKind{false}, 
x::Arrow.ToList{Vector{Bool}, false, Vector{Vector{Bool}}, Int32}, i::Int64, 
nl::Int64, fi::Int64, de::Dict{Int64, Any}, ded::Vector{Arrow.DictEncoding}, 
meta::Nothing; largelists::Bool, kw::Base.Pairs{Symbol, Union{Nothing, 
Integer}, NTuple{6, Symbol}, NamedTuple{(:dictencode, :maxdepth, :lareglists, 
:compression, :denseunions, :dictencodenested), Tuple{Bool, Int64, Bool, 
Nothing, Bool, Bool}}})
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-testing] zeroshade opened a new pull request, #81: ARROW-18031: [C++][Parquet] Undefined behavior in boolean RLE decoder

2022-10-13 Thread GitBox


zeroshade opened a new pull request, #81:
URL: https://github.com/apache/arrow-testing/pull/81

   Corresponding Fix for this issue found in 
https://github.com/apache/arrow/pull/14407


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-testing] pitrou merged pull request #81: ARROW-18031: [C++][Parquet] Undefined behavior in boolean RLE decoder

2022-10-13 Thread GitBox


pitrou merged PR #81:
URL: https://github.com/apache/arrow-testing/pull/81


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] palday opened a new issue, #345: Tests fail on Apple silicon on Julia 1.8

2022-10-18 Thread GitBox


palday opened a new issue, #345:
URL: https://github.com/apache/arrow-julia/issues/345

   ```julia
   
   ArgumentError: unsafe_wrap: pointer 0x14858d048 is not properly aligned to 
16 bytes
 Stacktrace:
   [1] #unsafe_wrap#102
 @ ./pointer.jl:89 [inlined]
   [2] unsafe_wrap
 @ ./pointer.jl:87 [inlined]
   [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}}, 
batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer, compression::Nothing)
 @ Arrow ~/Code/arrow-julia/src/table.jl:507
   [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal, 
batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, 
Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool)
   
   ```
   
   
Full test output 
   
   ```julia
   (Arrow) pkg> test
Testing Arrow
 Status 
`/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Project.toml`
 [69666777] Arrow v2.3.0 `~/Code/arrow-julia`
   ⌅ [31f734f8] ArrowTypes v1.2.1
 [c3b6d118] BitIntegers v0.2.6
 [324d7699] CategoricalArrays v0.10.7
 [5ba52731] CodecLz4 v0.4.0
 [6b39b394] CodecZstd v0.7.2
 [9a962f9c] DataAPI v1.12.0
 [48062228] FilePathsBase v0.9.20
 [0f8b85d8] JSON3 v1.10.0
 [2dfb63ee] PooledArrays v1.4.2
 [91c51154] SentinelArrays v1.3.16
 [856f2bd8] StructTypes v1.10.0
 [bd369af6] Tables v1.10.0
 [f269a46b] TimeZones v1.9.0
 [76eceee3] WorkerUtilities v1.1.0
 [ade2ca70] Dates `@stdlib/Dates`
 [a63ad114] Mmap `@stdlib/Mmap`
 [9a3f8284] Random `@stdlib/Random`
 [8dfed614] Test `@stdlib/Test`
 [cf7118a7] UUIDs `@stdlib/UUIDs`
 Status 
`/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Manifest.toml`
 [69666777] Arrow v2.3.0 `~/Code/arrow-julia`
   ⌅ [31f734f8] ArrowTypes v1.2.1
 [c3b6d118] BitIntegers v0.2.6
 [fa961155] CEnum v0.4.2
 [324d7699] CategoricalArrays v0.10.7
 [5ba52731] CodecLz4 v0.4.0
 [6b39b394] CodecZstd v0.7.2
   ⌅ [34da2185] Compat v3.46.0
 [9a962f9c] DataAPI v1.12.0
 [e2d170a0] DataValueInterfaces v1.0.0
 [e2ba6199] ExprTools v0.1.8
 [48062228] FilePathsBase v0.9.20
 [842dd82b] InlineStrings v1.2.2
 [82899510] IteratorInterfaceExtensions v1.0.0
 [692b3bcd] JLLWrappers v1.4.1
 [0f8b85d8] JSON3 v1.10.0
 [e1d29d7a] Missings v1.0.2
 [78c3b35d] Mocking v0.7.3
 [bac558e1] OrderedCollections v1.4.1
 [69de0a69] Parsers v2.4.2
 [2dfb63ee] PooledArrays v1.4.2
 [21216c6a] Preferences v1.3.0
 [3cdcf5f2] RecipesBase v1.3.1
 [ae029012] Requires v1.3.0
 [6c6a2e73] Scratch v1.1.1
 [91c51154] SentinelArrays v1.3.16
 [66db9d55] SnoopPrecompile v1.0.1
 [856f2bd8] StructTypes v1.10.0
 [3783bdb8] TableTraits v1.0.1
 [bd369af6] Tables v1.10.0
 [f269a46b] TimeZones v1.9.0
 [3bb67fe8] TranscodingStreams v0.9.9
 [76eceee3] WorkerUtilities v1.1.0
 [5ced341a] Lz4_jll v1.9.3+0
 [3161d3a3] Zstd_jll v1.5.2+0
 [0dad84c5] ArgTools v1.1.1 `@stdlib/ArgTools`
 [56f22d72] Artifacts `@stdlib/Artifacts`
 [2a0f44e3] Base64 `@stdlib/Base64`
 [ade2ca70] Dates `@stdlib/Dates`
 [8bb1440f] DelimitedFiles `@stdlib/DelimitedFiles`
 [8ba89e20] Distributed `@stdlib/Distributed`
 [f43a241f] Downloads v1.6.0 `@stdlib/Downloads`
 [7b1f6079] FileWatching `@stdlib/FileWatching`
 [9fa8497b] Future `@stdlib/Future`
 [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils`
 [4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts`
 [b27032c2] LibCURL v0.6.3 `@stdlib/LibCURL`
 [76f85450] LibGit2 `@stdlib/LibGit2`
 [8f399da3] Libdl `@stdlib/Libdl`
 [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
 [56ddb016] Logging `@stdlib/Logging`
 [d6f4376e] Markdown `@stdlib/Markdown`
 [a63ad114] Mmap `@stdlib/Mmap`
 [ca575930] NetworkOptions v1.2.0 `@stdlib/NetworkOptions`
 [44cfe95a] Pkg v1.8.0 `@stdlib/Pkg`
 [de0858da] Printf `@stdlib/Printf`
 [3fa0cd96] REPL `@stdlib/REPL`
 [9a3f8284] Random `@stdlib/Random`
 [ea8e919c] SHA v0.7.0 `@stdlib/SHA`
 [9e88b42a] Serialization `@stdlib/Serialization`
 [1a1011a3] SharedArrays `@stdlib/SharedArrays`
 [6462fe0b] Sockets `@stdlib/Sockets`
 [2f01184e] SparseArrays `@stdlib/SparseArrays`
 [10745b16] Statistics `@stdlib/Statistics`
 [fa267f1f] TOML v1.0.0 `@stdlib/TOML`
 [a4e569a6] Tar v1.10.1 `@stdlib/Tar`
 [8dfed614] Test `@stdlib/Test`
 [cf7118a7] UUIDs `@stdlib/UUIDs`
 [4ec0a83e] Unicode `@stdlib/Unicode`
 [e66e0078] CompilerSupportLibraries_jll v0.5.2+0 
`@stdlib/CompilerSupportLibraries_jll`
 [deac9b47] LibCURL_jll v7.84.0+0 `@stdlib/LibCURL_jll`
 [29816b5a] LibSSH2_jll v1.10.2+0 `@stdlib/LibSSH2_jll`
 [c8ffd9c3] MbedTLS_jll v2.28.0+0 `@stdlib/MbedTLS_jll`
 [14a3606d] MozillaCACerts_jll v2022.2.1 `@stdlib/MozillaCACerts_jll`
 [4536629a] OpenBLAS_jll v0.3.20+0 `@stdlib/O

[GitHub] [arrow-julia] Moelf closed issue #344: error earlier when number of entries don't match across all fields

2022-10-22 Thread GitBox


Moelf closed issue #344: error earlier when number of entries don't match 
across all fields
URL: https://github.com/apache/arrow-julia/issues/344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] ericphanson opened a new issue, #348: Install Registrator.jl github app

2022-10-28 Thread GitBox


ericphanson opened a new issue, #348:
URL: https://github.com/apache/arrow-julia/issues/348

   Can we install the Julia Registrator github app on this repository? 
https://github.com/JuliaRegistries/Registrator.jl#install-registrator
   https://user-images.githubusercontent.com/5846501/198585934-972f8963-84c1-429a-acea-deca5ba872c6.png";>
   
   I can request it to be installed, but I guess someone will have to approve 
it. I have not requested it yet in case it is a breach of etiquette and there 
is a different process to be followed.
   
   Why?
   
   This enables us to easily register packages in Julia's package registry. It 
only requires minimal read-only permissions. This is the standard workflow for 
registering Julia packages.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bkamins opened a new issue, #352: Allow appending record batches to an existing Arrow file

2022-10-30 Thread GitBox


bkamins opened a new issue, #352:
URL: https://github.com/apache/arrow-julia/issues/352

   @quinnj - following my comment on Julia Slack. Would it be possible to add 
an option to append record batches to an existing Arrow file (I assume such 
append method should check if appended table has the same schema as the 
existing Arrow data).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bkamins closed issue #352: Allow appending record batches to an existing Arrow file

2022-10-30 Thread GitBox


bkamins closed issue #352: Allow appending record batches to an existing Arrow 
file
URL: https://github.com/apache/arrow-julia/issues/352


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bkamins opened a new issue, #353: Add an indexable variant of Arrow.Stream

2022-10-30 Thread GitBox


bkamins opened a new issue, #353:
URL: https://github.com/apache/arrow-julia/issues/353

   In distributed computing context it would be nice to have a vector-variant 
of `Arrow.Stream` iterator. The idea is to be able to split processing of a 
single large arrow file with multiple record batches into multiple worker 
processes. Looking at the source code this should be possible to be done in a 
relatively efficient way.
   
   @quinnj - what do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] bkamins opened a new issue, #354: Arrow.append to non-existent file

2022-10-30 Thread GitBox


bkamins opened a new issue, #354:
URL: https://github.com/apache/arrow-julia/issues/354

   Maybe we could consider allowing creation of a new Arrow file with 
`Arrow.append`? Currently it fails, so one needs to write:
   ```
   for i in 1:10
   if isfile("out.arrow")
   Arrow.append("out.arrow", DataFrame(i=i))
   else
   Arrow.write("out.arrow", DataFrame(i=i))
   end
   end
   ```
   which could be just:
   ```
   for i in 1:10
   Arrow.append("out.arrow", DataFrame(i=i))
   end
   ```
   But maybe there is a reason for the current design?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj closed issue #345: Tests fail on Apple silicon on Julia 1.8

2022-11-03 Thread GitBox


quinnj closed issue #345: Tests fail on Apple silicon on Julia 1.8
URL: https://github.com/apache/arrow-julia/issues/345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] alex-s-gardner opened a new issue, #359: Arrow changes data type from input in unexpected ways

2022-11-07 Thread GitBox


alex-s-gardner opened a new issue, #359:
URL: https://github.com/apache/arrow-julia/issues/359

   From this MWE input is unrecognizable from output (the path to the Zarr file 
is public and can be run locally):
   
   ```
   dc = 
Zarr.zopen("http://its-live-data.s3.amazonaws.com/datacubes/v02/N20E100/ITS_LIVE_vel_EPSG32647_G0120_X65_Y325.zarr";)
   C = dc["satellite_img1"][:]
   input = DataFrame([C,C],:auto)
   Arrow.write("test.arrow", input)
   output = Arrow.Table("test.arrow")
   ```
   
   `input.x1` looks like this:
   ```
   1460-element Vector{Zarr.MaxLengthStrings.MaxLengthString{2, UInt32}}:
"1A"
⋮
"8."
   ```
   
   while `output.x1` looks like this:
   ```
   1460-element Arrow.List{String, Int32, Vector{UInt8}}:
"1\0"
⋮
"\0\0"
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] dmbates opened a new issue, #360: Creating DictEncoded in the presence of missing values

2022-11-09 Thread GitBox


dmbates opened a new issue, #360:
URL: https://github.com/apache/arrow-julia/issues/360

   When, e.g. a `PooledArray` column, that contains missing values is converted 
to `DictEncoded` the dictionary is based on the result of `DataAPI.refpool`, 
which includes `missing`.  As a result both the dictionary and the index vector 
contain missing values, which confuses Pandas.  The missing value in the 
dictionary can be skipped because it is never referenced in the index vector.
   
   ```julia
   julia> using Arrow, DataAPI, PooledArrays
   
   julia> tbl = (; a = PooledArray([missing, "a", "b", "a"]))
   (a = Union{Missing, String}[missing, "a", "b", "a"],)
   
   julia> DataAPI.refarray(tbl.a)
   4-element Vector{UInt32}:
0x0001
0x0002
0x0003
0x0002
   
   julia> DataAPI.refpool(tbl.a)
   3-element Vector{Union{Missing, String}}:
missing
"a"
"b"
   
   julia> Arrow.write("tbl.arrow", tbl)
   "tbl.arrow"
   ```
   
   In the `read_table` result we see that there is a `null` in the dictionary 
at Python index 0 that is never referenced in the indices vector.
   
   ```python
   $ python
   Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 
10.4.0] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow.feather as fea
   >>> fea.read_table("tbl.arrow")
   pyarrow.Table
   a: dictionary
   
   a: [  -- dictionary:
   [null,"a","b"]  -- indices:
   [null,1,2,1]]
   >>> fea.read_feather('nyc_mv_collisions_202201.arrow')
   Traceback (most recent call last):
 File "", line 1, in 
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py", 
line 231, in read_feather
   return (read_table(
 File "pyarrow/array.pxi", line 823, in 
pyarrow.lib._PandasConvertible.to_pandas
 File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 818, in table_to_blockmanager
   blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 1170, in _table_to_blocks
   return [_reconstruct_block(item, columns, extension_columns)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 1170, in 
   return [_reconstruct_block(item, columns, extension_columns)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py",
 line 757, in _reconstruct_block
   cat = _pandas_api.categorical_type.from_codes(
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py",
 line 687, in from_codes
   dtype = CategoricalDtype._from_values_or_dtype(
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 299, in _from_values_or_dtype
   dtype = CategoricalDtype(categories, ordered)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 186, in __init__
   self._finalize(categories, ordered, fastpath=False)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 340, in _finalize
   categories = self.validate_categories(categories, fastpath=fastpath)
 File 
"/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py",
 line 534, in validate_categories
   raise ValueError("Categorical categories cannot be null")
   ValueError: Categorical categories cannot be null
   ```
   
   One possible approach is to check for `missing` in the refpool, find its 
index in the refpool, delete it from the refpool and rewrite the refarray to 
replace that index by missing.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-testing] zeroshade merged pull request #82: Add Go Parquet DeltaBitPacking test data

2022-11-22 Thread GitBox


zeroshade merged PR #82:
URL: https://github.com/apache/arrow-testing/pull/82


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] MrHenning opened a new issue, #363: Error when open files with MICROSECOND in Date.Time

2022-11-24 Thread GitBox


MrHenning opened a new issue, #363:
URL: https://github.com/apache/arrow-julia/issues/363

   When opening a table that contains a `Time` object containing `MICROSECONS` 
(which I think is the default with `python`/`pandas`) I get an error.
   
   Example:
   1. Create `python`/`pandas` dataframe:
   ```python
   pd.DataFrame(dict(
 i=range(0,10),
 time=[datetime.time(hour=i) for i in range(0,10)]
   )).to_feather('~/python_time_df.arrow')
   ```
   1. read file in `julia`:
   ```julia
   Arrow.Table("~/python_time_df.arrow", convert=true)
   ```
   yields the error
   ```
   Failed to show value:
   MethodError: no method matching 
Int64(::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64})
   Closest candidates are:
   (::Type{T})(!Matched::AbstractChar) where T<:Union{Int32, Int64} at 
char.jl:51
   (::Type{T})(!Matched::AbstractChar) where T<:Union{AbstractChar, Number} at 
char.jl:50
   (::Type{T})(!Matched::BigInt) where T<:Union{Int128, Int16, Int32, Int64, 
Int8} at gmp.jl:359
   ...
   
   - Dates.Time(::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}, 
::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Dates.AMPM)@types.jl:412
   - fromarrow(::Type{Dates.Time}, 
::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64})@ArrowTypes.jl:157
   - fromarrow(::Type{Union{Missing, Dates.Time}}, 
::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64})@ArrowTypes.jl:161
   - getin...@primitive.jl:46[inlined]
   - _getin...@abstractarray.jl:1274[inlined]
   - getin...@abstractarray.jl:1241[inlined]
   - isassigned(::Arrow.Primitive{Union{Missing, Dates.Time}, 
Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}}}, ::Int64, 
::Int64)@abstractarray.jl:565
   - alignment(::IOContext{IOBuffer}, ::AbstractVecOrMat, ::Vector{Int64}, 
::Vector{Int64}, ::Int64, ::Int64, ::Int64, ::Int64)@arrayshow.jl:68
   - _print_matrix(::IOContext{IOBuffer}, ::AbstractVecOrMat, ::String, 
::String, ::String, ::String, ::String, ::String, ::Int64, ::Int64, 
::UnitRange{Int64}, ::UnitRange{Int64})@arrayshow.jl:207
   - print_matrix(::IOContext{IOBuffer}, ::Arrow.Primitive{Union{Missing, 
Dates.Time}, Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}}}, 
::String, ::String, ::String, ::String, ::String, ::String, ::Int64, 
::Int64)@arrayshow.jl:171
   - print_ar...@arrayshow.jl:358[inlined]
   - show(::IOContext{IOBuffer}, ::MIME{Symbol("text/plain")}, 
::Arrow.Primitive{Union{Missing, Dates.Time}, 
Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, 
Int64}}})@arrayshow.jl:399
   - show_richest(::IOContext{IOBuffer}, ::Any)@PlutoRunner.jl:1157
   - show_richest_withretur...@plutorunner.jl:1095[inlined]
   - format_output_default(::Any, ::Any)@PlutoRunner.jl:995
   - var"#format_output#60"(::IOContext{Base.DevNull}, 
::typeof(Main.PlutoRunner.format_output), ::Any)@PlutoRunner.jl:1012
   - formatted_result_of(::Base.UUID, ::Base.UUID, ::Bool, 
::Vector{String}, ::Nothing, ::Module)@PlutoRunner.jl:905
   - top-level sc...@workspacemanager.jl:476
   ```
   
   Defining
   ```julia
   ArrowTypes.fromarrow(::Type{Dates.Time}, 
x::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}) = 
convert(Dates.Time, x)
   ```
   seems to fix the error.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] TanookiToad opened a new issue, #364: Values in PooledArray are incorrectly saved

2022-11-27 Thread GitBox


TanookiToad opened a new issue, #364:
URL: https://github.com/apache/arrow-julia/issues/364

   If a float-type element in ```PooledVector{Real, UInt32, Vector{UInt32}}``` 
is replaced by an integer, Arrow will incorrectly save this value.
   
   ```
   using Arrow, DataFrames, PooledArrays
   
   df = DataFrame(x = PooledArray(Vector{Real}([1.0])))
   df[1, 1] = 2
   Arrow.write("test.arrow", df)
   Arrow.Table(io) |> DataFrame
   ```
   
   The following code incorrectly saves 11 as 1.0e-323.
   
   ```
   1×1 DataFrame
Row │ x
│ Float64
   ─┼──
  1 │ 1.0e-323
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] ericphanson closed issue #232: Serializing `Dict{String,Real}` result in garbage values

2022-11-28 Thread GitBox


ericphanson closed issue #232: Serializing `Dict{String,Real}` result in 
garbage values 
URL: https://github.com/apache/arrow-julia/issues/232


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj closed issue #364: PooledArray are incorrectly saved

2022-11-28 Thread GitBox


quinnj closed issue #364: PooledArray are incorrectly saved
URL: https://github.com/apache/arrow-julia/issues/364


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou opened a new issue, #14816: [Release] Make dev/release/06-java-upload.sh reusable from other project

2022-12-01 Thread GitBox


kou opened a new issue, #14816:
URL: https://github.com/apache/arrow/issues/14816

   ### Describe the enhancement requested
   
   https://github.com/apache/arrow-adbc is one use case.
   
   See also: 
https://github.com/apache/arrow-adbc/pull/174#discussion_r1037547584
   
   ### Component(s)
   
   Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] ycyang-26 closed issue #14798: [parquet go] write parquet data code sample

2022-12-01 Thread GitBox


ycyang-26 closed issue #14798: [parquet go] write parquet data code sample
URL: https://github.com/apache/arrow/issues/14798


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou opened a new issue, #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed

2022-12-01 Thread GitBox


kou opened a new issue, #14819:
URL: https://github.com/apache/arrow/issues/14819

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   
https://github.com/ursacomputing/crossbow/actions/runs/3590866755/jobs/6044735050#step:6:1864
   
   ```text
   CMake Error at /usr/share/cmake/Modules/CMakeTestCCompiler.cmake:66 
(message):
 The C compiler
   
   "/opt/rh/gcc-toolset-12/root/usr/bin/gcc"
   
 is not able to compile a simple test program.
   
 It fails with the following output:
   
   Change Dir: 
/root/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp
   
   Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_427a1/fast && 
/usr/bin/gmake  -f CMakeFiles/cmTC_427a1.dir/build.make 
CMakeFiles/cmTC_427a1.dir/build
   gmake[1]: Entering directory 
'/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp'
   Building C object CMakeFiles/cmTC_427a1.dir/testCCompiler.c.o
   -- Check for working C compiler: /opt/rh/gcc-toolset-12/root/usr/bin/gcc - 
broken
   /opt/rh/gcc-toolset-12/root/usr/bin/gcc   -O2 -flto=auto 
-ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall 
-Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong 
-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64-v2 
-mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection 
-fcf-protection  -o CMakeFiles/cmTC_427a1.dir/testCCompiler.c.o -c 
/root/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp/testCCompiler.c
   cc1: fatal error: inaccessible plugin file 
/opt/rh/gcc-toolset-12/root/usr/lib/gcc/x86_64-redhat-linux/12/plugin/gcc-annobin.so
 expanded from short plugin name gcc-annobin: No such file or directory
   ```
   
   ### Component(s)
   
   Continuous Integration, Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14784: [Dev] Allow users to assign issues with a comment on the issue

2022-12-01 Thread GitBox


kou closed issue #14784: [Dev] Allow users to assign issues with a comment on 
the issue
URL: https://github.com/apache/arrow/issues/14784


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] AlenkaF opened a new issue, #14822: [Docs] Update the documentation to include the new GitHub issue workflow

2022-12-02 Thread GitBox


AlenkaF opened a new issue, #14822:
URL: https://github.com/apache/arrow/issues/14822

   ### Describe the enhancement requested
   
   As we are moving towards GitHub issues for Arrow issue reports, the 
documentation needs to be updated so that the contributors have a place to look 
if they need the information about the workflow we are/will be having.
   
   The issue will be divided into multiple subtasks (parts of the 
documentation) so that the review process will be easier:
   
   - [ ] https://github.com/apache/arrow README
   - [ ] https://github.com/apache/arrow/tree/master/r README
   - [ ] https://arrow.apache.org/community/ 
   - [ ] 
https://arrow.apache.org/docs/dev/developers/guide/step_by_step/finding_issues.html
   - [ ] https://arrow.apache.org/docs/dev/developers/guide/communication.html
   - [ ] 
https://arrow.apache.org/docs/dev/developers/bug_reports.html#bug-reports 
   - [ ] Version specific project docs warnings https://arrow.apache.org/docs/
 Tracked here: https://issues.apache.org/jira/browse/ARROW-18363
 https://github.com/apache/arrow-site/pull/275
 https://github.com/apache/arrow/pull/14687
   
   ### Component(s)
   
   Documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] assignUser opened a new issue, #14824: [CI] r-binary-packages should only upload artifacts if all tests succeed

2022-12-02 Thread GitBox


assignUser opened a new issue, #14824:
URL: https://github.com/apache/arrow/issues/14824

   ### Describe the enhancement requested
   
   Currently the upload step is missing a dependency on the centos binary test.
   
   ### Component(s)
   
   Continuous Integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lwhite1 opened a new issue, #14825: [Java] [Doc] Improve documentation for streaming file handling in VectorSchemaRoot

2022-12-02 Thread GitBox


lwhite1 opened a new issue, #14825:
URL: https://github.com/apache/arrow/issues/14825

   ### Describe the enhancement requested
   
   See comments in issue #14812
   
   ### Component(s)
   
   Documentation, Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] LucyMcGowan opened a new issue, #14826: write_dataset is crashing on my machine

2022-12-02 Thread GitBox


LucyMcGowan opened a new issue, #14826:
URL: https://github.com/apache/arrow/issues/14826

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When I run the following example from the documentation, it seems to crash. 
   
   ```
   library(arrow)
   one_level_tree <- tempfile()
   write_dataset(mtcars, one_level_tree, partitioning = "cyl")
   ```
   
   This reprex appears to crash R.
   See standard output and standard error for more details.
   
    Standard output and error
   
   ``` sh
   ✖ Install the styler package in order to use `style = TRUE`.
   
*** caught illegal operation ***
   address 0x11746ec1f, cause 'illegal opcode'
   
   Traceback:
1: ExecPlan_Write(self, node, 
prepare_key_value_metadata(node$final_metadata()), ...)
2: plan$Write(final_node, options, path_and_fs$fs, path_and_fs$path, 
partitioning, basename_template, existing_data_behavior, max_partitions, 
max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group)
3: write_dataset(mtcars, one_level_tree, partitioning = "cyl")
4: eval(expr, envir, enclos)
5: eval(expr, envir, enclos)
6: eval_with_user_handlers(expr, envir, enclos, user_handlers)
7: withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers))
8: withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, 
enclos, user_handlers)), warning = wHandler, error = eHandler, message = 
mHandler)
9: doTryCatch(return(expr), name, parentenv, handler)
   10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
   11: tryCatchList(expr, classes, parentenv, handlers)
   12: tryCatch(expr, error = function(e) {call <- conditionCall(e)if 
(!is.null(call)) {if (identical(call[[1L]], quote(doTryCatch))) 
call <- sys.call(-4L)dcall <- deparse(call, nlines = 1L)
prefix <- paste("Error in", dcall, ": ")LONG <- 75Lsm <- 
strsplit(conditionMessage(e), "\n")[[1L]]w <- 14L + nchar(dcall, type = 
"w") + nchar(sm[1L], type = "w")if (is.na(w)) w <- 14L + 
nchar(dcall, type = "b") + nchar(sm[1L], type = "b")if 
(w > LONG) prefix <- paste0(prefix, "\n  ")}else prefix <- 
"Error : "msg <- paste0(prefix, conditionMessage(e), "\n")
.Internal(seterrmessage(msg[1L]))if (!silent && 
isTRUE(getOption("show.error.messages"))) {cat(msg, file = outFile) 
   .Internal(printDeferredWarnings())}invisible(structure(msg, class = 
"try-error", condition = e))})
   13: try(f, silent = TRUE)
   14: handle(ev <- 
withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, 
enclos, user_handlers)), warning = wHandler, error = eHandler, message = 
mHandler))
   15: timing_fn(handle(ev <- 
withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, 
enclos, user_handlers)), warning = wHandler, error = eHandler, message = 
mHandler)))
   16: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, 
debug = debug, last = i == length(out), use_try = stop_on_error != 2L, 
keep_warning = keep_warning, keep_message = keep_message, output_handler = 
output_handler, include_timing = include_timing)
   17: evaluate::evaluate(...)
   18: evaluate(code, envir = env, new_device = FALSE, keep_warning = 
!isFALSE(options$warning), keep_message = !isFALSE(options$message), 
stop_on_error = if (is.numeric(options$error)) options$error else {if 
(options$error && options$include) 0Lelse 2L}, 
output_handler = knit_handlers(options$render, options))
   19: in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, 
keep_warning = !isFALSE(options$warning), keep_message = 
!isFALSE(options$message), stop_on_error = if (is.numeric(options$error)) 
options$error else {if (options$error && options$include) 
0Lelse 2L}, output_handler = knit_handlers(options$render, 
options)))
   20: eng_r(options)
   21: block_exec(params)
   22: call_block(x)
   23: process_group.block(group)
   24: process_group(group)
   25: withCallingHandlers(if (tangle) process_tangle(group) else 
process_group(group), error = function(e) {setwd(wd)
cat(res, sep = "\n", file = output %n% "")message("Quitting from lines 
", paste(current_lines(i), collapse = "-"), " (", 
knit_concord$get("infile"), ") ")})
   26: process_file(text, output)
   27: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet)
   28: rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = 
"UTF-8")
   29: (function (input) {rmarkdown::render(input, quiet = TRUE, envir = 
globalenv(), encoding = "UTF-8")})(input = 
base::quote("front-ray_reprex.R"))
   30: (function (what, args, quote = FALSE, envir = parent.frame()) {if 
(!is.list(arg

[GitHub] [arrow] paleolimbot closed issue #14813: [R] Install from local source on MacOS fails build after Abseil build step

2022-12-02 Thread GitBox


paleolimbot closed issue #14813: [R] Install from local source on MacOS fails 
build after Abseil build step
URL: https://github.com/apache/arrow/issues/14813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14801: [C++] CMake config files for Dataset are not copied to the install directory

2022-12-02 Thread GitBox


kou closed issue #14801: [C++] CMake config files for Dataset are not copied to 
the install directory 
URL: https://github.com/apache/arrow/issues/14801


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou opened a new issue, #14828: [CI][Conda] Nightly CI jobs aren't maintained

2022-12-02 Thread GitBox


kou opened a new issue, #14828:
URL: https://github.com/apache/arrow/issues/14828

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Nightly CI jobs for Conda were fixed by @h-vetinari and @xhochy in GH-14102 
but most nightly CI jobs are failing again this past month.
   
   Can we maintain them? If we can't maintain them, can we remove them? Are 
they useful to maintain https://github.com/conda-forge/arrow-cpp-feedstock ?
   
   http://crossbow.voltrondata.com/
   
   Task Name | Since Last Successful Build | Last Successful Commit | Last 
Successful Build | First Failure | 9 Days Ago | 8 Days Ago | 7 Days Ago | 6 
Days Ago | 5 Days Ago | 4 Days Ago | 3 Days Ago | 2 Days Ago | Most Recent 
Failure
   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
   conda-win-vs2019-py37-r40 | 78 days | 5e6da78 | pending | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-win-vs2019-py38 | 65 days | 60c9383 | pending | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py310-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py310-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py37-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py37-cpu-r40 | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py37-cpu-r41 | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py37-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py37-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py38-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py38-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py38-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py38-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py39-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py39-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py39-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py39-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-osx-arm64-clang-py38 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-osx-arm64-clang-py39 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-osx-clang-py37-r40 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-osx-clang-py37-r41 | 29 days | 8e3a1e1 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-osx-clang-py38 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail 
| fail | fail | fail | fail | fail | fail
   conda-osx-clang-py39 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail 
| fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py310-cuda | 15 days | 501b799 | pass | fail | fail | fail | 
fail | fail | fail | fail | fail | fail | fail
   conda-linux-gcc-py310-ppc64le | 15 days | 501b799 | pass | fail | fail | 
fail | fail | fail | fail | fail | fail | fail | fail
   conda-osx-arm64-clang-py310 | 15 days | 501b799 | pass | fail | fail | fail 
| fail | fail | fail | fail | fail | fail | fail
   conda-osx-clang-py310 | 15 days | 501b799 | pass | fail | fail | fail | fail 
| fail | fail | fail | fail | fail | fail
   
   
   
   ### Component(s)
   
   Continuous Integration, Packaging


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou opened a new issue, #14829: [CI][R][Homebrew] Nightly CI jobs aren't maintained

2022-12-02 Thread GitBox


kou opened a new issue, #14829:
URL: https://github.com/apache/arrow/issues/14829

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   `homebrew-r-autobrew` and `homebrew-r-brew` nightly CI jobs are failing in 
past 3 months:
   
   Can we maintain them? If we can't maintain them, can we remove them? Are 
they useful to maintain something?
   
   http://crossbow.voltrondata.com/
   
   Task Name | Since Last Successful Build | Last Successful Commit | Last 
Successful Build | First Failure | 9 Days Ago | 8 Days Ago | 7 Days Ago | 6 
Days Ago | 5 Days Ago | 4 Days Ago | 3 Days Ago | 2 Days Ago | Most Recent 
Failure
   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
   homebrew-r-autobrew | 94 days | cf27001 | pass | fail | fail | fail | fail | 
fail | fail | fail | fail | fail | fail
   homebrew-r-brew | 83 days | a63e60b | pass | fail | fail | fail | fail | 
fail | fail | fail | fail | fail | fail
   
   
   
   ### Component(s)
   
   Continuous Integration, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed

2022-12-02 Thread GitBox


kou closed issue #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed
URL: https://github.com/apache/arrow/issues/14819


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14816: [Release] Make dev/release/06-java-upload.sh reusable from other project

2022-12-02 Thread GitBox


kou closed issue #14816: [Release] Make dev/release/06-java-upload.sh reusable 
from other project
URL: https://github.com/apache/arrow/issues/14816


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] phpsxg opened a new issue, #14834: write_dataset how to add and update data

2022-12-04 Thread GitBox


phpsxg opened a new issue, #14834:
URL: https://github.com/apache/arrow/issues/14834

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   First, save the parquet file, there are 5 pieces of data
   ```
   dataset_name = 'test_update'
   df = pd.DataFrame({'one': [-1, 3, 2.5, 2.5, 2.5],
  'two': ['foo', 'bar', 'baz','foo','foo'],
  'three': [True, False, True,False,False]},
 )
   table = pa.Table.from_pandas(df)
   ds.write_dataset(table, dataset_name,
   existing_data_behavior='overwrite_or_ignore',
   format="parquet")
   ```
   
   Then I want to add two new ones, and I want to get a total of 7 results, and 
the new data is as follows:
   ```
   df = pd.DataFrame({'one': [1, 2],
  'two': ['foo-insert1','foo-insert2'],
  'three': [True, False]},
 )
   
   table = pa.Table.from_pandas(df)
   ds.write_dataset(table, dataset_name,
   # existing_data_behavior='delete_matching',
   existing_data_behavior='overwrite_or_ignore',
   format="parquet")
   ```
   1. **But this overwrites the original, there are only two data, how to 
achieve new data on the basis of the original**
   
   2. **I have another question, if I want to update the data according to the 
conditions, how to change how to do it, for example**
   
   > Update one=-1, two=foo's three to False
   
   
   - python=3.10
   - pyarrow=10.0.0
   
   
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] Qwertzi01 opened a new issue, #14835: [Python] Import/usage of pyarrow results in 'Invalid machine command'

2022-12-04 Thread GitBox


Qwertzi01 opened a new issue, #14835:
URL: https://github.com/apache/arrow/issues/14835

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello and good morning, 
   
   I get the error message 'Invalid machine command' if I try to use/import 
pyarrow v10 (also older versions).
   
   * Operating system: Debian 11 (Bullseye)
   * Kernel Version: `Linux 5.10.0-15-amd64 #1 SMP Debian 5.10.120-1 
(2022-06-09) x86_64 GNU/Linux`
   * Python Version: 3.9.2
   
   Console output after installing pyarrow with latest pip:
   ```
   Python 3.9.2 (default, Feb 28 2021, 17:03:44)
   [GCC 10.2.1 20210110] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   >>> import pyarrow
   Invalid machine command
   ```
   
   Kernel log if I try to use Freqtrade which need pyarrow v10:
   
   `Dec 3 17:01:08 server kernel: [617405.716600] traps: freqtrade[584823] trap 
invalid opcode ip:7f848859ae22 sp:7ffee55e2750 error:0 in 
libarrow.so.1000[7f84884a2000+189e000]`
   
   If it helps, here the original issue regarding Freqtrade with Pyarrow: 
https://github.com/freqtrade/freqtrade/issues/7839
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] phpsxg opened a new issue, #14837: ds.write_dataset how to implement new data?

2022-12-05 Thread GitBox


phpsxg opened a new issue, #14837:
URL: https://github.com/apache/arrow/issues/14837

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   
   When I am using pq.write_to_dataset ( 
existing_data_behavior='overwrite_or_ignore') is able to add data
   Why is the use of ds.write_dataset ( existing_data_behavior = 
'overwrite_or_ignore') an overlay?
   
   
   
   ```
   df = pd.DataFrame({'one': [-1, 3, 2.5, 2.5, 2.5],
  'two': ['foo', 'bar', 'baz', 'foo', 'foo'],
  'three': [True, False, True, False, True],
  'four': [datetime.date(2021, 1, 3), datetime.date(2021, 
3, 1), datetime.date(2021, 1, 1),
   datetime.date(2021, 3, 11), datetime.date(2021, 
4, 1)]
  },
 index=list('abcde'))
   
   
   table = pa.Table.from_pandas(df, preserve_index=True)
   ```
   
   **pq.write_to_datase**
   ```
   pq.write_to_dataset(table, root_path=root_path,
   # existing_data_behavior='delete_matching',
   existing_data_behavior='overwrite_or_ignore',
   use_legacy_dataset=False
   )
   ```
   
   **ds.write_dataset**
   ```
   ds.write_dataset(table, root_path,
# existing_data_behavior='delete_matching',
existing_data_behavior='overwrite_or_ignore',
format="parquet")
   ``
   
   ### Component(s)
   
   Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] sayantikabanik opened a new issue, #14840: Color coding for warnings [Building documentation]

2022-12-05 Thread GitBox


sayantikabanik opened a new issue, #14840:
URL: https://github.com/apache/arrow/issues/14840

   ### Describe the enhancement requested
   
   ### Short description
   
   While building the documentation, I noticed the warnings appearing on the 
console are color coded as `red` which I assumed as errors (was bit alarmed)
   
   ### Screenshot for reference 
   
   https://user-images.githubusercontent.com/17350312/205615808-a5e5b757-f757-439b-9ef7-0c73693413d0.png";>
   
   ### Expectation
   
   Usually I have come across warnings as `yellow`/ `orangish` color printed on 
the console.
   And errors highlighted or color coded as `red`.
   
   
   ### Component(s)
   
   Documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] assignUser closed issue #14837: ds.write_dataset how to implement new data?

2022-12-05 Thread GitBox


assignUser closed issue #14837: ds.write_dataset how to implement new data?
URL: https://github.com/apache/arrow/issues/14837


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] benibus opened a new issue, #14842: [C++] Propagate some errors in JSON chunker

2022-12-05 Thread GitBox


benibus opened a new issue, #14842:
URL: https://github.com/apache/arrow/issues/14842

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The current JSON `Chunker` defers all error reporting to a later parsing 
stage when `ParseOptions::newlines_in_values = true`. However, this poses 
issues when sequentially chunking buffers. Consider this:
   ```
   std::shared_ptr whole, rest;
   // Trailing right-bracket
   chunker->Process(Buffer::FromString("{\"a\":0}}"), &whole, &rest);
   ```
   Here, `whole` will be `{"a":0}` but `rest` will be `}`, which doesn't start 
a valid JSON block. As such, for the next buffer, you can't then call 
`ProcessWithPartial` with `rest` as its `partial` argument without crashing 
(via DCHECK, if enabled). This effectively prevents us from handling the error 
at all.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] markjschreiber opened a new issue, #14844: VectorValueComparator should skip the null test for NonNullable FieldTypes types

2022-12-05 Thread GitBox


markjschreiber opened a new issue, #14844:
URL: https://github.com/apache/arrow/issues/14844

   ### Describe the enhancement requested
   
   I recently ran a profiler on a simple implementation of a sort and noticed 
that the VectorValueComparator spends quite a lot of time checking the values 
to be compared for `null` values. In this case I had already declared the 
`FieldType` to be `notNullable`. Unless I am misunderstanding something, the 
values cannot be `null` so the comparator is making unnecessary and expensive 
comparisons.
   
   As a test I made a comparator that skips the null tests and runs ~12% faster 
as a result. Based on this I think the `VectorValueComparator` should skip the 
`nul`l test for NonNullable `FieldTypes`.
   
   I'd be happy to contribute a change to do this if you think it would be 
reasonable.
   
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou closed issue #14755: Expose QuotingStyle to Python

2022-12-05 Thread GitBox


pitrou closed issue #14755: Expose QuotingStyle to Python
URL: https://github.com/apache/arrow/issues/14755


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou closed issue #14842: [C++] Propagate some errors in JSON chunker

2022-12-05 Thread GitBox


pitrou closed issue #14842: [C++] Propagate some errors in JSON chunker
URL: https://github.com/apache/arrow/issues/14842


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14824: [CI] r-binary-packages should only upload artifacts if all tests succeed

2022-12-05 Thread GitBox


kou closed issue #14824: [CI] r-binary-packages should only upload artifacts if 
all tests succeed
URL: https://github.com/apache/arrow/issues/14824


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm opened a new issue, #14846: [Dev] Update download_rc_binaries to be able to fetch from GitHub Releases

2022-12-05 Thread GitBox


lidavidm opened a new issue, #14846:
URL: https://github.com/apache/arrow/issues/14846

   ### Describe the enhancement requested
   
   ADBC is using GitHub Releases instead of Artifactory; it would be nice to 
share a little bit of this infrastructure instead of having to replicate it 
all. (And eventually Arrow may be able to publish binaries on GitHub as well; 
we already do with Crossbow.)
   
   See https://github.com/apache/arrow-adbc/pull/215#discussion_r1040123283
   
   ### Component(s)
   
   Developer Tools


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jorisvandenbossche closed issue #14840: [Docs] Color coding for warnings [Building documentation]

2022-12-06 Thread GitBox


jorisvandenbossche closed issue #14840: [Docs] Color coding for warnings 
[Building documentation]
URL: https://github.com/apache/arrow/issues/14840


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] assignUser opened a new issue, #14849: [CI] R install-local builds sometimes fail because sccache times out

2022-12-06 Thread GitBox


assignUser opened a new issue, #14849:
URL: https://github.com/apache/arrow/issues/14849

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   The sccache server times out while starting and this leads to the build 
error'ing e.g. 
https://github.com/ursacomputing/crossbow/actions/runs/3625242046/jobs/6113064050#step:7:1134
   
   ### Component(s)
   
   Continuous Integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm closed issue #14835: [Python] Import/usage of pyarrow results in 'Invalid machine command'

2022-12-06 Thread GitBox


lidavidm closed issue #14835: [Python] Import/usage of pyarrow results in 
'Invalid machine command'
URL: https://github.com/apache/arrow/issues/14835


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] youngfn opened a new issue, #14853: [C++][Streaming execution] can't write data after hash_distinct

2022-12-06 Thread GitBox


youngfn opened a new issue, #14853:
URL: https://github.com/apache/arrow/issues/14853

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi, when I test the Streaming execution engine, I will always get an error 
like "Unsupported Type:list" in write node. This happened when I 
use hash_distinct in aggregate node (**but it will success with 
hash_count_distinct or hash_count**). 
   Is anything wrong with my demo? Is this a bug? I'm not sure. thx for any 
hint!
   ```
   //demo
   cp::Declaration::Sequence(   
{
{"scan", scan_node_options},
{"filter", cp::FilterNodeOptions{filter_opt}},
{"project", cp::ProjectNodeOptions{{
   cp::field_ref("id"), 
cp::field_ref("class_id"),cp::field_ref("gender"), 
cp::field_ref("age"),cp::field_ref("term"),
   expr}, {"id", "class_id", 
"name", "gender", "age", "score", "term"}}},
{"aggregate", 
cp::AggregateNodeOptions{/*aggregates=*/{{"hash_distinct", nullptr, "id", 
"distinct(id)"}}, {"id"}}},
{"write", write_node_options}
}
   ).AddToPlan(plan.get());
   
   if (!plan->Validate().ok()) {
std::cout << "plan is not validate" << std::endl;
return;
   }
   
   std::cout << "Execution Plan Created : " << plan->ToString() << std::endl;
   // // // start the ExecPlan
   plan->StartProducing();
   auto future = plan->finished();
   future.status();
   future.Wait();
   ```
   
   
   Error print:
   
   ```
   arrow error:Invalid: Unsupported Type:list
   arrow error:Invalid: Unsupported Type:list
   /tmp/tmp.GwaQRyi1BD/src/arrow/csv/writer.cc:454  
MakePopulator(*schema->field(col), end_chars, options.delimiter, null_string, 
options.quoting_style, options.io_context.pool())
   arrow error:Invalid: Unsupported Type:list
   /tmp/tmp.GwaQRyi1BD/src/arrow/csv/writer.cc:454  
MakePopulator(*schema->field(col), end_chars, options.delimiter, null_string, 
options.quoting_style, options.io_context.pool())
   /tmp/tmp.GwaQRyi1BD/src/arrow/dataset/file_csv.cc:335  
csv::MakeCSVWriter(destination, schema, *csv_options->write_options)
   ```
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm closed issue #14846: [Dev] Update download_rc_binaries to be able to fetch from GitHub Releases

2022-12-06 Thread GitBox


lidavidm closed issue #14846: [Dev] Update download_rc_binaries to be able to 
fetch from GitHub Releases
URL: https://github.com/apache/arrow/issues/14846


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] AlenkaF opened a new issue, #14854: [Docs] Make changes to arrow/ and arrow/r/README.md

2022-12-06 Thread GitBox


AlenkaF opened a new issue, #14854:
URL: https://github.com/apache/arrow/issues/14854

   ### Describe the enhancement requested
   
   Make changes to `arrow/` and `arrow/r/README.md` to update the change in the 
issue tracking workflow.
   
   ### Component(s)
   
   Documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm opened a new issue, #14855: [C++] Zero-case union can't be imported via C Data Interface

2022-12-06 Thread GitBox


lidavidm opened a new issue, #14855:
URL: https://github.com/apache/arrow/issues/14855

   The zero-case union is apparently not supported by Arrow C++'s C 
Data interface. I get:
   
   ```
   'arrow_type' failed with Invalid: Invalid or unsupported format string: 
'+us:'
   ```
   
   Reproducer for Python:
   
   ```python
   import pyarrow as pa
   from pyarrow.cffi import ffi
   empty_union = pa.sparse_union([])
   ptr = ffi.new("struct ArrowSchema*")
   empty_union._export_to_c(int(ffi.cast("uintptr_t", ptr)))
   pa.DataType._import_from_c(int(ffi.cast("uintptr_t", ptr)))
   # Traceback (most recent call last):
   #   File "", line 1, in 
   #   File "pyarrow/types.pxi", line 248, in 
pyarrow.lib.DataType._import_from_c
   #   File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
   #   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   # pyarrow.lib.ArrowInvalid: Invalid or unsupported format string: '+us:'
   ```
   
   _Originally posted by @paleolimbot in 
https://github.com/apache/arrow-nanoarrow/pull/81#discussion_r1041055778_
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] assignUser opened a new issue, #14856: [CI] Azure builds fail with docker permission error

2022-12-06 Thread GitBox


assignUser opened a new issue, #14856:
URL: https://github.com/apache/arrow/issues/14856

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Several of our nightlies fail due to an issue with the docker install task 
used on azure: 
https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=40924&view=logs&j=50a69d0a-7972-5459-cdae-135ee6ebe312&t=13df7b5c-76db-5c26-6592-75581a9ed64a&l=3093
   
   
   
   ### Component(s)
   
   Continuous Integration


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] workingnbar opened a new issue, #14860: Is there a way to call a custom compute function on a table.group_by aggregation?

2022-12-06 Thread GitBox


workingnbar opened a new issue, #14860:
URL: https://github.com/apache/arrow/issues/14860

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Is there a way to call a custom compute function on a table.group_by 
aggregation? If so, what should the custom function return?
   
   I do not see an example in the documentation.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] toddfarmer opened a new issue, #14861: MIGRATION: Update project documentation to point to GitHub issues

2022-12-06 Thread GitBox


toddfarmer opened a new issue, #14861:
URL: https://github.com/apache/arrow/issues/14861

   ### Describe the enhancement requested
   
   The Apache Arrow project documentation references Jira in number of places. 
These references should be updated to point to GitHub issues. Additionally, a 
best practices document should be started to establish emerging GitHub 
processes and policy.
   
   ### Component(s)
   
   Documentation


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] toddfarmer opened a new issue, #14862: Update Apache Arrow website references to Jira

2022-12-06 Thread GitBox


toddfarmer opened a new issue, #14862:
URL: https://github.com/apache/arrow/issues/14862

   ### Describe the enhancement requested
   
   The Apache Arrow website has references to Jira which should be updated to 
point to GitHub.
   
   ### Component(s)
   
   Website


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou opened a new issue, #14863: [C++] Add `Append(std::optional...)` convenience methods to builders

2022-12-06 Thread GitBox


pitrou opened a new issue, #14863:
URL: https://github.com/apache/arrow/issues/14863

   ### Describe the enhancement requested
   
   When you have a `std::optional` of the right value type, it would be 
convenient to append it directly to a concrete `ArrayBuilder` subclass instead 
of having to query whether it has a value:
   ```c++
   template 
   Status Append(const std::optional& value) {
 return (value) ? Append(*value) : AppendNull();
   }
   
   template 
   void UnsafeAppend(const std::optional& value) {
 if (value) {
   UnsafeAppend(*value);
 } else {
   UnsafeAppendNull();
 }
   }
   ```   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lidavidm opened a new issue, #14864: [C++] Refactor string matching kernel to be usable outside of compute

2022-12-06 Thread GitBox


lidavidm opened a new issue, #14864:
URL: https://github.com/apache/arrow/issues/14864

   ### Describe the enhancement requested
   
   From https://github.com/apache/arrow/pull/14082/files#r1041353323
   
   It can be useful to use some of the string matching kernel functionality 
outside of a kernel context, e.g. to evaluate filters in Flight SQL/ADBC. While 
we can call the kernel on a single scalar, that has overhead (and requires 
ARROW_COMPUTE); we can instead refactor the string matching utilities into 
`arrow/util`. (Though this will still require ARROW_WITH_RE2.)
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] kou closed issue #14856: [CI] Azure builds fail with docker permission error

2022-12-06 Thread GitBox


kou closed issue #14856: [CI] Azure builds fail with docker permission error
URL: https://github.com/apache/arrow/issues/14856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-adbc] dhirschfeld opened a new issue, #224: [Feature Request] Use AnyIO for Python Async

2022-12-06 Thread GitBox


dhirschfeld opened a new issue, #224:
URL: https://github.com/apache/arrow-adbc/issues/224

   > *In particular, it can interleave I/O and conversion*
   
   If you're implementing an async interface, as a 
[`trio`](https://trio.readthedocs.io/en/stable/) user, it would be great if you 
could use [`anyio`](https://github.com/agronholm/anyio) rather than native 
`acyncio` features. This will enable the code to be used with any async library.
   
   Perhaps the most prominent Python library to support AnyIO is 
[`fastapi`](https://fastapi.tiangolo.com/async/#write-your-own-async-code), and 
that's where I'd (eventually) like to make use of `adbc` - asynchronously 
connecting to databases for displaying data in FastAPI dashboards.
   
   _Originally posted by @dhirschfeld in 
https://github.com/apache/arrow-adbc/issues/71#issuecomment-1340130033_
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] necro351 opened a new issue, #14865: pqarrow.WriteArrowToColumn leaks memory from its memory.Allocator

2022-12-06 Thread GitBox


necro351 opened a new issue, #14865:
URL: https://github.com/apache/arrow/issues/14865

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   I am testing a benchmark with different parquet libraries to write an Arrow 
buffer to file. My benchmark checks the allocator is empty at the end of its 
run. I found that pqarrow retains a memory.Buffer if maybeParentNulls is true, 
and never releases it. I looked through the API to find a Release() function or 
some functionality that would release this buffer but did not find anything.
   
   I noticed the unit tests do not check the allocator is zeroed out.
   
   This is the if-statement I am concerned about:
   ```
   // WriteArrowToColumn writes apache arrow columnar data directly to a 
ColumnWriter.
   // Returns non-nil error if the array data type is not compatible with the 
concrete
   // writer type.
   // 
   // leafArr is always a primitive (possibly dictionary encoded type).
   // Leaf_field_nullable indicates whether the leaf array is considered 
nullable
   // according to its schema in a Table or its parent array.
   func WriteArrowToColumn(ctx context.Context, cw file.ColumnChunkWriter, 
leafArr arrow.Array, defLevels, repLevels []int16, leafFieldNullable bool) 
error {
  // Leaf nulls are canonical when there is only a single null element 
after a list
  // and it is at the leaf.
  colLevelInfo := cw.LevelInfo()
  singleNullable := (colLevelInfo.DefLevel == 
colLevelInfo.RepeatedAncestorDefLevel+1) && leafFieldNullable
  maybeParentNulls := colLevelInfo.HasNullableValues() && !singleNullable
   
  if maybeParentNulls {
 buf := memory.NewResizableBuffer(cw.Properties().Allocator())
   ---NON-RELEASED ALLOC HERE--->  
buf.Resize(int(bitutil.BytesForBits(cw.Properties().WriteBatchSize(
 cw.SetBitsBuffer(buf)
  }
   ...
   ```
   
   This is the suspicious allocation (I added a PrintStack call in my own 
custom debug allocator to print this):
   ```
   goroutine 19 [running]:  



   runtime/debug.Stack()



   /usr/local/go/src/runtime/debug/stack.go:24 +0x65



   runtime/debug.PrintStack()   



   /usr/local/go/src/runtime/debug/stack.go:16 +0x19



   
gitlab.eng.vmware.com/taurus/data-mesh.git/compact-lake/rows.(*VerboseAllocator).Allocate(0xc0002d5da0,
 0x80)  

  
   /home/rick/data-mesh/compact-lake/rows/buffer_test.go:145 +0x6a  



   github.com/apache/arrow/go/v11/arrow/memory.(*Buffer).Reserve(0xc00011ee10, 
0xc0001596b0?)  

 
   
/home/rick/go/pkg/mod/github.com/apache/arrow/go/v11@v11.0.0-20221206133351-50a164ec7f64/arrow/memory/buffer.go:110
 +0x5b  
  
   github.com/apache/arrow/go/v11/arrow/memory.(*Buffer).resize(0xc00011ee10, 
0x80, 0xf0?)

  
   
/home/rick/go/pkg/mod/github.com/apache/arrow/go/v11@v11.0.0-20

[GitHub] [arrow] westonpace opened a new issue, #14866: [C++] Remove internal GroupBy implementation

2022-12-07 Thread GitBox


westonpace opened a new issue, #14866:
URL: https://github.com/apache/arrow/issues/14866

   ### Describe the enhancement requested
   
   Currently there are two ways to compute a group by.  The supported way is to 
use an aggregate node in an exec plan.  The second (internal) way is to use the 
internal function `arrow::internal::GroupBy`.  This internal function 
simulates, but does not actually use, an aggregate node.
   
   The internal implementation has caused issues in the past where we did not 
notice an error in the aggregate node's invocation of aggregate kernels since 
we use the internal function for testing aggregates and it behaves slightly 
differently.  The internal implementation also requires maintenance and 
significantly complicated #14352 .
   
   I would like to remove the internal implementation.  Unfortunately, the 
internal implementation is used by tests, benchmarks, and pyarrow.  However, we 
should be able to update those bindings to a friendly wrapper around exec plans.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] youngfn closed issue #14853: [C++][Streaming execution] can't write data after hash_distinct

2022-12-07 Thread GitBox


youngfn closed issue #14853: [C++][Streaming execution] can't write data after 
hash_distinct
URL: https://github.com/apache/arrow/issues/14853


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] lukester1975 opened a new issue, #14869: [C++] arrow.pc should have -DARROW_STATIC for Windows static builds

2022-12-07 Thread GitBox


lukester1975 opened a new issue, #14869:
URL: https://github.com/apache/arrow/issues/14869

   ### Describe the enhancement requested
   
   Without, the generated pc file is insufficient (at least without "manually" 
defining ARROW_STATIC, which is unpleasant).
   
   Quick hack fix: 
https://github.com/lukester1975/arrow/commit/2a8efd9c0bf69fe1b466e157bd69e83a757c926e
   
   * Cflags.private is quite new 
(https://gitlab.freedesktop.org/pkg-config/pkg-config/-/merge_requests/13), so 
this approach might be unpalatable.
   * pkgconfiglite is too old to include that (no commits since 2016??). It 
does quietly ignore the field, though.
   * vcpkg does its own merging of Cflags.private into Cflags, so not an issue 
if pkg-config doesn't understand Cflags.private there (my case).
   * Obviously this is applying to all platforms, not just Windows, but should 
do no harm...?
   
   Seems like there should be some sort of fix here rather than asking vcpkg to 
patch it!
   
   Regards
   
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou opened a new issue, #14870: [C++][Parquet] Support min_value and max_value Statistics

2022-12-07 Thread GitBox


pitrou opened a new issue, #14870:
URL: https://github.com/apache/arrow/issues/14870

   ### Describe the enhancement requested
   
   The `Statistics` structure in Parquet files provides two ways of specifying 
lower and upper bounds for a data page:
   * `min` and `max` are legacy fields for compatibility with older writers, 
with ill-defined comparison semantics in most cases except for signed integers
   * `min_value` and `max_value` are "new" fields (introduced in 2017! - see 
https://github.com/apache/parquet-format/commit/041708da1af52e7cb9288c331b542aa25b68a2b6
 and 
https://github.com/apache/parquet-format/commit/bef5438990116725af041cdd8ced2bca0ed2608a)
 with well-defined comparison semantics depending on the logical type
   
   Currently Parquet C++ supports only the legacy fields `min` and `max`. We 
should add support for reading and writing the newer ones, with the appropriate 
semantics on the write path.
   
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] pitrou closed issue #14870: [C++][Parquet] Support min_value and max_value Statistics

2022-12-07 Thread GitBox


pitrou closed issue #14870: [C++][Parquet] Support min_value and max_value 
Statistics
URL: https://github.com/apache/arrow/issues/14870


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] jandom opened a new issue, #14871: pq.ParquetDataset usage with moto3 mocks?

2022-12-07 Thread GitBox


jandom opened a new issue, #14871:
URL: https://github.com/apache/arrow/issues/14871

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi there, 
   
   I'm trying to mock some S3 objects, to write a test exercising a pd.Dataset, 
this is using boto3, moto3 and pytest
   
   ```
   @mock_s3
   def test_ignore_moto3():
   s3 = boto3.resource("s3", region_name="us-east-1")
   s3.create_bucket(Bucket="fake-bucket")
   parquet_object = s3.Object("fake-bucket", "dummy.parquet")
   buffer = io.BytesIO()
   df = pd.DataFrame([{"foo": 123, "bar": 123}])
   df.to_parquet(buffer, index=False)
   parquet_object.put(Body=buffer.getvalue())
   
   
   s3 = boto3.resource('s3')
   obj = s3.Object('fake-bucket', 'dummy.parquet')
   print(obj.get()['Body'].read())
   
   ds = pq.ParquetDataset("s3://fake-bucket/dummy.parquet", 
use_legacy_dataset=False)
   ```
   
   But unexpectedly this tests is dying 
   
   ```
   >   ds = pq.ParquetDataset("s3://fake-bucket/dummy.parquet", 
use_legacy_dataset=False)
   
   tests/virtual_screening/integration/test_results.py:46: 
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
/opt/micromamba/envs/main/lib/python3.9/site-packages/pyarrow/parquet/core.py:1724:
 in __new__
   return _ParquetDatasetV2(
   
/opt/micromamba/envs/main/lib/python3.9/site-packages/pyarrow/parquet/core.py:2401:
 in __init__
   if filesystem.get_file_info(path_or_paths).is_file:
   pyarrow/_fs.pyx:564: in pyarrow._fs.FileSystem.get_file_info
   ???
   pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status
   ???
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
   
   >   ???
   E   OSError: When getting information for key 'dummy.parquet' in bucket 
'fake-bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response 
body.
   ```
   
   But the test output includes the file contents (so there is no typo or 
misconfiguration of the mock)
   
   ```
   

 Captured stdout call 
-
   
b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c{\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x12\x15\x16,\x15\x02\x15\x10\x15\x06\x15\x06\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t
 
\x02\x00\x00\x00\x02\x01\x01\x02\x00&\xc8\x01\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03foo\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&8&\x08\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c{\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x12\x15\x16,\x15\x02\x15\x10\x15\x06\x15\x06\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00
 \x00\x00\x00\t 
\x02\x00\x00\x00\x02\x01\x01\x02\x00&\xc2\x04\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03bar\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&\xb2\x03&\x82\x03\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x19<5\x00\x18\x06schema\x15\x04\x00\x15\x04%\x02\x18\x03foo\x00\x15\x04%\x02\x18\x03bar\x00\x16\x02\x19\x1c\x19,&\xc8\x01\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03foo\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&8&\x08\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00&\xc2\x04\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03bar\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&\xb2\x03&\x82\x03\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x
 
18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x16\xf0\x02\x16\x02&\x08\x16\x80\x03\x14\x00\x00\x19,\x18\x06pandas\x18\xd9\x02{"index_columns":
 [], "column_indexes": [], "columns": [{"name": "foo", "field_name": "foo

[GitHub] [arrow] jandom closed issue #14871: pq.ParquetDataset usage with moto3 mocks?

2022-12-07 Thread GitBox


jandom closed issue #14871: pq.ParquetDataset usage with moto3 mocks?
URL: https://github.com/apache/arrow/issues/14871


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] DavZim opened a new issue, #14872: [R] arrow returns wrong variable content when multiple group_by/summarise statements are used

2022-12-07 Thread GitBox


DavZim opened a new issue, #14872:
URL: https://github.com/apache/arrow/issues/14872

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When collecting a query with multiple group_by + summarise statements, one 
variable gets wrongly assigned values from another variable. When an ungroup is 
inserted, everything works fine again.
   
   To reproduce, consider the following:
   In the examples below, the variable `gender` should be `F`, or `M` and not 
`Group X`.
   When the `ungroup()` is inserted (second part), gender is again F/M and not 
Group X.
   
   ``` r
   library(dplyr)
   library(arrow)
   
   # Create sample dataset
   N <- 1000
   set.seed(123)
   orig_data <- tibble(
 code_group = sample(paste("Group", 1:2), N, replace = TRUE),
 year = sample(2015:2016, N, replace = TRUE),
 gender = sample(c("F", "M"), N, replace = TRUE),
 value = runif(N, 0, 10)
   )
   write_dataset(orig_data, "example")
   
   # Query and replicate the error
   (ds <- open_dataset("example/"))
   #> FileSystemDataset with 1 Parquet file
   #> code_group: string
   #> year: int32
   #> gender: string
   #> value: double
   
   ds |>
 group_by(year, code_group, gender) |>
 summarise(value = sum(value)) |>
 group_by(code_group, gender) |>
 summarise(value = max(value), NN = n()) |>
 collect()
   #> # A tibble: 2 × 4
   #> # Groups:   code_group [2]
   #>   code_group gender  valueNN
   #> 
   #> 1 Group 1Group 1  724. 4
   #> 2 Group 2Group 2  661. 4
   ```
   
   **ERROR** the gender variable is replaced by the values of the group variable
   
   ``` r
   ds |>
 group_by(year, code_group, gender) |>
 summarise(value = sum(value)) |>
 ungroup() |> #< Added this 
line...
 group_by(code_group, gender) |>
 summarise(value = max(value), NN = n()) |>
 collect()
   #> # A tibble: 4 × 4
   #> # Groups:   code_group [2]
   #>   code_group gender valueNN
   #>
   #> 1 Group 1F   724. 2
   #> 2 Group 2M   627. 2
   #> 3 Group 1M   658. 2
   #> 4 Group 2F   661. 2
   ```
   
   **Note** now after inserting the `ungroup()` between the group-by - 
summarise calls, gender is not replaced
   
   
   Quick look at the query (note Node 4 where `"gender": code_group`)
   
   ``` r
   ds |>
 group_by(year, code_group, gender) |>
 summarise(value = sum(value)) |>
 group_by(code_group, gender) |>
 summarise(value = max(value), NN = n()) |> 
 show_query()
   #> ExecPlan with 8 nodes:
   #> 7:SinkNode{}
   #>   6:ProjectNode{projection=[code_group, gender, value, NN]}
   #> 5:GroupByNode{keys=["code_group", "gender"], aggregates=[
   #>  hash_max(value, {skip_nulls=false, min_count=0}),
   #>  hash_sum(NN, {skip_nulls=true, min_count=1}),
   #> ]}
   #>   4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": 
code_group]}   #< gender is wrongfully mapped to code_group! 
   #> 3:ProjectNode{projection=[year, code_group, gender, value]}
   #>   2:GroupByNode{keys=["year", "code_group", "gender"], 
aggregates=[
   #>  hash_sum(value, {skip_nulls=false, min_count=0}),
   #>   ]}
   #> 1:ProjectNode{projection=[value, year, code_group, gender]}
   #>   0:SourceNode{}
   ```
   
   Note that this was also asked [here on 
SO](https://stackoverflow.com/q/74710844/3048453)
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] gf2121 opened a new issue, #14873: [Java] DictionaryEncoder can decode without building a DictionaryHashTable

2022-12-07 Thread GitBox


gf2121 opened a new issue, #14873:
URL: https://github.com/apache/arrow/issues/14873

   ### Describe the enhancement requested
   
   Today DictionaryEncoder always forces the building of a DictionaryHashTable 
in the constructor. It can be avoided in scenarios where only decoding is 
required.
   
   ### Component(s)
   
   Java


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] zeroshade opened a new issue, #14875: [Python][C++] C Data Interface incorrect validate failures

2022-12-07 Thread GitBox


zeroshade opened a new issue, #14875:
URL: https://github.com/apache/arrow/issues/14875

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Spinning off from #14814: 
   
   When testing round trips of empty arrays between Python and Go using the C 
Data Interface, I found an issue with the binary and string data type arrays.
   
   The data types: `pa.binary()`, `pa.large_binary()`, `pa.string()`, 
`pa.large_string()` all throw an error when calling `validate(full=True)` after 
the `_import_from_c` that contained a null value data buffer:
   
   ```
   Traceback (most recent call last):
 File 
"/home/zeroshade/Projects/GitHub/arrow/go/arrow/cdata/test/test_export_to_cgo.py",
 line 218, in test
   b.validate(full=True)
 File "pyarrow/array.pxi", line 1501, in pyarrow.lib.Array.validate
 File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
   pyarrow.lib.ArrowInvalid: Value data buffer is null
   ```
   
   Following up from #14805 clarifying that buffers can be null in a 0-length 
array. My guess here is rather than the offsets buffer, the issue is the second 
data buffer which would contain the actual binary/utf-8 data if the array had a 
length >0. But that's just a theory, I haven't confirmed it.
   
   ### Component(s)
   
   C++, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] zeroshade opened a new issue, #14876: [Go] Address Crashes for empty C Data arrays with nil buffers

2022-12-07 Thread GitBox


zeroshade opened a new issue, #14876:
URL: https://github.com/apache/arrow/issues/14876

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Following up from #14805: Go's `cdata` package needs to address handling nil 
data buffers for 0 length empty arrays.
   
   ### Component(s)
   
   Go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] mattwarkentin opened a new issue, #14880: Best practices for handling larger than memory data

2022-12-07 Thread GitBox


mattwarkentin opened a new issue, #14880:
URL: https://github.com/apache/arrow/issues/14880

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   Hi,
   
   I am wondering if someone from the Arrow team could offer some guidance on 
best practices for handling very large data in an optimal way (such as if 
partitioning is even the answer). The specific data is a TSV file that is 26Gb 
on disk and ~50Gb in-memory when read into R. The data frame is ~500K rows and 
~14K columns. It is prohibitively slow/memory intensive to read the full data 
across each of several projects when, typically, only a small subset of the 
data (either subset of rows or columns) is relevant for any given project. 
   
   However, the filtering conditions for which subset changes project to 
project, So I don't see an obvious column to use for grouping and partitioning. 
Does it ever make sense to randomly chunk/partition the data into smaller sets 
of 5000-1 observations? My understanding was that much of the memory gain 
would occur if you chunked on a sensible variable (e.g., `year`) and then when 
you `filter()` a certain year, some of the data sets won't even be 
touched/loaded. Is there any way random chunking of observations offers any 
time/memory advantage?
   
   Most commonly, most/all rows but only a very small set of columns are 
needed. I had hoped that something like the following would work, where `...` 
is just a small set of column names:
   ```r
   ds <- arrow::open_dataset('data.tsv', format = 'tsv')
   df <- ds |> dplyr::select(...) |> dplyr::collect()
   ```
   But this is seemingly just as slow as loading the full table. I had thought 
only `...` columns would be read into memory so there would be a time savings. 
   
   Anyway, any suggestions? Am I fundamentally misunderstanding how to handle 
larger-than-memory data with `arrow`?
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] mattwarkentin closed issue #14880: Best practices for handling larger than memory data

2022-12-07 Thread GitBox


mattwarkentin closed issue #14880: Best practices for handling larger than 
memory data
URL: https://github.com/apache/arrow/issues/14880


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow-julia] quinnj closed issue #327: DST ambiguities in ZonedDateTime not supported

2022-12-07 Thread GitBox


quinnj closed issue #327: DST ambiguities in ZonedDateTime not supported
URL: https://github.com/apache/arrow-julia/issues/327


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [arrow] code1704 opened a new issue, #14882: How to do arrow table group by and split?

2022-12-07 Thread GitBox


code1704 opened a new issue, #14882:
URL: https://github.com/apache/arrow/issues/14882

   ### Describe the usage question you have. Please include as many useful 
details as  possible.
   
   
   How to group arrow table items and split into tables?
   
   ```
   
   g = table.group_by("a")
   for x in g:
  do_somthing(x)
   
   ```
   
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



<    2   3   4   5   6   7   8   9   10   11   >