[GitHub] [arrow-nanoarrow] pitrou opened a new issue, #27: Tune down issue notifications
pitrou opened a new issue, #27: URL: https://github.com/apache/arrow-nanoarrow/issues/27 Currently, every comment on every PR and issue is forwarded to the Arrow issues ML. It would be to tune that down to perhaps just sending notifications of issue and PR creations. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] pitrou opened a new issue, #28: Allow namespacing
pitrou opened a new issue, #28: URL: https://github.com/apache/arrow-nanoarrow/issues/28 Since one of the selling points of nanoarrow is easier embedding and vendoring, we should probably make it possible to avoid conflicts between different nanoarrow versions loaded in the same process. See for example a similar configuration option offered by xxhash: https://github.com/Cyan4973/xxHash/blob/c4359b17db082888fdc18371eba918b957a6baaa/xxhash.h#L210-L225 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #27: Tune down issue notifications
paleolimbot commented on issue #27: URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1220958811 I have no idea why that happens or how to stop it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot commented on issue #28: Allow namespacing
paleolimbot commented on issue #28: URL: https://github.com/apache/arrow-nanoarrow/issues/28#issuecomment-1220963652 Definitely! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] lidavidm commented on issue #28: Allow namespacing
lidavidm commented on issue #28: URL: https://github.com/apache/arrow-nanoarrow/issues/28#issuecomment-1220966008 Duplicate of #21? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] lidavidm opened a new issue, #71: [C] Research ConnectorX/pgeon for optimizing libpq driver
lidavidm opened a new issue, #71: URL: https://github.com/apache/arrow-adbc/issues/71 Pgeon: https://github.com/0x0L/pgeon ConnectorX: https://sfu-db.github.io/connector-x/intro.html -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] lidavidm opened a new issue, #72: [C] Research Turbodbc/Arrowdantic for developing ODBC-wrapping driver
lidavidm opened a new issue, #72: URL: https://github.com/apache/arrow-adbc/issues/72 Arrowdantic: https://github.com/jorgecarleitao/arrowdantic/ Turbodbc: https://github.com/blue-yonder/turbodbc/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] paleolimbot commented on issue #70: [Python] Try using PyCapsule for handles to C structs
paleolimbot commented on issue #70: URL: https://github.com/apache/arrow-adbc/issues/70#issuecomment-1220998984 Nothing! reticulate doesn't handle them. In this case the R package would implement `py_to_r.some.qualified.python.type.schema()` and return an external pointer classed as `nanoarrow_schema` (for example). My point was that the semantics should be exactly the same as if the transformation was automatic (at least in this case). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot commented on pull request #26: Implement getters
paleolimbot commented on PR #26: URL: https://github.com/apache/arrow-nanoarrow/pull/26#issuecomment-1221041550 (I don't think there's much of a point with "safe" variants of these unless there's any objection) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] wesm commented on issue #27: Tune down issue notifications
wesm commented on issue #27: URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1222122597 Here's Arrow's .asf.yaml https://github.com/apache/arrow/blob/master/.asf.yaml and this repo's https://github.com/apache/arrow-nanoarrow/blob/main/.asf.yaml I suggest copying over the the e-mail settings from the main Arrow repository -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] wesm commented on issue #27: Tune down issue notifications
wesm commented on issue #27: URL: https://github.com/apache/arrow-nanoarrow/issues/27#issuecomment-1222122949 ``` notifications: commits: comm...@arrow.apache.org issues: git...@arrow.apache.org pullrequests: git...@arrow.apache.org ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot opened a new pull request, #29: Tune down notifications
paleolimbot opened a new pull request, #29: URL: https://github.com/apache/arrow-nanoarrow/pull/29 Fixes #27. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot merged pull request #29: Tune down notifications
paleolimbot merged PR #29: URL: https://github.com/apache/arrow-nanoarrow/pull/29 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-nanoarrow] paleolimbot closed issue #27: Tune down issue notifications
paleolimbot closed issue #27: Tune down issue notifications URL: https://github.com/apache/arrow-nanoarrow/issues/27 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] lidavidm merged pull request #73: MINOR: Make issue notifications less noisy
lidavidm merged PR #73: URL: https://github.com/apache/arrow-adbc/pull/73 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] ericphanson opened a new issue, #332: Difficulties trying to serialize to Union types
ericphanson opened a new issue, #332: URL: https://github.com/apache/arrow-julia/issues/332 Setup: ```julia using Arrow struct A0 end struct A1 x::Int end struct A2 x::Int y::Float64 end ArrowTypes.arrowname(::Type{A0}) = :A0 ArrowTypes.JuliaType(::Val{:A0}) = A0 ArrowTypes.arrowname(::Type{A1}) = :A1 ArrowTypes.JuliaType(::Val{:A1}) = A1 ArrowTypes.arrowname(::Type{A2}) = :A2 ArrowTypes.JuliaType(::Val{:A2}) = A2 struct MyUnion{T <:Tuple} elts::T end ArrowTypes.arrowname(::Type{<:MyUnion}) = :MyUnion ArrowTypes.JuliaType(::Val{:MyUnion}) = MyUnion ArrowTypes.ArrowType(::Type{<:MyUnion}) = ArrowTypes.UnionKind() ArrowTypes.toarrow(u::MyUnion{T}) where {T} = collect(Union{T.parameters...}, u.elts) ArrowTypes.fromarrow(::Type{<:MyUnion}, args...) = MyUnion(args) ``` Then: ```julia julia> u = MyUnion((A0(), A1(1), A2(1, 2.0))) MyUnion{Tuple{A0, A1, A2}}((A0(), A1(1), A2(1, 2.0))) julia> ArrowTypes.toarrow(u) 3-element Vector{Union{A0, A1, A2}}: A0() A1(1) A2(1, 2.0) julia> tbl = (; col = [u]); julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1] ERROR: MethodError: no method matching isstringtype(::ArrowTypes.StructKind) Closest candidates are: isstringtype(::ArrowTypes.ListKind{stringtype}) where stringtype at ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:196 isstringtype(::Type{ArrowTypes.ListKind{stringtype}}) where stringtype at ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:197 Stacktrace: [1] getindex(l::Arrow.List{MyUnion, Int32, Arrow.DenseUnion{Union{Missing, A0, A1, A2}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0, A1}, A2}}, Tuple{Arrow.DenseUnion{Union{Missing, A0, A1}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0}, A1}}, Tuple{Arrow.Struct{Union{Missing, A0}, Tuple{}}, Arrow.Struct{A1, Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Struct{A2, Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Primitive{Float64, Vector{Float64}}}, i::Int64) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/arraytypes/list.jl:52 [2] top-level scope @ REPL[25]:1 ``` I took a guess and added ```julia ArrowTypes.isstringtype(::ArrowTypes.StructKind) = false ``` which seems to fix it ```julia julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1] MyUnion{Tuple{SubArray{Union{Missing, A0, A1, A2}, 1, Arrow.DenseUnion{Union{Missing, A0, A1, A2}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0, A1}, A2}}, Tuple{Arrow.DenseUnion{Union{Missing, A0, A1}, Arrow.UnionT{Arrow.Flatbuf.UnionModes.Dense, nothing, Tuple{Union{Missing, A0}, A1}}, Tuple{Arrow.Struct{Union{Missing, A0}, Tuple{}}, Arrow.Struct{A1, Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Struct{A2, Tuple{Arrow.Primitive{Int64, Vector{Int64}}, Arrow.Primitive{Float64, Vector{Float64}}, Tuple{UnitRange{Int64}}, true}}}((Union{Missing, A0, A1, A2}[A0(), A1(1), A2(1, 2.0)],)) ``` However, in my real code, I was using this macro to define the methods: ```julia macro arrow_record(T1) T = esc(T1) name = :(Symbol("JuliaLang.", string(parentmodule($T)), '.', string(nameof($T return quote ArrowTypes.arrowname(::Type{$T}) = $name ArrowTypes.ArrowType(::Type{$T}) = fieldtypes($T) ArrowTypes.toarrow(obj::$T) = ntuple(i -> getfield(obj, i), fieldcount($T)) ArrowTypes.JuliaType(::Val{$name}, ::Any) = $T ArrowTypes.fromarrow(::Type{$T}, args) = $T(args...) ArrowTypes.fromarrow(::Type{$T}, arg::$T) = arg end end ``` and had a second `A2`-style struct, ```julia struct A22 x::Int y::Float64 end ``` If I do ```julia @arrow_record A2 @arrow_record A22 ``` and define ``` julia> u1 = MyUnion((A2(1, 2.0), A22(2, 3.0))) MyUnion{Tuple{A2, A22}}((A2(1, 2.0), A22(2, 3.0))) julia> u2 = MyUnion((A22(1, 2.0), A2(2, 3.0))) MyUnion{Tuple{A22, A2}}((A22(1, 2.0), A2(2, 3.0))) julia> tbl = (; col = [u1, u2]); ``` Then I get ```julia julia> Arrow.Table(Arrow.tobuffer(tbl)).col[1] ERROR: TypeError: in Union, expected Type, got a value of type Tuple{DataType, DataType} Stacktrace: [1] ArrowType(#unused#::Type{Union{Missing, A2}}) @ ArrowTypes ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:71 [2] ArrowTypes.ToArrow(x::Vector{Union{Missing, A2}}) @ ArrowTypes ~/.julia/packages/ArrowTypes/dkiHE/src/ArrowTypes.jl:338 [3] arrowvector(x::Vector{Union{Missing, A2}}, i::Int64, nl::Int64, fi::Int64, de::Dict{Int64, Any}, ded::Vector{Arrow.DictEncoding}, meta::Nothing; dictencoding::Bool, dictencode::
[GitHub] [arrow-julia] jariji opened a new issue, #333: Can't round-trip integer CategoricalArrays
jariji opened a new issue, #333: URL: https://github.com/apache/arrow-julia/issues/333 ```jl julia> Arrow.write("/tmp/my.arrow",DataFrame(x=CategoricalArray([1,2,3]))) "/tmp/my.arrow" julia> DataFrame(Arrow.Table("/tmp/my.arrow")).x|>eltype Int64 [69666777] Arrow v2.3.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bilelomrani1 opened a new issue, #334: Streaming: Pyarrow is 15 times faster than Arrow.jl
bilelomrani1 opened a new issue, #334: URL: https://github.com/apache/arrow-julia/issues/334 I have an `.arrow` file generated with `pyarrow` whose schema is the following: ``` input: struct[512], high: fixed_size_list[512], low: fixed_size_list[512], close: fixed_size_list[512]> not null child 0, open: fixed_size_list[512] child 0, item: float child 1, high: fixed_size_list[512] child 0, item: float child 2, low: fixed_size_list[512] child 0, item: float child 3, close: fixed_size_list[512] child 0, item: float ``` With `pyarrow`, I load and iterate over records with the following: ```python with pa.memory_map('arraydata.arrow', 'r') as source: loaded_arrays = pa.ipc.open_file(source).read_all() a = 0 for batch in loaded_arrays.to_batches(): for input_candles in batch["input"]: a += 1 ``` Iterating over my example file (~10,000 lines) takes 210 ms. In julia, I load and iterate over the same file with the following: ```julia stream = Arrow.Stream("./arraydata.arrow") function bench_iteration(stream) a = 0 for batch in stream for sample in batch.input a += 1 end end end @btime bench_iteration($stream) ``` ``` 3.169 s (25272097 allocations: 1.70 GiB) ``` Iterating over records takes 15 more time with `Arrow.jl`. Am I doing something wrong? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] svilupp opened a new issue, #335: Inconsistent handling of eltype Decimals.Decimal (with silent errors?)
svilupp opened a new issue, #335: URL: https://github.com/apache/arrow-julia/issues/335 First of all, thank you for the amazing package! I have noticed unexpected behaviour that I wanted to point out. **Expected behaviour:** rational numbers like 1.0 and 0.1 will be represented as Float; they can be saved and loaded again. **Actual behaviour:** When writing column with eltype Decimals.Decimal, `Arrow.write(filename,df)` will give a method error (see below) and `Arrow.write(filename,df;compress=:lz4)` will complete without an error, but the resulting table is wrong when re-read (see MWE below). I've had a quick look at the code base and I cannot see any type checks - are those left to the user / MethodErrors? MWE: ``` using Decimals using DataFrames, Arrow df=DataFrame(:a=>[Decimal(2.0)]) # this will fail with error that Decimal cannot be saved Arrow.write("test.feather", df) # nested task error: MethodError: no method matching write(::IOBuffer, ::Decimals.Decimal) # this will succeed Arrow.write("test.feather", df;compress=:lz4) # but the loaded dataframe will be rubbish df2=Arrow.Table("test.feather")|>DataFrame # 1×1 DataFrame # Row │ a # │ Float64 # ─┼─ #1 │ 2.1509e-314 ``` Error stack trace from Arrow.write() without a keyword argument: > ERROR: TaskFailedException Stacktrace: [1] wait @ ./task.jl:345 [inlined] [2] close(writer::Arrow.Writer{IOStream}) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:230 [3] open(::Arrow.var"#120#121"{DataFrame}, ::Type, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:file,), Tuple{Bool}}}) @ Base ./io.jl:386 [4] #write#119 @ ~/.julia/packages/Arrow/ZlMFU/src/write.jl:57 [inlined] [5] write(file_path::String, tbl::DataFrame) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:56 [6] top-level scope @ REPL[14]:1 > nested task error: MethodError: no method matching write(::IOBuffer, ::Decimals.Decimal) Closest candidates are: write(::IO, ::Any) at io.jl:672 write(::IO, ::Any, ::Any...) at io.jl:673 write(::Base.GenericIOBuffer, ::UInt8) at iobuffer.jl:442 ... Stacktrace: [1] write(io::IOBuffer, x::Decimals.Decimal) @ Base ./io.jl:672 [2] writearray(io::IOStream, #unused#::Type{Decimals.Decimal}, col::Vector{Union{Missing, Decimals.Decimal}}) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/utils.jl:50 [3] writebuffer(io::IOStream, col::Arrow.Primitive{Union{Missing, Decimals.Decimal}, Vector{Union{Missing, Decimals.Decimal}}}, alignment::Int64) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/arraytypes/primitive.jl:102 [4] write(io::IOStream, msg::Arrow.Message, blocks::Tuple{Vector{Arrow.Block}, Vector{Arrow.Block}}, sch::Base.RefValue{Tables.Schema}, alignment::Int64) @ Arrow ~/.julia/packages/Arrow/ZlMFU/src/write.jl:365 [5] macro expansion @ ~/.julia/packages/Arrow/ZlMFU/src/write.jl:149 [inlined] [6] (::Arrow.var"#122#124"{IOStream, Int64, Tuple{Vector{Arrow.Block}, Vector{Arrow.Block}}, Base.RefValue{Tables.Schema}, Arrow.OrderedChannel{Arrow.Message}})() @ Arrow ./threadingconstructs.jl:258 **Package version** [69666777] Arrow v2.3.0 [a93c6f00] DataFrames v1.3.4 [194296ae] LibPQ v1.14.0 **versioninfo()** (but it was the same on 1.7) Julia Version 1.8.0 Commit 5544a0fab76 (2022-08-17 13:38 UTC) Platform Info: OS: macOS (arm64-apple-darwin21.3.0) CPU: 8 × Apple M1 Pro WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1) Threads: 6 on 6 virtual cores -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] TanookiToad opened a new issue, #336: Invalid argument error
TanookiToad opened a new issue, #336: URL: https://github.com/apache/arrow-julia/issues/336 If you try to save a loaded table into the same file, it will lead to an invalid argument error. Seems like it's caused by mmap on windows. See JuliaData/CSV.jl#70. ```jl using Arrow using DataFrames df = DataFrame(rand(100, 100), :auto) Arrow.write("test.arrow", df) df = Arrow.Table("test.arrow") Arrow.write("test.arrow", df) ``` The last line will raise an error. ```jl ERROR: SystemError: opening file "test.arrow": Invalid argument Stacktrace: [1] systemerror(p::String, errno::Int32; extrainfo::Nothing) @ Base .\error.jl:174 [2] #systemerror#68 @ .\error.jl:173 [inlined] [3] systemerror @ .\error.jl:173 [inlined] [4] open(fname::String; lock::Bool, read::Nothing, write::Nothing, create::Nothing, truncate::Bool, append::Nothing) @ Base .\iostream.jl:293 [5] open(fname::String, mode::String; lock::Bool) @ Base .\iostream.jl:355 [6] open(fname::String, mode::String) @ Base .\iostream.jl:355 [7] open(::Arrow.var"#116#117"{Nothing, Nothing, Bool, Nothing, Bool, Bool, Bool, Int64, Int64, Float64, Bool, Arrow.Table}, ::String, ::Vararg{String}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}) @ Base .\io.jl:328 [8] open(::Function, ::String, ::String) @ Base .\io.jl:328 [9] #write#115 @ C:\Users\R9000K\.julia\packages\Arrow\SFb8h\src\write.jl:57 [inlined] [10] write(file_path::String, tbl::Arrow.Table) @ Arrow C:\Users\R9000K\.julia\packages\Arrow\SFb8h\src\write.jl:57 [11] top-level scope @ Untitled-1:8 ``` However, it works when saved to a different file name other than the original one. ```jl Arrow.write("test1.arrow", df) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] westonpace closed pull request #74: ARROW-15425: [Integration] Add delta dictionaries in file format to integration tests
westonpace closed pull request #74: ARROW-15425: [Integration] Add delta dictionaries in file format to integration tests URL: https://github.com/apache/arrow-testing/pull/74 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bkamins opened a new issue, #337: Support DataAPI.jl metadata API
bkamins opened a new issue, #337: URL: https://github.com/apache/arrow-julia/issues/337 Hi @quinnj - could you please add to the release plan of Arrow.jl support for https://github.com/JuliaData/DataAPI.jl/pull/48 for the created Arrow tables. Only read methods need to be implemented for Arrow tables: * `DataAPI.metadata` * `DataAPI.metadatakeys` * `DataAPI.colmetadata` * `DataAPI.colmetadatakeys` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] Moelf opened a new issue, #340: Feather file with compression and larger than RAM
Moelf opened a new issue, #340: URL: https://github.com/apache/arrow-julia/issues/340 Last time I checked, `mmap` breaks down for files with compression. This is understandable because the compressed buffers clearly can't be re-interpreted without inflation. But the larger the file is the more likely it's compressed, can we decompressed only a single "row group" (and only the relevant columns of course) on the fly yet? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj closed issue #295: Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved
quinnj closed issue #295: Order of record batches from "arrow file" format files (i.e. `Arrow.Table`) not preserved URL: https://github.com/apache/arrow-julia/issues/295 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj closed issue #324: filtering DataFrame loaded from feather file triggers `deleteat!` error
quinnj closed issue #324: filtering DataFrame loaded from feather file triggers `deleteat!` error URL: https://github.com/apache/arrow-julia/issues/324 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj opened a new issue, #342: Need to improve code review/release process and reduce developer productivity friction
quinnj opened a new issue, #342: URL: https://github.com/apache/arrow-julia/issues/342 In https://github.com/apache/arrow-julia/issues/284, I originally raised some concerns about the health and long-term maintainability of the package under the apache organization. Having let that sit for a while, I'm again raising concerns around how the package is managed. In particular, I have 3 main complaints: 1. Inability for meaningful contributors to approve pull requests (only Arrow PMC members are able to approve PRs to be merged) 2. Inability for meaningful contributors to approve new releases (same as above) 3. Slowness of getting fixes merged and new releases made (combination of requiring Arrow PMC approvals from above 2 and current 72 hour release window) On point 1, it's unfortunate because only Arrow PMC members (only @kou so far) can approve PRs/releases in a meaningful way, yet these members, no disrespect intended, don't have the skills/context/code abilities to actually evaluate code changes. It would be much more helpful if @jrevels, @omus, @ericphanson, @nickrobinson251, @bkamins, and @baumgold had the necessary permissions to approve pull requests and new releases. On point 3, the current 72-hour window is really long. Especially when it's idiomatic in Julia packages to merge a single pull request with small fix, and immediately issue a patch release. I think ideally we'd be able to have at least 12 or 24 hour release windows that would make things much more manageable. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] Moelf opened a new issue, #344: error earlier when number of entries don't match across all fields
Moelf opened a new issue, #344: URL: https://github.com/apache/arrow-julia/issues/344 right now it throws confusing messages: ``` julia> Arrow.write(tempname(), df) ERROR: UndefRefError: access to undefined reference Stacktrace: [1] getindex @ ./array.jl:924 [inlined] [2] iterate @ ~/.julia/packages/Arrow/SFb8h/src/arraytypes/list.jl:171 [inlined] [3] Arrow.ToList(input::Arrow.ToList{Vector{Bool}, false, Vector{Vector{Bool}}, Int32}; largelists::Bool) @ Arrow ~/.julia/packages/Arrow/SFb8h/src/arraytypes/list.jl:103 [4] arrowvector(::ArrowTypes.ListKind{false}, x::Arrow.ToList{Vector{Bool}, false, Vector{Vector{Bool}}, Int32}, i::Int64, nl::Int64, fi::Int64, de::Dict{Int64, Any}, ded::Vector{Arrow.DictEncoding}, meta::Nothing; largelists::Bool, kw::Base.Pairs{Symbol, Union{Nothing, Integer}, NTuple{6, Symbol}, NamedTuple{(:dictencode, :maxdepth, :lareglists, :compression, :denseunions, :dictencodenested), Tuple{Bool, Int64, Bool, Nothing, Bool, Bool}}}) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] zeroshade opened a new pull request, #81: ARROW-18031: [C++][Parquet] Undefined behavior in boolean RLE decoder
zeroshade opened a new pull request, #81: URL: https://github.com/apache/arrow-testing/pull/81 Corresponding Fix for this issue found in https://github.com/apache/arrow/pull/14407 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] pitrou merged pull request #81: ARROW-18031: [C++][Parquet] Undefined behavior in boolean RLE decoder
pitrou merged PR #81: URL: https://github.com/apache/arrow-testing/pull/81 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] palday opened a new issue, #345: Tests fail on Apple silicon on Julia 1.8
palday opened a new issue, #345: URL: https://github.com/apache/arrow-julia/issues/345 ```julia ArgumentError: unsafe_wrap: pointer 0x14858d048 is not properly aligned to 16 bytes Stacktrace: [1] #unsafe_wrap#102 @ ./pointer.jl:89 [inlined] [2] unsafe_wrap @ ./pointer.jl:87 [inlined] [3] reinterp(#unused#::Type{Arrow.Decimal{2, 2, Int128}}, batch::Arrow.Batch, buf::Arrow.Flatbuf.Buffer, compression::Nothing) @ Arrow ~/Code/arrow-julia/src/table.jl:507 [4] build(f::Arrow.Flatbuf.Field, #unused#::Arrow.Flatbuf.Decimal, batch::Arrow.Batch, rb::Arrow.Flatbuf.RecordBatch, de::Dict{Int64, Arrow.DictEncoding}, nodeidx::Int64, bufferidx::Int64, convert::Bool) ``` Full test output ```julia (Arrow) pkg> test Testing Arrow Status `/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Project.toml` [69666777] Arrow v2.3.0 `~/Code/arrow-julia` ⌅ [31f734f8] ArrowTypes v1.2.1 [c3b6d118] BitIntegers v0.2.6 [324d7699] CategoricalArrays v0.10.7 [5ba52731] CodecLz4 v0.4.0 [6b39b394] CodecZstd v0.7.2 [9a962f9c] DataAPI v1.12.0 [48062228] FilePathsBase v0.9.20 [0f8b85d8] JSON3 v1.10.0 [2dfb63ee] PooledArrays v1.4.2 [91c51154] SentinelArrays v1.3.16 [856f2bd8] StructTypes v1.10.0 [bd369af6] Tables v1.10.0 [f269a46b] TimeZones v1.9.0 [76eceee3] WorkerUtilities v1.1.0 [ade2ca70] Dates `@stdlib/Dates` [a63ad114] Mmap `@stdlib/Mmap` [9a3f8284] Random `@stdlib/Random` [8dfed614] Test `@stdlib/Test` [cf7118a7] UUIDs `@stdlib/UUIDs` Status `/private/var/folders/yy/nyj87tsn7093bb7d84rl64rhgp/T/jl_xRGYNK/Manifest.toml` [69666777] Arrow v2.3.0 `~/Code/arrow-julia` ⌅ [31f734f8] ArrowTypes v1.2.1 [c3b6d118] BitIntegers v0.2.6 [fa961155] CEnum v0.4.2 [324d7699] CategoricalArrays v0.10.7 [5ba52731] CodecLz4 v0.4.0 [6b39b394] CodecZstd v0.7.2 ⌅ [34da2185] Compat v3.46.0 [9a962f9c] DataAPI v1.12.0 [e2d170a0] DataValueInterfaces v1.0.0 [e2ba6199] ExprTools v0.1.8 [48062228] FilePathsBase v0.9.20 [842dd82b] InlineStrings v1.2.2 [82899510] IteratorInterfaceExtensions v1.0.0 [692b3bcd] JLLWrappers v1.4.1 [0f8b85d8] JSON3 v1.10.0 [e1d29d7a] Missings v1.0.2 [78c3b35d] Mocking v0.7.3 [bac558e1] OrderedCollections v1.4.1 [69de0a69] Parsers v2.4.2 [2dfb63ee] PooledArrays v1.4.2 [21216c6a] Preferences v1.3.0 [3cdcf5f2] RecipesBase v1.3.1 [ae029012] Requires v1.3.0 [6c6a2e73] Scratch v1.1.1 [91c51154] SentinelArrays v1.3.16 [66db9d55] SnoopPrecompile v1.0.1 [856f2bd8] StructTypes v1.10.0 [3783bdb8] TableTraits v1.0.1 [bd369af6] Tables v1.10.0 [f269a46b] TimeZones v1.9.0 [3bb67fe8] TranscodingStreams v0.9.9 [76eceee3] WorkerUtilities v1.1.0 [5ced341a] Lz4_jll v1.9.3+0 [3161d3a3] Zstd_jll v1.5.2+0 [0dad84c5] ArgTools v1.1.1 `@stdlib/ArgTools` [56f22d72] Artifacts `@stdlib/Artifacts` [2a0f44e3] Base64 `@stdlib/Base64` [ade2ca70] Dates `@stdlib/Dates` [8bb1440f] DelimitedFiles `@stdlib/DelimitedFiles` [8ba89e20] Distributed `@stdlib/Distributed` [f43a241f] Downloads v1.6.0 `@stdlib/Downloads` [7b1f6079] FileWatching `@stdlib/FileWatching` [9fa8497b] Future `@stdlib/Future` [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils` [4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts` [b27032c2] LibCURL v0.6.3 `@stdlib/LibCURL` [76f85450] LibGit2 `@stdlib/LibGit2` [8f399da3] Libdl `@stdlib/Libdl` [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra` [56ddb016] Logging `@stdlib/Logging` [d6f4376e] Markdown `@stdlib/Markdown` [a63ad114] Mmap `@stdlib/Mmap` [ca575930] NetworkOptions v1.2.0 `@stdlib/NetworkOptions` [44cfe95a] Pkg v1.8.0 `@stdlib/Pkg` [de0858da] Printf `@stdlib/Printf` [3fa0cd96] REPL `@stdlib/REPL` [9a3f8284] Random `@stdlib/Random` [ea8e919c] SHA v0.7.0 `@stdlib/SHA` [9e88b42a] Serialization `@stdlib/Serialization` [1a1011a3] SharedArrays `@stdlib/SharedArrays` [6462fe0b] Sockets `@stdlib/Sockets` [2f01184e] SparseArrays `@stdlib/SparseArrays` [10745b16] Statistics `@stdlib/Statistics` [fa267f1f] TOML v1.0.0 `@stdlib/TOML` [a4e569a6] Tar v1.10.1 `@stdlib/Tar` [8dfed614] Test `@stdlib/Test` [cf7118a7] UUIDs `@stdlib/UUIDs` [4ec0a83e] Unicode `@stdlib/Unicode` [e66e0078] CompilerSupportLibraries_jll v0.5.2+0 `@stdlib/CompilerSupportLibraries_jll` [deac9b47] LibCURL_jll v7.84.0+0 `@stdlib/LibCURL_jll` [29816b5a] LibSSH2_jll v1.10.2+0 `@stdlib/LibSSH2_jll` [c8ffd9c3] MbedTLS_jll v2.28.0+0 `@stdlib/MbedTLS_jll` [14a3606d] MozillaCACerts_jll v2022.2.1 `@stdlib/MozillaCACerts_jll` [4536629a] OpenBLAS_jll v0.3.20+0 `@stdlib/O
[GitHub] [arrow-julia] Moelf closed issue #344: error earlier when number of entries don't match across all fields
Moelf closed issue #344: error earlier when number of entries don't match across all fields URL: https://github.com/apache/arrow-julia/issues/344 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] ericphanson opened a new issue, #348: Install Registrator.jl github app
ericphanson opened a new issue, #348: URL: https://github.com/apache/arrow-julia/issues/348 Can we install the Julia Registrator github app on this repository? https://github.com/JuliaRegistries/Registrator.jl#install-registrator https://user-images.githubusercontent.com/5846501/198585934-972f8963-84c1-429a-acea-deca5ba872c6.png";> I can request it to be installed, but I guess someone will have to approve it. I have not requested it yet in case it is a breach of etiquette and there is a different process to be followed. Why? This enables us to easily register packages in Julia's package registry. It only requires minimal read-only permissions. This is the standard workflow for registering Julia packages. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bkamins opened a new issue, #352: Allow appending record batches to an existing Arrow file
bkamins opened a new issue, #352: URL: https://github.com/apache/arrow-julia/issues/352 @quinnj - following my comment on Julia Slack. Would it be possible to add an option to append record batches to an existing Arrow file (I assume such append method should check if appended table has the same schema as the existing Arrow data). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bkamins closed issue #352: Allow appending record batches to an existing Arrow file
bkamins closed issue #352: Allow appending record batches to an existing Arrow file URL: https://github.com/apache/arrow-julia/issues/352 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bkamins opened a new issue, #353: Add an indexable variant of Arrow.Stream
bkamins opened a new issue, #353: URL: https://github.com/apache/arrow-julia/issues/353 In distributed computing context it would be nice to have a vector-variant of `Arrow.Stream` iterator. The idea is to be able to split processing of a single large arrow file with multiple record batches into multiple worker processes. Looking at the source code this should be possible to be done in a relatively efficient way. @quinnj - what do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] bkamins opened a new issue, #354: Arrow.append to non-existent file
bkamins opened a new issue, #354: URL: https://github.com/apache/arrow-julia/issues/354 Maybe we could consider allowing creation of a new Arrow file with `Arrow.append`? Currently it fails, so one needs to write: ``` for i in 1:10 if isfile("out.arrow") Arrow.append("out.arrow", DataFrame(i=i)) else Arrow.write("out.arrow", DataFrame(i=i)) end end ``` which could be just: ``` for i in 1:10 Arrow.append("out.arrow", DataFrame(i=i)) end ``` But maybe there is a reason for the current design? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj closed issue #345: Tests fail on Apple silicon on Julia 1.8
quinnj closed issue #345: Tests fail on Apple silicon on Julia 1.8 URL: https://github.com/apache/arrow-julia/issues/345 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] alex-s-gardner opened a new issue, #359: Arrow changes data type from input in unexpected ways
alex-s-gardner opened a new issue, #359: URL: https://github.com/apache/arrow-julia/issues/359 From this MWE input is unrecognizable from output (the path to the Zarr file is public and can be run locally): ``` dc = Zarr.zopen("http://its-live-data.s3.amazonaws.com/datacubes/v02/N20E100/ITS_LIVE_vel_EPSG32647_G0120_X65_Y325.zarr";) C = dc["satellite_img1"][:] input = DataFrame([C,C],:auto) Arrow.write("test.arrow", input) output = Arrow.Table("test.arrow") ``` `input.x1` looks like this: ``` 1460-element Vector{Zarr.MaxLengthStrings.MaxLengthString{2, UInt32}}: "1A" ⋮ "8." ``` while `output.x1` looks like this: ``` 1460-element Arrow.List{String, Int32, Vector{UInt8}}: "1\0" ⋮ "\0\0" ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] dmbates opened a new issue, #360: Creating DictEncoded in the presence of missing values
dmbates opened a new issue, #360: URL: https://github.com/apache/arrow-julia/issues/360 When, e.g. a `PooledArray` column, that contains missing values is converted to `DictEncoded` the dictionary is based on the result of `DataAPI.refpool`, which includes `missing`. As a result both the dictionary and the index vector contain missing values, which confuses Pandas. The missing value in the dictionary can be skipped because it is never referenced in the index vector. ```julia julia> using Arrow, DataAPI, PooledArrays julia> tbl = (; a = PooledArray([missing, "a", "b", "a"])) (a = Union{Missing, String}[missing, "a", "b", "a"],) julia> DataAPI.refarray(tbl.a) 4-element Vector{UInt32}: 0x0001 0x0002 0x0003 0x0002 julia> DataAPI.refpool(tbl.a) 3-element Vector{Union{Missing, String}}: missing "a" "b" julia> Arrow.write("tbl.arrow", tbl) "tbl.arrow" ``` In the `read_table` result we see that there is a `null` in the dictionary at Python index 0 that is never referenced in the indices vector. ```python $ python Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow.feather as fea >>> fea.read_table("tbl.arrow") pyarrow.Table a: dictionary a: [ -- dictionary: [null,"a","b"] -- indices: [null,1,2,1]] >>> fea.read_feather('nyc_mv_collisions_202201.arrow') Traceback (most recent call last): File "", line 1, in File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py", line 231, in read_feather return (read_table( File "pyarrow/array.pxi", line 823, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 818, in table_to_blockmanager blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in _table_to_blocks return [_reconstruct_block(item, columns, extension_columns) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in return [_reconstruct_block(item, columns, extension_columns) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 757, in _reconstruct_block cat = _pandas_api.categorical_type.from_codes( File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 687, in from_codes dtype = CategoricalDtype._from_values_or_dtype( File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 299, in _from_values_or_dtype dtype = CategoricalDtype(categories, ordered) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 186, in __init__ self._finalize(categories, ordered, fastpath=False) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 340, in _finalize categories = self.validate_categories(categories, fastpath=fastpath) File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 534, in validate_categories raise ValueError("Categorical categories cannot be null") ValueError: Categorical categories cannot be null ``` One possible approach is to check for `missing` in the refpool, find its index in the refpool, delete it from the refpool and rewrite the refarray to replace that index by missing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-testing] zeroshade merged pull request #82: Add Go Parquet DeltaBitPacking test data
zeroshade merged PR #82: URL: https://github.com/apache/arrow-testing/pull/82 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] MrHenning opened a new issue, #363: Error when open files with MICROSECOND in Date.Time
MrHenning opened a new issue, #363: URL: https://github.com/apache/arrow-julia/issues/363 When opening a table that contains a `Time` object containing `MICROSECONS` (which I think is the default with `python`/`pandas`) I get an error. Example: 1. Create `python`/`pandas` dataframe: ```python pd.DataFrame(dict( i=range(0,10), time=[datetime.time(hour=i) for i in range(0,10)] )).to_feather('~/python_time_df.arrow') ``` 1. read file in `julia`: ```julia Arrow.Table("~/python_time_df.arrow", convert=true) ``` yields the error ``` Failed to show value: MethodError: no method matching Int64(::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}) Closest candidates are: (::Type{T})(!Matched::AbstractChar) where T<:Union{Int32, Int64} at char.jl:51 (::Type{T})(!Matched::AbstractChar) where T<:Union{AbstractChar, Number} at char.jl:50 (::Type{T})(!Matched::BigInt) where T<:Union{Int128, Int16, Int32, Int64, Int8} at gmp.jl:359 ... - Dates.Time(::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}, ::Int64, ::Int64, ::Int64, ::Int64, ::Int64, ::Dates.AMPM)@types.jl:412 - fromarrow(::Type{Dates.Time}, ::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64})@ArrowTypes.jl:157 - fromarrow(::Type{Union{Missing, Dates.Time}}, ::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64})@ArrowTypes.jl:161 - getin...@primitive.jl:46[inlined] - _getin...@abstractarray.jl:1274[inlined] - getin...@abstractarray.jl:1241[inlined] - isassigned(::Arrow.Primitive{Union{Missing, Dates.Time}, Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}}}, ::Int64, ::Int64)@abstractarray.jl:565 - alignment(::IOContext{IOBuffer}, ::AbstractVecOrMat, ::Vector{Int64}, ::Vector{Int64}, ::Int64, ::Int64, ::Int64, ::Int64)@arrayshow.jl:68 - _print_matrix(::IOContext{IOBuffer}, ::AbstractVecOrMat, ::String, ::String, ::String, ::String, ::String, ::String, ::Int64, ::Int64, ::UnitRange{Int64}, ::UnitRange{Int64})@arrayshow.jl:207 - print_matrix(::IOContext{IOBuffer}, ::Arrow.Primitive{Union{Missing, Dates.Time}, Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}}}, ::String, ::String, ::String, ::String, ::String, ::String, ::Int64, ::Int64)@arrayshow.jl:171 - print_ar...@arrayshow.jl:358[inlined] - show(::IOContext{IOBuffer}, ::MIME{Symbol("text/plain")}, ::Arrow.Primitive{Union{Missing, Dates.Time}, Vector{Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}}})@arrayshow.jl:399 - show_richest(::IOContext{IOBuffer}, ::Any)@PlutoRunner.jl:1157 - show_richest_withretur...@plutorunner.jl:1095[inlined] - format_output_default(::Any, ::Any)@PlutoRunner.jl:995 - var"#format_output#60"(::IOContext{Base.DevNull}, ::typeof(Main.PlutoRunner.format_output), ::Any)@PlutoRunner.jl:1012 - formatted_result_of(::Base.UUID, ::Base.UUID, ::Bool, ::Vector{String}, ::Nothing, ::Module)@PlutoRunner.jl:905 - top-level sc...@workspacemanager.jl:476 ``` Defining ```julia ArrowTypes.fromarrow(::Type{Dates.Time}, x::Arrow.Time{Arrow.Flatbuf.TimeUnits.MICROSECOND, Int64}) = convert(Dates.Time, x) ``` seems to fix the error. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] TanookiToad opened a new issue, #364: Values in PooledArray are incorrectly saved
TanookiToad opened a new issue, #364: URL: https://github.com/apache/arrow-julia/issues/364 If a float-type element in ```PooledVector{Real, UInt32, Vector{UInt32}}``` is replaced by an integer, Arrow will incorrectly save this value. ``` using Arrow, DataFrames, PooledArrays df = DataFrame(x = PooledArray(Vector{Real}([1.0]))) df[1, 1] = 2 Arrow.write("test.arrow", df) Arrow.Table(io) |> DataFrame ``` The following code incorrectly saves 11 as 1.0e-323. ``` 1×1 DataFrame Row │ x │ Float64 ─┼── 1 │ 1.0e-323 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] ericphanson closed issue #232: Serializing `Dict{String,Real}` result in garbage values
ericphanson closed issue #232: Serializing `Dict{String,Real}` result in garbage values URL: https://github.com/apache/arrow-julia/issues/232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj closed issue #364: PooledArray are incorrectly saved
quinnj closed issue #364: PooledArray are incorrectly saved URL: https://github.com/apache/arrow-julia/issues/364 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new issue, #14816: [Release] Make dev/release/06-java-upload.sh reusable from other project
kou opened a new issue, #14816: URL: https://github.com/apache/arrow/issues/14816 ### Describe the enhancement requested https://github.com/apache/arrow-adbc is one use case. See also: https://github.com/apache/arrow-adbc/pull/174#discussion_r1037547584 ### Component(s) Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] ycyang-26 closed issue #14798: [parquet go] write parquet data code sample
ycyang-26 closed issue #14798: [parquet go] write parquet data code sample URL: https://github.com/apache/arrow/issues/14798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new issue, #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed
kou opened a new issue, #14819: URL: https://github.com/apache/arrow/issues/14819 ### Describe the bug, including details regarding any error messages, version, and platform. https://github.com/ursacomputing/crossbow/actions/runs/3590866755/jobs/6044735050#step:6:1864 ```text CMake Error at /usr/share/cmake/Modules/CMakeTestCCompiler.cmake:66 (message): The C compiler "/opt/rh/gcc-toolset-12/root/usr/bin/gcc" is not able to compile a simple test program. It fails with the following output: Change Dir: /root/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp Run Build Command(s):/usr/bin/gmake -f Makefile cmTC_427a1/fast && /usr/bin/gmake -f CMakeFiles/cmTC_427a1.dir/build.make CMakeFiles/cmTC_427a1.dir/build gmake[1]: Entering directory '/build/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp' Building C object CMakeFiles/cmTC_427a1.dir/testCCompiler.c.o -- Check for working C compiler: /opt/rh/gcc-toolset-12/root/usr/bin/gcc - broken /opt/rh/gcc-toolset-12/root/usr/bin/gcc -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -o CMakeFiles/cmTC_427a1.dir/testCCompiler.c.o -c /root/rpmbuild/BUILD/apache-arrow-11.0.0.dev189/cpp/redhat-linux-build/CMakeFiles/CMakeTmp/testCCompiler.c cc1: fatal error: inaccessible plugin file /opt/rh/gcc-toolset-12/root/usr/lib/gcc/x86_64-redhat-linux/12/plugin/gcc-annobin.so expanded from short plugin name gcc-annobin: No such file or directory ``` ### Component(s) Continuous Integration, Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14784: [Dev] Allow users to assign issues with a comment on the issue
kou closed issue #14784: [Dev] Allow users to assign issues with a comment on the issue URL: https://github.com/apache/arrow/issues/14784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] AlenkaF opened a new issue, #14822: [Docs] Update the documentation to include the new GitHub issue workflow
AlenkaF opened a new issue, #14822: URL: https://github.com/apache/arrow/issues/14822 ### Describe the enhancement requested As we are moving towards GitHub issues for Arrow issue reports, the documentation needs to be updated so that the contributors have a place to look if they need the information about the workflow we are/will be having. The issue will be divided into multiple subtasks (parts of the documentation) so that the review process will be easier: - [ ] https://github.com/apache/arrow README - [ ] https://github.com/apache/arrow/tree/master/r README - [ ] https://arrow.apache.org/community/ - [ ] https://arrow.apache.org/docs/dev/developers/guide/step_by_step/finding_issues.html - [ ] https://arrow.apache.org/docs/dev/developers/guide/communication.html - [ ] https://arrow.apache.org/docs/dev/developers/bug_reports.html#bug-reports - [ ] Version specific project docs warnings https://arrow.apache.org/docs/ Tracked here: https://issues.apache.org/jira/browse/ARROW-18363 https://github.com/apache/arrow-site/pull/275 https://github.com/apache/arrow/pull/14687 ### Component(s) Documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser opened a new issue, #14824: [CI] r-binary-packages should only upload artifacts if all tests succeed
assignUser opened a new issue, #14824: URL: https://github.com/apache/arrow/issues/14824 ### Describe the enhancement requested Currently the upload step is missing a dependency on the centos binary test. ### Component(s) Continuous Integration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lwhite1 opened a new issue, #14825: [Java] [Doc] Improve documentation for streaming file handling in VectorSchemaRoot
lwhite1 opened a new issue, #14825: URL: https://github.com/apache/arrow/issues/14825 ### Describe the enhancement requested See comments in issue #14812 ### Component(s) Documentation, Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] LucyMcGowan opened a new issue, #14826: write_dataset is crashing on my machine
LucyMcGowan opened a new issue, #14826: URL: https://github.com/apache/arrow/issues/14826 ### Describe the bug, including details regarding any error messages, version, and platform. When I run the following example from the documentation, it seems to crash. ``` library(arrow) one_level_tree <- tempfile() write_dataset(mtcars, one_level_tree, partitioning = "cyl") ``` This reprex appears to crash R. See standard output and standard error for more details. Standard output and error ``` sh ✖ Install the styler package in order to use `style = TRUE`. *** caught illegal operation *** address 0x11746ec1f, cause 'illegal opcode' Traceback: 1: ExecPlan_Write(self, node, prepare_key_value_metadata(node$final_metadata()), ...) 2: plan$Write(final_node, options, path_and_fs$fs, path_and_fs$path, partitioning, basename_template, existing_data_behavior, max_partitions, max_open_files, max_rows_per_file, min_rows_per_group, max_rows_per_group) 3: write_dataset(mtcars, one_level_tree, partitioning = "cyl") 4: eval(expr, envir, enclos) 5: eval(expr, envir, enclos) 6: eval_with_user_handlers(expr, envir, enclos, user_handlers) 7: withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)) 8: withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler) 9: doTryCatch(return(expr), name, parentenv, handler) 10: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 11: tryCatchList(expr, classes, parentenv, handlers) 12: tryCatch(expr, error = function(e) {call <- conditionCall(e)if (!is.null(call)) {if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L)dcall <- deparse(call, nlines = 1L) prefix <- paste("Error in", dcall, ": ")LONG <- 75Lsm <- strsplit(conditionMessage(e), "\n")[[1L]]w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b")if (w > LONG) prefix <- paste0(prefix, "\n ")}else prefix <- "Error : "msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L]))if (!silent && isTRUE(getOption("show.error.messages"))) {cat(msg, file = outFile) .Internal(printDeferredWarnings())}invisible(structure(msg, class = "try-error", condition = e))}) 13: try(f, silent = TRUE) 14: handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler)) 15: timing_fn(handle(ev <- withCallingHandlers(withVisible(eval_with_user_handlers(expr, envir, enclos, user_handlers)), warning = wHandler, error = eHandler, message = mHandler))) 16: evaluate_call(expr, parsed$src[[i]], envir = envir, enclos = enclos, debug = debug, last = i == length(out), use_try = stop_on_error != 2L, keep_warning = keep_warning, keep_message = keep_message, output_handler = output_handler, include_timing = include_timing) 17: evaluate::evaluate(...) 18: evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (is.numeric(options$error)) options$error else {if (options$error && options$include) 0Lelse 2L}, output_handler = knit_handlers(options$render, options)) 19: in_dir(input_dir(), evaluate(code, envir = env, new_device = FALSE, keep_warning = !isFALSE(options$warning), keep_message = !isFALSE(options$message), stop_on_error = if (is.numeric(options$error)) options$error else {if (options$error && options$include) 0Lelse 2L}, output_handler = knit_handlers(options$render, options))) 20: eng_r(options) 21: block_exec(params) 22: call_block(x) 23: process_group.block(group) 24: process_group(group) 25: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), error = function(e) {setwd(wd) cat(res, sep = "\n", file = output %n% "")message("Quitting from lines ", paste(current_lines(i), collapse = "-"), " (", knit_concord$get("infile"), ") ")}) 26: process_file(text, output) 27: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet) 28: rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8") 29: (function (input) {rmarkdown::render(input, quiet = TRUE, envir = globalenv(), encoding = "UTF-8")})(input = base::quote("front-ray_reprex.R")) 30: (function (what, args, quote = FALSE, envir = parent.frame()) {if (!is.list(arg
[GitHub] [arrow] paleolimbot closed issue #14813: [R] Install from local source on MacOS fails build after Abseil build step
paleolimbot closed issue #14813: [R] Install from local source on MacOS fails build after Abseil build step URL: https://github.com/apache/arrow/issues/14813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14801: [C++] CMake config files for Dataset are not copied to the install directory
kou closed issue #14801: [C++] CMake config files for Dataset are not copied to the install directory URL: https://github.com/apache/arrow/issues/14801 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new issue, #14828: [CI][Conda] Nightly CI jobs aren't maintained
kou opened a new issue, #14828: URL: https://github.com/apache/arrow/issues/14828 ### Describe the bug, including details regarding any error messages, version, and platform. Nightly CI jobs for Conda were fixed by @h-vetinari and @xhochy in GH-14102 but most nightly CI jobs are failing again this past month. Can we maintain them? If we can't maintain them, can we remove them? Are they useful to maintain https://github.com/conda-forge/arrow-cpp-feedstock ? http://crossbow.voltrondata.com/ Task Name | Since Last Successful Build | Last Successful Commit | Last Successful Build | First Failure | 9 Days Ago | 8 Days Ago | 7 Days Ago | 6 Days Ago | 5 Days Ago | 4 Days Ago | 3 Days Ago | 2 Days Ago | Most Recent Failure -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- conda-win-vs2019-py37-r40 | 78 days | 5e6da78 | pending | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-win-vs2019-py38 | 65 days | 60c9383 | pending | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py310-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py310-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py37-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py37-cpu-r40 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py37-cpu-r41 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py37-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py37-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py38-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py38-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py38-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py38-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py39-arm64 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py39-cpu | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py39-cuda | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py39-ppc64le | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-arm64-clang-py38 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-arm64-clang-py39 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-clang-py37-r40 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-clang-py37-r41 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-clang-py38 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-clang-py39 | 29 days | 8e3a1e1 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py310-cuda | 15 days | 501b799 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-linux-gcc-py310-ppc64le | 15 days | 501b799 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-arm64-clang-py310 | 15 days | 501b799 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail conda-osx-clang-py310 | 15 days | 501b799 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail ### Component(s) Continuous Integration, Packaging -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou opened a new issue, #14829: [CI][R][Homebrew] Nightly CI jobs aren't maintained
kou opened a new issue, #14829: URL: https://github.com/apache/arrow/issues/14829 ### Describe the bug, including details regarding any error messages, version, and platform. `homebrew-r-autobrew` and `homebrew-r-brew` nightly CI jobs are failing in past 3 months: Can we maintain them? If we can't maintain them, can we remove them? Are they useful to maintain something? http://crossbow.voltrondata.com/ Task Name | Since Last Successful Build | Last Successful Commit | Last Successful Build | First Failure | 9 Days Ago | 8 Days Ago | 7 Days Ago | 6 Days Ago | 5 Days Ago | 4 Days Ago | 3 Days Ago | 2 Days Ago | Most Recent Failure -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- homebrew-r-autobrew | 94 days | cf27001 | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail homebrew-r-brew | 83 days | a63e60b | pass | fail | fail | fail | fail | fail | fail | fail | fail | fail | fail ### Component(s) Continuous Integration, R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed
kou closed issue #14819: [CI][RPM] CentOS 9 Stream nightly CI is failed URL: https://github.com/apache/arrow/issues/14819 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14816: [Release] Make dev/release/06-java-upload.sh reusable from other project
kou closed issue #14816: [Release] Make dev/release/06-java-upload.sh reusable from other project URL: https://github.com/apache/arrow/issues/14816 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] phpsxg opened a new issue, #14834: write_dataset how to add and update data
phpsxg opened a new issue, #14834: URL: https://github.com/apache/arrow/issues/14834 ### Describe the usage question you have. Please include as many useful details as possible. First, save the parquet file, there are 5 pieces of data ``` dataset_name = 'test_update' df = pd.DataFrame({'one': [-1, 3, 2.5, 2.5, 2.5], 'two': ['foo', 'bar', 'baz','foo','foo'], 'three': [True, False, True,False,False]}, ) table = pa.Table.from_pandas(df) ds.write_dataset(table, dataset_name, existing_data_behavior='overwrite_or_ignore', format="parquet") ``` Then I want to add two new ones, and I want to get a total of 7 results, and the new data is as follows: ``` df = pd.DataFrame({'one': [1, 2], 'two': ['foo-insert1','foo-insert2'], 'three': [True, False]}, ) table = pa.Table.from_pandas(df) ds.write_dataset(table, dataset_name, # existing_data_behavior='delete_matching', existing_data_behavior='overwrite_or_ignore', format="parquet") ``` 1. **But this overwrites the original, there are only two data, how to achieve new data on the basis of the original** 2. **I have another question, if I want to update the data according to the conditions, how to change how to do it, for example** > Update one=-1, two=foo's three to False - python=3.10 - pyarrow=10.0.0 ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] Qwertzi01 opened a new issue, #14835: [Python] Import/usage of pyarrow results in 'Invalid machine command'
Qwertzi01 opened a new issue, #14835: URL: https://github.com/apache/arrow/issues/14835 ### Describe the bug, including details regarding any error messages, version, and platform. Hello and good morning, I get the error message 'Invalid machine command' if I try to use/import pyarrow v10 (also older versions). * Operating system: Debian 11 (Bullseye) * Kernel Version: `Linux 5.10.0-15-amd64 #1 SMP Debian 5.10.120-1 (2022-06-09) x86_64 GNU/Linux` * Python Version: 3.9.2 Console output after installing pyarrow with latest pip: ``` Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import pyarrow Invalid machine command ``` Kernel log if I try to use Freqtrade which need pyarrow v10: `Dec 3 17:01:08 server kernel: [617405.716600] traps: freqtrade[584823] trap invalid opcode ip:7f848859ae22 sp:7ffee55e2750 error:0 in libarrow.so.1000[7f84884a2000+189e000]` If it helps, here the original issue regarding Freqtrade with Pyarrow: https://github.com/freqtrade/freqtrade/issues/7839 ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] phpsxg opened a new issue, #14837: ds.write_dataset how to implement new data?
phpsxg opened a new issue, #14837: URL: https://github.com/apache/arrow/issues/14837 ### Describe the bug, including details regarding any error messages, version, and platform. When I am using pq.write_to_dataset ( existing_data_behavior='overwrite_or_ignore') is able to add data Why is the use of ds.write_dataset ( existing_data_behavior = 'overwrite_or_ignore') an overlay? ``` df = pd.DataFrame({'one': [-1, 3, 2.5, 2.5, 2.5], 'two': ['foo', 'bar', 'baz', 'foo', 'foo'], 'three': [True, False, True, False, True], 'four': [datetime.date(2021, 1, 3), datetime.date(2021, 3, 1), datetime.date(2021, 1, 1), datetime.date(2021, 3, 11), datetime.date(2021, 4, 1)] }, index=list('abcde')) table = pa.Table.from_pandas(df, preserve_index=True) ``` **pq.write_to_datase** ``` pq.write_to_dataset(table, root_path=root_path, # existing_data_behavior='delete_matching', existing_data_behavior='overwrite_or_ignore', use_legacy_dataset=False ) ``` **ds.write_dataset** ``` ds.write_dataset(table, root_path, # existing_data_behavior='delete_matching', existing_data_behavior='overwrite_or_ignore', format="parquet") `` ### Component(s) Parquet, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] sayantikabanik opened a new issue, #14840: Color coding for warnings [Building documentation]
sayantikabanik opened a new issue, #14840: URL: https://github.com/apache/arrow/issues/14840 ### Describe the enhancement requested ### Short description While building the documentation, I noticed the warnings appearing on the console are color coded as `red` which I assumed as errors (was bit alarmed) ### Screenshot for reference https://user-images.githubusercontent.com/17350312/205615808-a5e5b757-f757-439b-9ef7-0c73693413d0.png";> ### Expectation Usually I have come across warnings as `yellow`/ `orangish` color printed on the console. And errors highlighted or color coded as `red`. ### Component(s) Documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser closed issue #14837: ds.write_dataset how to implement new data?
assignUser closed issue #14837: ds.write_dataset how to implement new data? URL: https://github.com/apache/arrow/issues/14837 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] benibus opened a new issue, #14842: [C++] Propagate some errors in JSON chunker
benibus opened a new issue, #14842: URL: https://github.com/apache/arrow/issues/14842 ### Describe the bug, including details regarding any error messages, version, and platform. The current JSON `Chunker` defers all error reporting to a later parsing stage when `ParseOptions::newlines_in_values = true`. However, this poses issues when sequentially chunking buffers. Consider this: ``` std::shared_ptr whole, rest; // Trailing right-bracket chunker->Process(Buffer::FromString("{\"a\":0}}"), &whole, &rest); ``` Here, `whole` will be `{"a":0}` but `rest` will be `}`, which doesn't start a valid JSON block. As such, for the next buffer, you can't then call `ProcessWithPartial` with `rest` as its `partial` argument without crashing (via DCHECK, if enabled). This effectively prevents us from handling the error at all. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] markjschreiber opened a new issue, #14844: VectorValueComparator should skip the null test for NonNullable FieldTypes types
markjschreiber opened a new issue, #14844: URL: https://github.com/apache/arrow/issues/14844 ### Describe the enhancement requested I recently ran a profiler on a simple implementation of a sort and noticed that the VectorValueComparator spends quite a lot of time checking the values to be compared for `null` values. In this case I had already declared the `FieldType` to be `notNullable`. Unless I am misunderstanding something, the values cannot be `null` so the comparator is making unnecessary and expensive comparisons. As a test I made a comparator that skips the null tests and runs ~12% faster as a result. Based on this I think the `VectorValueComparator` should skip the `nul`l test for NonNullable `FieldTypes`. I'd be happy to contribute a change to do this if you think it would be reasonable. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed issue #14755: Expose QuotingStyle to Python
pitrou closed issue #14755: Expose QuotingStyle to Python URL: https://github.com/apache/arrow/issues/14755 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed issue #14842: [C++] Propagate some errors in JSON chunker
pitrou closed issue #14842: [C++] Propagate some errors in JSON chunker URL: https://github.com/apache/arrow/issues/14842 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14824: [CI] r-binary-packages should only upload artifacts if all tests succeed
kou closed issue #14824: [CI] r-binary-packages should only upload artifacts if all tests succeed URL: https://github.com/apache/arrow/issues/14824 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new issue, #14846: [Dev] Update download_rc_binaries to be able to fetch from GitHub Releases
lidavidm opened a new issue, #14846: URL: https://github.com/apache/arrow/issues/14846 ### Describe the enhancement requested ADBC is using GitHub Releases instead of Artifactory; it would be nice to share a little bit of this infrastructure instead of having to replicate it all. (And eventually Arrow may be able to publish binaries on GitHub as well; we already do with Crossbow.) See https://github.com/apache/arrow-adbc/pull/215#discussion_r1040123283 ### Component(s) Developer Tools -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jorisvandenbossche closed issue #14840: [Docs] Color coding for warnings [Building documentation]
jorisvandenbossche closed issue #14840: [Docs] Color coding for warnings [Building documentation] URL: https://github.com/apache/arrow/issues/14840 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser opened a new issue, #14849: [CI] R install-local builds sometimes fail because sccache times out
assignUser opened a new issue, #14849: URL: https://github.com/apache/arrow/issues/14849 ### Describe the bug, including details regarding any error messages, version, and platform. The sccache server times out while starting and this leads to the build error'ing e.g. https://github.com/ursacomputing/crossbow/actions/runs/3625242046/jobs/6113064050#step:7:1134 ### Component(s) Continuous Integration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm closed issue #14835: [Python] Import/usage of pyarrow results in 'Invalid machine command'
lidavidm closed issue #14835: [Python] Import/usage of pyarrow results in 'Invalid machine command' URL: https://github.com/apache/arrow/issues/14835 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] youngfn opened a new issue, #14853: [C++][Streaming execution] can't write data after hash_distinct
youngfn opened a new issue, #14853: URL: https://github.com/apache/arrow/issues/14853 ### Describe the usage question you have. Please include as many useful details as possible. Hi, when I test the Streaming execution engine, I will always get an error like "Unsupported Type:list" in write node. This happened when I use hash_distinct in aggregate node (**but it will success with hash_count_distinct or hash_count**). Is anything wrong with my demo? Is this a bug? I'm not sure. thx for any hint! ``` //demo cp::Declaration::Sequence( { {"scan", scan_node_options}, {"filter", cp::FilterNodeOptions{filter_opt}}, {"project", cp::ProjectNodeOptions{{ cp::field_ref("id"), cp::field_ref("class_id"),cp::field_ref("gender"), cp::field_ref("age"),cp::field_ref("term"), expr}, {"id", "class_id", "name", "gender", "age", "score", "term"}}}, {"aggregate", cp::AggregateNodeOptions{/*aggregates=*/{{"hash_distinct", nullptr, "id", "distinct(id)"}}, {"id"}}}, {"write", write_node_options} } ).AddToPlan(plan.get()); if (!plan->Validate().ok()) { std::cout << "plan is not validate" << std::endl; return; } std::cout << "Execution Plan Created : " << plan->ToString() << std::endl; // // // start the ExecPlan plan->StartProducing(); auto future = plan->finished(); future.status(); future.Wait(); ``` Error print: ``` arrow error:Invalid: Unsupported Type:list arrow error:Invalid: Unsupported Type:list /tmp/tmp.GwaQRyi1BD/src/arrow/csv/writer.cc:454 MakePopulator(*schema->field(col), end_chars, options.delimiter, null_string, options.quoting_style, options.io_context.pool()) arrow error:Invalid: Unsupported Type:list /tmp/tmp.GwaQRyi1BD/src/arrow/csv/writer.cc:454 MakePopulator(*schema->field(col), end_chars, options.delimiter, null_string, options.quoting_style, options.io_context.pool()) /tmp/tmp.GwaQRyi1BD/src/arrow/dataset/file_csv.cc:335 csv::MakeCSVWriter(destination, schema, *csv_options->write_options) ``` ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm closed issue #14846: [Dev] Update download_rc_binaries to be able to fetch from GitHub Releases
lidavidm closed issue #14846: [Dev] Update download_rc_binaries to be able to fetch from GitHub Releases URL: https://github.com/apache/arrow/issues/14846 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] AlenkaF opened a new issue, #14854: [Docs] Make changes to arrow/ and arrow/r/README.md
AlenkaF opened a new issue, #14854: URL: https://github.com/apache/arrow/issues/14854 ### Describe the enhancement requested Make changes to `arrow/` and `arrow/r/README.md` to update the change in the issue tracking workflow. ### Component(s) Documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new issue, #14855: [C++] Zero-case union can't be imported via C Data Interface
lidavidm opened a new issue, #14855: URL: https://github.com/apache/arrow/issues/14855 The zero-case union is apparently not supported by Arrow C++'s C Data interface. I get: ``` 'arrow_type' failed with Invalid: Invalid or unsupported format string: '+us:' ``` Reproducer for Python: ```python import pyarrow as pa from pyarrow.cffi import ffi empty_union = pa.sparse_union([]) ptr = ffi.new("struct ArrowSchema*") empty_union._export_to_c(int(ffi.cast("uintptr_t", ptr))) pa.DataType._import_from_c(int(ffi.cast("uintptr_t", ptr))) # Traceback (most recent call last): # File "", line 1, in # File "pyarrow/types.pxi", line 248, in pyarrow.lib.DataType._import_from_c # File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status # File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status # pyarrow.lib.ArrowInvalid: Invalid or unsupported format string: '+us:' ``` _Originally posted by @paleolimbot in https://github.com/apache/arrow-nanoarrow/pull/81#discussion_r1041055778_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] assignUser opened a new issue, #14856: [CI] Azure builds fail with docker permission error
assignUser opened a new issue, #14856: URL: https://github.com/apache/arrow/issues/14856 ### Describe the bug, including details regarding any error messages, version, and platform. Several of our nightlies fail due to an issue with the docker install task used on azure: https://dev.azure.com/ursacomputing/crossbow/_build/results?buildId=40924&view=logs&j=50a69d0a-7972-5459-cdae-135ee6ebe312&t=13df7b5c-76db-5c26-6592-75581a9ed64a&l=3093 ### Component(s) Continuous Integration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] workingnbar opened a new issue, #14860: Is there a way to call a custom compute function on a table.group_by aggregation?
workingnbar opened a new issue, #14860: URL: https://github.com/apache/arrow/issues/14860 ### Describe the usage question you have. Please include as many useful details as possible. Is there a way to call a custom compute function on a table.group_by aggregation? If so, what should the custom function return? I do not see an example in the documentation. ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] toddfarmer opened a new issue, #14861: MIGRATION: Update project documentation to point to GitHub issues
toddfarmer opened a new issue, #14861: URL: https://github.com/apache/arrow/issues/14861 ### Describe the enhancement requested The Apache Arrow project documentation references Jira in number of places. These references should be updated to point to GitHub issues. Additionally, a best practices document should be started to establish emerging GitHub processes and policy. ### Component(s) Documentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] toddfarmer opened a new issue, #14862: Update Apache Arrow website references to Jira
toddfarmer opened a new issue, #14862: URL: https://github.com/apache/arrow/issues/14862 ### Describe the enhancement requested The Apache Arrow website has references to Jira which should be updated to point to GitHub. ### Component(s) Website -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new issue, #14863: [C++] Add `Append(std::optional...)` convenience methods to builders
pitrou opened a new issue, #14863: URL: https://github.com/apache/arrow/issues/14863 ### Describe the enhancement requested When you have a `std::optional` of the right value type, it would be convenient to append it directly to a concrete `ArrayBuilder` subclass instead of having to query whether it has a value: ```c++ template Status Append(const std::optional& value) { return (value) ? Append(*value) : AppendNull(); } template void UnsafeAppend(const std::optional& value) { if (value) { UnsafeAppend(*value); } else { UnsafeAppendNull(); } } ``` ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lidavidm opened a new issue, #14864: [C++] Refactor string matching kernel to be usable outside of compute
lidavidm opened a new issue, #14864: URL: https://github.com/apache/arrow/issues/14864 ### Describe the enhancement requested From https://github.com/apache/arrow/pull/14082/files#r1041353323 It can be useful to use some of the string matching kernel functionality outside of a kernel context, e.g. to evaluate filters in Flight SQL/ADBC. While we can call the kernel on a single scalar, that has overhead (and requires ARROW_COMPUTE); we can instead refactor the string matching utilities into `arrow/util`. (Though this will still require ARROW_WITH_RE2.) ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] kou closed issue #14856: [CI] Azure builds fail with docker permission error
kou closed issue #14856: [CI] Azure builds fail with docker permission error URL: https://github.com/apache/arrow/issues/14856 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-adbc] dhirschfeld opened a new issue, #224: [Feature Request] Use AnyIO for Python Async
dhirschfeld opened a new issue, #224: URL: https://github.com/apache/arrow-adbc/issues/224 > *In particular, it can interleave I/O and conversion* If you're implementing an async interface, as a [`trio`](https://trio.readthedocs.io/en/stable/) user, it would be great if you could use [`anyio`](https://github.com/agronholm/anyio) rather than native `acyncio` features. This will enable the code to be used with any async library. Perhaps the most prominent Python library to support AnyIO is [`fastapi`](https://fastapi.tiangolo.com/async/#write-your-own-async-code), and that's where I'd (eventually) like to make use of `adbc` - asynchronously connecting to databases for displaying data in FastAPI dashboards. _Originally posted by @dhirschfeld in https://github.com/apache/arrow-adbc/issues/71#issuecomment-1340130033_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] necro351 opened a new issue, #14865: pqarrow.WriteArrowToColumn leaks memory from its memory.Allocator
necro351 opened a new issue, #14865: URL: https://github.com/apache/arrow/issues/14865 ### Describe the bug, including details regarding any error messages, version, and platform. I am testing a benchmark with different parquet libraries to write an Arrow buffer to file. My benchmark checks the allocator is empty at the end of its run. I found that pqarrow retains a memory.Buffer if maybeParentNulls is true, and never releases it. I looked through the API to find a Release() function or some functionality that would release this buffer but did not find anything. I noticed the unit tests do not check the allocator is zeroed out. This is the if-statement I am concerned about: ``` // WriteArrowToColumn writes apache arrow columnar data directly to a ColumnWriter. // Returns non-nil error if the array data type is not compatible with the concrete // writer type. // // leafArr is always a primitive (possibly dictionary encoded type). // Leaf_field_nullable indicates whether the leaf array is considered nullable // according to its schema in a Table or its parent array. func WriteArrowToColumn(ctx context.Context, cw file.ColumnChunkWriter, leafArr arrow.Array, defLevels, repLevels []int16, leafFieldNullable bool) error { // Leaf nulls are canonical when there is only a single null element after a list // and it is at the leaf. colLevelInfo := cw.LevelInfo() singleNullable := (colLevelInfo.DefLevel == colLevelInfo.RepeatedAncestorDefLevel+1) && leafFieldNullable maybeParentNulls := colLevelInfo.HasNullableValues() && !singleNullable if maybeParentNulls { buf := memory.NewResizableBuffer(cw.Properties().Allocator()) ---NON-RELEASED ALLOC HERE---> buf.Resize(int(bitutil.BytesForBits(cw.Properties().WriteBatchSize( cw.SetBitsBuffer(buf) } ... ``` This is the suspicious allocation (I added a PrintStack call in my own custom debug allocator to print this): ``` goroutine 19 [running]: runtime/debug.Stack() /usr/local/go/src/runtime/debug/stack.go:24 +0x65 runtime/debug.PrintStack() /usr/local/go/src/runtime/debug/stack.go:16 +0x19 gitlab.eng.vmware.com/taurus/data-mesh.git/compact-lake/rows.(*VerboseAllocator).Allocate(0xc0002d5da0, 0x80) /home/rick/data-mesh/compact-lake/rows/buffer_test.go:145 +0x6a github.com/apache/arrow/go/v11/arrow/memory.(*Buffer).Reserve(0xc00011ee10, 0xc0001596b0?) /home/rick/go/pkg/mod/github.com/apache/arrow/go/v11@v11.0.0-20221206133351-50a164ec7f64/arrow/memory/buffer.go:110 +0x5b github.com/apache/arrow/go/v11/arrow/memory.(*Buffer).resize(0xc00011ee10, 0x80, 0xf0?) /home/rick/go/pkg/mod/github.com/apache/arrow/go/v11@v11.0.0-20
[GitHub] [arrow] westonpace opened a new issue, #14866: [C++] Remove internal GroupBy implementation
westonpace opened a new issue, #14866: URL: https://github.com/apache/arrow/issues/14866 ### Describe the enhancement requested Currently there are two ways to compute a group by. The supported way is to use an aggregate node in an exec plan. The second (internal) way is to use the internal function `arrow::internal::GroupBy`. This internal function simulates, but does not actually use, an aggregate node. The internal implementation has caused issues in the past where we did not notice an error in the aggregate node's invocation of aggregate kernels since we use the internal function for testing aggregates and it behaves slightly differently. The internal implementation also requires maintenance and significantly complicated #14352 . I would like to remove the internal implementation. Unfortunately, the internal implementation is used by tests, benchmarks, and pyarrow. However, we should be able to update those bindings to a friendly wrapper around exec plans. ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] youngfn closed issue #14853: [C++][Streaming execution] can't write data after hash_distinct
youngfn closed issue #14853: [C++][Streaming execution] can't write data after hash_distinct URL: https://github.com/apache/arrow/issues/14853 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] lukester1975 opened a new issue, #14869: [C++] arrow.pc should have -DARROW_STATIC for Windows static builds
lukester1975 opened a new issue, #14869: URL: https://github.com/apache/arrow/issues/14869 ### Describe the enhancement requested Without, the generated pc file is insufficient (at least without "manually" defining ARROW_STATIC, which is unpleasant). Quick hack fix: https://github.com/lukester1975/arrow/commit/2a8efd9c0bf69fe1b466e157bd69e83a757c926e * Cflags.private is quite new (https://gitlab.freedesktop.org/pkg-config/pkg-config/-/merge_requests/13), so this approach might be unpalatable. * pkgconfiglite is too old to include that (no commits since 2016??). It does quietly ignore the field, though. * vcpkg does its own merging of Cflags.private into Cflags, so not an issue if pkg-config doesn't understand Cflags.private there (my case). * Obviously this is applying to all platforms, not just Windows, but should do no harm...? Seems like there should be some sort of fix here rather than asking vcpkg to patch it! Regards ### Component(s) C++ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou opened a new issue, #14870: [C++][Parquet] Support min_value and max_value Statistics
pitrou opened a new issue, #14870: URL: https://github.com/apache/arrow/issues/14870 ### Describe the enhancement requested The `Statistics` structure in Parquet files provides two ways of specifying lower and upper bounds for a data page: * `min` and `max` are legacy fields for compatibility with older writers, with ill-defined comparison semantics in most cases except for signed integers * `min_value` and `max_value` are "new" fields (introduced in 2017! - see https://github.com/apache/parquet-format/commit/041708da1af52e7cb9288c331b542aa25b68a2b6 and https://github.com/apache/parquet-format/commit/bef5438990116725af041cdd8ced2bca0ed2608a) with well-defined comparison semantics depending on the logical type Currently Parquet C++ supports only the legacy fields `min` and `max`. We should add support for reading and writing the newer ones, with the appropriate semantics on the write path. ### Component(s) Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] pitrou closed issue #14870: [C++][Parquet] Support min_value and max_value Statistics
pitrou closed issue #14870: [C++][Parquet] Support min_value and max_value Statistics URL: https://github.com/apache/arrow/issues/14870 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] jandom opened a new issue, #14871: pq.ParquetDataset usage with moto3 mocks?
jandom opened a new issue, #14871: URL: https://github.com/apache/arrow/issues/14871 ### Describe the usage question you have. Please include as many useful details as possible. Hi there, I'm trying to mock some S3 objects, to write a test exercising a pd.Dataset, this is using boto3, moto3 and pytest ``` @mock_s3 def test_ignore_moto3(): s3 = boto3.resource("s3", region_name="us-east-1") s3.create_bucket(Bucket="fake-bucket") parquet_object = s3.Object("fake-bucket", "dummy.parquet") buffer = io.BytesIO() df = pd.DataFrame([{"foo": 123, "bar": 123}]) df.to_parquet(buffer, index=False) parquet_object.put(Body=buffer.getvalue()) s3 = boto3.resource('s3') obj = s3.Object('fake-bucket', 'dummy.parquet') print(obj.get()['Body'].read()) ds = pq.ParquetDataset("s3://fake-bucket/dummy.parquet", use_legacy_dataset=False) ``` But unexpectedly this tests is dying ``` > ds = pq.ParquetDataset("s3://fake-bucket/dummy.parquet", use_legacy_dataset=False) tests/virtual_screening/integration/test_results.py:46: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /opt/micromamba/envs/main/lib/python3.9/site-packages/pyarrow/parquet/core.py:1724: in __new__ return _ParquetDatasetV2( /opt/micromamba/envs/main/lib/python3.9/site-packages/pyarrow/parquet/core.py:2401: in __init__ if filesystem.get_file_info(path_or_paths).is_file: pyarrow/_fs.pyx:564: in pyarrow._fs.FileSystem.get_file_info ??? pyarrow/error.pxi:144: in pyarrow.lib.pyarrow_internal_check_status ??? _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > ??? E OSError: When getting information for key 'dummy.parquet' in bucket 'fake-bucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body. ``` But the test output includes the file contents (so there is no typo or misconfiguration of the mock) ``` Captured stdout call - b'PAR1\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c{\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x12\x15\x16,\x15\x02\x15\x10\x15\x06\x15\x06\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\t \x02\x00\x00\x00\x02\x01\x01\x02\x00&\xc8\x01\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03foo\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&8&\x08\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x15\x10\x15\x14L\x15\x02\x15\x00\x12\x00\x00\x08\x1c{\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x12\x15\x16,\x15\x02\x15\x10\x15\x06\x15\x06\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00 \x00\x00\x00\t \x02\x00\x00\x00\x02\x01\x01\x02\x00&\xc2\x04\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03bar\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&\xb2\x03&\x82\x03\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x19<5\x00\x18\x06schema\x15\x04\x00\x15\x04%\x02\x18\x03foo\x00\x15\x04%\x02\x18\x03bar\x00\x16\x02\x19\x1c\x19,&\xc8\x01\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03foo\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&8&\x08\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00&\xc2\x04\x1c\x15\x04\x195\x10\x00\x06\x19\x18\x03bar\x15\x02\x16\x02\x16\xb8\x01\x16\xc0\x01&\xb2\x03&\x82\x03\x1c\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x 18\x08{\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08{\x00\x00\x00\x00\x00\x00\x00\x18\x08{\x00\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x16\xf0\x02\x16\x02&\x08\x16\x80\x03\x14\x00\x00\x19,\x18\x06pandas\x18\xd9\x02{"index_columns": [], "column_indexes": [], "columns": [{"name": "foo", "field_name": "foo
[GitHub] [arrow] jandom closed issue #14871: pq.ParquetDataset usage with moto3 mocks?
jandom closed issue #14871: pq.ParquetDataset usage with moto3 mocks? URL: https://github.com/apache/arrow/issues/14871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] DavZim opened a new issue, #14872: [R] arrow returns wrong variable content when multiple group_by/summarise statements are used
DavZim opened a new issue, #14872: URL: https://github.com/apache/arrow/issues/14872 ### Describe the bug, including details regarding any error messages, version, and platform. When collecting a query with multiple group_by + summarise statements, one variable gets wrongly assigned values from another variable. When an ungroup is inserted, everything works fine again. To reproduce, consider the following: In the examples below, the variable `gender` should be `F`, or `M` and not `Group X`. When the `ungroup()` is inserted (second part), gender is again F/M and not Group X. ``` r library(dplyr) library(arrow) # Create sample dataset N <- 1000 set.seed(123) orig_data <- tibble( code_group = sample(paste("Group", 1:2), N, replace = TRUE), year = sample(2015:2016, N, replace = TRUE), gender = sample(c("F", "M"), N, replace = TRUE), value = runif(N, 0, 10) ) write_dataset(orig_data, "example") # Query and replicate the error (ds <- open_dataset("example/")) #> FileSystemDataset with 1 Parquet file #> code_group: string #> year: int32 #> gender: string #> value: double ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> collect() #> # A tibble: 2 × 4 #> # Groups: code_group [2] #> code_group gender valueNN #> #> 1 Group 1Group 1 724. 4 #> 2 Group 2Group 2 661. 4 ``` **ERROR** the gender variable is replaced by the values of the group variable ``` r ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> ungroup() |> #< Added this line... group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> collect() #> # A tibble: 4 × 4 #> # Groups: code_group [2] #> code_group gender valueNN #> #> 1 Group 1F 724. 2 #> 2 Group 2M 627. 2 #> 3 Group 1M 658. 2 #> 4 Group 2F 661. 2 ``` **Note** now after inserting the `ungroup()` between the group-by - summarise calls, gender is not replaced Quick look at the query (note Node 4 where `"gender": code_group`) ``` r ds |> group_by(year, code_group, gender) |> summarise(value = sum(value)) |> group_by(code_group, gender) |> summarise(value = max(value), NN = n()) |> show_query() #> ExecPlan with 8 nodes: #> 7:SinkNode{} #> 6:ProjectNode{projection=[code_group, gender, value, NN]} #> 5:GroupByNode{keys=["code_group", "gender"], aggregates=[ #> hash_max(value, {skip_nulls=false, min_count=0}), #> hash_sum(NN, {skip_nulls=true, min_count=1}), #> ]} #> 4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]} #< gender is wrongfully mapped to code_group! #> 3:ProjectNode{projection=[year, code_group, gender, value]} #> 2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[ #> hash_sum(value, {skip_nulls=false, min_count=0}), #> ]} #> 1:ProjectNode{projection=[value, year, code_group, gender]} #> 0:SourceNode{} ``` Note that this was also asked [here on SO](https://stackoverflow.com/q/74710844/3048453) ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] gf2121 opened a new issue, #14873: [Java] DictionaryEncoder can decode without building a DictionaryHashTable
gf2121 opened a new issue, #14873: URL: https://github.com/apache/arrow/issues/14873 ### Describe the enhancement requested Today DictionaryEncoder always forces the building of a DictionaryHashTable in the constructor. It can be avoided in scenarios where only decoding is required. ### Component(s) Java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeroshade opened a new issue, #14875: [Python][C++] C Data Interface incorrect validate failures
zeroshade opened a new issue, #14875: URL: https://github.com/apache/arrow/issues/14875 ### Describe the bug, including details regarding any error messages, version, and platform. Spinning off from #14814: When testing round trips of empty arrays between Python and Go using the C Data Interface, I found an issue with the binary and string data type arrays. The data types: `pa.binary()`, `pa.large_binary()`, `pa.string()`, `pa.large_string()` all throw an error when calling `validate(full=True)` after the `_import_from_c` that contained a null value data buffer: ``` Traceback (most recent call last): File "/home/zeroshade/Projects/GitHub/arrow/go/arrow/cdata/test/test_export_to_cgo.py", line 218, in test b.validate(full=True) File "pyarrow/array.pxi", line 1501, in pyarrow.lib.Array.validate File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Value data buffer is null ``` Following up from #14805 clarifying that buffers can be null in a 0-length array. My guess here is rather than the offsets buffer, the issue is the second data buffer which would contain the actual binary/utf-8 data if the array had a length >0. But that's just a theory, I haven't confirmed it. ### Component(s) C++, Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] zeroshade opened a new issue, #14876: [Go] Address Crashes for empty C Data arrays with nil buffers
zeroshade opened a new issue, #14876: URL: https://github.com/apache/arrow/issues/14876 ### Describe the bug, including details regarding any error messages, version, and platform. Following up from #14805: Go's `cdata` package needs to address handling nil data buffers for 0 length empty arrays. ### Component(s) Go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mattwarkentin opened a new issue, #14880: Best practices for handling larger than memory data
mattwarkentin opened a new issue, #14880: URL: https://github.com/apache/arrow/issues/14880 ### Describe the usage question you have. Please include as many useful details as possible. Hi, I am wondering if someone from the Arrow team could offer some guidance on best practices for handling very large data in an optimal way (such as if partitioning is even the answer). The specific data is a TSV file that is 26Gb on disk and ~50Gb in-memory when read into R. The data frame is ~500K rows and ~14K columns. It is prohibitively slow/memory intensive to read the full data across each of several projects when, typically, only a small subset of the data (either subset of rows or columns) is relevant for any given project. However, the filtering conditions for which subset changes project to project, So I don't see an obvious column to use for grouping and partitioning. Does it ever make sense to randomly chunk/partition the data into smaller sets of 5000-1 observations? My understanding was that much of the memory gain would occur if you chunked on a sensible variable (e.g., `year`) and then when you `filter()` a certain year, some of the data sets won't even be touched/loaded. Is there any way random chunking of observations offers any time/memory advantage? Most commonly, most/all rows but only a very small set of columns are needed. I had hoped that something like the following would work, where `...` is just a small set of column names: ```r ds <- arrow::open_dataset('data.tsv', format = 'tsv') df <- ds |> dplyr::select(...) |> dplyr::collect() ``` But this is seemingly just as slow as loading the full table. I had thought only `...` columns would be read into memory so there would be a time savings. Anyway, any suggestions? Am I fundamentally misunderstanding how to handle larger-than-memory data with `arrow`? ### Component(s) R -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] mattwarkentin closed issue #14880: Best practices for handling larger than memory data
mattwarkentin closed issue #14880: Best practices for handling larger than memory data URL: https://github.com/apache/arrow/issues/14880 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow-julia] quinnj closed issue #327: DST ambiguities in ZonedDateTime not supported
quinnj closed issue #327: DST ambiguities in ZonedDateTime not supported URL: https://github.com/apache/arrow-julia/issues/327 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] code1704 opened a new issue, #14882: How to do arrow table group by and split?
code1704 opened a new issue, #14882: URL: https://github.com/apache/arrow/issues/14882 ### Describe the usage question you have. Please include as many useful details as possible. How to group arrow table items and split into tables? ``` g = table.group_by("a") for x in g: do_somthing(x) ``` ### Component(s) Python -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org