I got a backtrace when the question occured??This problem occurred when the
training program read the training set data and then read the verification set
data. It seemed that a deadlock occurred when the first data file of the
verification set was read. The training set and the verification set used
different dataset objects to read.
#0 futex_wait_cancelable (private=<optimized out>, expected=0,
futex_word=0x7f7010115e58) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0,
mutex=0x7f7010115e08, cond=0x7f7010115e30) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x7f7010115e30, mutex=0x7f7010115e08) at
pthread_cond_wait.c:647
#3 0x00007f74d8bfbe30 in
std::condition_variable::wait(std::unique_lock<std::mutex>&) () from
/usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f70380a4152 in
std::condition_variable::wait<arrow::ConcreteFutureImpl::DoWait()::{lambda()#1}>(std::unique_lock<std::mutex>&,
arrow::ConcreteFutureImpl::DoWait()::{lambda()#1}) (this=0x7f7010115e30,
__lock=..., __p=...) at /usr/include/c++/9/condition_variable:101
#5 0x00007f70380a347c in arrow::ConcreteFutureImpl::DoWait
(this=0x7f7010115dc0) at external/arrow/cpp/src/arrow/util/future.cc:348
#6 0x00007f70380a0114 in arrow::FutureImpl::Wait (this=0x7f7010115dc0) at
external/arrow/cpp/src/arrow/util/future.cc:393
#7 0x00007f7037e678c6 in arrow::Future<std::shared_ptr<arrow::Buffer>
>::Wait (this=0x7f679fffd640) at
external/arrow/cpp/src/arrow/util/future.h:435
#8 0x00007f7037e64fec in arrow::Future<std::shared_ptr<arrow::Buffer>
>::result() const & (this=0x7f679fffd640) at
external/arrow/cpp/src/arrow/util/future.h:405
#9 0x00007f7037f7f4b2 in arrow::io::internal::ReadRangeCache::Impl::Read
(this=0x7f701011b340, range=...) at
external/arrow/cpp/src/arrow/io/caching.cc:206
#10 0x00007f7037f7ebfb in arrow::io::internal::ReadRangeCache::Read
(this=0x7f70101cc040, range=...) at
external/arrow/cpp/src/arrow/io/caching.cc:310
#11 0x00007f7037ace885 in parquet::SerializedRowGroup::GetColumnPageReader
(this=0x7f70101007c0, i=121) at
external/arrow/cpp/src/parquet/file_reader.cc:203
#12 0x00007f7037aca426 in parquet::RowGroupReader::GetColumnPageReader
(this=0x7f70101ab4b0, i=121) at
external/arrow/cpp/src/parquet/file_reader.cc:126
#13 0x00007f7037d9f631 in parquet::arrow::FileColumnIterator::NextChunk
(this=0x7f7010183f90) at
external/arrow/cpp/src/parquet/arrow/reader_internal.h:80
#14 0x00007f7037d8dccd in parquet::arrow::(anonymous
namespace)::LeafReader::NextRowGroup (this=0x7f70100d5300) at
external/arrow/cpp/src/parquet/arrow/reader.cc:502
#15 0x00007f7037d8d7bf in parquet::arrow::(anonymous
namespace)::LeafReader::LeafReader (this=0x7f70100d5300,
ctx=std::shared_ptr<parquet::arrow::ReaderContext> (empty) =
{...}, field=std::shared_ptr<arrow::Field> (empty) = {...},
input=std::unique_ptr<parquet::arrow::FileColumnIterator> =
{...}, leaf_info=...) at external/arrow/cpp/src/parquet/arrow/reader.cc:452
#16 0x00007f7037d8fd77 in parquet::arrow::(anonymous namespace)::GetReader
(field=..., arrow_field=std::shared_ptr<arrow::Field> (use count 3, weak
count 0) = {...},
ctx=std::shared_ptr<parquet::arrow::ReaderContext> (use count
2, weak count 0) = {...}, out=0x7f679fffdd40) at
external/arrow/cpp/src/parquet/arrow/reader.cc:845
#17 0x00007f7037d914b9 in parquet::arrow::(anonymous namespace)::GetReader
(field=...,
ctx=std::shared_ptr<parquet::arrow::ReaderContext> (use count
2, weak count 0) = {...}, out=0x7f679fffdd40) at
external/arrow/cpp/src/parquet/arrow/reader.cc:957
#18 0x00007f7037d8fe57 in parquet::arrow::(anonymous namespace)::GetReader
(field=..., arrow_field=std::shared_ptr<arrow::Field> (use count 2, weak
count 0) = {...},
ctx=std::shared_ptr<parquet::arrow::ReaderContext> (use count
2, weak count 0) = {...}, out=0x7f679fffdfc0) at
external/arrow/cpp/src/parquet/arrow/reader.cc:852
#19 0x00007f7037d914b9 in parquet::arrow::(anonymous namespace)::GetReader
(field=...,
ctx=std::shared_ptr<parquet::arrow::ReaderContext> (use count
2, weak count 0) = {...}, out=0x7f679fffdfc0) at
external/arrow/cpp/src/parquet/arrow/reader.cc:957
#20 0x00007f7037d8bbda in parquet::arrow::(anonymous
namespace)::FileReaderImpl::GetFieldReader (this=0x7f70101099a0, i=121,
included_leaves=std::shared_ptr<std::unordered_set<int,
std::hash<int>, std::equal_to<int>, std::allocator<int> >> (use
count 4, weak count 0) = {...},
row_groups=std::vector of length 2, capacity 2 = {...},
out=0x7f679fffdfc0) at external/arrow/cpp/src/parquet/arrow/reader.cc:212
#21 0x00007f7037d8be1f in parquet::arrow::(anonymous
namespace)::FileReaderImpl::GetFieldReaders (this=0x7f70101099a0,
column_indices=std::vector of length 154, capacity 256 = {...},
row_groups=std::vector of length 2, capacity 2 = {...},
out=0x7f679fffe0d0,
out_schema=0x7f679fffe0b0) at
external/arrow/cpp/src/parquet/arrow/reader.cc:230
#22 0x00007f7037d93d2e in parquet::arrow::(anonymous
namespace)::FileReaderImpl::DecodeRowGroups (this=0x7f70101099a0,
self=std::shared_ptr<parquet::arrow::(anonymous
namespace)::FileReaderImpl> (empty) = {...}, row_groups=std::vector of
length 2, capacity 2 = {...},
column_indices=std::vector of length 154, capacity 256 = {...},
cpu_executor=0x0) at external/arrow/cpp/src/parquet/arrow/reader.cc:1228
#23 0x00007f7037d93496 in parquet::arrow::(anonymous
namespace)::FileReaderImpl::ReadRowGroups (this=0x7f70101099a0,
row_groups=std::vector of length 2, capacity 2 = {...},
column_indices=std::vector of length 154, capacity 256 = {...},
out=0x7f679fffe3c0)
at external/arrow/cpp/src/parquet/arrow/reader.cc:1216
#24 0x00007f7037d8ba2d in parquet::arrow::(anonymous
namespace)::FileReaderImpl::ReadTable (this=0x7f70101099a0,
indices=std::vector of length 154, capacity 256 = {...},
out=0x7f679fffe3c0) at external/arrow/cpp/src/parquet/arrow/reader.cc:199
#25 0x00007f70378c998c in
tensorflow::data::ArrowS3DatasetOp::Dataset::Iterator::ReadFile
(this=0x3a47b620, file_index=0, background=false)
at tensorflow_io/core/kernels/arrow/arrow_dataset_ops.cc:1237
#26 0x00007f70378c8a3d in
tensorflow::data::ArrowS3DatasetOp::Dataset::Iterator::SetupStreamsLocked
(this=0x3a47b620, env=0x1e17cd0)
at tensorflow_io/core/kernels/arrow/arrow_dataset_ops.cc:1110
#27 0x00007f70378d7758 in
tensorflow::data::ArrowDatasetBase::ArrowBaseIterator<tensorflow::data::ArrowS3DatasetOp::Dataset>::GetNextInternal
(this=0x3a47b620,
--Type <RET> for more, q to quit, c to continue without paging--
ctx=0x7f7008448c10, out_tensors=0x7f679fffeac0,
end_of_sequence=0x7f7010275ea8) at
tensorflow_io/core/kernels/arrow/arrow_dataset_ops.cc:110
#28 0x00007f73fcf42aa4 in
tensorflow::data::DatasetBaseIterator::GetNext(tensorflow::data::IteratorContext*,
std::vector<tensorflow::Tensor, std::allocator<tensorflow::Tensor> >*,
bool*) () from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#29 0x00007f74174288f6 in
tensorflow::data::ParallelMapDatasetOp::Dataset::Iterator::CallFunction(std::shared_ptr<tensorflow::data::IteratorContext>
const&,
std::shared_ptr<tensorflow::data::ParallelMapDatasetOp::Dataset::Iterator::InvocationResult>
const&) ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#30 0x00007f741742e022 in
tensorflow::data::ParallelMapDatasetOp::Dataset::Iterator::RunnerThread(std::shared_ptr<tensorflow::data::IteratorContext>
const&) ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#31 0x00007f7416003145 in tensorflow::data::(anonymous
namespace)::WorkQueueFunc(std::function<void ()> const&,
std::shared_ptr<tensorflow::Notification>) ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#32 0x00007f7416003f5d in std::_Function_handler<void (), std::_Bind<void
(*(std::function<void ()>,
std::shared_ptr<tensorflow::Notification>))(std::function<void ()>
const&, std::shared_ptr<tensorflow::Notification>)>
>::_M_invoke(std::_Any_data const&) ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#33 0x00007f73fd6eb791 in tensorflow::UnboundedWorkQueue::PooledThreadFunc() ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#34 0x00007f73fd6f25d8 in tensorflow::(anonymous
namespace)::PThread::ThreadFn(void*) ()
from
/usr/local/lib/python3.8/dist-packages/tensorflow/python/../libtensorflow_framework.so.2
#35 0x00007f74e4cc6609 in start_thread (arg=<optimized out>) at
pthread_create.c:477
#36 0x00007f74e4e00133 in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:95
1057445597
[email protected]
------------------ ???????? ------------------
??????:
"user"
<[email protected]>;
????????: 2023??3??23??(??????) ????4:01
??????: "user"<[email protected]>;
????: Re: How to troubleshoot curlCode 18 errors
Your error looks very similar to one already reported [1] that had to do with
using a non-AWS S3 compatible storage provider (R2 in this case), though a
solution was never provided. Are you seeing this error using AWS S3 or another
provider?
[1] https://github.com/apache/arrow/issues/33275
On Wed, Mar 22, 2023 at 5:42?6?2AM 1057445597 <[email protected]> wrote:
my code is here
https://github.com/tensorflow/io/pull/1720/files#diff-7133d540dc86c9bb9e552655025061798314e226695c00b4e1d8cecb178a2920R1181
arrow_dataset_ops.cc:1181
I read the parquet file in columns from s3 storage. This error is very rare and
cannot be repeated 100%. I would like to consult the possible cause
1057445597
[email protected]
------------------ ???????? ------------------
??????:
"user"
<[email protected]>;
????????: 2023??3??22??(??????) ????9:23
??????: "user"<[email protected]>;
????: Re: How to troubleshoot curlCode 18 errors
Can you give a bit more details about what you were doing that caused this
error? (and ideally a reproducible code example)
On Wed, 22 Mar 2023 at 14:14, 1057445597 <[email protected]> wrote:
The error message is as follows.
ErrorType: 99 Message: curlCode: 18, Transferred a partial file ExceptionName
This error is very unusual.
1057445597
[email protected]