Solved this by moving flink-avro to lib and reverting to
`classloader.resolve-order: parent-first`.  I still don't know why, but I
guess if you're reading Avro both from file and Kafka in the same pipeline
then inverted class loader delegation will not work. Thanks, Vino for your
help!

On Tue, Aug 21, 2018 at 8:02 AM Cliff Resnick <cre...@gmail.com> wrote:

> Hi Aljoscha,
>
> We need flink-shaded-hadoop2-uber.jar because there is no hadoop distro on
> the instance the Flink session/jobs is managed from and the process that
> launches Flink is not a java process, but execs a process that calls the
> flink script.
>
> -Cliff
>
> On Tue, Aug 21, 2018 at 5:11 AM Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
>> Hi Cliff,
>>
>> Do you actually need the flink-shaded-hadoop2-uber.jar in lib. If you're
>> running on YARN, you should be able to just remove them because with YARN
>> you will have Hadoop in the classpath anyways.
>>
>> Aljoscha
>>
>> On 21. Aug 2018, at 03:45, vino yang <yanghua1...@gmail.com> wrote:
>>
>> Hi Cliff,
>>
>> If so, you can explicitly exclude Avro's dependencies from related
>> dependencies (using <exclude>) and then directly introduce dependencies on
>> the Avro version you need.
>>
>> Thanks, vino.
>>
>> Cliff Resnick <cre...@gmail.com> 于2018年8月21日周二 上午5:13写道:
>>
>>> Hi Vino,
>>>
>>> Unfortunately, I'm still stuck here. By moving the avro dependency chain
>>> to lib (and removing it from user jar), my OCFs decode but I get the error
>>> described here:
>>>
>>> https://github.com/confluentinc/schema-registry/pull/509
>>>
>>> However, the Flink fix described in the PR above was to move the Avro
>>> dependency to the user jar. However, since I'm using YARN, I'm required to
>>> have flink-shaded-hadoop2-uber.jar loaded from lib -- and that has
>>> avro bundled un-shaded. So I'm back to the start problem...
>>>
>>> Any advice is welcome!
>>>
>>> -Cliff
>>>
>>>
>>> On Mon, Aug 20, 2018 at 1:42 PM Cliff Resnick <cre...@gmail.com> wrote:
>>>
>>>> Hi Vino,
>>>>
>>>> You were right in your assumption -- unshaded avro was being added to
>>>> our application jar via third-party dependency. Excluding it in packaging
>>>> fixed the issue. For the record, it looks flink-avro must be loaded from
>>>> the lib or there will be errors in checkpoint restores.
>>>>
>>>> On Mon, Aug 20, 2018 at 8:43 AM Cliff Resnick <cre...@gmail.com> wrote:
>>>>
>>>>> Hi Vino,
>>>>>
>>>>> Thanks for the explanation, but the job only ever uses the Avro
>>>>> (1.8.2) pulled in by flink-formats/avro, so it's not a class version
>>>>> conflict there.
>>>>>
>>>>> I'm using default child-first loading. It might be a further
>>>>> transitive dependency, though it's not clear by stack trace or stepping
>>>>> through the process. When I get a chance I'll look further into it but in
>>>>> case anyone is experiencing similar problems, what is clear is that
>>>>> classloader order does matter with Avro.
>>>>>
>>>>> On Sun, Aug 19, 2018, 11:36 PM vino yang <yanghua1...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Cliff,
>>>>>>
>>>>>> My personal guess is that this may be caused by Job's Avro conflict
>>>>>> with the Avro that the Flink framework itself relies on.
>>>>>> Flink has provided some configuration parameters which allows you to
>>>>>> determine the order of the classloaders yourself. [1]
>>>>>> Alternatively, you can debug classloading and participate in the
>>>>>> documentation.[2]
>>>>>>
>>>>>> [1]:
>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/config.html
>>>>>> [2]:
>>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/debugging_classloading.html
>>>>>>
>>>>>> Thanks, vino.
>>>>>>
>>>>>> Cliff Resnick <cre...@gmail.com> 于2018年8月20日周一 上午10:40写道:
>>>>>>
>>>>>>> Our Flink/YARN pipeline has been reading Avro from Kafka for a while
>>>>>>> now. We just introduced a source of Avro OCF (Object Container Files) 
>>>>>>> read
>>>>>>> from S3. The Kafka Avro continued to decode without incident, but the 
>>>>>>> OCF
>>>>>>> files failed 100% with anomalous parse errors in the decoding phase 
>>>>>>> after
>>>>>>> the schema and codec were successfully read from them. The pipeline 
>>>>>>> would
>>>>>>> work on my laptop, and when I submitted a test Main program to the Flink
>>>>>>> Session in YARN, that would also successfully decode. Only the actual
>>>>>>> pipeline run from the TaskManager failed. At one point I even remote
>>>>>>> debugged the TaskManager process and stepped through what looked like a
>>>>>>> normal Avro decode (if you can describe Avro code as normal!) -- until 
>>>>>>> it
>>>>>>> abruptly failed with an int decode or what-have-you.
>>>>>>>
>>>>>>> This stumped me for a while, but I finally tried moving
>>>>>>> flink-avro.jar from the lib to the application jar, and that fixed it. 
>>>>>>> I'm
>>>>>>> not sure why this is, especially since there were no typical
>>>>>>> classloader-type errors.  This issue was observed both on Flink 1.5 and 
>>>>>>> 1.6
>>>>>>> in Flip-6 mode.
>>>>>>>
>>>>>>> -Cliff
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>

Reply via email to