Handling imperfect data

2020-04-07 Thread Cameron Bateman
I am trying to create a pipeline that intakes PDF files, parses the data using Tika and processes the data. A problem I have is that sometimes Tika doesn't perfectly convert certain pieces of text correctly. I can detect that this and would like to fork the output of my pipeline: for correctly

Re: Using Self signed root ca for https connection in eleasticsearchIO

2020-04-07 Thread Kenneth Knowles
Hi Mohil, Thanks for the detailed report. I think most people are reduced capacity right now. Filing a Jira would be helpful for tracking this. Since I am writing, I will add a quick guess, but we should move to Jira. It seems this has more to do with Dataflow than ElasticSearch. The default for

Side input of size around 50Mb causing long GC pause

2020-04-07 Thread Kiran Hurakadli
I have facing issue related to side inputs and mentioned it in this link https://stackoverflow.com/questions/60900937/side-input-of-size-around-50mb-causing-long-gc-pause Any help would be appreciated -- Regards, *Kiran M Hurakadli.*