Hi Rohini, Unfortunately we were not able to find it. Currently investigating this on hold, but we do plan to investigate further.
-Kurt On 6/1/16, 10:55 AM, "Rohini Palaniswamy" <rohini.adi...@gmail.com> wrote: >Kurt, > Did you find the problem? > >Regards, >Rohini > >On Thu, May 5, 2016 at 1:41 PM, Kurt Muehlner <kmuehl...@connexity.com> >wrote: > >> Hello all, >> >> I posted this issue in the Tez user group earlier today, where it was >> suggested I also post it here. We have a Pig/Tez application exhibiting >> data discrepancies which occur only when there is a difference between >> requested parallelism (via SET_DEFAULT_PARALLEL) and the number of >> containers YARN is able to allocate to the application. >> >> Has anyone seen this sort of problem, or have any suspicions as to what >> may be going wrong? >> >> Thanks, >> Kurt >> >> Original message to Tez group: >> >> Hello, >> >> We have a Pig/Tez application which is exhibiting a strange problem. This >> application was recently migrated from Pig/MR to Pig/Tez. We carefully >> vetted during QA that both MR and Tez versions produced identical results. >> However, after deploying to production, we noticed that occasionally, >> results are not the same (either as compared to MR results, or results of >> Tez processing the same data on a QA cluster). >> >> We’re still looking into the root cause, but I’d like to reach out to the >> user group in case anyone has seen anything similar, or has suggestions on >> what might be wrong/what to investigate. >> >> *** What we know so far *** >> Results discrepancy occurs ONLY when the number of containers given to the >> application by YARN is less than the number requested (we have disabled >> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig >> scripts). When this occurs, we also see a corresponding discrepancy in the >> the file system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when >> number of containers is low), despite the fact that in all cases number of >> records processed is identical. >> >> Thus, when the production cluster is very busy, we get invalid results. >> We have kept a separate instance of the Pig/Tez application running on >> another cluster where it never competes for resources, so we have been able >> to compare results for each run of the application, which has allowed us to >> diagnose the problem this far. By comparing results on these two clusters, >> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) >> correlates with the ratio (actual containers)/(requested containers). >> Likewise, we see the same correlation between hdfs ops ratio and container >> ratio. >> >> Below are some relevant counters. For each counter, the first line is the >> value from the production cluster showing the problem, and the second line >> is the value from the QA cluster running on the same data. >> >> Any hints/suggestions/questions are most welcome. >> >> Thanks, >> Kurt >> >> org.apache.tez.common.counters.DAGCounter >> >> NUM_SUCCEEDED_TASKS=950 >> NUM_SUCCEEDED_TASKS=950 >> >> TOTAL_LAUNCHED_TASKS=950 >> TOTAL_LAUNCHED_TASKS=950 >> >> File System Counters >> >> FILE_BYTES_READ=7745801982 >> FILE_BYTES_READ=8003771938 >> >> FILE_BYTES_WRITTEN=9725468612 >> FILE_BYTES_WRITTEN=9675253887 >> >> *HDFS_BYTES_READ=9487600888 (when number of containers equals the >> number requested, this counter is the same between the two clusters) >> *HDFS_BYTES_READ=17996466110 >> >> *HDFS_READ_OPS=3080 (when number of containers equals the number >> requested, this counter is the same between the two clusters) >> *HDFS_READ_OPS=3600 >> >> HDFS_WRITE_OPS=900 >> HDFS_WRITE_OPS=900 >> >> org.apache.tez.common.counters.TaskCounter >> INPUT_RECORDS_PROCESSED=28729671 >> INPUT_RECORDS_PROCESSED=28729671 >> >> >> OUTPUT_RECORDS=33655895 >> OUTPUT_RECORDS=33655895 >> >> OUTPUT_BYTES=28290888628 >> OUTPUT_BYTES=28294000270 >> >> Input(s): >> Successfully read 2254733 records (1632743360 bytes) from: "input1" >> Successfully read 2254733 records (1632743360 bytes) from: "input1" >> >> >> Output(s): >> Successfully stored 0 records in: “output1” >> Successfully stored 0 records in: "output1” >> >> Successfully stored 56019 records (10437069 bytes) in: “output2” >> Successfully stored 56019 records (10437069 bytes) in: "output2” >> >> Successfully stored 2254733 records (1651936175 bytes) in: "output3” >> Successfully stored 2254733 records (1651936175 bytes) in: "output3” >> >> Successfully stored 1160599 records (823479742 bytes) in: "output4” >> Successfully stored 1160599 records (823480450 bytes) in: "output4” >> >> Successfully stored 28605 records (21176320 bytes) in: "output5” >> Successfully stored 28605 records (21177552 bytes) in: "output5” >> >> Successfully stored 6574 records (4442933 bytes) in: "output6” >> Successfully stored 6574 records (4442933 bytes) in: "output6” >> >> Successfully stored 111416 records (164375858 bytes) in: "output7” >> Successfully stored 111416 records (164379800 bytes) in: "output7” >> >> Successfully stored 542 records (387761 bytes) in: "output8” >> Successfully stored 542 records (387762 bytes) in: "output8" >> >> >> >>