What version are you running with? thanks — Hitesh
> On May 5, 2016, at 10:31 AM, Kurt Muehlner <kmuehl...@connexity.com> wrote: > > Hello, > > We have a Pig/Tez application which is exhibiting a strange problem. This > application was recently migrated from Pig/MR to Pig/Tez. We carefully > vetted during QA that both MR and Tez versions produced identical results. > However, after deploying to production, we noticed that occasionally, results > are not the same (either as compared to MR results, or results of Tez > processing the same data on a QA cluster). > > We’re still looking into the root cause, but I’d like to reach out to the > user group in case anyone has seen anything similar, or has suggestions on > what might be wrong/what to investigate. > > *** What we know so far *** > Results discrepancy occurs ONLY when the number of containers given to the > application by YARN is less than the number requested (we have disabled > auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig scripts). > When this occurs, we also see a corresponding discrepancy in the the file > system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when number of > containers is low), despite the fact that in all cases number of records > processed is identical. > > Thus, when the production cluster is very busy, we get invalid results. We > have kept a separate instance of the Pig/Tez application running on another > cluster where it never competes for resources, so we have been able to > compare results for each run of the application, which has allowed us to > diagnose the problem this far. By comparing results on these two clusters, > we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) > correlates with the ratio (actual containers)/(requested containers). > Likewise, we see the same correlation between hdfs ops ratio and container > ratio. > > Below are some relevant counters. For each counter, the first line is the > value from the production cluster showing the problem, and the second line is > the value from the QA cluster running on the same data. > > Any hints/suggestions/questions are most welcome. > > Thanks, > Kurt > > org.apache.tez.common.counters.DAGCounter > > NUM_SUCCEEDED_TASKS=950 > NUM_SUCCEEDED_TASKS=950 > > TOTAL_LAUNCHED_TASKS=950 > TOTAL_LAUNCHED_TASKS=950 > > File System Counters > > FILE_BYTES_READ=7745801982 > FILE_BYTES_READ=8003771938 > > FILE_BYTES_WRITTEN=9725468612 > FILE_BYTES_WRITTEN=9675253887 > > *HDFS_BYTES_READ=9487600888 (when number of containers equals the number > requested, this counter is the same between the two clusters) > *HDFS_BYTES_READ=17996466110 > > *HDFS_READ_OPS=3080 (when number of containers equals the number requested, > this counter is the same between the two clusters) > *HDFS_READ_OPS=3600 > > HDFS_WRITE_OPS=900 > HDFS_WRITE_OPS=900 > > org.apache.tez.common.counters.TaskCounter > INPUT_RECORDS_PROCESSED=28729671 > INPUT_RECORDS_PROCESSED=28729671 > > > OUTPUT_RECORDS=33655895 > OUTPUT_RECORDS=33655895 > > OUTPUT_BYTES=28290888628 > OUTPUT_BYTES=28294000270 > > Input(s): > Successfully read 2254733 records (1632743360 bytes) from: "input1" > Successfully read 2254733 records (1632743360 bytes) from: "input1" > > > Output(s): > Successfully stored 0 records in: “output1” > Successfully stored 0 records in: "output1” > > Successfully stored 56019 records (10437069 bytes) in: “output2” > Successfully stored 56019 records (10437069 bytes) in: "output2” > > Successfully stored 2254733 records (1651936175 bytes) in: "output3” > Successfully stored 2254733 records (1651936175 bytes) in: "output3” > > Successfully stored 1160599 records (823479742 bytes) in: "output4” > Successfully stored 1160599 records (823480450 bytes) in: "output4” > > Successfully stored 28605 records (21176320 bytes) in: "output5” > Successfully stored 28605 records (21177552 bytes) in: "output5” > > Successfully stored 6574 records (4442933 bytes) in: "output6” > Successfully stored 6574 records (4442933 bytes) in: "output6” > > Successfully stored 111416 records (164375858 bytes) in: "output7” > Successfully stored 111416 records (164379800 bytes) in: "output7” > > Successfully stored 542 records (387761 bytes) in: "output8” > Successfully stored 542 records (387762 bytes) in: "output8" > > > > > > > > > > > > > > > >