Re: data discrepancies related to parallelism

2016-05-05 Thread Kurt Muehlner
Rohini:

We are still looking into that.  The file I named ‘output7’ in this thread is 
used as input to the next DAG.  We’re still analyzing how it may be different 
in the two environments, if at all. In that DAG, although number of input 
records is the same, output records diverges.  We’re looking into why that is.

Thanks,
Kurt

From: Rohini Palaniswamy 
mailto:rohini.adi...@gmail.com>>
Reply-To: "user@tez.apache.org<mailto:user@tez.apache.org>" 
mailto:user@tez.apache.org>>
Date: Thursday, May 5, 2016 at 11:46 AM
To: "user@tez.apache.org<mailto:user@tez.apache.org>" 
mailto:user@tez.apache.org>>
Subject: Re: data discrepancies related to parallelism

Haven't seen this before. The Pig's stats output counters seem to be exactly 
same for records. Which output do you the see the data being incorrect?

On Thu, May 5, 2016 at 11:23 AM, Hitesh Shah 
mailto:hit...@apache.org>> wrote:
Thanks for the info, Kurt. You may wish to post this question to the Pig lists 
too to see if anyone has seen this.

— Hitesh


> On May 5, 2016, at 11:05 AM, Kurt Muehlner 
> mailto:kmuehl...@connexity.com>> wrote:
>
> Hi Hitesh,
>
> We are using Pig 0.15.0 and Tez 0.8.2.
>
> Thanks,
> Kurt
>
>
>
> On 5/5/16, 11:00 AM, "Hitesh Shah" 
> mailto:hit...@apache.org>> wrote:
>
>> What version are you running with?
>>
>> thanks
>> — Hitesh




Re: data discrepancies related to parallelism

2016-05-05 Thread Rohini Palaniswamy
Haven't seen this before. The Pig's stats output counters seem to be
exactly same for records. Which output do you the see the data being
incorrect?

On Thu, May 5, 2016 at 11:23 AM, Hitesh Shah  wrote:

> Thanks for the info, Kurt. You may wish to post this question to the Pig
> lists too to see if anyone has seen this.
>
> — Hitesh
>
>
> > On May 5, 2016, at 11:05 AM, Kurt Muehlner 
> wrote:
> >
> > Hi Hitesh,
> >
> > We are using Pig 0.15.0 and Tez 0.8.2.
> >
> > Thanks,
> > Kurt
> >
> >
> >
> > On 5/5/16, 11:00 AM, "Hitesh Shah"  wrote:
> >
> >> What version are you running with?
> >>
> >> thanks
> >> — Hitesh
>
>


Re: data discrepancies related to parallelism

2016-05-05 Thread Hitesh Shah
Thanks for the info, Kurt. You may wish to post this question to the Pig lists 
too to see if anyone has seen this.

— Hitesh 


> On May 5, 2016, at 11:05 AM, Kurt Muehlner  wrote:
> 
> Hi Hitesh,
> 
> We are using Pig 0.15.0 and Tez 0.8.2.
> 
> Thanks,
> Kurt
> 
> 
> 
> On 5/5/16, 11:00 AM, "Hitesh Shah"  wrote:
> 
>> What version are you running with? 
>> 
>> thanks
>> — Hitesh



Re: data discrepancies related to parallelism

2016-05-05 Thread Kurt Muehlner
Hi Hitesh,

We are using Pig 0.15.0 and Tez 0.8.2.

Thanks,
Kurt



On 5/5/16, 11:00 AM, "Hitesh Shah"  wrote:

>What version are you running with? 
>
>thanks
>— Hitesh


Re: data discrepancies related to parallelism

2016-05-05 Thread Hitesh Shah
What version are you running with? 

thanks
— Hitesh 

> On May 5, 2016, at 10:31 AM, Kurt Muehlner  wrote:
> 
> Hello,
> 
> We have a Pig/Tez application which is exhibiting a strange problem.  This 
> application was recently migrated from Pig/MR to Pig/Tez.  We carefully 
> vetted during QA that both MR and Tez versions produced identical results.  
> However, after deploying to production, we noticed that occasionally, results 
> are not the same (either as compared to MR results, or results of Tez 
> processing the same data on a QA cluster).
> 
> We’re still looking into the root cause, but I’d like to reach out to the 
> user group in case anyone has seen anything similar, or has suggestions on 
> what might be wrong/what to investigate.
> 
> *** What we know so far ***
> Results discrepancy occurs ONLY when the number of containers given to the 
> application by YARN is less than the number requested (we have disabled 
> auto-parallelism, and are using SET_DEFAULT_PARALLEL=50 in all pig scripts).  
> When this occurs, we also see a corresponding discrepancy in the the file 
> system counters HDFS_READ_OPS and HDFS_BYTES_READ (lower when number of 
> containers is low), despite the fact that in all cases number of records 
> processed is identical.
> 
> Thus, when the production cluster is very busy, we get invalid results.  We 
> have kept a separate instance of the Pig/Tez application running on another 
> cluster where it never competes for resources, so we have been able to 
> compare results for each run of the application, which has allowed us to 
> diagnose the problem this far.  By comparing results on these two clusters, 
> we also know that the ratio (actual HDFS_READ_OPS)/(expected HDFS_READ_OPS) 
> correlates with the ratio (actual containers)/(requested containers).  
> Likewise, we see the same correlation between hdfs ops ratio and container 
> ratio.
> 
> Below are some relevant counters.  For each counter, the first line is the 
> value from the production cluster showing the problem, and the second line is 
> the value from the QA cluster running on the same data.
> 
> Any hints/suggestions/questions are most welcome.
> 
> Thanks,
> Kurt
> 
> org.apache.tez.common.counters.DAGCounter
> 
>  NUM_SUCCEEDED_TASKS=950
>  NUM_SUCCEEDED_TASKS=950
> 
>  TOTAL_LAUNCHED_TASKS=950
>  TOTAL_LAUNCHED_TASKS=950
> 
> File System Counters
> 
>  FILE_BYTES_READ=7745801982
>  FILE_BYTES_READ=8003771938
> 
>  FILE_BYTES_WRITTEN=9725468612
>  FILE_BYTES_WRITTEN=9675253887
> 
>  *HDFS_BYTES_READ=9487600888  (when number of containers equals the number 
> requested, this counter is the same between the two clusters)
>  *HDFS_BYTES_READ=17996466110
> 
>  *HDFS_READ_OPS=3080  (when number of containers equals the number requested, 
> this counter is the same between the two clusters)
>  *HDFS_READ_OPS=3600
> 
>  HDFS_WRITE_OPS=900
>  HDFS_WRITE_OPS=900
> 
> org.apache.tez.common.counters.TaskCounter
>  INPUT_RECORDS_PROCESSED=28729671
>  INPUT_RECORDS_PROCESSED=28729671
> 
> 
>  OUTPUT_RECORDS=33655895
>  OUTPUT_RECORDS=33655895
> 
>  OUTPUT_BYTES=28290888628
>  OUTPUT_BYTES=28294000270
> 
> Input(s):
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
> Successfully read 2254733 records (1632743360 bytes) from: "input1"
> 
> 
> Output(s):
> Successfully stored 0 records in: “output1”
> Successfully stored 0 records in: "output1”
> 
> Successfully stored 56019 records (10437069 bytes) in: “output2”
> Successfully stored 56019 records (10437069 bytes) in: "output2”
> 
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
> Successfully stored 2254733 records (1651936175 bytes) in: "output3”
> 
> Successfully stored 1160599 records (823479742 bytes) in: "output4”
> Successfully stored 1160599 records (823480450 bytes) in: "output4”
> 
> Successfully stored 28605 records (21176320 bytes) in: "output5”
> Successfully stored 28605 records (21177552 bytes) in: "output5”
> 
> Successfully stored 6574 records (4442933 bytes) in: "output6”
> Successfully stored 6574 records (4442933 bytes) in: "output6”
> 
> Successfully stored 111416 records (164375858 bytes) in: "output7”
> Successfully stored 111416 records (164379800 bytes) in: "output7”
> 
> Successfully stored 542 records (387761 bytes) in: "output8”
> Successfully stored 542 records (387762 bytes) in: "output8"
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>