Guys Couldn't send pig earlier because needed to ask my colleagues if i am allowed to send them. Please help as my problem is still unresolved.
Thanks in advance On Wed, Jul 8, 2015 at 11:42 AM, Sachin Sabbarwal <[email protected] > wrote: > Hi Rohini > Attached zip contains the main.pig and other Macros used in this pig. > Please lemme know if there is anything else missing/required. > > Thanks > > On Wed, Jul 8, 2015 at 4:47 AM, Rohini Palaniswamy < > [email protected]> wrote: > >> Sachin, >> Can you attach your pig script and pig client log as well as I asked >> earlier? >> >> Regards, >> Rohini >> >> On Tue, Jul 7, 2015 at 2:43 AM, Sachin Sabbarwal < >> [email protected]> >> wrote: >> >> > Hi Guys >> > I'm using Apache Pig version 0.14.0 (r1640057) and 0.5.3 TEZ. >> > I am running a pig script in following 2 sceniors: >> > 1. Without any data. In this case when run with TEZ mode script runs >> with >> > in 1 min. With Mapred it takes around 7 minutes. >> > 2. With little data, In tez mode same pig script takes around 10 mins >> and >> > with mapred it takes around 14-15 mins. >> > >> > I had sent a mail to [email protected] and got to know that the >> problem >> > is that pig auto-reduce isn't coming in to play for few of my scopes and >> > hence lots of tasks are getting created. This causes the run in tez >> mode to >> > become slow. As suggested i tried setting pig.exec.reducers.max to a >> > smaller value(tried with 150,25 and 10). And with this property set to >> 10, >> > my script ran under 4 minutes. Also from logs we could see that auto >> > -reducer is set to false for few scopes. For example: >> > >> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler] >> > vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings >> > minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600 >> minTasks:1 >> > 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler] >> > impl.VertexImpl: Creating 999 for vertex: >> > >> > Now the problem here is that in my pig.properties i have >> > set pig.tez.auto.parallelism=true and also in my pig script i have set >> this >> > propetry to true. But still we are seeing this getting set to false for >> few >> > scopes. >> > >> > You can find the My conversation with Rajesh from TEZ below. Please >> lemme >> > know what i am missing here. Lemme know if any other information is >> > required. >> > >> > Thanks in advance >> > >> > >> > >> > >> > >> > ---------- Forwarded message ---------- >> > From: Rajesh Balamohan <[email protected]> >> > Date: Tue, Jul 7, 2015 at 2:12 PM >> > Subject: Re: Same pig script running slower with Tez as compared with >> run >> > in Mapred mode >> > To: [email protected] >> > >> > >> > Hi Sachin, >> > >> > That was just a temporary workaround to ensure that was the issue. >> Ideally >> > user does not need to set this parameter. Real issue is why >> auto-reducer >> > is set to false in certain vertices in pig-tez. Will wait for Pig folks >> to >> > chime in. >> > >> > For doc/tutorial related, you can start off with the following >> > - http://tez.apache.org/talks.html >> > - couple of youtube videos are available from hadoop summits and >> meetups. >> > - >> > >> > >> http://hortonworks.com/blog/apache-tez-a-new-chapter-in-hadoop-data-processing/ >> > (this is pretty old) >> > - Pig on tez >> > >> > >> http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-data-processing-with-big-data >> > >> > ~Rajesh.B >> > >> > On Tue, Jul 7, 2015 at 1:54 PM, Sachin Sabbarwal < >> > [email protected]> >> > wrote: >> > >> > > Hi Rajesh >> > > Thanks for your response. *This seems to be working for me.* >> > > By setting pig.exec.reducers.max to 10 i am able to complete my run in >> > > under 4 mins.(Initially it was running in 14-15 mins). >> > > I'm new to pig/tez/hadoop world. Do you write any blogs about >> > > pig/tez/hadoop etc? Can you suggest any tutorials/links to read about >> > tez? >> > > I need to understand concepts like scope, DAG, parallelism etc. I just >> > > have a very basic understanding of tez. If i understand all these >> > concepts >> > > i'll be able to tune my job. >> > > >> > > Thanks >> > > >> > > >> > > On Tue, Jul 7, 2015 at 1:23 PM, Rajesh Balamohan < >> > > [email protected]> wrote: >> > > >> > >> Forgot to add the following. Ideally auto-reduce implementation >> should >> > >> have kicked-in and on need basis, it should have decreased the >> number of >> > >> reducers needed. However, for the vertices of concern (scope-2037 & >> > >> scope-2162), auto-reducer has been turned off in configuration by Pig >> > and >> > >> for the rest of the vertices it is turned on. >> > >> >> > >> Pig folks would be able to help in terms of providing details on why >> > >> auto-reduce parallelism is turned off in certain vertices. >> > >> >> > >> 2015-07-06 14:11:35,109 INFO [AsyncDispatcher event handler] >> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for >> > >> vertex_1436152736518_0210_1_28* [scope-2037]* >> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler] >> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings >> > >> minFrac:0.25 maxFrac:0.75* auto:false* desiredTaskIput:104857600 >> > >> minTasks:1 >> > >> 2015-07-06 14:11:35,123 INFO [AsyncDispatcher event handler] >> > >> impl.VertexImpl: Creating 999 for vertex: >> vertex_1436152736518_0210_1_28 >> > >> [scope-2037] >> > >> .... >> > >> >> > >> 2015-07-06 14:11:35,245 INFO [AsyncDispatcher event handler] >> > >> impl.VertexImpl: Setting vertexManager to ShuffleVertexManager for >> > >> vertex_1436152736518_0210_1_39 *[scope-2162]* >> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler] >> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings >> > >> minFrac:0.25 maxFrac:0.75 *auto:false* desiredTaskIput:104857600 >> > >> minTasks:1 >> > >> 2015-07-06 14:11:35,257 INFO [AsyncDispatcher event handler] >> > >> impl.VertexImpl: Creating 999 for vertex: >> vertex_1436152736518_0210_1_39 >> > >> [scope-2162] >> > >> .... >> > >> >> > >> 2015-07-06 14:11:35,417 INFO [AsyncDispatcher event handler] >> > >> impl.VertexImpl: Setting user vertex manager plugin: >> > >> org.apache.tez.dag.library.vertexmanager.ShuffleVertexManager on >> > vertex:* >> > >> scope-2185* >> > >> 2015-07-06 14:11:35,419 INFO [AsyncDispatcher event handler] >> > >> vertexmanager.ShuffleVertexManager: Shuffle Vertex Manager: settings >> > >> minFrac:0.25 maxFrac:0.75 *auto:true* desiredTaskIput:104857600 >> > >> minTasks:1 >> > >> ... >> > >> >> > >> >> > >> On Tue, Jul 7, 2015 at 12:47 PM, Rajesh Balamohan < >> > >> [email protected]> wrote: >> > >> >> > >>> >> > >>> Attaching the DAG and the swimlane for the job. >> > >>> >> > >>> scope-2052 which had to give the data to other vertices slowed down >> (~ >> > >>> 150-180 seconds) due to multiple spills and NumberFormatExceptions >> in >> > >>> data. You might want to try setting >> > >>> >> > >> "tez.task.scale.memory.additional-reservation.fraction.max='PARTITIONED_UNSORTED_OUTPUT:12,UNSORTED_INPUT:1,UNSORTED_OUTPUT:12,SORTED_OUTPUT:12,SORTED_MERGED_INPUT:1,PROCESSOR:1,OTHER:1' >> > >>> " to allocate more memory for unordered outputs. >> > >>> Following are the details for this scope. >> > >>> - attempt_1436152736518_0210_1_31_000000_0, >> > >>> PigLatin:dmwith1tapin.pig-0_scope-0, VertexName: scope-2052, >> > >>> VertexParallelism: 1, >> > >>> TaskAttemptID:attempt_1436152736518_0210_1_31_000000_0, >> > >>> - numInputs=1, numOutputs=4, JVM.maxFree=734527488 >> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild] >> resources.MemoryDistributor: >> > >>> Informing: INPUT, scope-546, org.apache.tez.mapreduce.input.MRInput: >> > >>> requested=0, allocated=0 >> > >>> - Small allocation of ~7 MB allocation to unordered output lead to >> > >>> multiple spills. >> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild] >> resources.MemoryDistributor: >> > >>> Informing: OUTPUTPUT_RECORDS, scope-2117, >> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput: >> > >>> requested=268435456, allocated=222303401 >> > >>> - 2015-07-06 14:11:40,047 INFO [TezChild] >> resources.MemoryDistributor: >> > >>> Informing: OUTPUT, scope-2251, >> > >>> org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput: >> > >>> requested=268435456, allocated=222303401 >> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild] >> resources.MemoryDistributor: >> > >>> Informing: OUTPUT, scope-2063, >> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput: >> > >>> requested=104857600, allocated=7236438 >> > >>> - 2015-07-06 14:11:40,048 INFO [TezChild] >> resources.MemoryDistributor: >> > >>> Informing: OUTPUT, scope-2068, >> > >>> org.apache.tez.runtime.library.output.UnorderedPartitionedKVOutput: >> > >>> requested=104857600, allocated=7236438 >> > >>> - Too many number of records had issues in NumberFormatException >> > >>> leading to large amount of logs. This dragged the runtime of this >> task >> > to >> > >>> around >> > >>> e.g "mapReduceLayer.PigHadoopLogger: >> > >>> java.lang.Class(FIELD_DISCARDED_TYPE_CONVERSION_FAILED): Unable to >> > >>> interpret value in field being converted to long, caught >> > >>> NumberFormatException <empty String> field discarded" >> > >>> >> > >>> >> > >>> - scope-2037 and scope-2162 had set the vertex parallelism to "999" >> > >>> affecting subsequent execution. >> > >>> - VertexName: scope-2037, VertexParallelism: 999 vertex: >> > >>> vertex_1436152736518_0210_1_28 finished in *410* seconds. Tasks >> > >>> themselves were small, but due to large number of tasks that had to >> be >> > >>> executed in small containers (pretty much used the same container to >> > >>> execute this) it took time. >> > >>> - VertexName: scope-2162, VertexParallelism: 999 vertex: >> > >>> vertex_1436152736518_0210_1_39 finished in *697* seconds. Similar >> > >>> observation as previous vertex. >> > >>> *Above 2 vertices have caused the entire job to slow down.* >> > >>> >> > >>> "999" is set as the reducer parallelism at compile time by Pig. >> This is >> > >>> not for the input. I am not sure how pig sets the parallelism at >> > compile >> > >>> time. You can possibly try setting "pig.exec.reducers.max=50" in >> your >> > case >> > >>> and give it a try. Pig folks would be in a better position to >> explain >> > that. >> > >>> >> > >>> >> > >>> On Tue, Jul 7, 2015 at 11:22 AM, Sachin Sabbarwal < >> > >>> [email protected]> wrote: >> > >>> >> > >>>> >> > >>>> logs.gz >> > >>>> < >> > >> https://drive.google.com/file/d/0B-RFcYxUIHzzUVJpRzVDZXB5TUk/view?usp=drive_web >> > > >> > >>>> Hi Rajesh >> > >>>> >> > >>>> PFA the gziped logs. >> > >>>> FYI It's a single file, when you'll gunzip it, it'll be around >> 1.5gb >> > in >> > >>>> size. >> > >>>> One more thing which you might find useful: >> > >>>> >> > >>>> In the dmOutputTez file i could see following line, which suggests >> > that >> > >>>> TEZ created a total of 7660 tasks. This is surprising as my data is >> > only >> > >>>> few mbs(10-15 mb max). How is this number of tasks decided? is >> there >> > any >> > >>>> property to tune it? >> > >>>> >> > >>>> 2015-07-07 05:37:02,647 [Timer-0] INFO >> > >>>> org.apache.pig.backend.hadoop.executionengine.tez.TezJob - DAG >> > >>>> Status: status=RUNNING, progress=TotalTasks: 7660 Succeeded: 0 >> > Running: >> > >>>> 0 Failed: 0 Killed: 0, diagnostics= >> > >>>> >> > >>>> Thanks >> > >>>> >> > >>>> On Mon, Jul 6, 2015 at 8:34 PM, Rajesh Balamohan < >> > >>>> [email protected]> wrote: >> > >>>> >> > >>>>> yarn logs -applicationId application_1436152736518_0210 >> > >>>>> >> > >>>>> You can possibly send the output to a log file, gzip it and post >> it. >> > >>>>> >> > >>>>> ~Rajesh.B >> > >>>>> >> > >>>>> On Mon, Jul 6, 2015 at 8:12 PM, Sachin Sabbarwal < >> > >>>>> [email protected]> wrote: >> > >>>>> >> > >>>>>> Hi >> > >>>>>> Thanks for reply. My tez-site.xml contains only following: >> > >>>>>> >> > >>>>>> <configuration> >> > >>>>>> <property> >> > >>>>>> <name>tez.lib.uris</name> >> > >>>>>> <value>${fs.defaultFS}/apps/tez-0.5/tez-0.5.3.tar.gz, >> > >>>>>> >> > >> ${fs.defaultFS}/apps/tez-0.5/*,${fs.defaultFS}/apps/tez-0.5/lib/*</value> >> > >>>>>> </property> >> > >>>>>> </configuration> >> > >>>>>> >> > >>>>>> PFA the application logs. Here is the version information: >> > >>>>>> 1. Hadoop version: Hadoop 2.5.0-cdh5.3.1 >> > >>>>>> 2. Pig: Apache Pig version 0.14.0 (r1640057) >> > >>>>>> 3. Tez: 0.5.3 >> > >>>>>> >> > >>>>>> Lemme know if anything else is needed. >> > >>>>>> >> > >>>>>> Thanks in advance >> > >>>>>> >> > >>>>>> On Mon, Jul 6, 2015 at 7:07 PM, Rajesh Balamohan < >> > >>>>>> [email protected]> wrote: >> > >>>>>> >> > >>>>>>> Can you post the application logs, tez-site.xml and also the >> > version >> > >>>>>>> details? >> > >>>>>>> >> > >>>>>>> ~Rajesh.B >> > >>>>>>> >> > >>>>>>> On Mon, Jul 6, 2015 at 6:38 PM, Sachin Sabbarwal < >> > >>>>>>> [email protected]> wrote: >> > >>>>>>> >> > >>>>>>>> >> > >>>>>>>> ---------- Forwarded message ---------- >> > >>>>>>>> From: Sachin Sabbarwal <[email protected]> >> > >>>>>>>> Date: Mon, Jul 6, 2015 at 5:34 PM >> > >>>>>>>> Subject: Same pig script running slower with Tez as compared >> with >> > >>>>>>>> run in Mapred mode >> > >>>>>>>> To: [email protected] >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Hello Guys >> > >>>>>>>> Trying Apache Tez. >> > >>>>>>>> I've setup to use pig in TEZ mode. >> > >>>>>>>> I'm running a pig script against i) no data and ii) with some >> > data. >> > >>>>>>>> In case i) when i run with pig using TEZ mode my pig scripts >> > >>>>>>>> completes run in ~40secs. Whereas when i run case i) with >> mapred >> > it takes >> > >>>>>>>> around 7-8 mins. >> > >>>>>>>> in case ii) when run with pig using TEZ, same pig script takes >> > >>>>>>>> around 14-15 mins but with mapred it takes around 10 mins. >> > >>>>>>>> When i'm running same pig script with production data(which is >> > much >> > >>>>>>>> more than the data i used here to run case i) and (ii) ) the >> job >> > takes >> > >>>>>>>> hours to complete. >> > >>>>>>>> Hence I'm trying tez to run my pig job in a faster mode. I'm >> not >> > >>>>>>>> really sure what i might be missing here. Please help, ask for >> > any further >> > >>>>>>>> info if required. >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> Thanks >> > >>>>>>>> -- >> > >>>>>>>> Sachin Sabbarwal >> > >>>>>>>> Linkedin: >> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal >> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal >> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/ >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> >> > >>>>>>>> -- >> > >>>>>>>> Sachin Sabbarwal >> > >>>>>>>> Linkedin: >> > >>>>>>>> https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > >>>>>>>> Facebook: facebook.com/sachinsabbarwal >> > >>>>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal >> > >>>>>>>> Blog: http://sachinsabbarwal.tumblr.com/ >> > >>>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> -- >> > >>>>>>> ~Rajesh.B >> > >>>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> >> > >>>>>> -- >> > >>>>>> Sachin Sabbarwal >> > >>>>>> Linkedin: >> > https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > >>>>>> Facebook: facebook.com/sachinsabbarwal >> > >>>>>> Quora: http://www.quora.com/Sachin-Sabbarwal >> > >>>>>> Blog: http://sachinsabbarwal.tumblr.com/ >> > >>>>>> >> > >>>>> >> > >>>>> >> > >>>>> >> > >>>>> -- >> > >>>>> ~Rajesh.B >> > >>>>> >> > >>>> >> > >>>> >> > >>>> >> > >>>> -- >> > >>>> Sachin Sabbarwal >> > >>>> Linkedin: >> https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > >>>> Facebook: facebook.com/sachinsabbarwal >> > >>>> Quora: http://www.quora.com/Sachin-Sabbarwal >> > >>>> Blog: http://sachinsabbarwal.tumblr.com/ >> > >>>> >> > >>> >> > >>> >> > >>> >> > >>> -- >> > >>> ~Rajesh.B >> > >>> >> > >> >> > >> >> > >> >> > >> -- >> > >> ~Rajesh.B >> > >> >> > > >> > > >> > > >> > > -- >> > > Sachin Sabbarwal >> > > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > > Facebook: facebook.com/sachinsabbarwal >> > > Quora: http://www.quora.com/Sachin-Sabbarwal >> > > Blog: http://sachinsabbarwal.tumblr.com/ >> > > >> > >> > >> > >> > -- >> > ~Rajesh.B >> > >> > >> > >> > -- >> > Sachin Sabbarwal >> > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265 >> > Facebook: facebook.com/sachinsabbarwal >> > Quora: http://www.quora.com/Sachin-Sabbarwal >> > Blog: http://sachinsabbarwal.tumblr.com/ >> > >> > > > > -- > Sachin Sabbarwal > Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265 > Facebook: facebook.com/sachinsabbarwal > Quora: http://www.quora.com/Sachin-Sabbarwal > Blog: http://sachinsabbarwal.tumblr.com/ > -- Sachin Sabbarwal Linkedin: https://www.linkedin.com/profile?viewProfile=&key=95777265 Facebook: facebook.com/sachinsabbarwal Quora: http://www.quora.com/Sachin-Sabbarwal Blog: http://sachinsabbarwal.tumblr.com/
