Re: How do I parallize Spark Jobs at Executor Level.
Yes, it's also possible. Just pass in the sequence files you want to process as a comma-separated list in sc.sequenceFile(). -Deng On Fri, Oct 30, 2015 at 5:46 PM, Vinoth Sankar wrote: > Hi Deng. > > Thanks for the response. > > Is it possible to load sequence files parallely and process each of it in > parallel...? > > > Regards > Vinoth Sankar > > On Fri, Oct 30, 2015 at 2:56 PM Deng Ching-Mallete > wrote: > >> Hi, >> >> You seem to be creating a new RDD for each element in your files RDD. >> What I would suggest is to load and process only one sequence file in your >> Spark job, then just execute multiple spark jobs to process each sequence >> file. >> >> With regard to your question of where to view the logs inside the >> closures, you should be able to see them in the executor log (via the Spark >> UI, in the Executors page). >> >> HTH, >> Deng >> >> >> On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar >> wrote: >> >>> Hi Adrian, >>> >>> Yes. I need to load all files and process it in parallel. Following code >>> doesn't seem working(Here I used map, even tried foreach) ,I just >>> downloading the files from HDFS to local system and printing the logs count >>> in each file. Its not throwing any Exceptions,but not working. Files are >>> not getting downloaded. I even didn't get that LOGGER print. Same code >>> works if I iterate the files, but its not Parallelized. How do I get my >>> code Parallelized and Working. >>> >>> JavaRDD files = sparkContext.parallelize(fileList); >>> >>> files.map(new Function() >>> { >>> public static final long serialVersionUID = 1L; >>> >>> @Override >>> public Void call(String hdfsPath) throws Exception >>> { >>> JavaPairRDD hdfsContent = >>> sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class); >>> JavaRDD logs = hdfsContent.map(new Function>> BytesWritable>, Message>() >>> { >>> public Message call(Tuple2 tuple2) throws >>> Exception >>> { >>> BytesWritable value = tuple2._2(); >>> BytesWritable tmp = new BytesWritable(); >>> tmp.setCapacity(value.getLength()); >>> tmp.set(value); >>> return (Message) getProtos(1, tmp.getBytes()); >>> } >>> }); >>> >>> String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26); >>> >>> Thread.sleep(2000); >>> logs.saveAsObjectFile(path); >>> >>> LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] { >>> hdfsPath, logs.count() }); >>> return null; >>> } >>> }); >>> >>> >>> >>> Note : In another scenario also i didn't get the logs which are present >>> inside map,filter closures. But logs outside these closures are getting >>> printed as usual. If i can't get the logger prints inside these closures >>> how do i debug them ? >>> >>> Thanks >>> Vinoth Sankar >>> >>> On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase wrote: >>> >>>> The first line is distributing your fileList variable in the cluster as >>>> a RDD, partitioned using the default partitioner settings (e.g. Number of >>>> cores in your cluster). >>>> >>>> Each of your workers would one or more slices of data (depending on how >>>> many cores each executor has) and the abstraction is called partition. >>>> >>>> What is your use case? If you want to load the files and continue >>>> processing in parallel, then a simple .map should work. >>>> If you want to execute arbitrary code based on the list of files that >>>> each executor received, then you need to use .foreach that will get >>>> executed for each of the entries, on the worker. >>>> >>>> -adrian >>>> >>>> From: Vinoth Sankar >>>> Date: Wednesday, October 28, 2015 at 2:49 PM >>>> To: "user@spark.apache.org" >>>> Subject: How do I parallize Spark Jobs at Executor Level. >>>> >>>> Hi, >>>> >>>> I'm reading and filtering large no of files using Spark. It's getting >>>> parallized at Spark Driver level only. How do i make it parallelize to >>>> Executor(Worker) Level. Refer the following sample. Is there any way to >>>> paralleling iterate the localIterator ? >>>> >>>> Note : I use Java 1.7 version >>>> >>>> JavaRDD files = javaSparkContext.parallelize(fileList) >>>> Iterator localIterator = files.toLocalIterator(); >>>> >>>> Regards >>>> Vinoth Sankar >>>> >>>
Re: How do I parallize Spark Jobs at Executor Level.
Hi Deng. Thanks for the response. Is it possible to load sequence files parallely and process each of it in parallel...? Regards Vinoth Sankar On Fri, Oct 30, 2015 at 2:56 PM Deng Ching-Mallete wrote: > Hi, > > You seem to be creating a new RDD for each element in your files RDD. What > I would suggest is to load and process only one sequence file in your Spark > job, then just execute multiple spark jobs to process each sequence file. > > With regard to your question of where to view the logs inside the > closures, you should be able to see them in the executor log (via the Spark > UI, in the Executors page). > > HTH, > Deng > > > On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar > wrote: > >> Hi Adrian, >> >> Yes. I need to load all files and process it in parallel. Following code >> doesn't seem working(Here I used map, even tried foreach) ,I just >> downloading the files from HDFS to local system and printing the logs count >> in each file. Its not throwing any Exceptions,but not working. Files are >> not getting downloaded. I even didn't get that LOGGER print. Same code >> works if I iterate the files, but its not Parallelized. How do I get my >> code Parallelized and Working. >> >> JavaRDD files = sparkContext.parallelize(fileList); >> >> files.map(new Function() >> { >> public static final long serialVersionUID = 1L; >> >> @Override >> public Void call(String hdfsPath) throws Exception >> { >> JavaPairRDD hdfsContent = >> sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class); >> JavaRDD logs = hdfsContent.map(new Function> BytesWritable>, Message>() >> { >> public Message call(Tuple2 tuple2) throws >> Exception >> { >> BytesWritable value = tuple2._2(); >> BytesWritable tmp = new BytesWritable(); >> tmp.setCapacity(value.getLength()); >> tmp.set(value); >> return (Message) getProtos(1, tmp.getBytes()); >> } >> }); >> >> String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26); >> >> Thread.sleep(2000); >> logs.saveAsObjectFile(path); >> >> LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] { >> hdfsPath, logs.count() }); >> return null; >> } >> }); >> >> >> >> Note : In another scenario also i didn't get the logs which are present >> inside map,filter closures. But logs outside these closures are getting >> printed as usual. If i can't get the logger prints inside these closures >> how do i debug them ? >> >> Thanks >> Vinoth Sankar >> >> On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase wrote: >> >>> The first line is distributing your fileList variable in the cluster as >>> a RDD, partitioned using the default partitioner settings (e.g. Number of >>> cores in your cluster). >>> >>> Each of your workers would one or more slices of data (depending on how >>> many cores each executor has) and the abstraction is called partition. >>> >>> What is your use case? If you want to load the files and continue >>> processing in parallel, then a simple .map should work. >>> If you want to execute arbitrary code based on the list of files that >>> each executor received, then you need to use .foreach that will get >>> executed for each of the entries, on the worker. >>> >>> -adrian >>> >>> From: Vinoth Sankar >>> Date: Wednesday, October 28, 2015 at 2:49 PM >>> To: "user@spark.apache.org" >>> Subject: How do I parallize Spark Jobs at Executor Level. >>> >>> Hi, >>> >>> I'm reading and filtering large no of files using Spark. It's getting >>> parallized at Spark Driver level only. How do i make it parallelize to >>> Executor(Worker) Level. Refer the following sample. Is there any way to >>> paralleling iterate the localIterator ? >>> >>> Note : I use Java 1.7 version >>> >>> JavaRDD files = javaSparkContext.parallelize(fileList) >>> Iterator localIterator = files.toLocalIterator(); >>> >>> Regards >>> Vinoth Sankar >>> >>
Re: How do I parallize Spark Jobs at Executor Level.
Hi, You seem to be creating a new RDD for each element in your files RDD. What I would suggest is to load and process only one sequence file in your Spark job, then just execute multiple spark jobs to process each sequence file. With regard to your question of where to view the logs inside the closures, you should be able to see them in the executor log (via the Spark UI, in the Executors page). HTH, Deng On Fri, Oct 30, 2015 at 1:20 PM, Vinoth Sankar wrote: > Hi Adrian, > > Yes. I need to load all files and process it in parallel. Following code > doesn't seem working(Here I used map, even tried foreach) ,I just > downloading the files from HDFS to local system and printing the logs count > in each file. Its not throwing any Exceptions,but not working. Files are > not getting downloaded. I even didn't get that LOGGER print. Same code > works if I iterate the files, but its not Parallelized. How do I get my > code Parallelized and Working. > > JavaRDD files = sparkContext.parallelize(fileList); > > files.map(new Function() > { > public static final long serialVersionUID = 1L; > > @Override > public Void call(String hdfsPath) throws Exception > { > JavaPairRDD hdfsContent = > sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class); > JavaRDD logs = hdfsContent.map(new Function BytesWritable>, Message>() > { > public Message call(Tuple2 tuple2) throws > Exception > { > BytesWritable value = tuple2._2(); > BytesWritable tmp = new BytesWritable(); > tmp.setCapacity(value.getLength()); > tmp.set(value); > return (Message) getProtos(1, tmp.getBytes()); > } > }); > > String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26); > > Thread.sleep(2000); > logs.saveAsObjectFile(path); > > LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] { > hdfsPath, logs.count() }); > return null; > } > }); > > > > Note : In another scenario also i didn't get the logs which are present > inside map,filter closures. But logs outside these closures are getting > printed as usual. If i can't get the logger prints inside these closures > how do i debug them ? > > Thanks > Vinoth Sankar > > On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase wrote: > >> The first line is distributing your fileList variable in the cluster as a >> RDD, partitioned using the default partitioner settings (e.g. Number of >> cores in your cluster). >> >> Each of your workers would one or more slices of data (depending on how >> many cores each executor has) and the abstraction is called partition. >> >> What is your use case? If you want to load the files and continue >> processing in parallel, then a simple .map should work. >> If you want to execute arbitrary code based on the list of files that >> each executor received, then you need to use .foreach that will get >> executed for each of the entries, on the worker. >> >> -adrian >> >> From: Vinoth Sankar >> Date: Wednesday, October 28, 2015 at 2:49 PM >> To: "user@spark.apache.org" >> Subject: How do I parallize Spark Jobs at Executor Level. >> >> Hi, >> >> I'm reading and filtering large no of files using Spark. It's getting >> parallized at Spark Driver level only. How do i make it parallelize to >> Executor(Worker) Level. Refer the following sample. Is there any way to >> paralleling iterate the localIterator ? >> >> Note : I use Java 1.7 version >> >> JavaRDD files = javaSparkContext.parallelize(fileList) >> Iterator localIterator = files.toLocalIterator(); >> >> Regards >> Vinoth Sankar >> >
Re: How do I parallize Spark Jobs at Executor Level.
Hi Adrian, Yes. I need to load all files and process it in parallel. Following code doesn't seem working(Here I used map, even tried foreach) ,I just downloading the files from HDFS to local system and printing the logs count in each file. Its not throwing any Exceptions,but not working. Files are not getting downloaded. I even didn't get that LOGGER print. Same code works if I iterate the files, but its not Parallelized. How do I get my code Parallelized and Working. JavaRDD files = sparkContext.parallelize(fileList); files.map(new Function() { public static final long serialVersionUID = 1L; @Override public Void call(String hdfsPath) throws Exception { JavaPairRDD hdfsContent = sc.sequenceFile(hdfsPath, IntWritable.class, BytesWritable.class); JavaRDD logs = hdfsContent.map(new Function, Message>() { public Message call(Tuple2 tuple2) throws Exception { BytesWritable value = tuple2._2(); BytesWritable tmp = new BytesWritable(); tmp.setCapacity(value.getLength()); tmp.set(value); return (Message) getProtos(1, tmp.getBytes()); } }); String path = "/home/sas/logsfilterapp/temp/" + hdfsPath.substring(26); Thread.sleep(2000); logs.saveAsObjectFile(path); LOGGER.log(Level.INFO, "HDFS Path : {0} Count : {1}", new Object[] { hdfsPath, logs.count() }); return null; } }); Note : In another scenario also i didn't get the logs which are present inside map,filter closures. But logs outside these closures are getting printed as usual. If i can't get the logger prints inside these closures how do i debug them ? Thanks Vinoth Sankar On Wed, Oct 28, 2015 at 8:29 PM Adrian Tanase wrote: > The first line is distributing your fileList variable in the cluster as a > RDD, partitioned using the default partitioner settings (e.g. Number of > cores in your cluster). > > Each of your workers would one or more slices of data (depending on how > many cores each executor has) and the abstraction is called partition. > > What is your use case? If you want to load the files and continue > processing in parallel, then a simple .map should work. > If you want to execute arbitrary code based on the list of files that each > executor received, then you need to use .foreach that will get executed for > each of the entries, on the worker. > > -adrian > > From: Vinoth Sankar > Date: Wednesday, October 28, 2015 at 2:49 PM > To: "user@spark.apache.org" > Subject: How do I parallize Spark Jobs at Executor Level. > > Hi, > > I'm reading and filtering large no of files using Spark. It's getting > parallized at Spark Driver level only. How do i make it parallelize to > Executor(Worker) Level. Refer the following sample. Is there any way to > paralleling iterate the localIterator ? > > Note : I use Java 1.7 version > > JavaRDD files = javaSparkContext.parallelize(fileList) > Iterator localIterator = files.toLocalIterator(); > > Regards > Vinoth Sankar >
Re: How do I parallize Spark Jobs at Executor Level.
The first line is distributing your fileList variable in the cluster as a RDD, partitioned using the default partitioner settings (e.g. Number of cores in your cluster). Each of your workers would one or more slices of data (depending on how many cores each executor has) and the abstraction is called partition. What is your use case? If you want to load the files and continue processing in parallel, then a simple .map should work. If you want to execute arbitrary code based on the list of files that each executor received, then you need to use .foreach that will get executed for each of the entries, on the worker. -adrian From: Vinoth Sankar Date: Wednesday, October 28, 2015 at 2:49 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: How do I parallize Spark Jobs at Executor Level. Hi, I'm reading and filtering large no of files using Spark. It's getting parallized at Spark Driver level only. How do i make it parallelize to Executor(Worker) Level. Refer the following sample. Is there any way to paralleling iterate the localIterator ? Note : I use Java 1.7 version JavaRDD files = javaSparkContext.parallelize(fileList) Iterator localIterator = files.toLocalIterator(); Regards Vinoth Sankar
How do I parallize Spark Jobs at Executor Level.
Hi, I'm reading and filtering large no of files using Spark. It's getting parallized at Spark Driver level only. How do i make it parallelize to Executor(Worker) Level. Refer the following sample. Is there any way to paralleling iterate the localIterator ? Note : I use Java 1.7 version JavaRDD files = javaSparkContext.parallelize(fileList) Iterator localIterator = files.toLocalIterator(); Regards Vinoth Sankar