Re: MapOutputTracker.getMapSizesByExecutorId and mutation on the driver?
Hi, I think the comment [1] is only correct for "getStatistics" as it is called at driver side. It should be added in "getMapSizesByExecutorId" by mistake. Jacek Laskowski wrote > Hi, > > I've been reviewing how MapOutputTracker works and can't understand > the comment [1]: > > // Synchronize on the returned array because, on the driver, it gets > mutated in place > > How is this possible since "the returned array" is a local value? I'm > stuck and would appreciate help. Thanks! > > (It also says "Called from executors" [2] so how could the driver be > involved?!) > > [1] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L145 > > [2] > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L133 > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark > Follow me at https://twitter.com/jaceklaskowski > > - > To unsubscribe e-mail: > dev-unsubscribe@.apache - Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MapOutputTracker-getMapSizesByExecutorId-and-mutation-on-the-driver-tp20342p20349.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Negative number of active tasks
Could you share Pseudo code for the same. Cheers! C Khatri. On Fri, Dec 23, 2016 at 4:33 PM, Andy Dang wrote: > Hi all, > > Today I hit a weird bug in Spark 2.0.2 (vanilla Spark) - the executor tab > shows negative number of active tasks. > > I have about 25 jobs, each with 20k tasks so the numbers are not that > crazy. > > What could possibly the cause of this bug? This is the first time I've > seen it and the only special thing I'm doing is saving multiple datasets at > the same time to HDFS from different threads. > > Thanks, > Andy > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >
Re: Approach: Incremental data load from HBASE
Ted Correct, In my case i want Incremental Import from HBASE and Incremental load to Hive. Both approach discussed earlier with Indexing seems accurate to me. But like Sqoop support Incremental import and load for RDBMS, Is there any tool which supports Incremental import from HBase ? On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote: > Incremental load traditionally means generating hfiles and > using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be generic best practice solution for Incremental load >> from HBASE. >> >> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: >> >>> I haven't used Gobblin. >>> You can consider asking Gobblin mailing list of the first option. >>> >>> The second option would work. >>> >>> >>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Hello Guys, I would like to understand different approach for Distributed Incremental load from HBase, Is there any *tool / incubactor tool* which satisfy requirement ? *Approach 1:* Write Kafka Producer and maintain manually column flag for events and ingest it with Linkedin Gobblin to HDFS / S3. *Approach 2:* Run Scheduled Spark Job - Read from HBase and do transformations and maintain flag column at HBase Level. In above both approach, I need to maintain column level flags. such as 0 - by default, 1-sent,2-sent and acknowledged. So next time Producer will take another 1000 rows of batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri >>> >>> >> >
Re: Best Practice for Spark Job Jar Generation
Correct, so the approach you suggested and Uber Jar Approach. What i think that Uber Jar approach is best practice because if you wish to do environment migration then would be easy. and Performance wise also Uber Jar Approach would be more optimised rather than Uber less approach. Thanks. On Fri, Dec 23, 2016 at 11:41 PM, Andy Dang wrote: > We remodel Spark dependencies and ours together and chuck them under the > /jars path. There are other ways to do it but we want the classpath to be > strictly as close to development as possible. > > --- > Regards, > Andy > > On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Andy, Thanks for reply. >> >> If we download all the dependencies at separate location and link with >> spark job jar on spark cluster, is it best way to execute spark job ? >> >> Thanks. >> >> On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang wrote: >> >>> I used to use uber jar in Spark 1.x because of classpath issues (we >>> couldn't re-model our dependencies based on our code, and thus cluster's >>> run dependencies could be very different from running Spark directly in the >>> IDE. We had to use userClasspathFirst "hack" to work around this. >>> >>> With Spark 2, it's easier to replace dependencies (say, Guava) than >>> before. We moved away from deploying superjar and just pass the libraries >>> as part of Spark jars (still can't use Guava v19 or later because Spark >>> uses a deprecated method that's not available, but that's not a big issue >>> for us). >>> >>> --- >>> Regards, >>> Andy >>> >>> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < >>> chetan.opensou...@gmail.com> wrote: >>> Hello Spark Community, For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and then submit to spark-submit. Example, bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar But other folks has debate with for Uber Less Jar, Guys can you please explain me best practice industry standard for the same. Thanks, Chetan Khatri. >>> >>> >> >
Re: Best Practice for Spark Job Jar Generation
We remodel Spark dependencies and ours together and chuck them under the /jars path. There are other ways to do it but we want the classpath to be strictly as close to development as possible. --- Regards, Andy On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri wrote: > Andy, Thanks for reply. > > If we download all the dependencies at separate location and link with > spark job jar on spark cluster, is it best way to execute spark job ? > > Thanks. > > On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang wrote: > >> I used to use uber jar in Spark 1.x because of classpath issues (we >> couldn't re-model our dependencies based on our code, and thus cluster's >> run dependencies could be very different from running Spark directly in the >> IDE. We had to use userClasspathFirst "hack" to work around this. >> >> With Spark 2, it's easier to replace dependencies (say, Guava) than >> before. We moved away from deploying superjar and just pass the libraries >> as part of Spark jars (still can't use Guava v19 or later because Spark >> uses a deprecated method that's not available, but that's not a big issue >> for us). >> >> --- >> Regards, >> Andy >> >> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Hello Spark Community, >>> >>> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and >>> then submit to spark-submit. >>> >>> Example, >>> >>> bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob >>> /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar >>> >>> But other folks has debate with for Uber Less Jar, Guys can you please >>> explain me best practice industry standard for the same. >>> >>> Thanks, >>> >>> Chetan Khatri. >>> >> >> >
Re: Best Practice for Spark Job Jar Generation
Andy, Thanks for reply. If we download all the dependencies at separate location and link with spark job jar on spark cluster, is it best way to execute spark job ? Thanks. On Fri, Dec 23, 2016 at 8:34 PM, Andy Dang wrote: > I used to use uber jar in Spark 1.x because of classpath issues (we > couldn't re-model our dependencies based on our code, and thus cluster's > run dependencies could be very different from running Spark directly in the > IDE. We had to use userClasspathFirst "hack" to work around this. > > With Spark 2, it's easier to replace dependencies (say, Guava) than > before. We moved away from deploying superjar and just pass the libraries > as part of Spark jars (still can't use Guava v19 or later because Spark > uses a deprecated method that's not available, but that's not a big issue > for us). > > --- > Regards, > Andy > > On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Spark Community, >> >> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and >> then submit to spark-submit. >> >> Example, >> >> bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob >> /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar >> >> But other folks has debate with for Uber Less Jar, Guys can you please >> explain me best practice industry standard for the same. >> >> Thanks, >> >> Chetan Khatri. >> > >
Re: Best Practice for Spark Job Jar Generation
I used to use uber jar in Spark 1.x because of classpath issues (we couldn't re-model our dependencies based on our code, and thus cluster's run dependencies could be very different from running Spark directly in the IDE. We had to use userClasspathFirst "hack" to work around this. With Spark 2, it's easier to replace dependencies (say, Guava) than before. We moved away from deploying superjar and just pass the libraries as part of Spark jars (still can't use Guava v19 or later because Spark uses a deprecated method that's not available, but that's not a big issue for us). --- Regards, Andy On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri wrote: > Hello Spark Community, > > For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and > then submit to spark-submit. > > Example, > > bin/spark-submit --class hbase.spark.chetan.com.SparkHbaseJob > /home/chetan/hbase-spark/SparkMSAPoc-assembly-1.0.jar > > But other folks has debate with for Uber Less Jar, Guys can you please > explain me best practice industry standard for the same. > > Thanks, > > Chetan Khatri. >
MapOutputTracker.getMapSizesByExecutorId and mutation on the driver?
Hi, I've been reviewing how MapOutputTracker works and can't understand the comment [1]: // Synchronize on the returned array because, on the driver, it gets mutated in place How is this possible since "the returned array" is a local value? I'm stuck and would appreciate help. Thanks! (It also says "Called from executors" [2] so how could the driver be involved?!) [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L145 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L133 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Negative number of active tasks
Hi all, Today I hit a weird bug in Spark 2.0.2 (vanilla Spark) - the executor tab shows negative number of active tasks. I have about 25 jobs, each with 20k tasks so the numbers are not that crazy. What could possibly the cause of this bug? This is the first time I've seen it and the only special thing I'm doing is saving multiple datasets at the same time to HDFS from different threads. Thanks, Andy - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Dependency Injection and Microservice development with Spark
Hello Community, Current approach I am using for Spark Job Development with Scala + SBT and Uber Jar with yml properties file to pass configuration parameters. But If i would like to use Dependency Injection and MicroService Development like Spring Boot feature in Scala then what would be the standard approach. Thanks Chetan