streaming on yarn

2016-06-24 Thread Alex Dzhagriev
Hello, Can someone, please, share the opinions on the options available for running spark streaming jobs on yarn? The first thing comes to my mind is to use slider. Googling for such experience didn't give me much. From my experience running the same jobs on mesos, I have two concerns: automatic

--jars for mesos cluster

2016-05-03 Thread Alex Dzhagriev
Hello all, In the Mesos related spark docs ( http://spark.apache.org/docs/1.6.0/running-on-mesos.html#cluster-mode) I found this statement: Note that jars or python files that are passed to spark-submit should be > URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically >

Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Alex Dzhagriev
Hi Vinay, I believe it's not possible as the spark-shuffle code should run in the same JVM process as the Node Manager. I haven't heard anything about on the fly bytecode loading in the Node Manger. Thanks, Alex. On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap wrote: > Hi

Re: Sorting the RDD

2016-03-03 Thread Alex Dzhagriev
Hi Angel, Your x() functions returns an Any type, thus there is no Ordering[Any] defined in the scope and it doesn't make sense to define one. Basically it's the same as to order java Objects, which don't have any fields. So the problem is with your x() function, make sure it returns something

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
<moshir.mik...@gmail.com> wrote: > Hi Alex, > thanks for the link. Will check it. > Does someone know of a more streamlined approach ? > > > > > Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev <dzh...@gmail.com> a écrit : > >> Hi Moshir, >> >

Re: Spark Integration Patterns

2016-02-29 Thread Alex Dzhagriev
Hi Moshir, I think you can use the rest api provided with Spark: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala Unfortunately, I haven't find any documentation, but it looks fine. Thanks, Alex. On Sun, Feb 28, 2016 at 3:25

Re: reasonable number of executors

2016-02-24 Thread Alex Dzhagriev
there is a section that is connected to your question > > On 23 February 2016 at 16:49, Alex Dzhagriev <dzh...@gmail.com> wrote: > >> Hello all, >> >> Can someone please advise me on the pros and cons on how to allocate the >> resources: many small heap machin

reasonable number of executors

2016-02-23 Thread Alex Dzhagriev
Hello all, Can someone please advise me on the pros and cons on how to allocate the resources: many small heap machines with 1 core or few machines with big heaps and many cores? I'm sure that depends on the data flow and there is no best practise solution. E.g. with bigger heap I can perform

Re: Can we load csv partitioned data into one DF?

2016-02-22 Thread Alex Dzhagriev
Hi Saif, You can put your files into one directory and read it as text. Another option is to read them separately and then union the datasets. Thanks, Alex. On Mon, Feb 22, 2016 at 4:25 PM, wrote: > Hello all, I am facing a silly data question. > > If I have +100

an OOM while persist as DISK_ONLY

2016-02-22 Thread Alex Dzhagriev
Hello all, I'm using spark 1.6 and trying to cache a dataset which is 1.5 TB, I have only ~800GB RAM in total, so I am choosing the DISK_ONLY storage level. Unfortunately, I'm getting out of the overhead memory limit: Container killed by YARN for exceeding memory limits. 27.0 GB of 27 GB

Re: Importing csv files into Hive ORC target table

2016-02-18 Thread Alex Dzhagriev
that one column is missing > > *scala> ttt.first* > *res81: Invoice = Invoice(360,10/02/2014,"?2,500.00",?0.00)* > > it seems that I am missing the last column here! > > I suspect the cause of the problem is the "," used in "?2,500.00" which is

Re: Importing csv files into Hive ORC target table

2016-02-17 Thread Alex Dzhagriev
Hi Mich, You can use data frames ( http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes) to achieve that. val sqlContext = new HiveContext(sc) var rdd = sc.textFile("/data/stg/table2") //... //perform you business logic, cleanups, etc. //...

cartesian with Dataset

2016-02-17 Thread Alex Dzhagriev
Hello all, Is anybody aware of any plans to support cartesian for Datasets? Are there any ways to work around this issue without switching to RDDs? Thanks, Alex.