Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Tim, We will try to run the application in coarse grain mode, and share the findings with you. Regards Sumit Chawla On Mon, Dec 19, 2016 at 3:11 PM, Timothy Chen wrote: > Dynamic allocation works with Coarse grain mode only, we wasn't aware > a need for Fine grain mode

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Timothy Chen
Dynamic allocation works with Coarse grain mode only, we wasn't aware a need for Fine grain mode after we enabled dynamic allocation support on the coarse grain mode. What's the reason you're running fine grain mode instead of coarse grain + dynamic allocation? Tim On Mon, Dec 19, 2016 at 2:45

Loading a class from a dependency jar

2016-12-19 Thread viraj
Hi, I am currently using the kite library(https://github.com/kite-sdk/kite) to persist to HBase from my Spark Job. All this happens in the driver. I am on version 1.6.1 on spark. The problem I am facing is that a particular class in one of the dependency jars is not found by kite when it uses

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Mehdi Meziane
We will be interested by the results if you give a try to Dynamic allocation with mesos ! - Mail Original - De: "Michael Gummelt" À: "Sumit Chawla" Cc: u...@mesos.apache.org, d...@mesos.apache.org, "User" ,

Re: PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Found it. Some how my host mapping was messing it up. Changing it to point to localhost worked.: /etc/host #127.0.0.1 XX.com 127.0.0.1 localhost From: "Jain, Nishit" > Date: Monday, December 19, 2016 at 2:54 PM To:

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
> Is this problem of idle executors sticking around solved in Dynamic Resource Allocation? Is there some timeout after which Idle executors can just shutdown and cleanup its resources. Yes, that's exactly what dynamic allocation does. But again I have no idea what the state of dynamic

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Great. Makes much better sense now. What will be reason to have spark.mesos.mesosExecutor.cores more than 1, as this number doesn't include the number of cores for tasks. So in my case it seems like 30 CPUs are allocated to executors. And there are 48 tasks so 48 + 30 = 78 CPUs. And i am

PySpark: [Errno 8] nodename nor servname provided, or not known

2016-12-19 Thread Jain, Nishit
Hi, I am using pre-built 'spark-2.0.1-bin-hadoop2.7’ and when I try to start pyspark, I get following message. Any ideas what could be wrong? I tried using python3, setting SPARK_LOCAL_IP to 127.0.0.1 but same error. ~ -> cd /Applications/spark-2.0.1-bin-hadoop2.7/bin/

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
> I should preassume that No of executors should be less than number of tasks. No. Each executor runs 0 or more tasks. Each executor consumes 1 CPU, and each task running on that executor consumes another CPU. You can customize this via spark.mesos.mesosExecutor.cores (

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
Ah thanks. looks like i skipped reading this *"Neither will executors terminate when they’re idle."* So in my job scenario, I should preassume that No of executors should be less than number of tasks. Ideally one executor should execute 1 or more tasks. But i am observing something strange

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Timothy Chen
Hi Chawla, One possible reason is that Mesos fine grain mode also takes up cores to run the executor per host, so if you have 20 agents running Fine grained executor it will take up 20 cores while it's still running. Tim On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
Yea, the idea is to use dynamic allocation. I can't speak to how well it works with Mesos, though. On Mon, Dec 19, 2016 at 11:01 AM, Mehdi Meziane wrote: > I think that what you are looking for is Dynamic resource allocation: >

Re: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Michael Stratton
I don't think the issue is an empty partition, but it may not hurt to try a repartition prior to writing just to rule it out due to the premature EOF exception. On Mon, Dec 19, 2016 at 1:53 PM, Joseph Naegele wrote: > Thanks Michael, hdfs dfsadmin -report tells me:

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Mehdi Meziane
I think that what you are looking for is Dynamic resource allocation: http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your

Re: Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread Sergey B.
I have a asked a similar question here http://stackoverflow.com/questions/40701518/spark-2-0-redefining-sparksession-params-through-getorcreate-and-not-seeing-cha Please see the answer, basically stating that it's impossible to change Session config as soon as it was initiated On Mon, Dec 19,

RE: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Joseph Naegele
Thanks Michael, hdfs dfsadmin -report tells me: Configured Capacity: 7999424823296 (7.28 TB) Present Capacity: 7997657774971 (7.27 TB) DFS Remaining: 7959091768187 (7.24 TB) DFS Used: 38566006784 (35.92 GB) DFS Used%: 0.48% Under replicated blocks: 0 Blocks with corrupt replicas: 0

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Chawla,Sumit
But coarse grained does the exact same thing which i am trying to avert here. At the cost of lower startup, it keeps the resources reserved till the entire duration of the job. Regards Sumit Chawla On Mon, Dec 19, 2016 at 10:06 AM, Michael Gummelt wrote: > Hi > > I

Pivot in Spark with Case and when

2016-12-19 Thread KhajaAsmath Mohammed
Hi , I am trying to convert sample of hive code into spark sql for better performance. below is part of Hive query that needs to be converted to Spark SQL. All the data is grouped on particular column(id) and max value(value column) is taken for that particular grouped column(id) and pivoted

Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-19 Thread Michael Gummelt
Hi I don't have a lot of experience with the fine-grained scheduler. It's deprecated and fairly old now. CPUs should be relinquished as tasks complete, so I'm not sure why you're seeing what you're seeing. There have been a few discussions on the spark list regarding deprecating the

Re: Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread Venkata Naidu
We can create a link in the spark conf directory to point hive.conf file of hive installation I believe. Thanks, Venkat. On Mon, Dec 19, 2016, 10:58 AM apu wrote: > This is for Spark 2.0: > > If I wanted Hive support on a new SparkSession, I would build it with: > >

Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread apu
This is for Spark 2.0: If I wanted Hive support on a new SparkSession, I would build it with: spark = SparkSession \ .builder \ .enableHiveSupport() \ .getOrCreate() However, PySpark already creates a SparkSession for me, which appears to lack HiveSupport. How can I either: (a) Add

Re: Reference External Variables in Map Function (Inner class)

2016-12-19 Thread mbayebabacar
Hello Marcelo, Finally what was the solution, I face the same problem. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reference-External-Variables-in-Map-Function-Inner-class-tp11990p28237.html Sent from the Apache Spark User List mailing list

Re: [Spark SQL] Task failed while writing rows

2016-12-19 Thread Michael Stratton
It seems like an issue w/ Hadoop. What do you get when you run hdfs dfsadmin -report? Anecdotally(And w/o specifics as it has been a while), I've generally used Parquet instead of ORC as I've gotten a bunch of random problems reading and writing ORC w/ Spark... but given ORC performs a lot better

Re: Spark SQL Syntax

2016-12-19 Thread A Shaikh
I use pyspark on Spark 2. I used Oracle, Postgres syntax just to get back "unhappy response". I do get it some of it resolved after some searching but that consumes a lot of my time, having a platform to test my SQL Syntax and its results would be very helpful. On 19 December 2016 at 14:00,

stratified sampling scales poorly

2016-12-19 Thread Martin Le
Hi all, I perform sampling on a DStream by taking samples from RDDs in the DStream. I have used two sampling mechanisms: simple random sampling and stratified sampling. Simple random sampling: inputStream.transform(x => x.sample(false, fraction)). Stratified sampling: inputStream.transform(x =>

Spark SQL Syntax

2016-12-19 Thread A Shaikh
HI, I keep getting Spark SQL Syntax invalid especially for Dates/Timestamps manipulation. What's the best way to test SQL Syntax in Spark Dataframe is valid? Any online site for test or run a demo SQL! Thanks, Afzal

Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
Thanks , It worked !! On Mon, Dec 19, 2016 at 5:55 PM, Dhaval Modi wrote: > > Replace with ":" > > Regards, > Dhaval Modi > > On 19 December 2016 at 13:10, Rabin Banerjee > wrote: > >> HI All, >> >> I am trying to save data from Spark

Re: How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Dhaval Modi
Replace with ":" Regards, Dhaval Modi On 19 December 2016 at 13:10, Rabin Banerjee wrote: > HI All, > > I am trying to save data from Spark into HBase using saveHadoopDataSet > API . Please refer the below code . Code is working fine .But the table is > getting

How to set NameSpace while storing from Spark to HBase using saveAsNewAPIHadoopDataSet

2016-12-19 Thread Rabin Banerjee
HI All, I am trying to save data from Spark into HBase using saveHadoopDataSet API . Please refer the below code . Code is working fine .But the table is getting stored in the default namespace.how to set the NameSpace in the below code? wordCounts.foreachRDD ( rdd = { val conf =

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-19 Thread Eike von Seggern
Hi, are you using Spark 2.0.*? Then it might be related to https://issues.apache.org/jira/browse/SPARK-18281 . Best Eike 2016-12-18 6:21 GMT+01:00 Russell Jurney : > Anyone? This is for a book, so I need to figure this out. > > On Fri, Dec 16, 2016 at 12:53 AM

Re: Reading xls and xlsx files

2016-12-19 Thread Jörn Franke
I am currently developing one https://github.com/ZuInnoTe/hadoopoffice It contains working source code, but a release will likely be only beginning of the year (will include a Spark data source, but the existing source code can be used without issues in a Spark application). > On 19 Dec 2016,

Reading xls and xlsx files

2016-12-19 Thread Selvam Raman
Hi, Is there a way to read xls and xlsx files using spark?. is there any hadoop inputformat available to read xls and xlsx files which could be used in spark? -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"

Re: How to perform Join operation using JAVARDD

2016-12-19 Thread ayan guha
What's your desired output? On Sat., 17 Dec. 2016 at 9:50 pm, Sree Eedupuganti wrote: > I tried like this, > > *CrashData_1.csv:* > > *CRASH_KEYCRASH_NUMBER CRASH_DATECRASH_MONTH* > *2016899114 2016899114 01/02/2016 12:00:00 > AM

Re: How to get recent value in spark dataframe

2016-12-19 Thread ayan guha
You have 2 parts to it 1. Do a sub query where for each primary key derive latest value of flag=1 records. Ensure you get exactly 1 record per primary key value. Here you can use rank() over (partition by primary key order by year desc) 2. Join your original dataset with the above on primary

Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi, If you are concerned with the performance of the alternative filesystems (ie. needing a caching client), you can use Alluxio on top of any of NFS , Ceph