Re: Failed to connect to master ...

2017-03-07 Thread Shixiong(Ryan) Zhu
The Spark master may bind to a different address. Take a look at this page to find the correct URL: http://VM_IPAddress:8080/ On Tue, Mar 7, 2017 at 10:13 PM, Mina Aslani wrote: > Master and worker processes are running! > > On Wed, Mar 8, 2017 at 12:38 AM, ayan guha

Re: Failed to connect to master ...

2017-03-07 Thread Mina Aslani
Master and worker processes are running! On Wed, Mar 8, 2017 at 12:38 AM, ayan guha wrote: > You need to start Master and worker processes before connecting to them. > > On Wed, Mar 8, 2017 at 3:33 PM, Mina Aslani wrote: > >> Hi, >> >> I am writing a

Re: Failed to connect to master ...

2017-03-07 Thread ayan guha
You need to start Master and worker processes before connecting to them. On Wed, Mar 8, 2017 at 3:33 PM, Mina Aslani wrote: > Hi, > > I am writing a spark Transformer in intelliJ in Java and trying to connect > to the spark in a VM using setMaster. I get "Failed to connect

Failed to connect to master ...

2017-03-07 Thread Mina Aslani
Hi, I am writing a spark Transformer in intelliJ in Java and trying to connect to the spark in a VM using setMaster. I get "Failed to connect to master ..." I get 17/03/07 16:20:55 WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master VM_IPAddress:7077

Re: Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread cht liu
Do you enable the spark fault tolerance mechanism, RDD run at the end of the job, will start a separate job, to the checkpoint data written to the file system before the persistence of high availability 2017-03-08 2:45 GMT+08:00 Swapnil Shinde : > Hello all >I have

made spark job to throw exception still going under finished succeeded status in yarn

2017-03-07 Thread nancy henry
Hi Team, Wrote below code to throw exception.. How to make below code to throw exception and make the job to goto failed status in yarn if under some condition but still close spark context and release resources .. object Demo { def main(args: Array[String]) = { var a = 0; var c = 0;

PySpark Serialization/Deserialization (Pickling) Overhead

2017-03-07 Thread Yeoul Na
Hi all, I am trying to analyze PySpark performance overhead. People just say PySpark is slower than Scala due to the Serialization/Deserialization overhead. I tried with the example in this post: https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many articles say

Re: finding Spark Master

2017-03-07 Thread Yong Zhang
This website explains it very clear, if you are using Yarn. https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_running_spark_on_yarn.html Running Spark Applications on YARN -

RE: finding Spark Master

2017-03-07 Thread Adaryl Wakefield
Ah so I see setMaster(‘yarn-client’). Hmm. What I was ultimately trying to do was develop with Eclipse on my windows box and have the code point to my cluster so it executes there instead of my local windows machine. Perhaps I’m going about this wrong. Adaryl "Bob" Wakefield, MBA Principal

Re: finding Spark Master

2017-03-07 Thread Koert Kuipers
assuming this is running on yarn there is really spark-master. every job created its own "master" within a yarn application. On Tue, Mar 7, 2017 at 6:27 PM, Adaryl Wakefield < adaryl.wakefi...@hotmail.com> wrote: > I’m running a three node cluster along with Spark along with Hadoop as > part of

Spark job stopping abrubptly

2017-03-07 Thread Divya Gehlot
Hi, I have spark standalone cluster on AWS EC2 and recently my spark stream jobs stopping abrubptly. When I check the logs I found this 17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 17/03/07 06:09:39 ERROR

RE: finding Spark Master

2017-03-07 Thread Adaryl Wakefield
I’m sorry I don’t understand. Is that a question or the answer? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From:

Re: finding Spark Master

2017-03-07 Thread ayan guha
yarn-client or yarn-cluster On Wed, 8 Mar 2017 at 10:28 am, Adaryl Wakefield < adaryl.wakefi...@hotmail.com> wrote: > I’m running a three node cluster along with Spark along with Hadoop as > part of a HDP stack. How do I find my Spark Master? I’m just seeing the > clients. I’m trying to figure

finding Spark Master

2017-03-07 Thread Adaryl Wakefield
I'm running a three node cluster along with Spark along with Hadoop as part of a HDP stack. How do I find my Spark Master? I'm just seeing the clients. I'm trying to figure out what goes in setMaster() aside from local[*]. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC

Re: How to unit test spark streaming?

2017-03-07 Thread kant kodali
Agreed with the statement in quotes below whether one wants to do unit tests or not It is a good practice to write code that way. But I think the more painful and tedious task is to mock/emulate all the nodes such as spark workers/master/hdfs/input source stream and all that. I wish there is

Re: Structured Streaming - Kafka

2017-03-07 Thread Bowden, Chris
https://issues.apache.org/jira/browse/SPARK-19853, pr by eow From: Shixiong(Ryan) Zhu Sent: Tuesday, March 7, 2017 2:04:45 PM To: Bowden, Chris Cc: user; Gudenkauf, Jack Subject: Re: Structured Streaming - Kafka Good catch. Could you

Re: Structured Streaming - Kafka

2017-03-07 Thread Shixiong(Ryan) Zhu
Good catch. Could you create a ticket? You can also submit a PR to fix it if you have time :) On Tue, Mar 7, 2017 at 1:52 PM, Bowden, Chris wrote: > Potential bug when using startingOffsets = SpecificOffsets with Kafka > topics containing uppercase characters? > >

Structured Streaming - Kafka

2017-03-07 Thread Bowden, Chris
Potential bug when using startingOffsets = SpecificOffsets with Kafka topics containing uppercase characters? KafkaSourceProvider#L80/86: val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match { case Some("latest") => LatestOffsets

Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.

Does anybody use spark.rpc.io.mode=epoll?

2017-03-07 Thread Steven Ruppert
The epoll mode definitely exists in spark, but the official documentation does not mention it, nor any of the other settings that appear to be unofficially documented in: https://github.com/jaceklaskowski/mastering-apache-spark-book/blob/master/spark-rpc-netty.adoc I don't seem to have any

Re: How to unit test spark streaming?

2017-03-07 Thread Michael Armbrust
> > Basically you abstract your transformations to take in a dataframe and > return one, then you assert on the returned df > +1 to this suggestion. This is why we wanted streaming and batch dataframes to share the same API.

Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas
I was kind of hoping that I would use Spark in this instance to generate that intermediate SQL as part of its workflow strategy. Sort of as a database independent way of doing my preprocessing. Is there any way that allows me to capture the generated SQL from catalyst? If so I would just use

Huge partitioning job takes longer to close after all tasks finished

2017-03-07 Thread Swapnil Shinde
Hello all I have a spark job that reads parquet data and partition it based on one of the columns. I made sure partitions equally distributed and not skewed. My code looks like this - datasetA.write.partitonBy("column1").parquet(outputPath) Execution plan - [image: Inline image 1] All

RE: using spark to load a data warehouse in real time

2017-03-07 Thread Adaryl Wakefield
Hi Henry, I didn’t catch your email until now. When you wrote to the database, how did you enforce the schema? Did the data frames just spit everything out with the necessary keys? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685

Re: (python) Spark .textFile(s3://…) access denied 403 with valid credentials

2017-03-07 Thread Amjad ALSHABANI
Hi Jonhy, What is the master you are using with spark-submit? I ve had this problem before because Spark (different from CLI and boto3) was running in Yarn distributed mode (--master yarn) So the keys were not copied to all the executors' nodes so I have had to submit my spark job as

(python) Spark .textFile(s3://…) access denied 403 with valid credentials

2017-03-07 Thread Jonhy Stack
In order to access my S3 bucket i have exported my creds export AWS_SECRET_ACCESS_KEY= export AWS_ACCESSS_ACCESS_KEY= I can verify that everything works by doing aws s3 ls mybucket I can also verify with boto3 that it works in python resource = boto3.resource("s3",

Re: How to unit test spark streaming?

2017-03-07 Thread Jörn Franke
This depends on your target setup! I run for example for my open source libraries for spark integration tests (a dedicated folder a side the unit tests) a local spark master, but also use a minidfs cluster (to simulate HDFS on a node) and sometimes also a miniyarn cluster (see

Re: Spark JDBC reads

2017-03-07 Thread Subhash Sriram
Could you create a view of the table on your JDBC data source and just query that from Spark? Thanks, Subhash Sent from my iPhone > On Mar 7, 2017, at 6:37 AM, El-Hassan Wanas wrote: > > As an example, this is basically what I'm doing: > > val myDF =

Re: How to unit test spark streaming?

2017-03-07 Thread Sam Elamin
Hey kant You can use holdens spark test base Have a look at some of the specs I wrote here to give you an idea https://github.com/samelamin/spark-bigquery/blob/master/src/test/scala/com/samelamin/spark/bigquery/BigQuerySchemaSpecs.scala Basically you abstract your transformations to take in a

How to unit test spark streaming?

2017-03-07 Thread kant kodali
Hi All, How to unit test spark streaming or spark in general? How do I test the results of my transformations? Also, more importantly don't we need to spawn master and worker JVM's either in one or multiple nodes? Thanks! kant

Re: Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas
As an example, this is basically what I'm doing: val myDF = originalDataFrame.select(col(columnName).when(col(columnName) === "foobar", 0).when(col(columnName) === "foobarbaz", 1)) Except there's much more columns and much more conditionals. The generated Spark workflow starts with an

Re: Spark JDBC reads

2017-03-07 Thread Jörn Franke
Can you provide some source code? I am not sure I understood the problem . If you want to do a preprocessing at the JDBC datasource then you can write your own data source. Additionally you may want to modify the sql statement to extract the data in the right format and push some preprocessing

Spark JDBC reads

2017-03-07 Thread El-Hassan Wanas
Hello, There is, as usual, a big table lying on some JDBC data source. I am doing some data processing on that data from Spark, however, in order to speed up my analysis, I use reduced encodings and minimize the general size of the data before processing. Spark has been doing a great job at

Re: Check if dataframe is empty

2017-03-07 Thread Deepak Sharma
On Tue, Mar 7, 2017 at 2:37 PM, Nick Pentreath wrote: > df.take(1).isEmpty should work My bad. It will return empty array: emptydf.take(1) res0: Array[org.apache.spark.sql.Row] = Array() and applying isEmpty would return boolean emptydf.take(1).isEmpty res2:

Re: Check if dataframe is empty

2017-03-07 Thread Nick Pentreath
I believe take on an empty dataset will return an empty Array rather than throw an exception. df.take(1).isEmpty should work On Tue, 7 Mar 2017 at 07:42, Deepak Sharma wrote: > If the df is empty , the .take would return > java.util.NoSuchElementException. > This can be

Re: FPGrowth Model is taking too long to generate frequent item sets

2017-03-07 Thread Eli Super
Hi It's area of knowledge , you will need to read online several hours about it What is your programming language ? Try search online : "machine learning binning %my_programing_langauge%" and "machine learning feature engineering %my_programing_langauge%" On Tue, Mar 7, 2017 at 3:39 AM, Raju