Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Thanks to TD, the savior! Shall look into it. On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das wrote: > Relevant: https://databricks.com/blog/2018/03/13/ > introducing-stream-stream-joins-in-apache-spark-2-3.html > > This is true stream-stream join which will

Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
It’s best to start with Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#tab_python_0 https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#tab_python_0 _ From: Aakash Basu

Spark Conf

2018-03-14 Thread Vinyas Shetty
Hi, I am trying to understand the spark internals ,so was looking the spark code flow. Now in a scenario where i do a spark-submit in yarn cluster mode with --executor-memory 8g via command line ,now how does spark know about this exectuor memory value ,since in SparkContext i see :

Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
It is already partitioned by timestamp. But is it right retention policy process to stop the streaming job, trim the parquet file and restart the streaming job? Thanks. On Wed, Mar 14, 2018 at 12:51 PM, Sunil Parmar wrote: > Can you use partitioning ( by day ) ? That will

Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Sunil Parmar
Can you use partitioning ( by day ) ? That will make it easier to drop data older than x days outside streaming job. Sunil Parmar On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang wrote: > I have a spark structured streaming job which dump data into a parquet > file. To

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
Relevant: https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html This is true stream-stream join which will automatically buffer delayed data and appropriately join stuff with SQL join semantics. Please check it out :) TD On Wed, Mar 14, 2018 at 12:07

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Do I need to set SPARK_DIST_CLASSPATH or SPARK_CLASSPATH ? The latest version of spark (2.3) only has SPARK_CLASSPATH. On Wed, Mar 14, 2018 at 11:37 AM, kant kodali wrote: > Hi, > > I am not using emr. And yes I restarted several times. > > On Wed, Mar 14, 2018 at 6:35 AM,

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
I misread it, and thought that you question was if pyspark supports kafka lol. Sorry! On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu wrote: > Hey Dylan, > > Great! > > Can you revert back to my initial and also the latest mail? > > Thanks, > Aakash. > > On 15-Mar-2018

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hey Dylan, Great! Can you revert back to my initial and also the latest mail? Thanks, Aakash. On 15-Mar-2018 12:27 AM, "Dylan Guedes" wrote: > Hi, > > I've been using the Kafka with pyspark since 2.1. > > On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
Hi, I've been using the Kafka with pyspark since 2.1. On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu wrote: > Hi, > > I'm yet to. > > Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package > allows Python? I read somewhere, as of now Scala and Java are

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, I'm yet to. Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package allows Python? I read somewhere, as of now Scala and Java are the languages to be used. Please correct me if am wrong. Thanks, Aakash. On 14-Mar-2018 8:24 PM, "Georg Heiler" wrote:

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Hi, I am not using emr. And yes I restarted several times. On Wed, Mar 14, 2018 at 6:35 AM, Anthony, Olufemi < olufemi.anth...@capitalone.com> wrote: > After you updated your yarn-site.xml file, did you restart the YARN > resource manager ? > > > >

retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
I have a spark structured streaming job which dump data into a parquet file. To avoid the parquet file grows infinitely, I want to discard 3 month old data. Does spark streaming supports this? Or I need to stop the streaming job, trim the parquet file and restart the streaming job? Thanks for any

Re: Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Thanks for pointing . On Wed, Mar 14, 2018 at 11:19 PM, Vadim Semenov wrote: > This question should be directed to the `spark-jobserver` group: > https://github.com/spark-jobserver/spark-jobserver#contact > > They also have a gitter chat. > > Also include the errors you

Re: Spark Job Server application compilation issue

2018-03-14 Thread Vadim Semenov
This question should be directed to the `spark-jobserver` group: https://github.com/spark-jobserver/spark-jobserver#contact They also have a gitter chat. Also include the errors you get once you're going to be asking them a question On Wed, Mar 14, 2018 at 1:37 PM, sujeet jog

Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Input is a json request, which would be decoded in myJob() & processed further. Not sure what is wrong with below code, it emits errors as unimplemented methods (runJob/validate), any pointers on this would be helpful, jobserver-0.8.0 object MyJobServer extends SparkSessionJob { type JobData

Bisecting Kmeans Linkage Matrix Output (Cluster Indices)

2018-03-14 Thread GabeChurch
I have been working on a project to return a Linkage Matrix output from the Spark Bisecting Kmeans Algorithm output so that it is possible to plot the selection steps in a dendogram. I am having trouble returning valid Indices when I use more than 3-4 clusters in the algorithm and am hoping

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Georg Heiler
Did you try spark 2.3 with structured streaming? There watermarking and plain sql might be really interesting for you. Aakash Basu schrieb am Mi. 14. März 2018 um 14:57: > Hi, > > > > *Info (Using):Spark Streaming Kafka 0.8 package* > > *Spark 2.2.1* > *Kafka 1.0.1* >

Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, *Info (Using):Spark Streaming Kafka 0.8 package* *Spark 2.2.1* *Kafka 1.0.1* As of now, I am feeding paragraphs in Kafka console producer and my Spark, which is acting as a receiver is printing the flattened words, which is a complete RDD operation. *My motive is to read two tables

Re: How to run spark shell using YARN

2018-03-14 Thread Anthony, Olufemi
After you updated your yarn-site.xml file, did you restart the YARN resource manager ? https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/ Femi From: kant kodali Date: Wednesday, March 14, 2018 at 6:16 AM To: Femi Anthony Cc:

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
16GB RAM. AWS m4.xlarge. It's a three node cluster and I only have YARN and HDFS running. Resources are barely used however I believe there is something in my config that is preventing YARN to see that I have good amount of resources I think (thats my guess I never worked with YARN before). My

Re: Insufficient memory for Java Runtime

2018-03-14 Thread Femi Anthony
Try specifying executor memory. On Tue, Mar 13, 2018 at 5:15 PM, Shiyuan wrote: > Hi Spark-Users, > I encountered the problem of "insufficient memory". The error is logged > in the file with a name " hs_err_pid86252.log"(attached in the end of this > email). > > I launched

Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
What's the hardware configuration of the box you're running on i.e. how much memory does it have ? Femi On Wed, Mar 14, 2018 at 5:32 AM, kant kodali wrote: > Tried this > > ./spark-shell --master yarn --deploy-mode client --executor-memory 4g > > > Same issue. Keeps going

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Tried this ./spark-shell --master yarn --deploy-mode client --executor-memory 4g Same issue. Keeps going forever.. 18/03/14 09:31:25 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1521019884656

Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
Make sure you have enough memory allocated for Spark workers, try specifying executor memory as follows: --executor-memory to spark-submit. On Wed, Mar 14, 2018 at 3:25 AM, kant kodali wrote: > I am using spark 2.3.0 and hadoop 2.7.3. > > Also I have done the following

Re: [EXT] Debugging a local spark executor in pycharm

2018-03-14 Thread Vitaliy Pisarev
Actually, I stumbled on this SO page . While it is not straightforward, it is a fairly simple solution. In short: - I made sure there is only one executing task at a time by calling repartition(1) - this

Re: Spark Application stuck

2018-03-14 Thread Femi Anthony
Have you taken a look at the EMR UI ? What does your Spark setup look like ? I assume you're on EMR on AWS. The various UI urls and ports are listed here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/ emr-web-interfaces.html On Wed, Mar 14, 2018 at 4:23 AM, Mukund Big Data

Spark Application stuck

2018-03-14 Thread Mukund Big Data
Hi I am executing the following recommendation engine using Spark ML https://aws.amazon.com/blogs/big-data/building-a-recommendation-engine-with-spark-ml-on-amazon-emr-using-zeppelin/ When I am trying to save the model, the application hungs and does't respond. Any pointers to find where the

How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Aakash Basu
Hi all, Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone system? Step wise, practical hands-on, would be great. Also, connecting Kafka with Spark and getting real time data and processing it in micro-batches... Any help? Thanks, Aakash.

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I am using spark 2.3.0 and hadoop 2.7.3. Also I have done the following and restarted all. But I still see ACCEPTED: waiting for AM container to be allocated, launched and register with RM. And i am unable to spawn spark-shell. editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
any idea? On Wed, Mar 14, 2018 at 12:12 AM, kant kodali wrote: > I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website > and these are the > only three files I changed Do I need to set or change

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I set core-site.xml, hdfs-site.xml, yarn-site.xml as per this website and these are the only three files I changed Do I need to set or change anything in mapred-site.xml (As of now I have not touched mapred-site.xml)? when I do yarn -node