Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Thanks to TD, the savior!

Shall look into it.

On Thu, Mar 15, 2018 at 1:04 AM, Tathagata Das 
wrote:

> Relevant: https://databricks.com/blog/2018/03/13/
> introducing-stream-stream-joins-in-apache-spark-2-3.html
>
> This is true stream-stream join which will automatically buffer delayed
> data and appropriately join stuff with SQL join semantics. Please check it
> out :)
>
> TD
>
>
>
> On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes 
> wrote:
>
>> I misread it, and thought that you question was if pyspark supports kafka
>> lol. Sorry!
>>
>> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
>> wrote:
>>
>>> Hey Dylan,
>>>
>>> Great!
>>>
>>> Can you revert back to my initial and also the latest mail?
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>>>
 Hi,

 I've been using the Kafka with pyspark since 2.1.

 On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu <
 aakash.spark@gmail.com> wrote:

> Hi,
>
> I'm yet to.
>
> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
> allows Python? I read somewhere, as of now Scala and Java are the 
> languages
> to be used.
>
> Please correct me if am wrong.
>
> Thanks,
> Aakash.
>
> On 14-Mar-2018 8:24 PM, "Georg Heiler" 
> wrote:
>
>> Did you try spark 2.3 with structured streaming? There watermarking
>> and plain sql might be really interesting for you.
>> Aakash Basu  schrieb am Mi. 14. März
>> 2018 um 14:57:
>>
>>> Hi,
>>>
>>>
>>>
>>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>>
>>> *Spark 2.2.1*
>>> *Kafka 1.0.1*
>>>
>>> As of now, I am feeding paragraphs in Kafka console producer and my
>>> Spark, which is acting as a receiver is printing the flattened words, 
>>> which
>>> is a complete RDD operation.
>>>
>>> *My motive is to read two tables continuously (being updated) as two
>>> distinct Kafka topics being read as two Spark Dataframes and join them
>>> based on a key and produce the output. *(I am from Spark-SQL
>>> background, pardon my Spark-SQL-ish writing)
>>>
>>> *It may happen, the first topic is receiving new data 15 mins prior
>>> to the second topic, in that scenario, how to proceed? I should not lose
>>> any data.*
>>>
>>> As of now, I want to simply pass paragraphs, read them as RDD,
>>> convert to DF and then join to get the common keys as the output. (Just 
>>> for
>>> R).
>>>
>>> Started using Spark Streaming and Kafka today itself.
>>>
>>> Please help!
>>>
>>> Thanks,
>>> Aakash.
>>>
>>

>>
>


Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
It’s best to start with Structured Streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#tab_python_0

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#tab_python_0

_
From: Aakash Basu 
Sent: Wednesday, March 14, 2018 1:09 AM
Subject: How to start practicing Python Spark Streaming in Linux?
To: user 


Hi all,

Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone 
system? Step wise, practical hands-on, would be great.

Also, connecting Kafka with Spark and getting real time data and processing it 
in micro-batches...

Any help?

Thanks,
Aakash.




Spark Conf

2018-03-14 Thread Vinyas Shetty
Hi,

I am trying to understand the spark internals ,so was looking the spark
code flow. Now in a scenario where i do a spark-submit in yarn cluster mode
with --executor-memory 8g via command line ,now how does spark know about
this exectuor memory value ,since in SparkContext i see :

_executorMemory = _conf.getOption("spark.executor.memory")
.orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))
   .orElse(Option(System.getenv("SPARK_MEM"))


Now SparkConf loads the default from Java System Properties ,but then i did
not find where the command line value is added to Java System Properties
sys.props in yarn cluster mode ie did not see a call to
Utils.loadDefaultSparkProperties.How
is this default command line value reaching the SparkConf which is part of
SparkContext.

Regards,
Vinyas


Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
It is already partitioned by timestamp. But is it right retention policy
process to stop the streaming job, trim the parquet file and restart the
streaming job? Thanks.

On Wed, Mar 14, 2018 at 12:51 PM, Sunil Parmar 
wrote:

> Can you use partitioning ( by day ) ? That will  make it easier to drop
> data older than x days outside streaming job.
>
> Sunil Parmar
>
> On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang 
> wrote:
>
>> I have a spark structured streaming job which dump data into a parquet
>> file. To avoid the parquet file grows infinitely, I want to discard 3 month
>> old data. Does spark streaming supports this? Or I need to stop the
>> streaming job, trim the parquet file and restart the streaming job? Thanks
>> for any hints.
>>
>
>


Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Sunil Parmar
Can you use partitioning ( by day ) ? That will  make it easier to drop
data older than x days outside streaming job.

Sunil Parmar

On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang  wrote:

> I have a spark structured streaming job which dump data into a parquet
> file. To avoid the parquet file grows infinitely, I want to discard 3 month
> old data. Does spark streaming supports this? Or I need to stop the
> streaming job, trim the parquet file and restart the streaming job? Thanks
> for any hints.
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
Relevant:
https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html


This is true stream-stream join which will automatically buffer delayed
data and appropriately join stuff with SQL join semantics. Please check it
out :)

TD



On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes  wrote:

> I misread it, and thought that you question was if pyspark supports kafka
> lol. Sorry!
>
> On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
> wrote:
>
>> Hey Dylan,
>>
>> Great!
>>
>> Can you revert back to my initial and also the latest mail?
>>
>> Thanks,
>> Aakash.
>>
>> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>>
>>> Hi,
>>>
>>> I've been using the Kafka with pyspark since 2.1.
>>>
>>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu >> > wrote:
>>>
 Hi,

 I'm yet to.

 Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
 allows Python? I read somewhere, as of now Scala and Java are the languages
 to be used.

 Please correct me if am wrong.

 Thanks,
 Aakash.

 On 14-Mar-2018 8:24 PM, "Georg Heiler" 
 wrote:

> Did you try spark 2.3 with structured streaming? There watermarking
> and plain sql might be really interesting for you.
> Aakash Basu  schrieb am Mi. 14. März 2018
> um 14:57:
>
>> Hi,
>>
>>
>>
>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>
>> *Spark 2.2.1*
>> *Kafka 1.0.1*
>>
>> As of now, I am feeding paragraphs in Kafka console producer and my
>> Spark, which is acting as a receiver is printing the flattened words, 
>> which
>> is a complete RDD operation.
>>
>> *My motive is to read two tables continuously (being updated) as two
>> distinct Kafka topics being read as two Spark Dataframes and join them
>> based on a key and produce the output. *(I am from Spark-SQL
>> background, pardon my Spark-SQL-ish writing)
>>
>> *It may happen, the first topic is receiving new data 15 mins prior
>> to the second topic, in that scenario, how to proceed? I should not lose
>> any data.*
>>
>> As of now, I want to simply pass paragraphs, read them as RDD,
>> convert to DF and then join to get the common keys as the output. (Just 
>> for
>> R).
>>
>> Started using Spark Streaming and Kafka today itself.
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>
>>>
>


Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Do I need to set SPARK_DIST_CLASSPATH or SPARK_CLASSPATH ? The latest
version of spark (2.3) only has SPARK_CLASSPATH.

On Wed, Mar 14, 2018 at 11:37 AM, kant kodali  wrote:

> Hi,
>
> I am not using emr. And yes I restarted several times.
>
> On Wed, Mar 14, 2018 at 6:35 AM, Anthony, Olufemi <
> olufemi.anth...@capitalone.com> wrote:
>
>> After you updated your yarn-site.xml  file, did you restart the YARN
>> resource manager ?
>>
>>
>>
>> https://aws.amazon.com/premiumsupport/knowledge-center/
>> restart-service-emr/
>>
>>
>>
>> Femi
>>
>>
>>
>> *From: *kant kodali 
>> *Date: *Wednesday, March 14, 2018 at 6:16 AM
>> *To: *Femi Anthony 
>> *Cc: *vermanurag , "user @spark" <
>> user@spark.apache.org>
>> *Subject: *Re: How to run spark shell using YARN
>>
>>
>>
>> 16GB RAM.  AWS m4.xlarge. It's a three node cluster and I only have YARN
>> and  HDFS running. Resources are barely used however I believe there is
>> something in my config that is preventing YARN to see that I have good
>> amount of resources I think (thats my guess I never worked with YARN
>> before). My mapred-site.xml is empty. Do I even need this? if so, what
>> should I set it to?
>>
>>
>>
>> On Wed, Mar 14, 2018 at 2:46 AM, Femi Anthony  wrote:
>>
>> What's the hardware configuration of the box you're running on i.e. how
>> much memory does it have ?
>>
>>
>>
>> Femi
>>
>>
>>
>> On Wed, Mar 14, 2018 at 5:32 AM, kant kodali  wrote:
>>
>> Tried this
>>
>>
>>
>>  ./spark-shell --master yarn --deploy-mode client --executor-memory 4g
>>
>>
>>
>> Same issue. Keeps going forever..
>>
>>
>>
>> 18/03/14 09:31:25 INFO Client:
>>
>> client token: N/A
>>
>> diagnostics: N/A
>>
>> ApplicationMaster host: N/A
>>
>> ApplicationMaster RPC port: -1
>>
>> queue: default
>>
>> start time: 1521019884656
>>
>> final status: UNDEFINED
>>
>> tracking URL: http://ip-172-31-0-54:8088/proxy/application_1521014458020_
>> 0004/
>> 
>>
>> user: centos
>>
>>
>>
>> 18/03/14 09:30:08 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:09 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:10 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:11 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:12 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:13 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:14 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:15 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>>
>>
>> On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony  wrote:
>>
>> Make sure you have enough memory allocated for Spark workers, try
>> specifying executor memory as follows:
>>
>> --executor-memory 
>>
>> to spark-submit.
>>
>>
>>
>> On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:
>>
>> I am using spark 2.3.0 and hadoop 2.7.3.
>>
>>
>>
>> Also I have done the following and restarted all. But I still
>> see ACCEPTED: waiting for AM container to be allocated, launched and
>> register with RM. And i am unable to spawn spark-shell.
>>
>>
>>
>> editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
>> following property value from 0.1 to something higher. I changed to 0.5
>> (50%)
>>
>> 
>>
>> yarn.scheduler.capacity.maximum-am-resource-percent
>>
>> 0.5
>>
>> 
>>
>> Maximum percent of resources in the cluster which can be used to run 
>> application masters i.e. controls number of concurrent running applications.
>>
>> 
>>
>> 
>>
>> You may have to allocate more memory to YARN by editing yarn-site.xml by
>> updating the following property:
>>
>> 
>>
>> yarn.nodemanager.resource.memory-mb
>>
>> 8192
>>
>> 
>>
>> https://stackoverflow.com/questions/45687607/waiting-for-am-
>> container-to-be-allocated-launched-and-register-with-rm
>> 

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
I misread it, and thought that you question was if pyspark supports kafka
lol. Sorry!

On Wed, Mar 14, 2018 at 3:58 PM, Aakash Basu 
wrote:

> Hey Dylan,
>
> Great!
>
> Can you revert back to my initial and also the latest mail?
>
> Thanks,
> Aakash.
>
> On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:
>
>> Hi,
>>
>> I've been using the Kafka with pyspark since 2.1.
>>
>> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm yet to.
>>>
>>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>>> allows Python? I read somewhere, as of now Scala and Java are the languages
>>> to be used.
>>>
>>> Please correct me if am wrong.
>>>
>>> Thanks,
>>> Aakash.
>>>
>>> On 14-Mar-2018 8:24 PM, "Georg Heiler" 
>>> wrote:
>>>
 Did you try spark 2.3 with structured streaming? There watermarking and
 plain sql might be really interesting for you.
 Aakash Basu  schrieb am Mi. 14. März 2018
 um 14:57:

> Hi,
>
>
>
> *Info (Using):Spark Streaming Kafka 0.8 package*
>
> *Spark 2.2.1*
> *Kafka 1.0.1*
>
> As of now, I am feeding paragraphs in Kafka console producer and my
> Spark, which is acting as a receiver is printing the flattened words, 
> which
> is a complete RDD operation.
>
> *My motive is to read two tables continuously (being updated) as two
> distinct Kafka topics being read as two Spark Dataframes and join them
> based on a key and produce the output. *(I am from Spark-SQL
> background, pardon my Spark-SQL-ish writing)
>
> *It may happen, the first topic is receiving new data 15 mins prior to
> the second topic, in that scenario, how to proceed? I should not lose any
> data.*
>
> As of now, I want to simply pass paragraphs, read them as RDD, convert
> to DF and then join to get the common keys as the output. (Just for R).
>
> Started using Spark Streaming and Kafka today itself.
>
> Please help!
>
> Thanks,
> Aakash.
>

>>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hey Dylan,

Great!

Can you revert back to my initial and also the latest mail?

Thanks,
Aakash.

On 15-Mar-2018 12:27 AM, "Dylan Guedes"  wrote:

> Hi,
>
> I've been using the Kafka with pyspark since 2.1.
>
> On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
> wrote:
>
>> Hi,
>>
>> I'm yet to.
>>
>> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
>> allows Python? I read somewhere, as of now Scala and Java are the languages
>> to be used.
>>
>> Please correct me if am wrong.
>>
>> Thanks,
>> Aakash.
>>
>> On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:
>>
>>> Did you try spark 2.3 with structured streaming? There watermarking and
>>> plain sql might be really interesting for you.
>>> Aakash Basu  schrieb am Mi. 14. März 2018
>>> um 14:57:
>>>
 Hi,



 *Info (Using):Spark Streaming Kafka 0.8 package*

 *Spark 2.2.1*
 *Kafka 1.0.1*

 As of now, I am feeding paragraphs in Kafka console producer and my
 Spark, which is acting as a receiver is printing the flattened words, which
 is a complete RDD operation.

 *My motive is to read two tables continuously (being updated) as two
 distinct Kafka topics being read as two Spark Dataframes and join them
 based on a key and produce the output. *(I am from Spark-SQL
 background, pardon my Spark-SQL-ish writing)

 *It may happen, the first topic is receiving new data 15 mins prior to
 the second topic, in that scenario, how to proceed? I should not lose any
 data.*

 As of now, I want to simply pass paragraphs, read them as RDD, convert
 to DF and then join to get the common keys as the output. (Just for R).

 Started using Spark Streaming and Kafka today itself.

 Please help!

 Thanks,
 Aakash.

>>>
>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Dylan Guedes
Hi,

I've been using the Kafka with pyspark since 2.1.

On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu 
wrote:

> Hi,
>
> I'm yet to.
>
> Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package
> allows Python? I read somewhere, as of now Scala and Java are the languages
> to be used.
>
> Please correct me if am wrong.
>
> Thanks,
> Aakash.
>
> On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:
>
>> Did you try spark 2.3 with structured streaming? There watermarking and
>> plain sql might be really interesting for you.
>> Aakash Basu  schrieb am Mi. 14. März 2018 um
>> 14:57:
>>
>>> Hi,
>>>
>>>
>>>
>>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>>
>>> *Spark 2.2.1*
>>> *Kafka 1.0.1*
>>>
>>> As of now, I am feeding paragraphs in Kafka console producer and my
>>> Spark, which is acting as a receiver is printing the flattened words, which
>>> is a complete RDD operation.
>>>
>>> *My motive is to read two tables continuously (being updated) as two
>>> distinct Kafka topics being read as two Spark Dataframes and join them
>>> based on a key and produce the output. *(I am from Spark-SQL
>>> background, pardon my Spark-SQL-ish writing)
>>>
>>> *It may happen, the first topic is receiving new data 15 mins prior to
>>> the second topic, in that scenario, how to proceed? I should not lose any
>>> data.*
>>>
>>> As of now, I want to simply pass paragraphs, read them as RDD, convert
>>> to DF and then join to get the common keys as the output. (Just for R).
>>>
>>> Started using Spark Streaming and Kafka today itself.
>>>
>>> Please help!
>>>
>>> Thanks,
>>> Aakash.
>>>
>>


Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi,

I'm yet to.

Just want to know, when does Spark 2.3 with 0.10 Kafka Spark Package allows
Python? I read somewhere, as of now Scala and Java are the languages to be
used.

Please correct me if am wrong.

Thanks,
Aakash.

On 14-Mar-2018 8:24 PM, "Georg Heiler"  wrote:

> Did you try spark 2.3 with structured streaming? There watermarking and
> plain sql might be really interesting for you.
> Aakash Basu  schrieb am Mi. 14. März 2018 um
> 14:57:
>
>> Hi,
>>
>>
>>
>> *Info (Using):Spark Streaming Kafka 0.8 package*
>>
>> *Spark 2.2.1*
>> *Kafka 1.0.1*
>>
>> As of now, I am feeding paragraphs in Kafka console producer and my
>> Spark, which is acting as a receiver is printing the flattened words, which
>> is a complete RDD operation.
>>
>> *My motive is to read two tables continuously (being updated) as two
>> distinct Kafka topics being read as two Spark Dataframes and join them
>> based on a key and produce the output. *(I am from Spark-SQL background,
>> pardon my Spark-SQL-ish writing)
>>
>> *It may happen, the first topic is receiving new data 15 mins prior to
>> the second topic, in that scenario, how to proceed? I should not lose any
>> data.*
>>
>> As of now, I want to simply pass paragraphs, read them as RDD, convert to
>> DF and then join to get the common keys as the output. (Just for R).
>>
>> Started using Spark Streaming and Kafka today itself.
>>
>> Please help!
>>
>> Thanks,
>> Aakash.
>>
>


Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Hi,

I am not using emr. And yes I restarted several times.

On Wed, Mar 14, 2018 at 6:35 AM, Anthony, Olufemi <
olufemi.anth...@capitalone.com> wrote:

> After you updated your yarn-site.xml  file, did you restart the YARN
> resource manager ?
>
>
>
> https://aws.amazon.com/premiumsupport/knowledge-
> center/restart-service-emr/
>
>
>
> Femi
>
>
>
> *From: *kant kodali 
> *Date: *Wednesday, March 14, 2018 at 6:16 AM
> *To: *Femi Anthony 
> *Cc: *vermanurag , "user @spark" <
> user@spark.apache.org>
> *Subject: *Re: How to run spark shell using YARN
>
>
>
> 16GB RAM.  AWS m4.xlarge. It's a three node cluster and I only have YARN
> and  HDFS running. Resources are barely used however I believe there is
> something in my config that is preventing YARN to see that I have good
> amount of resources I think (thats my guess I never worked with YARN
> before). My mapred-site.xml is empty. Do I even need this? if so, what
> should I set it to?
>
>
>
> On Wed, Mar 14, 2018 at 2:46 AM, Femi Anthony  wrote:
>
> What's the hardware configuration of the box you're running on i.e. how
> much memory does it have ?
>
>
>
> Femi
>
>
>
> On Wed, Mar 14, 2018 at 5:32 AM, kant kodali  wrote:
>
> Tried this
>
>
>
>  ./spark-shell --master yarn --deploy-mode client --executor-memory 4g
>
>
>
> Same issue. Keeps going forever..
>
>
>
> 18/03/14 09:31:25 INFO Client:
>
> client token: N/A
>
> diagnostics: N/A
>
> ApplicationMaster host: N/A
>
> ApplicationMaster RPC port: -1
>
> queue: default
>
> start time: 1521019884656
>
> final status: UNDEFINED
>
> tracking URL: http://ip-172-31-0-54:8088/proxy/application_
> 1521014458020_0004/
> 
>
> user: centos
>
>
>
> 18/03/14 09:30:08 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:09 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:10 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:11 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:12 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:13 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:14 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:15 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
>
>
> On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony  wrote:
>
> Make sure you have enough memory allocated for Spark workers, try
> specifying executor memory as follows:
>
> --executor-memory 
>
> to spark-submit.
>
>
>
> On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:
>
> I am using spark 2.3.0 and hadoop 2.7.3.
>
>
>
> Also I have done the following and restarted all. But I still
> see ACCEPTED: waiting for AM container to be allocated, launched and
> register with RM. And i am unable to spawn spark-shell.
>
>
>
> editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
> following property value from 0.1 to something higher. I changed to 0.5
> (50%)
>
> 
>
> yarn.scheduler.capacity.maximum-am-resource-percent
>
> 0.5
>
> 
>
> Maximum percent of resources in the cluster which can be used to run 
> application masters i.e. controls number of concurrent running applications.
>
> 
>
> 
>
> You may have to allocate more memory to YARN by editing yarn-site.xml by
> updating the following property:
>
> 
>
> yarn.nodemanager.resource.memory-mb
>
> 8192
>
> 
>
> https://stackoverflow.com/questions/45687607/waiting-
> for-am-container-to-be-allocated-launched-and-register-with-rm
> 
>
>
>
>
>
>
>
> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>
> any idea?
>
>
>
> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>
> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
> 

retention policy for spark structured streaming dataset

2018-03-14 Thread Lian Jiang
I have a spark structured streaming job which dump data into a parquet
file. To avoid the parquet file grows infinitely, I want to discard 3 month
old data. Does spark streaming supports this? Or I need to stop the
streaming job, trim the parquet file and restart the streaming job? Thanks
for any hints.


Re: Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Thanks for pointing .


On Wed, Mar 14, 2018 at 11:19 PM, Vadim Semenov  wrote:

> This question should be directed to the `spark-jobserver` group:
> https://github.com/spark-jobserver/spark-jobserver#contact
>
> They also have a gitter chat.
>
> Also include the errors you get once you're going to be asking them a
> question
>
> On Wed, Mar 14, 2018 at 1:37 PM, sujeet jog  wrote:
>
>>
>> Input is a json request, which would be decoded in myJob() & processed
>> further.
>>
>> Not sure what is wrong with below code, it emits errors as unimplemented
>> methods (runJob/validate),
>> any pointers on this would be helpful,
>>
>> jobserver-0.8.0
>>
>> object MyJobServer extends SparkSessionJob {
>>
>>   type JobData = String
>>   type JobOutput = Seq[String]
>>
>>   def myJob(a : String)  = {
>> }
>>
>>   def runJob(sc: SparkContext, runtime: JobEnvironment, data: JobData):
>> JobOutput = {
>>myJob(a)
>>}
>>
>>  def validate(sc: SparkContext, runtime: JobEnvironment, config: Config):
>> JobData Or Every[ValidationProblem] = {
>>Good(config.root().render())
>>  }
>>
>>
>
>
> --
> Sent from my iPhone
>


Re: Spark Job Server application compilation issue

2018-03-14 Thread Vadim Semenov
This question should be directed to the `spark-jobserver` group:
https://github.com/spark-jobserver/spark-jobserver#contact

They also have a gitter chat.

Also include the errors you get once you're going to be asking them a
question

On Wed, Mar 14, 2018 at 1:37 PM, sujeet jog  wrote:

>
> Input is a json request, which would be decoded in myJob() & processed
> further.
>
> Not sure what is wrong with below code, it emits errors as unimplemented
> methods (runJob/validate),
> any pointers on this would be helpful,
>
> jobserver-0.8.0
>
> object MyJobServer extends SparkSessionJob {
>
>   type JobData = String
>   type JobOutput = Seq[String]
>
>   def myJob(a : String)  = {
> }
>
>   def runJob(sc: SparkContext, runtime: JobEnvironment, data: JobData):
> JobOutput = {
>myJob(a)
>}
>
>  def validate(sc: SparkContext, runtime: JobEnvironment, config: Config):
> JobData Or Every[ValidationProblem] = {
>Good(config.root().render())
>  }
>
>


-- 
Sent from my iPhone


Spark Job Server application compilation issue

2018-03-14 Thread sujeet jog
Input is a json request, which would be decoded in myJob() & processed
further.

Not sure what is wrong with below code, it emits errors as unimplemented
methods (runJob/validate),
any pointers on this would be helpful,

jobserver-0.8.0

object MyJobServer extends SparkSessionJob {

  type JobData = String
  type JobOutput = Seq[String]

  def myJob(a : String)  = {
}

  def runJob(sc: SparkContext, runtime: JobEnvironment, data: JobData):
JobOutput = {
   myJob(a)
   }

 def validate(sc: SparkContext, runtime: JobEnvironment, config: Config):
JobData Or Every[ValidationProblem] = {
   Good(config.root().render())
 }


Bisecting Kmeans Linkage Matrix Output (Cluster Indices)

2018-03-14 Thread GabeChurch
I have been working on a project to return a Linkage Matrix output from the
Spark Bisecting Kmeans Algorithm output so that it is possible to plot the
selection steps in a dendogram. I am having trouble returning valid Indices
when I use more than 3-4 clusters in the algorithm and am hoping someone
else might have the time/interest enough to take a look. 

To achieve this I made some modifications to the Bisecting Kmeans algorithm
to produce a z-linkage matrix based on yu-iskw's work. I also made some
modifications to provide more information about the selection steps in the
Bisecting Kmeans Algorithm to the log at run-time.

Test outputs using the Iris Dataset with both k = 3 and k = 10 clusters can
be seen on  my stack overflow post

  

The project so far (with a simple sbt build and the compiled jars) can also
be seen on  my github repo
  and is also detailed in
the aforementioned stack overflow post.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Georg Heiler
Did you try spark 2.3 with structured streaming? There watermarking and
plain sql might be really interesting for you.
Aakash Basu  schrieb am Mi. 14. März 2018 um
14:57:

> Hi,
>
>
>
> *Info (Using):Spark Streaming Kafka 0.8 package*
>
> *Spark 2.2.1*
> *Kafka 1.0.1*
>
> As of now, I am feeding paragraphs in Kafka console producer and my Spark,
> which is acting as a receiver is printing the flattened words, which is a
> complete RDD operation.
>
> *My motive is to read two tables continuously (being updated) as two
> distinct Kafka topics being read as two Spark Dataframes and join them
> based on a key and produce the output. *(I am from Spark-SQL background,
> pardon my Spark-SQL-ish writing)
>
> *It may happen, the first topic is receiving new data 15 mins prior to the
> second topic, in that scenario, how to proceed? I should not lose any data.*
>
> As of now, I want to simply pass paragraphs, read them as RDD, convert to
> DF and then join to get the common keys as the output. (Just for R).
>
> Started using Spark Streaming and Kafka today itself.
>
> Please help!
>
> Thanks,
> Aakash.
>


Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi,



*Info (Using):Spark Streaming Kafka 0.8 package*

*Spark 2.2.1*
*Kafka 1.0.1*

As of now, I am feeding paragraphs in Kafka console producer and my Spark,
which is acting as a receiver is printing the flattened words, which is a
complete RDD operation.

*My motive is to read two tables continuously (being updated) as two
distinct Kafka topics being read as two Spark Dataframes and join them
based on a key and produce the output. *(I am from Spark-SQL background,
pardon my Spark-SQL-ish writing)

*It may happen, the first topic is receiving new data 15 mins prior to the
second topic, in that scenario, how to proceed? I should not lose any data.*

As of now, I want to simply pass paragraphs, read them as RDD, convert to
DF and then join to get the common keys as the output. (Just for R).

Started using Spark Streaming and Kafka today itself.

Please help!

Thanks,
Aakash.


Re: How to run spark shell using YARN

2018-03-14 Thread Anthony, Olufemi
After you updated your yarn-site.xml  file, did you restart the YARN resource 
manager ?

https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/

Femi

From: kant kodali 
Date: Wednesday, March 14, 2018 at 6:16 AM
To: Femi Anthony 
Cc: vermanurag , "user @spark" 

Subject: Re: How to run spark shell using YARN

16GB RAM.  AWS m4.xlarge. It's a three node cluster and I only have YARN and  
HDFS running. Resources are barely used however I believe there is something in 
my config that is preventing YARN to see that I have good amount of resources I 
think (thats my guess I never worked with YARN before). My mapred-site.xml is 
empty. Do I even need this? if so, what should I set it to?

On Wed, Mar 14, 2018 at 2:46 AM, Femi Anthony 
> wrote:
What's the hardware configuration of the box you're running on i.e. how much 
memory does it have ?

Femi

On Wed, Mar 14, 2018 at 5:32 AM, kant kodali 
> wrote:
Tried this


 ./spark-shell --master yarn --deploy-mode client --executor-memory 4g



Same issue. Keeps going forever..



18/03/14 09:31:25 INFO Client:

client token: N/A

diagnostics: N/A

ApplicationMaster host: N/A

ApplicationMaster RPC port: -1

queue: default

start time: 1521019884656

final status: UNDEFINED

tracking URL: 
http://ip-172-31-0-54:8088/proxy/application_1521014458020_0004/

user: centos



18/03/14 09:30:08 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:09 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:10 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:11 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:12 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:13 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:14 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:15 INFO Client: Application report for 
application_1521014458020_0003 (state: ACCEPTED)

On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony 
> wrote:
Make sure you have enough memory allocated for Spark workers, try specifying 
executor memory as follows:

--executor-memory 

to spark-submit.

On Wed, Mar 14, 2018 at 3:25 AM, kant kodali 
> wrote:
I am using spark 2.3.0 and hadoop 2.7.3.

Also I have done the following and restarted all. But I still see ACCEPTED: 
waiting for AM container to be allocated, launched and register with RM. And i 
am unable to spawn spark-shell.


editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the following 
property value from 0.1 to something higher. I changed to 0.5 (50%)



yarn.scheduler.capacity.maximum-am-resource-percent

0.5



Maximum percent of resources in the cluster which can be used to run 
application masters i.e. controls number of concurrent running applications.





You may have to allocate more memory to YARN by editing yarn-site.xml by 
updating the following property:



yarn.nodemanager.resource.memory-mb

8192


https://stackoverflow.com/questions/45687607/waiting-for-am-container-to-be-allocated-launched-and-register-with-rm



On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
> wrote:
any idea?

On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
> wrote:
I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this 
website
 and these are the only three files I changed Do I 

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
16GB RAM.  AWS m4.xlarge. It's a three node cluster and I only have YARN
and  HDFS running. Resources are barely used however I believe there is
something in my config that is preventing YARN to see that I have good
amount of resources I think (thats my guess I never worked with YARN
before). My mapred-site.xml is empty. Do I even need this? if so, what
should I set it to?

On Wed, Mar 14, 2018 at 2:46 AM, Femi Anthony  wrote:

> What's the hardware configuration of the box you're running on i.e. how
> much memory does it have ?
>
> Femi
>
> On Wed, Mar 14, 2018 at 5:32 AM, kant kodali  wrote:
>
>> Tried this
>>
>>  ./spark-shell --master yarn --deploy-mode client --executor-memory 4g
>>
>>
>> Same issue. Keeps going forever..
>>
>>
>> 18/03/14 09:31:25 INFO Client:
>>
>> client token: N/A
>>
>> diagnostics: N/A
>>
>> ApplicationMaster host: N/A
>>
>> ApplicationMaster RPC port: -1
>>
>> queue: default
>>
>> start time: 1521019884656
>>
>> final status: UNDEFINED
>>
>> tracking URL: http://ip-172-31-0-54:8088/proxy/application_1521014458020_
>> 0004/
>>
>> user: centos
>>
>>
>> 18/03/14 09:30:08 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:09 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:10 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:11 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:12 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:13 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:14 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> 18/03/14 09:30:15 INFO Client: Application report for
>> application_1521014458020_0003 (state: ACCEPTED)
>>
>> On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony  wrote:
>>
>>> Make sure you have enough memory allocated for Spark workers, try
>>> specifying executor memory as follows:
>>>
>>> --executor-memory 
>>>
>>> to spark-submit.
>>>
>>> On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:
>>>
 I am using spark 2.3.0 and hadoop 2.7.3.

 Also I have done the following and restarted all. But I still
 see ACCEPTED: waiting for AM container to be allocated, launched and
 register with RM. And i am unable to spawn spark-shell.

 editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
 following property value from 0.1 to something higher. I changed to 0.5
 (50%)

 
 yarn.scheduler.capacity.maximum-am-resource-percent
 0.5
 
 Maximum percent of resources in the cluster which can be used to 
 run application masters i.e. controls number of concurrent running 
 applications.
 
 

 You may have to allocate more memory to YARN by editing yarn-site.xml
 by updating the following property:

 
 yarn.nodemanager.resource.memory-mb
 8192
 

 https://stackoverflow.com/questions/45687607/waiting-for-am-
 container-to-be-allocated-launched-and-register-with-rm



 On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
 wrote:

> any idea?
>
> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
> wrote:
>
>> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this
>> website  and
>> these are the only three files I changed Do I need to set or change
>> anything in mapred-site.xml (As of now I have not touched 
>> mapred-site.xml)?
>>
>> when I do yarn -node -list -all I can see both node manager and
>> resource managers are running fine.
>>
>> But when I run spark-shell --master yarn --deploy-mode client
>>
>>
>> it just keeps looping forever and never stops with the following
>> messages
>>
>> 18/03/14 07:07:47 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:48 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:49 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:50 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:51 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:52 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>>
>> when I go to RM UI I see this
>>
>> ACCEPTED: 

Re: Insufficient memory for Java Runtime

2018-03-14 Thread Femi Anthony
Try specifying executor memory.

On Tue, Mar 13, 2018 at 5:15 PM, Shiyuan  wrote:

> Hi Spark-Users,
>   I encountered the problem of "insufficient memory". The error is logged
> in the file with a name " hs_err_pid86252.log"(attached in the end of this
> email).
>
> I launched the spark job by " spark-submit --driver-memory 40g --master
> yarn --deploy-mode client".  The spark session was created with 10
> executors each with 60g memory. The data access pattern is pretty simple, I
> keep reading some spark dataframe from hdfs one by one, filter, join with
> another dataframe,  and then append the results to an dataframe:
> for i= 1,2,3
> df1 = spark.read.parquet(file_i)
> df_r = df1.filter(...). join(df2)
> df_all = df_all.union(df_r)
>
> each file_i is quite small, only a few GB, but there are a lot of such
> files. after filtering and join, each df_r is also quite small. When the
> program failed, df_all had only 10k rows which should be around 10GB.  Each
> machine in the cluster has round 80GB memory and 1TB disk space and  only
> one user was using the cluster when it failed due to insufficient memory.
> My questions are:
> i).  The log file showed that it failed to allocate 8G committing memory.
> But how could that happen when the driver and executors have more than 40g
> free memory. In fact, only transformations but no actions had run when the
> program failed.  As I understand, only DAG and book-keeping work is done
> during dataframe transformation, no data is brought into the memory.  Why
> spark still tries to allocate such large memory?
> ii). Could manually running garbage collection help?
> iii). Did I mis-specify some runtime parameter for jvm, yarn, or spark?
>
>
> Any help or references are appreciated!
>
> The content of hs_err_pid86252,log:
>
> # There is insufficient memory for the Java Runtime Environment to
> continue.
> # Native memory allocation (mmap) failed to map 8663334912
> <(866)%20333-4912> bytes(~8G) for committing reserved memory.
> # Possible reasons:
> #   The system is out of physical RAM or swap space
> #   In 32 bit mode, the process size limit was hit
> # Possible solutions:
> #   Reduce memory load on the system
> #   Increase physical memory or swap space
> #   Check if swap backing store is full
> #   Use 64 bit Java on a 64 bit OS
> #   Decrease Java heap size (-Xmx/-Xms)
> #   Decrease number of Java threads
> #   Decrease Java thread stack sizes (-Xss)
> #   Set larger code cache with -XX:ReservedCodeCacheSize=
> # This output file may be truncated or incomplete.
> #
> #  Out of Memory Error (os_linux.cpp:2643), pid=86252,
> tid=0x7fd69e683700
> #
> # JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build
> 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
> # Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 )
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
>
> ---  T H R E A D  ---
>
> Current thread (0x7fe0bc08c000):  VMThread [stack: 
> 0x7fd69e583000,0x7fd69e684000]
> [id=86295]
>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
What's the hardware configuration of the box you're running on i.e. how
much memory does it have ?

Femi

On Wed, Mar 14, 2018 at 5:32 AM, kant kodali  wrote:

> Tried this
>
>  ./spark-shell --master yarn --deploy-mode client --executor-memory 4g
>
>
> Same issue. Keeps going forever..
>
>
> 18/03/14 09:31:25 INFO Client:
>
> client token: N/A
>
> diagnostics: N/A
>
> ApplicationMaster host: N/A
>
> ApplicationMaster RPC port: -1
>
> queue: default
>
> start time: 1521019884656
>
> final status: UNDEFINED
>
> tracking URL: http://ip-172-31-0-54:8088/proxy/application_
> 1521014458020_0004/
>
> user: centos
>
>
> 18/03/14 09:30:08 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:09 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:10 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:11 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:12 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:13 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:14 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> 18/03/14 09:30:15 INFO Client: Application report for
> application_1521014458020_0003 (state: ACCEPTED)
>
> On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony  wrote:
>
>> Make sure you have enough memory allocated for Spark workers, try
>> specifying executor memory as follows:
>>
>> --executor-memory 
>>
>> to spark-submit.
>>
>> On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:
>>
>>> I am using spark 2.3.0 and hadoop 2.7.3.
>>>
>>> Also I have done the following and restarted all. But I still
>>> see ACCEPTED: waiting for AM container to be allocated, launched and
>>> register with RM. And i am unable to spawn spark-shell.
>>>
>>> editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
>>> following property value from 0.1 to something higher. I changed to 0.5
>>> (50%)
>>>
>>> 
>>> yarn.scheduler.capacity.maximum-am-resource-percent
>>> 0.5
>>> 
>>> Maximum percent of resources in the cluster which can be used to 
>>> run application masters i.e. controls number of concurrent running 
>>> applications.
>>> 
>>> 
>>>
>>> You may have to allocate more memory to YARN by editing yarn-site.xml by
>>> updating the following property:
>>>
>>> 
>>> yarn.nodemanager.resource.memory-mb
>>> 8192
>>> 
>>>
>>> https://stackoverflow.com/questions/45687607/waiting-for-am-
>>> container-to-be-allocated-launched-and-register-with-rm
>>>
>>>
>>>
>>> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
>>> wrote:
>>>
 any idea?

 On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
 wrote:

> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
>  and these are
> the only three files I changed Do I need to set or change anything in
> mapred-site.xml (As of now I have not touched mapred-site.xml)?
>
> when I do yarn -node -list -all I can see both node manager and
> resource managers are running fine.
>
> But when I run spark-shell --master yarn --deploy-mode client
>
>
> it just keeps looping forever and never stops with the following
> messages
>
> 18/03/14 07:07:47 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:48 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:49 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:50 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:51 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:52 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
>
> when I go to RM UI I see this
>
> ACCEPTED: waiting for AM container to be allocated, launched and
> register with RM.
>
>
>
>
> On Mon, Mar 12, 2018 at 7:16 PM, vermanurag <
> anurag.ve...@fnmathlogic.com> wrote:
>
>> This does not look like Spark error. Looks like yarn has not been
>> able to
>> allocate resources for spark driver. If you check resource manager UI
>> you
>> are likely to see this as spark application waiting for resources. Try
>> reducing the driver node memory and/ or other bottlenecks based on
>> what you
>> see in the resource manager UI.
>>
>>
>>

Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
Tried this

 ./spark-shell --master yarn --deploy-mode client --executor-memory 4g


Same issue. Keeps going forever..


18/03/14 09:31:25 INFO Client:

client token: N/A

diagnostics: N/A

ApplicationMaster host: N/A

ApplicationMaster RPC port: -1

queue: default

start time: 1521019884656

final status: UNDEFINED

tracking URL:
http://ip-172-31-0-54:8088/proxy/application_1521014458020_0004/

user: centos


18/03/14 09:30:08 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:09 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:10 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:11 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:12 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:13 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:14 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

18/03/14 09:30:15 INFO Client: Application report for
application_1521014458020_0003 (state: ACCEPTED)

On Wed, Mar 14, 2018 at 2:03 AM, Femi Anthony  wrote:

> Make sure you have enough memory allocated for Spark workers, try
> specifying executor memory as follows:
>
> --executor-memory 
>
> to spark-submit.
>
> On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:
>
>> I am using spark 2.3.0 and hadoop 2.7.3.
>>
>> Also I have done the following and restarted all. But I still
>> see ACCEPTED: waiting for AM container to be allocated, launched and
>> register with RM. And i am unable to spawn spark-shell.
>>
>> editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
>> following property value from 0.1 to something higher. I changed to 0.5
>> (50%)
>>
>> 
>> yarn.scheduler.capacity.maximum-am-resource-percent
>> 0.5
>> 
>> Maximum percent of resources in the cluster which can be used to run 
>> application masters i.e. controls number of concurrent running applications.
>> 
>> 
>>
>> You may have to allocate more memory to YARN by editing yarn-site.xml by
>> updating the following property:
>>
>> 
>> yarn.nodemanager.resource.memory-mb
>> 8192
>> 
>>
>> https://stackoverflow.com/questions/45687607/waiting-for-am-
>> container-to-be-allocated-launched-and-register-with-rm
>>
>>
>>
>> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>>
>>> any idea?
>>>
>>> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali 
>>> wrote:
>>>
 I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
  and these are
 the only three files I changed Do I need to set or change anything in
 mapred-site.xml (As of now I have not touched mapred-site.xml)?

 when I do yarn -node -list -all I can see both node manager and
 resource managers are running fine.

 But when I run spark-shell --master yarn --deploy-mode client


 it just keeps looping forever and never stops with the following
 messages

 18/03/14 07:07:47 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)
 18/03/14 07:07:48 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)
 18/03/14 07:07:49 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)
 18/03/14 07:07:50 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)
 18/03/14 07:07:51 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)
 18/03/14 07:07:52 INFO Client: Application report for
 application_1521011212550_0001 (state: ACCEPTED)

 when I go to RM UI I see this

 ACCEPTED: waiting for AM container to be allocated, launched and
 register with RM.




 On Mon, Mar 12, 2018 at 7:16 PM, vermanurag <
 anurag.ve...@fnmathlogic.com> wrote:

> This does not look like Spark error. Looks like yarn has not been able
> to
> allocate resources for spark driver. If you check resource manager UI
> you
> are likely to see this as spark application waiting for resources. Try
> reducing the driver node memory and/ or other bottlenecks based on
> what you
> see in the resource manager UI.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

>>>
>>
>
>
> --
> http://www.femibyte.com/twiki5/bin/view/Tech/
> http://www.nextmatrix.com
> "Great spirits have always encountered 

Re: How to run spark shell using YARN

2018-03-14 Thread Femi Anthony
Make sure you have enough memory allocated for Spark workers, try
specifying executor memory as follows:

--executor-memory 

to spark-submit.

On Wed, Mar 14, 2018 at 3:25 AM, kant kodali  wrote:

> I am using spark 2.3.0 and hadoop 2.7.3.
>
> Also I have done the following and restarted all. But I still
> see ACCEPTED: waiting for AM container to be allocated, launched and
> register with RM. And i am unable to spawn spark-shell.
>
> editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
> following property value from 0.1 to something higher. I changed to 0.5
> (50%)
>
> 
> yarn.scheduler.capacity.maximum-am-resource-percent
> 0.5
> 
> Maximum percent of resources in the cluster which can be used to run 
> application masters i.e. controls number of concurrent running applications.
> 
> 
>
> You may have to allocate more memory to YARN by editing yarn-site.xml by
> updating the following property:
>
> 
> yarn.nodemanager.resource.memory-mb
> 8192
> 
>
> https://stackoverflow.com/questions/45687607/waiting-
> for-am-container-to-be-allocated-launched-and-register-with-rm
>
>
>
> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>
>> any idea?
>>
>> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>>
>>> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
>>>  and these are
>>> the only three files I changed Do I need to set or change anything in
>>> mapred-site.xml (As of now I have not touched mapred-site.xml)?
>>>
>>> when I do yarn -node -list -all I can see both node manager and resource
>>> managers are running fine.
>>>
>>> But when I run spark-shell --master yarn --deploy-mode client
>>>
>>>
>>> it just keeps looping forever and never stops with the following messages
>>>
>>> 18/03/14 07:07:47 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>> 18/03/14 07:07:48 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>> 18/03/14 07:07:49 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>> 18/03/14 07:07:50 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>> 18/03/14 07:07:51 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>> 18/03/14 07:07:52 INFO Client: Application report for
>>> application_1521011212550_0001 (state: ACCEPTED)
>>>
>>> when I go to RM UI I see this
>>>
>>> ACCEPTED: waiting for AM container to be allocated, launched and
>>> register with RM.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 12, 2018 at 7:16 PM, vermanurag <
>>> anurag.ve...@fnmathlogic.com> wrote:
>>>
 This does not look like Spark error. Looks like yarn has not been able
 to
 allocate resources for spark driver. If you check resource manager UI
 you
 are likely to see this as spark application waiting for resources. Try
 reducing the driver node memory and/ or other bottlenecks based on what
 you
 see in the resource manager UI.



 --
 Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>>
>>
>


-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Re: [EXT] Debugging a local spark executor in pycharm

2018-03-14 Thread Vitaliy Pisarev
Actually, I stumbled on this SO page
.
While it is not straightforward, it is a fairly simple solution.

In short:


   - I made sure there is only one executing task at a time by calling
   repartition(1) - this made it easy to locate the one and only spark deamon
   - I set a BP wherever I needed to
   - In order to "catch" the BP, I set a print out and a time.sleep(15)
   right before it. The print out gives me a notice that the daemon is up and
   running
   and the sleep gives me time to push a few buttons so I can attache to
   the procesa

It worked fairly well, and I was able to debug the executor. I did notice
two strange things: sometimes I got a strange error and the debugger didnt
actually attach. It was not deterministic.

Other times I noticed a big gap between the point I got the notification
and attached to the process until the execution was resumed and I could
actually step through (by big gap I mean a gap that is considerably bigger
than the sleep period, usually about 1 minute).

Not perfect but worked most of the time.



On Wed, Mar 14, 2018 at 12:07 AM, Michael Mansour <
michael_mans...@symantec.com> wrote:

> Vitaliy,
>
>
>
> From what I understand, this is not possible to do.  However, let me share
> my workaround with you.
>
>
>
> Assuming you have your debugger up and running on PyCharm, set a
> breakpoint at this line, Take|collect|sample  your data (could also
> consider doing a glom if its critical the data remain partitioned, then the
> take/collect), and pass it into the function directly (direct python, no
> spark).  Use the debugger to step through there on that small sample.
>
>
>
> Alternatively, you can open up the PyCharm execution module.  In the
> execution module, do the same as above with the RDD, and pass it into the
> function.  This alleviates the need to write debugging code etc.  I find
> this model useful and a bit more fast, but it does not offer the
> step-through capability.
>
>
>
> Best of luck!
>
> M
>
> --
>
> Michael Mansour
>
> Data Scientist
>
> Symantec CASB
>
> *From: *Vitaliy Pisarev 
> *Date: *Sunday, March 11, 2018 at 8:46 AM
> *To: *"user@spark.apache.org" 
> *Subject: *[EXT] Debugging a local spark executor in pycharm
>
>
>
> I want to step through the work of a spark executor running locally on my
> machine, from Pycharm.
>
> I am running explicit functionality, in the form of
> dataset.foreachPartition(f) and I want to see what is going on inside f.
>
> Is there a straightforward way to do it or do I need to resort to remote
> debugging?
>
> p.s
>
>
>
> Posted this on SO
> 
> as well.
>


Re: Spark Application stuck

2018-03-14 Thread Femi Anthony
Have you taken a look at the EMR UI ? What does your Spark setup look like
? I assume you're on EMR on AWS.
The various UI urls and ports are listed here:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/
emr-web-interfaces.html


On Wed, Mar 14, 2018 at 4:23 AM, Mukund Big Data 
wrote:

> Hi
>
> I am executing the following recommendation engine using Spark ML
>
> https://aws.amazon.com/blogs/big-data/building-a-
> recommendation-engine-with-spark-ml-on-amazon-emr-using-zeppelin/
>
> When I am trying to save the model, the application hungs and does't
> respond.
>
> Any pointers to find where the problem is?
> Is there any blog/doc which guides us to efficiently debug Spark
> applications?
>
> Thanks & Regards
> Mukund
>



-- 
http://www.femibyte.com/twiki5/bin/view/Tech/
http://www.nextmatrix.com
"Great spirits have always encountered violent opposition from mediocre
minds." - Albert Einstein.


Spark Application stuck

2018-03-14 Thread Mukund Big Data
Hi

I am executing the following recommendation engine using Spark ML

https://aws.amazon.com/blogs/big-data/building-a-recommendation-engine-with-spark-ml-on-amazon-emr-using-zeppelin/

When I am trying to save the model, the application hungs and does't
respond.

Any pointers to find where the problem is?
Is there any blog/doc which guides us to efficiently debug Spark
applications?

Thanks & Regards
Mukund


How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Aakash Basu
Hi all,

Any guide on how to kich-start learning PySpark Streaming in ubuntu
standalone system? Step wise, practical hands-on, would be great.

Also, connecting Kafka with Spark and getting real time data and processing
it in micro-batches...

Any help?

Thanks,
Aakash.


Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I am using spark 2.3.0 and hadoop 2.7.3.

Also I have done the following and restarted all. But I still see ACCEPTED:
waiting for AM container to be allocated, launched and register with RM.
And i am unable to spawn spark-shell.

editing $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml and change the
following property value from 0.1 to something higher. I changed to 0.5
(50%)


yarn.scheduler.capacity.maximum-am-resource-percent
0.5

Maximum percent of resources in the cluster which can be used
to run application masters i.e. controls number of concurrent running
applications.



You may have to allocate more memory to YARN by editing yarn-site.xml by
updating the following property:


yarn.nodemanager.resource.memory-mb
8192


https://stackoverflow.com/questions/45687607/waiting-for-am-container-to-be-allocated-launched-and-register-with-rm



On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:

> any idea?
>
> On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:
>
>> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
>>  and these are
>> the only three files I changed Do I need to set or change anything in
>> mapred-site.xml (As of now I have not touched mapred-site.xml)?
>>
>> when I do yarn -node -list -all I can see both node manager and resource
>> managers are running fine.
>>
>> But when I run spark-shell --master yarn --deploy-mode client
>>
>>
>> it just keeps looping forever and never stops with the following messages
>>
>> 18/03/14 07:07:47 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:48 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:49 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:50 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:51 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>> 18/03/14 07:07:52 INFO Client: Application report for
>> application_1521011212550_0001 (state: ACCEPTED)
>>
>> when I go to RM UI I see this
>>
>> ACCEPTED: waiting for AM container to be allocated, launched and register
>> with RM.
>>
>>
>>
>>
>> On Mon, Mar 12, 2018 at 7:16 PM, vermanurag > > wrote:
>>
>>> This does not look like Spark error. Looks like yarn has not been able to
>>> allocate resources for spark driver. If you check resource manager UI you
>>> are likely to see this as spark application waiting for resources. Try
>>> reducing the driver node memory and/ or other bottlenecks based on what
>>> you
>>> see in the resource manager UI.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>


Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
any idea?

On Wed, Mar 14, 2018 at 12:12 AM, kant kodali  wrote:

> I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
>  and these are the
> only three files I changed Do I need to set or change anything in
> mapred-site.xml (As of now I have not touched mapred-site.xml)?
>
> when I do yarn -node -list -all I can see both node manager and resource
> managers are running fine.
>
> But when I run spark-shell --master yarn --deploy-mode client
>
>
> it just keeps looping forever and never stops with the following messages
>
> 18/03/14 07:07:47 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:48 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:49 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:50 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:51 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
> 18/03/14 07:07:52 INFO Client: Application report for
> application_1521011212550_0001 (state: ACCEPTED)
>
> when I go to RM UI I see this
>
> ACCEPTED: waiting for AM container to be allocated, launched and register
> with RM.
>
>
>
>
> On Mon, Mar 12, 2018 at 7:16 PM, vermanurag 
> wrote:
>
>> This does not look like Spark error. Looks like yarn has not been able to
>> allocate resources for spark driver. If you check resource manager UI you
>> are likely to see this as spark application waiting for resources. Try
>> reducing the driver node memory and/ or other bottlenecks based on what
>> you
>> see in the resource manager UI.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: How to run spark shell using YARN

2018-03-14 Thread kant kodali
I set core-site.xml, hdfs-site.xml, yarn-site.xml  as per this website
 and these are the
only three files I changed Do I need to set or change anything in
mapred-site.xml (As of now I have not touched mapred-site.xml)?

when I do yarn -node -list -all I can see both node manager and resource
managers are running fine.

But when I run spark-shell --master yarn --deploy-mode client


it just keeps looping forever and never stops with the following messages

18/03/14 07:07:47 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)
18/03/14 07:07:48 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)
18/03/14 07:07:49 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)
18/03/14 07:07:50 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)
18/03/14 07:07:51 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)
18/03/14 07:07:52 INFO Client: Application report for
application_1521011212550_0001 (state: ACCEPTED)

when I go to RM UI I see this

ACCEPTED: waiting for AM container to be allocated, launched and register
with RM.




On Mon, Mar 12, 2018 at 7:16 PM, vermanurag 
wrote:

> This does not look like Spark error. Looks like yarn has not been able to
> allocate resources for spark driver. If you check resource manager UI you
> are likely to see this as spark application waiting for resources. Try
> reducing the driver node memory and/ or other bottlenecks based on what you
> see in the resource manager UI.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>