Re: Run Multiple Spark jobs. Reduce Execution time.

2018-02-14 Thread akshay naidu
a small hint would be very helpful .

On Wed, Feb 14, 2018 at 5:17 PM, akshay naidu 
wrote:

> Hello Siva,
> Thanks for your reply.
>
> Actually i'm trying to generate online reports for my clients. For this I
> want the jobs should be executed faster without putting any job on QUEUE
> irrespective of the number of jobs different clients are executing from
> different locations.
> currently , a job processing 17GB of data takes more than 20mins to
> execute. also only 6 jobs run simultaneously and the remaining one are in
> WAITING stage.
>
> Thanks
>
> On Wed, Feb 14, 2018 at 4:32 PM, Siva Gudavalli 
> wrote:
>
>>
>> Hello Akshay,
>>
>> I see there are 6 slaves * with 1 spark Instance each * 5 cores on each
>> Instance => 30 cores in total
>> Do you have any other pools confuted ? Running 8 jobs should be triggered
>> in parallel with the number of cores you have.
>>
>> For your long running job, did you have a chance to look at Tasks thats
>> being triggered.
>>
>> I would recommend slow running job to be configured in a separate pool.
>>
>> Regards
>> Shiv
>>
>> On Feb 14, 2018, at 5:44 AM, akshay naidu 
>> wrote:
>>
>> 
>> **
>> yarn-site.xml
>>
>>
>>  
>> yarn.scheduler.fair.preemption.cluster-utilization-
>> threshold
>> 0.8
>>   
>>
>> 
>> yarn.scheduler.minimum-allocation-mb
>> 3584
>> 
>>
>> 
>> yarn.scheduler.maximum-allocation-mb
>> 10752
>> 
>>
>> 
>> yarn.nodemanager.resource.memory-mb
>> 10752
>>
>> 
>> **
>> spark-defaults.conf
>>
>> spark.master   yarn
>> spark.driver.memory9g
>> spark.executor.memory  1024m
>> spark.yarn.executor.memoryOverhead 1024m
>> spark.eventLog.enabled  true
>> spark.eventLog.dir hdfs://tech-master:54310/spark-logs
>>
>> spark.history.providerorg.apache.spark.deploy.histor
>> y.FsHistoryProvider
>> spark.history.fs.logDirectory hdfs://tech-master:54310/spark-logs
>> spark.history.fs.update.interval  10s
>> spark.history.ui.port 18080
>>
>> spark.ui.enabledtrue
>> spark.ui.port   4040
>> spark.ui.killEnabledtrue
>> spark.ui.retainedDeadExecutors  100
>>
>> spark.scheduler.modeFAIR
>> spark.scheduler.allocation.file /usr/local/spark/current/conf/
>> fairscheduler.xml
>>
>> #spark.submit.deployMode cluster
>> spark.default.parallelism30
>>
>> SPARK_WORKER_MEMORY 10g
>> SPARK_WORKER_INSTANCES 1
>> SPARK_WORKER_CORES 5
>>
>> SPARK_DRIVER_MEMORY 9g
>> SPARK_DRIVER_CORES 5
>>
>> SPARK_MASTER_IP Tech-master
>> SPARK_MASTER_PORT 7077
>>
>> On Tue, Feb 13, 2018 at 4:43 PM, akshay naidu 
>> wrote:
>>
>>> Hello,
>>> I'm try to run multiple spark jobs on cluster running in yarn.
>>> Master is 24GB server with 6 Slaves of 12GB
>>>
>>> fairscheduler.xml settings are -
>>> 
>>> FAIR
>>> 10
>>> 2
>>> 
>>>
>>> I am running 8 jobs simultaneously , jobs are running parallelly but not
>>> all.
>>> at a time only 7 of then runs simultaneously while the 8th one is in
>>> queue WAITING for a job to stop.
>>>
>>> also, out of the 7 running jobs, 4 runs comparatively much faster than
>>> remaining three (maybe resources are not distributed properly) .
>>>
>>> I want to run n number of jobs at a time and make them run faster ,
>>> Right now, one job is taking more than three minutes while processing a max
>>> of 1GB data .
>>>
>>> Kindly assist me. what am I missing.
>>>
>>> Thanks.
>>>
>>
>>
>>
>


stdout: org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in

2018-02-14 Thread kant kodali
Hi All,

I get an AnalysisException when I run the following query

spark.sql(select current_timestamp() as tsp, count(*) from table group
by window(tsp, '5 minutes'))

I just want create a processing time columns and want to run some simple
stateful query like above. I understand current_timestamp is non
deterministic and if so how can I add a processing time column and use
group by to do stateful aggregation?
Thanks!


Re: [Structured Streaming] Avoiding multiple streaming queries

2018-02-14 Thread Tathagata Das
Of course, you can write to multiple Kafka topics from a single query. If
your dataframe that you want to write has a column named "topic" (along
with "key", and "value" columns), it will write the contents of a row to
the topic in that row. This automatically works. So the only thing you need
to figure out is how to generate the value of that column.

This is documented -
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka

Or am i misunderstanding the problem?

TD




On Tue, Feb 13, 2018 at 10:45 AM, Yogesh Mahajan 
wrote:

> I had a similar issue and i think that’s where the structured streaming
> design lacks.
> Seems like Question#2 in your email is a viable workaround for you.
>
> In my case, I have a custom Sink backed by an efficient in-memory column
> store suited for fast ingestion.
>
> I have a Kafka stream coming from one topic, and I need to classify the
> stream based on schema.
> For example, a Kafka topic can have three different types of schema
> messages and I would like to ingest into the three different column
> tables(having different schema) using my custom Sink implementation.
>
> Right now only(?) option I have is to create three streaming queries
> reading the same topic and ingesting to respective column tables using
> their Sink implementations.
> These three streaming queries create underlying three
> IncrementalExecutions and three KafkaSources, and three queries reading the
> same data from the same Kafka topic.
> Even with CachedKafkaConsumers at partition level, this is not an
> efficient way to handle a simple streaming use case.
>
> One workaround to overcome this limitation is to have same schema for all
> the messages in a Kafka partition, unfortunately this is not in our control
> and customers cannot change it due to their dependencies on other
> subsystems.
>
> Thanks,
> http://www.snappydata.io/blog 
>
> On Mon, Feb 12, 2018 at 5:54 PM, Priyank Shrivastava <
> priy...@asperasoft.com> wrote:
>
>> I have a structured streaming query which sinks to Kafka.  This query has
>> a complex aggregation logic.
>>
>>
>> I would like to sink the output DF of this query to multiple Kafka topics
>> each partitioned on a different ‘key’ column.  I don’t want to have
>> multiple Kafka sinks for each of the different Kafka topics because that
>> would mean running multiple streaming queries - one for each Kafka topic,
>> especially since my aggregation logic is complex.
>>
>>
>> Questions:
>>
>> 1.  Is there a way to output the results of a structured streaming query
>> to multiple Kafka topics each with a different key column but without
>> having to execute multiple streaming queries?
>>
>>
>> 2.  If not,  would it be efficient to cascade the multiple queries such
>> that the first query does the complex aggregation and writes output
>> to Kafka and then the other queries just read the output of the first query
>> and write their topics to Kafka thus avoiding doing the complex aggregation
>> again?
>>
>>
>> Thanks in advance for any help.
>>
>>
>> Priyank
>>
>>
>>
>


Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
1. Just loop like this.


def startQuery(): Streaming Query = {
   // Define the dataframes and start the query
}

// call this on main thread
while (notShutdown) {
   val query = startQuery()
   query.awaitTermination(refreshIntervalMs)
   query.stop()
   // refresh static data
}


2. Yes, stream-stream joins in 2.3.0, soon to be released. RC3 is available
if you want to test it right now -
https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc3-bin/.



On Wed, Feb 14, 2018 at 3:34 AM, Appu K  wrote:

> TD,
>
> Thanks a lot for the quick reply :)
>
>
> Did I understand it right that in the main thread, to wait for the
> termination of the context I'll not be able to use
>  outStream.awaitTermination()  -  [ since i'll be closing in inside another
> thread ]
>
> What would be a good approach to keep the main app long running if I’ve to
> restart queries?
>
> Should i just wait for 2.3 where i'll be able to join two structured
> streams ( if the release is just a few weeks away )
>
> Appreciate all the help!
>
> thanks
> App
>
>
>
> On 14 February 2018 at 4:41:52 PM, Tathagata Das (
> tathagata.das1...@gmail.com) wrote:
>
> Let me fix my mistake :)
> What I suggested in that earlier thread does not work. The streaming query
> that joins a streaming dataset with a batch view, does not correctly pick
> up when the view is updated. It works only when you restart the query. That
> is,
> - stop the query
> - recreate the dataframes,
> - start the query on the new dataframe using the same checkpoint location
> as the previous query
>
> Note that you dont need to restart the whole process/cluster/application,
> just restart the query in the same process/cluster/application. This should
> be very fast (within a few seconds). So, unless you have latency SLAs of 1
> second, you can periodically restart the query without restarting the
> process.
>
> Apologies for my misdirections in that earlier thread. Hope this helps.
>
> TD
>
> On Wed, Feb 14, 2018 at 2:57 AM, Appu K  wrote:
>
>> More specifically,
>>
>> Quoting TD from the previous thread
>> "Any streaming query that joins a streaming dataframe with the view will
>> automatically start using the most updated data as soon as the view is
>> updated”
>>
>> Wondering if I’m doing something wrong in  https://gist.github.com/anony
>> mous/90dac8efadca3a69571e619943ddb2f6
>>
>> My streaming dataframe is not using the updated data, even though the
>> view is updated!
>>
>> Thank you
>>
>>
>> On 14 February 2018 at 2:54:48 PM, Appu K (kut...@gmail.com) wrote:
>>
>> Hi,
>>
>> I had followed the instructions from the thread https://mail-archives.a
>> pache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-41CD-
>> 4ba3-8b77-0879f3669...@qvantel.com%3E while trying to reload a static
>> data frame periodically that gets joined to a structured streaming query.
>>
>> However, the streaming query results does not reflect the data from the
>> refreshed static data frame.
>>
>> Code is here https://gist.github.com/anonymous/90dac8efadca3a69571e6
>> 19943ddb2f6
>>
>> I’m using spark 2.2.1 . Any pointers would be highly helpful
>>
>> Thanks a lot
>>
>> Appu
>>
>>
>


Re: Why python cluster mode is not supported in standalone cluster?

2018-02-14 Thread Ashwin Sai Shankar
+dev mailing list(since i didn't get a response from user DL)

On Tue, Feb 13, 2018 at 12:20 PM, Ashwin Sai Shankar 
wrote:

> Hi Spark users!
> I noticed that spark doesn't allow python apps to run in cluster mode in
> spark standalone cluster. Does anyone know the reason? I checked jira but
> couldn't find anything relevant.
>
> Thanks,
> Ashwin
>


[Spark-Core]port opened by the SparkDriver is vulnerable to flooding attacks

2018-02-14 Thread sandeep-katta
SparkSubmit will open the port to communicate with the APP Master and
executors. 

This port is not closing the IDLE connections,so it is vulnerable for DOS
attack,I did telnet IP port and this connection is not closed. 

In order to fix this I tried to Handle in the *userEventTriggered *of
*TransportChannelHandler.java*  class,to my surprise APP master is also IDLE
if no JOB is submitted ,so fixing this will result in terminating the
connection for APP master also. 

Any one have you come across this type of problem,and is there any other way
to fix this issue 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread chandan prakash
Thanks a lot Hyukjin & Felix.
It was helpful.
Going to older version worked.

Regards,
Chandan

On Wed, Feb 14, 2018 at 3:28 PM, Felix Cheung 
wrote:

> Yes it is issue with the newer release of testthat.
>
> To workaround could you install an earlier version with devtools? will
> follow up for a fix.
>
> _
> From: Hyukjin Kwon 
> Sent: Wednesday, February 14, 2018 6:49 PM
> Subject: Re: SparkR test script issue: unable to run run-tests.h on spark
> 2.2
> To: chandan prakash 
> Cc: user @spark 
>
>
>
> From a very quick look, I think testthat version issue with SparkR.
>
> I had to fix that version to 1.x before in AppVeyor. There are few details
> in https://github.com/apache/spark/pull/20003
>
> Can you check and lower testthat version?
>
>
> On 14 Feb 2018 6:09 pm, "chandan prakash" 
> wrote:
>
>> Hi All,
>> I am trying to run test script of R under ./R/run-tests.sh but hitting
>> same ERROR everytime.
>> I tried running on mac as well as centos machine, same issue coming up.
>> I am using spark 2.2 (branch-2.2)
>> I followed from apache doc and followed the steps:
>> 1. installed R
>> 2. installed packages like testthat as mentioned in doc
>> 3. run run-tests.h
>>
>>
>> Every time I am getting this error line:
>>
>> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>>   object 'run_tests' not found
>> Calls: ::: -> get
>> Execution halted
>>
>>
>> Any Help?
>>
>> --
>> Chandan Prakash
>>
>>
>
>


-- 
Chandan Prakash


Re: Run Multiple Spark jobs. Reduce Execution time.

2018-02-14 Thread akshay naidu
Hello Siva,
Thanks for your reply.

Actually i'm trying to generate online reports for my clients. For this I
want the jobs should be executed faster without putting any job on QUEUE
irrespective of the number of jobs different clients are executing from
different locations.
currently , a job processing 17GB of data takes more than 20mins to
execute. also only 6 jobs run simultaneously and the remaining one are in
WAITING stage.

Thanks

On Wed, Feb 14, 2018 at 4:32 PM, Siva Gudavalli 
wrote:

>
> Hello Akshay,
>
> I see there are 6 slaves * with 1 spark Instance each * 5 cores on each
> Instance => 30 cores in total
> Do you have any other pools confuted ? Running 8 jobs should be triggered
> in parallel with the number of cores you have.
>
> For your long running job, did you have a chance to look at Tasks thats
> being triggered.
>
> I would recommend slow running job to be configured in a separate pool.
>
> Regards
> Shiv
>
> On Feb 14, 2018, at 5:44 AM, akshay naidu  wrote:
>
> 
> **
> yarn-site.xml
>
>
>  
> yarn.scheduler.fair.preemption.cluster-
> utilization-threshold
> 0.8
>   
>
> 
> yarn.scheduler.minimum-allocation-mb
> 3584
> 
>
> 
> yarn.scheduler.maximum-allocation-mb
> 10752
> 
>
> 
> yarn.nodemanager.resource.memory-mb
> 10752
>
> 
> **
> spark-defaults.conf
>
> spark.master   yarn
> spark.driver.memory9g
> spark.executor.memory  1024m
> spark.yarn.executor.memoryOverhead 1024m
> spark.eventLog.enabled  true
> spark.eventLog.dir hdfs://tech-master:54310/spark-logs
>
> spark.history.providerorg.apache.spark.deploy.
> history.FsHistoryProvider
> spark.history.fs.logDirectory hdfs://tech-master:54310/spark-logs
> spark.history.fs.update.interval  10s
> spark.history.ui.port 18080
>
> spark.ui.enabledtrue
> spark.ui.port   4040
> spark.ui.killEnabledtrue
> spark.ui.retainedDeadExecutors  100
>
> spark.scheduler.modeFAIR
> spark.scheduler.allocation.file /usr/local/spark/current/conf/
> fairscheduler.xml
>
> #spark.submit.deployMode cluster
> spark.default.parallelism30
>
> SPARK_WORKER_MEMORY 10g
> SPARK_WORKER_INSTANCES 1
> SPARK_WORKER_CORES 5
>
> SPARK_DRIVER_MEMORY 9g
> SPARK_DRIVER_CORES 5
>
> SPARK_MASTER_IP Tech-master
> SPARK_MASTER_PORT 7077
>
> On Tue, Feb 13, 2018 at 4:43 PM, akshay naidu 
> wrote:
>
>> Hello,
>> I'm try to run multiple spark jobs on cluster running in yarn.
>> Master is 24GB server with 6 Slaves of 12GB
>>
>> fairscheduler.xml settings are -
>> 
>> FAIR
>> 10
>> 2
>> 
>>
>> I am running 8 jobs simultaneously , jobs are running parallelly but not
>> all.
>> at a time only 7 of then runs simultaneously while the 8th one is in
>> queue WAITING for a job to stop.
>>
>> also, out of the 7 running jobs, 4 runs comparatively much faster than
>> remaining three (maybe resources are not distributed properly) .
>>
>> I want to run n number of jobs at a time and make them run faster , Right
>> now, one job is taking more than three minutes while processing a max of
>> 1GB data .
>>
>> Kindly assist me. what am I missing.
>>
>> Thanks.
>>
>
>
>


Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
TD,

Thanks a lot for the quick reply :)


Did I understand it right that in the main thread, to wait for the
termination of the context I'll not be able to use
 outStream.awaitTermination()  -  [ since i'll be closing in inside another
thread ]

What would be a good approach to keep the main app long running if I’ve to
restart queries?

Should i just wait for 2.3 where i'll be able to join two structured
streams ( if the release is just a few weeks away )

Appreciate all the help!

thanks
App



On 14 February 2018 at 4:41:52 PM, Tathagata Das (
tathagata.das1...@gmail.com) wrote:

Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query
that joins a streaming dataset with a batch view, does not correctly pick
up when the view is updated. It works only when you restart the query. That
is,
- stop the query
- recreate the dataframes,
- start the query on the new dataframe using the same checkpoint location
as the previous query

Note that you dont need to restart the whole process/cluster/application,
just restart the query in the same process/cluster/application. This should
be very fast (within a few seconds). So, unless you have latency SLAs of 1
second, you can periodically restart the query without restarting the
process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K  wrote:

> More specifically,
>
> Quoting TD from the previous thread
> "Any streaming query that joins a streaming dataframe with the view will
> automatically start using the most updated data as soon as the view is
> updated”
>
> Wondering if I’m doing something wrong in  https://gist.github.com/
> anonymous/90dac8efadca3a69571e619943ddb2f6
>
> My streaming dataframe is not using the updated data, even though the view
> is updated!
>
> Thank you
>
>
> On 14 February 2018 at 2:54:48 PM, Appu K (kut...@gmail.com) wrote:
>
> Hi,
>
> I had followed the instructions from the thread https://mail-archives.
> apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-
> 41cd-4ba3-8b77-0879f3669...@qvantel.com%3E while trying to reload a
> static data frame periodically that gets joined to a structured streaming
> query.
>
> However, the streaming query results does not reflect the data from the
> refreshed static data frame.
>
> Code is here https://gist.github.com/anonymous/
> 90dac8efadca3a69571e619943ddb2f6
>
> I’m using spark 2.2.1 . Any pointers would be highly helpful
>
> Thanks a lot
>
> Appu
>
>


Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
Let me fix my mistake :)
What I suggested in that earlier thread does not work. The streaming query
that joins a streaming dataset with a batch view, does not correctly pick
up when the view is updated. It works only when you restart the query. That
is,
- stop the query
- recreate the dataframes,
- start the query on the new dataframe using the same checkpoint location
as the previous query

Note that you dont need to restart the whole process/cluster/application,
just restart the query in the same process/cluster/application. This should
be very fast (within a few seconds). So, unless you have latency SLAs of 1
second, you can periodically restart the query without restarting the
process.

Apologies for my misdirections in that earlier thread. Hope this helps.

TD

On Wed, Feb 14, 2018 at 2:57 AM, Appu K  wrote:

> More specifically,
>
> Quoting TD from the previous thread
> "Any streaming query that joins a streaming dataframe with the view will
> automatically start using the most updated data as soon as the view is
> updated”
>
> Wondering if I’m doing something wrong in  https://gist.github.com/
> anonymous/90dac8efadca3a69571e619943ddb2f6
>
> My streaming dataframe is not using the updated data, even though the view
> is updated!
>
> Thank you
>
>
> On 14 February 2018 at 2:54:48 PM, Appu K (kut...@gmail.com) wrote:
>
> Hi,
>
> I had followed the instructions from the thread https://mail-archives.
> apache.org/mod_mbox/spark-user/201704.mbox/%3CD1315D33-
> 41cd-4ba3-8b77-0879f3669...@qvantel.com%3E while trying to reload a
> static data frame periodically that gets joined to a structured streaming
> query.
>
> However, the streaming query results does not reflect the data from the
> refreshed static data frame.
>
> Code is here https://gist.github.com/anonymous/
> 90dac8efadca3a69571e619943ddb2f6
>
> I’m using spark 2.2.1 . Any pointers would be highly helpful
>
> Thanks a lot
>
> Appu
>
>


Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
More specifically,

Quoting TD from the previous thread
"Any streaming query that joins a streaming dataframe with the view will
automatically start using the most updated data as soon as the view is
updated”

Wondering if I’m doing something wrong in
https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

My streaming dataframe is not using the updated data, even though the view
is updated!

Thank you


On 14 February 2018 at 2:54:48 PM, Appu K (kut...@gmail.com) wrote:

Hi,

I had followed the instructions from the thread
https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3cd1315d33-41cd-4ba3-8b77-0879f3669...@qvantel.com%3E
while
trying to reload a static data frame periodically that gets joined to a
structured streaming query.

However, the streaming query results does not reflect the data from the
refreshed static data frame.

Code is here
https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu


Re: Run Multiple Spark jobs. Reduce Execution time.

2018-02-14 Thread akshay naidu
**
yarn-site.xml


 

yarn.scheduler.fair.preemption.cluster-utilization-threshold
0.8
  


yarn.scheduler.minimum-allocation-mb
3584



yarn.scheduler.maximum-allocation-mb
10752



yarn.nodemanager.resource.memory-mb
10752

**
spark-defaults.conf

spark.master   yarn
spark.driver.memory9g
spark.executor.memory  1024m
spark.yarn.executor.memoryOverhead 1024m
spark.eventLog.enabled  true
spark.eventLog.dir hdfs://tech-master:54310/spark-logs

spark.history.provider
org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory hdfs://tech-master:54310/spark-logs
spark.history.fs.update.interval  10s
spark.history.ui.port 18080

spark.ui.enabledtrue
spark.ui.port   4040
spark.ui.killEnabledtrue
spark.ui.retainedDeadExecutors  100

spark.scheduler.modeFAIR
spark.scheduler.allocation.file
/usr/local/spark/current/conf/fairscheduler.xml

#spark.submit.deployMode cluster
spark.default.parallelism30

SPARK_WORKER_MEMORY 10g
SPARK_WORKER_INSTANCES 1
SPARK_WORKER_CORES 5

SPARK_DRIVER_MEMORY 9g
SPARK_DRIVER_CORES 5

SPARK_MASTER_IP Tech-master
SPARK_MASTER_PORT 7077

On Tue, Feb 13, 2018 at 4:43 PM, akshay naidu 
wrote:

> Hello,
> I'm try to run multiple spark jobs on cluster running in yarn.
> Master is 24GB server with 6 Slaves of 12GB
>
> fairscheduler.xml settings are -
> 
> FAIR
> 10
> 2
> 
>
> I am running 8 jobs simultaneously , jobs are running parallelly but not
> all.
> at a time only 7 of then runs simultaneously while the 8th one is in queue
> WAITING for a job to stop.
>
> also, out of the 7 running jobs, 4 runs comparatively much faster than
> remaining three (maybe resources are not distributed properly) .
>
> I want to run n number of jobs at a time and make them run faster , Right
> now, one job is taking more than three minutes while processing a max of
> 1GB data .
>
> Kindly assist me. what am I missing.
>
> Thanks.
>


Re: Run Multiple Spark jobs. Reduce Execution time.

2018-02-14 Thread akshay naidu
On Tue, Feb 13, 2018 at 4:43 PM, akshay naidu 
wrote:

> Hello,
> I'm try to run multiple spark jobs on cluster running in yarn.
> Master is 24GB server with 6 Slaves of 12GB
>
> fairscheduler.xml settings are -
> 
> FAIR
> 10
> 2
> 
>
> I am running 8 jobs simultaneously , jobs are running parallelly but not
> all.
> at a time only 7 of then runs simultaneously while the 8th one is in queue
> WAITING for a job to stop.
>
> also, out of the 7 running jobs, 4 runs comparatively much faster than
> remaining three (maybe resources are not distributed properly) .
>
> I want to run n number of jobs at a time and make them run faster , Right
> now, one job is taking more than three minutes while processing a max of
> 1GB data .
>
> Kindly assist me. what am I missing.
>
> Thanks.
>


Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung
Yes it is issue with the newer release of testthat.

To workaround could you install an earlier version with devtools? will follow 
up for a fix.

_
From: Hyukjin Kwon 
Sent: Wednesday, February 14, 2018 6:49 PM
Subject: Re: SparkR test script issue: unable to run run-tests.h on spark 2.2
To: chandan prakash 
Cc: user @spark 


>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details in 
https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash" 
> wrote:
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same 
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

--
Chandan Prakash





Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Hyukjin Kwon
>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details
in https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash"  wrote:

> Hi All,
> I am trying to run test script of R under ./R/run-tests.sh but hitting
> same ERROR everytime.
> I tried running on mac as well as centos machine, same issue coming up.
> I am using spark 2.2 (branch-2.2)
> I followed from apache doc and followed the steps:
> 1. installed R
> 2. installed packages like testthat as mentioned in doc
> 3. run run-tests.h
>
>
> Every time I am getting this error line:
>
> Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
>   object 'run_tests' not found
> Calls: ::: -> get
> Execution halted
>
>
> Any Help?
>
> --
> Chandan Prakash
>
>


Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Appu K
Hi,

I had followed the instructions from the thread
https://mail-archives.apache.org/mod_mbox/spark-user/201704.mbox/%3cd1315d33-41cd-4ba3-8b77-0879f3669...@qvantel.com%3E
while
trying to reload a static data frame periodically that gets joined to a
structured streaming query.

However, the streaming query results does not reflect the data from the
refreshed static data frame.

Code is here
https://gist.github.com/anonymous/90dac8efadca3a69571e619943ddb2f6

I’m using spark 2.2.1 . Any pointers would be highly helpful

Thanks a lot

Appu


SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread chandan prakash
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

-- 
Chandan Prakash