Exception in Shutdown-thread, bad file descriptor

2017-12-20 Thread Noorul Islam Kamal Malmiyoda
Hi all,

We are getting the following exception and this somehow blocks the parent
thread from proceeding further.

17/11/14 16:50:09 SPARK_APP WARN NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
17/11/14 16:50:17 SPARK_APP WARN SparkContext: Use an existing
SparkContext, some configuration may not take effect.
[Stage 4:> (13 + 1) / 200][Stage 5:> (1 + 0) / 8][Stage 8:> (0 + 0) /
8]Exception
in thread "Shutdown-checker" java.lang.RuntimeException: eventfd_write()
failed: Bad file descriptor
at io.netty.channel.epoll.Native.eventFdWrite(Native Method)
at io.netty.channel.epoll.EpollEventLoop.wakeup(EpollEventLoop.java:106)
at
io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:538)
at
io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:146)
at
io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:69)
at
com.datastax.driver.core.NettyOptions.onClusterClose(NettyOptions.java:193)
at com.datastax.driver.core.Connection$Factory.shutdown(Connection.java:902)
at
com.datastax.driver.core.Cluster$Manager$ClusterCloseFuture$1.run(Cluster.java:2539)

This is very hard to replicate, so I am not able to come up with a
re-produceable recipe. Has anyone faced similar issue? Any help is
appreciated.

Spark Version: 2.0.1


Thanks and Regards
Noorul


Controlling number of spark partitions in dataframes

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I have the following spark configuration

spark.app.name=Test
spark.cassandra.connection.host=127.0.0.1
spark.cassandra.connection.keep_alive_ms=5000
spark.cassandra.connection.port=1
spark.cassandra.connection.timeout_ms=3
spark.cleaner.ttl=3600
spark.default.parallelism=4
spark.master=local[2]
spark.ui.enabled=false
spark.ui.showConsoleProgress=false

Because I am setting spark.default.parallelism to 4, I was expecting
only 4 spark partitions. But it looks like it is not the case

When I do the following

df.foreachPartition { partition =>
  val groupedPartition = partition.toList.grouped(3).toList
  println("Grouped partition " + groupedPartition)
}

There are too many print statements with empty list at the top. Only
the relevant partitions are at the bottom. Is there a way to control
number of partitions?

Regards,
Noorul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I have a Dataframe with 1000 records. I want to split them into 100
each and post to rest API.

If it was RDD, I could use something like this

myRDD.foreachRDD {
  rdd =>
rdd.foreachPartition {
  partition => {

This will ensure that code is executed on executors and not on driver.

Is there any similar approach that we can take for Dataframes? I see
examples on stackoverflow with collect() which will bring whole data
to driver.

Thanks and Regards
Noorul

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-29 Thread Noorul Islam Kamal Malmiyoda
I think better place would be a in memory cache for real time.

Regards,
Noorul

On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809  wrote:
> I am getting streaming data and want to show them onto dashboards in real
> time?
> May I know how best we can handle these streaming data? where to store? (DB
> or HDFS or ???)
> I want to give users a real time analytics experience.
>
> Please suggest possible ways. Thanks.
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-best-we-can-store-streaming-data-on-dashboards-for-real-time-user-experience-tp28548.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Application kill from UI do not propagate exception

2017-03-24 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I am trying to trap UI kill event of a spark application from driver.
Some how the exception thrown is not propagated to the driver main
program. See for example using spark-shell below.

Is there a way to get hold of this event and shutdown the driver program?

Regards,
Noorul


spark@spark1:~/spark-2.1.0/sbin$ spark-shell --master spark://10.29.83.162:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/03/23 15:16:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
17/03/23 15:16:53 WARN ObjectStore: Failed to get database
global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.29.83.162:4040
Spark context available as 'sc' (master = spark://10.29.83.162:7077,
app id = app-20170323151648-0002).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 17/03/23 15:17:28 ERROR StandaloneSchedulerBackend: Application
has been killed. Reason: Master removed our application: KILLED
17/03/23 15:17:28 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster
scheduler: Master removed our application: KILLED
at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459)
at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:139)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:254)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@25b8f9d2

scala>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Application kill from UI do not propagate exception

2017-03-23 Thread Noorul Islam Kamal Malmiyoda
Hi all,

I am trying to trap UI kill event of a spark application from driver.
Some how the exception thrown is not propagated to the driver main
program. See for example using spark-shell below.

Is there a way to get hold of this event and shutdown the driver program?

Regards,
Noorul


spark@spark1:~/spark-2.1.0/sbin$ spark-shell --master spark://10.29.83.162:7077
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
17/03/23 15:16:47 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
17/03/23 15:16:53 WARN ObjectStore: Failed to get database
global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.29.83.162:4040
Spark context available as 'sc' (master = spark://10.29.83.162:7077,
app id = app-20170323151648-0002).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 17/03/23 15:17:28 ERROR StandaloneSchedulerBackend: Application
has been killed. Reason: Master removed our application: KILLED
17/03/23 15:17:28 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster
scheduler: Master removed our application: KILLED
at 
org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459)
at 
org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:139)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:254)
at 
org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@25b8f9d2

scala>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How do I deal with ever growing application log

2017-03-05 Thread Noorul Islam Kamal Malmiyoda
Or you could use sinks like elasticsearch.

Regards,
Noorul

On Mon, Mar 6, 2017 at 10:52 AM, devjyoti patra  wrote:
> Timothy, why are you writing application logs to HDFS? In case you want to
> analyze these logs later, you can write to local storage on your slave nodes
> and later rotate those files to a suitable location. If they are only going
> to useful for debugging the application, you can always remove them
> periodically.
> Thanks,
> Dev
>
> On Mar 6, 2017 9:48 AM, "Timothy Chan"  wrote:
>>
>> I'm running a single worker EMR cluster for a Structured Streaming job.
>> How do I deal with my application log filling up HDFS?
>>
>> /var/log/spark/apps/application_1487823545416_0021_1.inprogress
>>
>> is currently 21.8 GB
>>
>> Sent with Shift

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Testing --supervise flag

2016-08-02 Thread Noorul Islam Kamal Malmiyoda
Widening to dev@spark

On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K M  wrote:
>
> Hi all,
>
> I was trying to test --supervise flag of spark-submit.
>
> The documentation [1] says that, the flag helps in restarting your
> application automatically if it exited with non-zero exit code.
>
> I am looking for some clarification on that documentation. In this
> context, does application means the driver?
>
> Will the driver be re-launched if an exception is thrown by the
> application? I tested this scenario and the driver is not re-launched.
>
> ~/spark-1.6.1/bin/spark-submit --deploy-mode cluster --master 
> spark://10.29.83.162:6066 --class 
> org.apache.spark.examples.ExceptionHandlingTest 
> /home/spark/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar
>
> I killed the driver java process using 'kill -9' command and the driver
> is re-launched.
>
> Is this the only scenario were driver will be re-launched? Is there a
> way to simulate non-zero exit code and test the use of --supervise flag?
>
> Regards,
> Noorul
>
> [1] 
> http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Application not showing in Spark History

2016-08-02 Thread Noorul Islam Kamal Malmiyoda
Have you tried https://github.com/spark-jobserver/spark-jobserver

On Tue, Aug 2, 2016 at 2:23 PM, Rychnovsky, Dusan
 wrote:
> Hi,
>
>
> I am trying to launch my Spark application from within my Java application
> via the SparkSubmit class, like this:
>
>
>
> List args = new ArrayList<>();
>
> args.add("--verbose");
> args.add("--deploy-mode=cluster");
> args.add("--master=yarn");
> ...
>
>
> SparkSubmit.main(args.toArray(new String[args.size()]));
>
>
>
> This works fine, with one catch - the application does not appear in Spark
> History after it's finished.
>
>
> If, however, I run the application using `spark-submit.sh`, like this:
>
>
>
> spark-submit \
>
>   --verbose \
>
>   --deploy-mode=cluster \
>
>   --master=yarn \
>
>   ...
>
>
>
> the application appears in Spark History correctly.
>
>
> What am I missing?
>
>
> Also, is this a good way to launch a Spark application from within a Java
> application or is there a better way?
>
> Thanks,
>
> Dusan
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: When worker is killed driver continues to run causing issues in supervise mode

2016-07-13 Thread Noorul Islam Kamal Malmiyoda
Adding dev list
On Jul 13, 2016 5:38 PM, "Noorul Islam K M"  wrote:

>
> Spark version: 1.6.1
> Cluster Manager: Standalone
>
> I am experimenting with cluster mode deployment along with supervise for
> high availability of streaming applications.
>
> 1. Submit a streaming job in cluster mode with supervise
> 2. Say that driver is scheduled on worker1. The app started
>successfully.
> 3. Kill worker1 java process. This does not kill driver process and
>hence the application (context) is still alive.
> 4. Because of supervise flag, driver gets scheduled to new worker
>worker2 and hence a new context is created, making it a duplicate.
>
> I think this seems to be a bug.
>
> Regards,
> Noorul
>


Cassandra read throughput using DataStax connector in Spark

2015-12-26 Thread Noorul Islam Kamal Malmiyoda
Hello all,

I am using DataStax connector to read data from Cassandra and write to
another Cassandra cluster.  Infra is Amazon. I have three nodes
cluster with replication factor of 3 on both clusters.

But the throughput seems to be very low. It takes 7 minutes to
transfer around 2.5 GB/node. I think the bottleneck is at the read
side as I could see that spark node (Independent of two clusters) is
less loaded with respect to memory and CPU.

I tried tweaking some from
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters

Do you have any idea whether there is any parameter that I can tweak
to get better throughput?

Regards,
Noorul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org