RE: Spark UI not coming up in EMR

2017-01-11 Thread Saurabh Malviya (samalviy)
Any clue on this.

Jobs are running fine , But not able to access Spark UI in EMR -yarn.

Where I can see statistics like , No of events /per sec  and rows processed  
for streaming in log files (If UI is not working)

-Saurabh

From: Saurabh Malviya (samalviy)
Sent: Monday, January 09, 2017 10:59 AM
To: user@spark.apache.org
Subject: Spark UI not coming up in EMR

Spark web UI for detailed monitoring for streaming jobs stop rendering after 2 
weeks. Its keep looping to fetch the page. Is there any clue I can get that 
page. Or logs where I can see how many events coming in spark for each internval

-Saurabh


Spark UI not coming up in EMR

2017-01-09 Thread Saurabh Malviya (samalviy)
Spark web UI for detailed monitoring for streaming jobs stop rendering after 2 
weeks. Its keep looping to fetch the page. Is there any clue I can get that 
page. Or logs where I can see how many events coming in spark for each internval

-Saurabh


streaming deployment on yarn -emr

2016-12-05 Thread Saurabh Malviya (samalviy)
Hi,

We are using EMR and using oozie right now to deploy streaming job  (Workflow). 
I just want to know best practice to deploy streaming job. (In mesos we deploy 
using marathon, but what should be best approach in yarn which enforce only 
once instance and restart if it fails for any reason)

I noticed sometime not sure yarn launched two instance of streaming job, It 
seems to me if driver is unhealthy then its try to second attempt and in the 
meantime old driver recovers (That why need to check how we enforced only one 
instance of streaming is running on yarn)

-Saurabh




Spark on yarn enviroment var

2016-09-30 Thread Saurabh Malviya (samalviy)
Hi,

I am running spark on yarn using oozie.

When submit through command line using spark-submit spark is able to read env 
variable.  But while submit through oozie its not able toget env variable and 
don't see driver log.

Is there any way we specify env variable in oozie spark action.

Saurabh


Get profile from sbt

2016-09-19 Thread Saurabh Malviya (samalviy)
Hi,

Is there any way equivalent to profiles in maven in sbt. I want spark build to 
pick up endpoints based on environment jar is built for

In build.sbt we are ingesting variable dev,stage etc and pick up all 
dependencies. Similar way I need a way to pick up config for external 
dependencies like endpoints etc.

Or another approach is there any way I can access variable defined in built.sbt 
in scala code.

-Saurabh


EMR for spark job - instance type suggestion

2016-08-26 Thread Saurabh Malviya (samalviy)
We are going to use EMR cluster for spark jobs in aws. Any suggestion for 
instance type to be used.

M3.xlarge or r3.xlarge.

Details:

1)  We are going to run couple of streaming jobs so we need on demand 
instance type.

2)  There is no data on hdfs/s3 all data pull from kafka or elastic search


-Saurabh


JDBCRdd issue

2015-09-21 Thread Saurabh Malviya (samalviy)
Hi,


While using reference with in JDBCRdd , It is throwing serializable exception. 
Does JDBCRdd does not except reference from other part of code.?
 confMap= ConfFactory.getConf(ParquetStreaming)

  val jdbcRDD = new JdbcRDD(sc, () => {
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
DriverManager.getConnection(confMap(PHOENIX_URL)) -Throwing 
below exception

//DriverManager.getConnection(ConfFactory.getConf(ParquetStreaming)(PHOENIX_URL))
 ---This works
  }, s"SELECT tenant_id, data_source_id, mne_id, device_type1_key " 
+
 s" FROM XYZ_TYPE1_TEST WHERE DEVICE_TYPE1_KEY >= ? and 
DEVICE_TYPE1_KEY <= ? and TENANT_ID in ($tenantIds) " +
 s" AND DATA_SOURCE_ID in ($dataSourceIds) AND ISDELETED = 
false",
minKey, maxKey, 10, row => DeviceDel(row.getString(1), 
row.getString(2), row.getLong(3), row.getLong(4))).cache()

It throws runtime exception. However, " 
DriverManager.getConnection("jdbc:phoenix:10.20.87.1:2181") "   works fine.

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: 
org.apache.spark.SparkContext, value: 
org.apache.spark.SparkContext@5bb273b4)
- field (class: 
advance_reporting.transformations.DeviceDelETL$$anonfun$main$1, name: sc$1, 
type: class org.apache.spark.SparkContext)
- object (class 
advance_reporting.transformations.DeviceDelETL$$anonfun$main$1, )
- field (class: 
advance_reporting.transformations.DeviceDelETL$$anonfun$main$1$$anonfun$6, 
name: $outer, type: class $$anonfun$main$1)
- object (class 
advance_reporting.transformations.DeviceDelETL$$anonfun$main$1$$anonfun$6, 
)
- field (class: org.apache.spark.rdd.JdbcRDD, name: 
org$apache$spark$rdd$JdbcRDD$$getConnection, type: interface scala.Function0)
- object (class org.apache.spark.rdd.JdbcRDD, JdbcRDD[15] at 
JdbcRDD at DeviceDelETL.scala:91)
- field (class: scala.Tuple2, name: _1, type: class 
java.lang.Object)
- object (class scala.Tuple2, (JdbcRDD[15] at JdbcRDD at 
DeviceDelETL.scala:91,))
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at 
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:878)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1426)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at 
org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Sep 18, 2015 12:22:59 PM INFO: parquet.hadoop.ParquetFileReader: Initiating 
action with parallelism: 5

Any idea?



spark-submit chronos issue

2015-09-16 Thread Saurabh Malviya (samalviy)
Hi,

I am using facing strange issue while using chronos, As job is not able to find 
the Main class while invoking spark-submit using chronos.

Issue I identified as "colon" in the task name

Env -Chronos scheduled job on mesos

/tmp/mesos/slaves/20150911-070325-218147008-5050-30275-S4/frameworks/20150911-070325-218147008-5050-30275-0483/executors/ct:144244800:0:hbaseConnTest:/runs/6f9ddfc8-944c-4648-b29e-d4c9182f2292/spark-1.4.1-bin-custom-spark'


/tmp/mesos/slaves/20150911-070325-218147008-5050-30275-S4/frameworks/20150911-070325-218147008-5050-30275-0483/executors/ct:144244800:0:hbaseConnTest:/runs/6f9ddfc8-944c-4648-b29e-d4c9182f2292/spark-1.4.1-bin-custom-spark/bin/spark-submit


Error: Could not find or load main class org.apache.spark.launcher.Main



I identified if I rename above highlighted folder and remove those ":" it works.



Anyone faces similar problem using chronos, Let me know quick workaround or if 
there is another underline issue.





Thanks,

Saurabh