Re: Structured Streaming on Kubernetes

2018-04-16 Thread Krishna Kalyan
Thank you so much TD, Matt, Anirudh and Oz,
Really appropriate this.

On Fri, Apr 13, 2018 at 9:54 PM, Oz Ben-Ami  wrote:

> I can confirm that Structured Streaming works on Kubernetes, though we're
> not quite on production with that yet. Issues we're looking at are:
> - Submission through spark-submit works, but is a bit clunky with a
> kubernetes-centered workflow. Spark Operator
> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator> is
> promising, but still in alpha (eg, we ran into this
> <https://github.com/kubernetes/kubernetes/issues/56018>). Even better
> would be something that runs the driver as a Deployment / StatefulSet, so
> that long-running streaming jobs can be restarted automatically
> - Dynamic allocation: works with the spark-on-k8s fork, but not with plain
> Spark 2.3, due to reliance on shuffle service which hasn't been merged yet.
> Ideal implementation would be able to connect to a PersistentVolume
> independently of a node, but that's a bit more complicated
> - Checkpointing: We checkpoint to a separate HDFS (Dataproc) cluster,
> which works well for us both on the old Spark Streaming and Structured
> Streaming. We've successfully experimented with HDFS on Kubernetes
> <https://github.com/apache-spark-on-k8s/kubernetes-HDFS/tree/master>, but
> again not in production
> - UI: Unfortunately Structured Streaming does not yet have a comprehensive
> UI like the old Spark Streaming, but it does show the basic information
> (jobs, stages, queries, executors), and other information is generally
> available in the logs and metrics
> - Monitoring / Logging: this is a strength of Kubernetes, in that it's all
> centralized by the cluster. We use Splunk, but it would also be possible to 
> hook
> up <https://github.com/dhatim/dropwizard-prometheus> Spark's Dropwizard
> Metrics library to Prometheus, and read logs with fluentd or Stackdriver.
> - Side note: Kafka support in Spark and Structured Streaming is very good,
> but as of Spark 2.3 there are still a couple of missing features, notably
> transparent avro support (UDFs are needed) and taking advantage of
> transactional processing (introduced to Kafka last year) for better
> exactly-once guarantees
>
> On Fri, Apr 13, 2018 at 3:08 PM, Anirudh Ramanathan <
> ramanath...@google.com> wrote:
>
>> +ozzieba who was experimenting with streaming workloads recently. +1 to
>> what Matt said. Checkpointing and driver recovery is future work.
>> Structured streaming is important, and it would be good to get some
>> production experiences here and try and target improving the feature's
>> support on K8s for 2.4/3.0.
>>
>>
>> On Fri, Apr 13, 2018 at 11:55 AM Matt Cheah  wrote:
>>
>>> We don’t provide any Kubernetes-specific mechanisms for streaming, such
>>> as checkpointing to persistent volumes. But as long as streaming doesn’t
>>> require persisting to the executor’s local disk, streaming ought to work
>>> out of the box. E.g. you can checkpoint to HDFS, but not to the pod’s local
>>> directories.
>>>
>>>
>>>
>>> However, I’m unaware of any specific use of streaming with the Spark on
>>> Kubernetes integration right now. Would be curious to get feedback on the
>>> failover behavior right now.
>>>
>>>
>>>
>>> -Matt Cheah
>>>
>>>
>>>
>>> *From: *Tathagata Das 
>>> *Date: *Friday, April 13, 2018 at 1:27 AM
>>> *To: *Krishna Kalyan 
>>> *Cc: *user 
>>> *Subject: *Re: Structured Streaming on Kubernetes
>>>
>>>
>>>
>>> Structured streaming is stable in production! At Databricks, we and our
>>> customers collectively process almost 100s of billions of records per day
>>> using SS. However, we are not using kubernetes :)
>>>
>>>
>>>
>>> Though I don't think it will matter too much as long as kubes are
>>> correctly provisioned+configured and you are checkpointing to HDFS (for
>>> fault-tolerance guarantees).
>>>
>>>
>>>
>>> TD
>>>
>>>
>>>
>>> On Fri, Apr 13, 2018, 12:28 AM Krishna Kalyan 
>>> wrote:
>>>
>>> Hello All,
>>>
>>> We were evaluating Spark Structured Streaming on Kubernetes (Running on
>>> GCP). It would be awesome if the spark community could share their
>>> experience around this. I would like to know more about you production
>>> experience and the monitoring tools you are using.
>>>
>>>
>>>
>>> Since spark on kubernetes is a relatively new addition to spark, I was
>>> wondering if structured streaming is stable in production. We were also
>>> evaluating Apache Beam with Flink.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Krishna
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Anirudh Ramanathan
>>
>
>


Structured Streaming on Kubernetes

2018-04-13 Thread Krishna Kalyan
Hello All,
We were evaluating Spark Structured Streaming on Kubernetes (Running on
GCP). It would be awesome if the spark community could share their
experience around this. I would like to know more about you production
experience and the monitoring tools you are using.

Since spark on kubernetes is a relatively new addition to spark, I was
wondering if structured streaming is stable in production. We were also
evaluating Apache Beam with Flink.

Regards,
Krishna


Unable to build spark documentation

2017-01-11 Thread Krishna Kalyan
Hello,
I have been trying to build spark documentation. Instruction followed from
link below
https://github.com/apache/spark/blob/master/docs/README.md

My Jekyll build fails below. (Error in the gist below)
https://gist.github.com/krishnakalyan3/d0e38852efe97d7899d737b83b8d8702
and
https://gist.github.com/krishnakalyan3/08f00f49a943e43600cbc6b21f307228

Could someone please advice on how to go about resolving this error?.

Regards,
Krishna


Contributing to PySpark

2016-10-18 Thread Krishna Kalyan
Hello,
I am a masters student. Could someone please let me know how set up my dev
working environment to contribute to pyspark.
Questions I had were:
a) Should I use Intellij Idea or PyCharm?.
b) How do I test my changes?.

Regards,
Krishna


Re: Error Running SparkPi.scala Example

2016-06-17 Thread Krishna Kalyan
Hi Jacek,

Maven build output
*mvn clean install*

[INFO]

[INFO] BUILD FAILURE
[INFO]

[INFO] Total time: 30:12 min
[INFO] Finished at: 2016-06-17T15:15:46+02:00
[INFO] Final Memory: 82M/1253M
[INFO]

[ERROR] Failed to execute goal
org.scalatest:scalatest-maven-plugin:1.0:test (test) on project
spark-core_2.11: There are test failures -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions,
please read the following articles:
[ERROR] [Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the
command
[ERROR]   mvn  -rf :spark-core_2.11


and the error
- handles standalone cluster mode *** FAILED ***
  Map("spark.driver.memory" -> "4g", "SPARK_SUBMIT" -> "true",
"spark.driver.cores" -> "5", "spark.ui.enabled" -> "false",
"spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass",
"spark.jars" -> "file:/Users/krishna/Experiment/spark/core/thejar.jar",
"spark.submit.deployMode" -> "cluster", "spark.executor.extraClassPath" ->
"~/mysql-connector-java-5.1.12.jar", "spark.master" -> "spark://h:p",
"spark.driver.extraClassPath" -> "~/mysql-connector-java-5.1.12.jar") had
size 11 instead of expected size 9 (SparkSubmitSuite.scala:294)
- handles legacy standalone cluster mode *** FAILED ***
  Map("spark.driver.memory" -> "4g", "SPARK_SUBMIT" -> "true",
"spark.driver.cores" -> "5", "spark.ui.enabled" -> "false",
"spark.driver.supervise" -> "true", "spark.app.name" -> "org.SomeClass",
"spark.jars" -> "file:/Users/krishna/Experiment/spark/core/thejar.jar",
"spark.submit.deployMode" -> "cluster", "spark.executor.extraClassPath" ->
"~/mysql-connector-java-5.1.12.jar", "spark.master" -> "spark://h:p",
"spark.driver.extraClassPath" -> "~/mysql-connector-java-5.1.12.jar") had
size 11 instead of expected size 9 (SparkSubmitSuite.scala:294)


On Thu, Jun 16, 2016 at 1:57 PM, Jacek Laskowski  wrote:

> Hi,
>
> Before you try to do it inside another environment like an IDE, could
> you build Spark using mvn or sbt and only when successful try to run
> SparkPi using spark-submit run-example. With that, you could try to
> have a complete environment inside your beloved IDE (and I'm very glad
> to hear it's IDEA :))
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jun 16, 2016 at 1:37 AM, Krishna Kalyan
>  wrote:
> > Hello,
> > I am faced with problems when I try to run SparkPi.scala.
> > I took the following steps below:
> > a) git pull https://github.com/apache/spark
> > b) Import the project in Intellij as a maven project
> > c) Run 'SparkPi'
> >
> > Error Below:
> > Information:16/06/16 01:34 - Compilation completed with 10 errors and 5
> > warnings in 5s 843ms
> > Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
> > continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.channel.ChannelPipelineFactory not
> > found - continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.handler.execution.ExecutionHandler
> not
> > found - continuing with a stub.
> > Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not
> found -
> > continuing with a stub.
> > Warning:scalac: Class com.google.common.collect.ImmutableMap not found -
> > continuing with a stub.
> >
> /Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
> > Error:(45, 66) not found: type SparkFlumeProtocol
> >   val transactionTimeout: Int, val backOffInterval: Int) extends
> > SparkFlumeProtocol with Logging {
> >  ^
> > Error:(70, 39) not found: type EventBa

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
Hello,
I am faced with problems when I try to run SparkPi.scala.
I took the following steps below:
a) git pull https://github.com/apache/spark
b) Import the project in Intellij as a maven project
c) Run 'SparkPi'

Error Below:
Information:16/06/16 01:34 - Compilation completed with 10 errors and 5
warnings in 5s 843ms
Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
continuing with a stub.
Warning:scalac: Class org.jboss.netty.channel.ChannelPipelineFactory not
found - continuing with a stub.
Warning:scalac: Class org.jboss.netty.handler.execution.ExecutionHandler
not found - continuing with a stub.
Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found
- continuing with a stub.
Warning:scalac: Class com.google.common.collect.ImmutableMap not found -
continuing with a stub.
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
Error:(45, 66) not found: type SparkFlumeProtocol
  val transactionTimeout: Int, val backOffInterval: Int) extends
SparkFlumeProtocol with Logging {
 ^
Error:(70, 39) not found: type EventBatch
  override def getEventBatch(n: Int): EventBatch = {
  ^
Error:(85, 13) not found: type EventBatch
new EventBatch("Spark sink has been stopped!", "",
java.util.Collections.emptyList())
^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
Error:(80, 22) not found: type EventBatch
  def getEventBatch: EventBatch = {
 ^
Error:(48, 37) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
^
Error:(48, 54) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
 ^
Error:(115, 41) not found: type SparkSinkEvent
val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
^
Error:(146, 28) not found: type EventBatch
  eventBatch = new EventBatch("", seqNum, events)
   ^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
Error:(25, 27) not found: type EventBatch
  def isErrorBatch(batch: EventBatch): Boolean = {
  ^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
Error:(86, 51) not found: type SparkFlumeProtocol
val responder = new SpecificResponder(classOf[SparkFlumeProtocol],
handler.get)

Thanks,
Krishan


Re: RBM in mllib

2016-06-14 Thread Krishna Kalyan
Hi Robert,
According to the jira the Resolution is wont fix. The pull request was
closed as it did not merge cleanly with the master.
(https://github.com/apache/spark/pull/3222)

On Tue, Jun 14, 2016 at 4:23 PM, Roberto Pagliari  wrote:

> Is RBM being developed?
>
> This one is marked as resolved, but it is not
>
> https://issues.apache.org/jira/browse/SPARK-4251
> 
>
>
>