Re: Amazon Elastic Cache + Spark Streaming

2017-09-22 Thread ayan guha
AWS Elastic Cache supports MemCach and Redis. Spark has a Redis connector
 which I believe you can use to
connect to Elastic Cache.

On Sat, Sep 23, 2017 at 5:08 AM, Saravanan Nagarajan 
wrote:

> Hello,
>
> Anybody tried amazon elastic cache.Just give me some pointers. Thanks!
>



-- 
Best Regards,
Ayan Guha


Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
Thanks for the reply. Forgot to mention that, our Batch ETL Jobs are in 
Core-Spark. 


> On Sep 22, 2017, at 3:13 PM, Vadim Semenov  
> wrote:
> 
> 1. 40s is pretty negligible unless you run your job very frequently, there 
> can be many factors that influence that.
> 
> 2. Try to compare the CPU time instead of the wall-clock time
> 
> 3. Check the stages that got slower and compare the DAGs
> 
> 4. Test with dynamic allocation disabled
> 
> On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D  > wrote:
> Hello All, 
> 
> Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade into 
> Spark 2.1.0. 
> 
> With minor code changes (like configuration and Spark Session.sc) able to 
> execute the existing JOB into Spark 2.1.0. 
> 
> But noticed that JOB completion timings are much better in Spark 1.6.0 but no 
> in Spark 2.1.0.
> 
> For the instance, JOB A completed in 50s in Spark 1.6.0. 
> 
> And with the same input and JOB A completed in 1.5 mins in Spark 2.1.0. 
> 
> Is there any specific factor needs to be considered when switching to Spark 
> 2.1.0 from Spark 1.6.0. 
> 
> 
> 
> Thanks & Regards, 
> Gokula Krishnan (Gokul)
> 



Re: [Structured Streaming] How to replay data and overwrite using FileSink

2017-09-22 Thread Michael Armbrust
There is no automated way to do this today, but you are on the right track
for a hack.  If you delete both the entries in _spark_metadata and the
corresponding entries from the checkpoint/offesets of the streaming query,
it will reprocess the corresponding section of the Kafka stream.

On Wed, Sep 20, 2017 at 6:20 PM, Bandish Chheda 
wrote:

> Hi,
>
> We are using StructuredStreaming (Spark 2.2.0) for processing data from
> Kafka. We read from a Kafka topic, do some conversions, computation and
> then use FileSink to store data to partitioned path in HDFS. We have
> enabled checkpoint (using a dir in HDFS).
>
> For cases when there is a bad code push to streaming job, we want to
> replay data from Kafka (I was able to do that using custom starting
> offset). During replay, how do I make FileSink to overwrite the existing
> data.
> From code (https://github.com/apache/spark/blob/branch-2.2/sql/
> core/src/main/scala/org/apache/spark/sql/execution/
> streaming/FileStreamSink.scala#L99) it looks like, it checks for latest
> batchId and skips the processing. Any recommended way to avoid that? I am
> thinking of deleting files and corresponding entries in _spark_metadata
> based on last modified time (and time from which I want to replay).
>
> Any other better solutions?
>
> Thank you
>


Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Vadim Semenov
1. 40s is pretty negligible unless you run your job very frequently, there
can be many factors that influence that.

2. Try to compare the CPU time instead of the wall-clock time

3. Check the stages that got slower and compare the DAGs

4. Test with dynamic allocation disabled

On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D 
wrote:

> Hello All,
>
> Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade
> into Spark 2.1.0.
>
> With minor code changes (like configuration and Spark Session.sc) able to
> execute the existing JOB into Spark 2.1.0.
>
> But noticed that JOB completion timings are much better in Spark 1.6.0 but
> no in Spark 2.1.0.
>
> For the instance, JOB A completed in 50s in Spark 1.6.0.
>
> And with the same input and JOB A completed in 1.5 mins in Spark 2.1.0.
>
> Is there any specific factor needs to be considered when switching to
> Spark 2.1.0 from Spark 1.6.0.
>
>
>
> Thanks & Regards,
> Gokula Krishnan* (Gokul)*
>


Amazon Elastic Cache + Spark Streaming

2017-09-22 Thread Saravanan Nagarajan
Hello,

Anybody tried amazon elastic cache.Just give me some pointers. Thanks!


What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-22 Thread Gokula Krishnan D
Hello All,

Currently our Batch ETL Jobs are in Spark 1.6.0 and planning to upgrade
into Spark 2.1.0.

With minor code changes (like configuration and Spark Session.sc) able to
execute the existing JOB into Spark 2.1.0.

But noticed that JOB completion timings are much better in Spark 1.6.0 but
no in Spark 2.1.0.

For the instance, JOB A completed in 50s in Spark 1.6.0.

And with the same input and JOB A completed in 1.5 mins in Spark 2.1.0.

Is there any specific factor needs to be considered when switching to Spark
2.1.0 from Spark 1.6.0.



Thanks & Regards,
Gokula Krishnan* (Gokul)*


RE: plotting/resampling timeseries data

2017-09-22 Thread Brian Wylie
@vermanuraq

Great thanks, just what I needed.. I knew I was missing something simple.

Cheers,
-brian



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: graphframes on cluster

2017-09-22 Thread Imran Rajjad
sorry for posting without complete information

I am connecting to spark cluster with the driver program as the backend of
web application. This is intended to listen to job progress and some other
work. Below is how I am connecting to the cluster

sparkConf = new SparkConf().setAppName("isolated test")
   .setMaster("spark://master:7077")
.set("spark.executor.memory","6g")
.set("spark.driver.memory","6g")
.set("spark.driver.maxResultSize","2g")
.set("spark.executor.extrajavaoptions","-Xmx8g")

.set("spark.jars.packages","graphframes:graphframes:0.5.0-spark2.1-s_2.11")
.set("spark.jars","/home/usr/jobs.jar"); //this is shared location
Linux machines and has the required java classes

the crash occurs at

gFrame.connectedComponents().setBroadcastThreshold(2).run();

with exception

Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 0.0 (TID 5, 10.112.29.80):
java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2024)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2018)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1942)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1808)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1353)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:373)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

after googling around..this appears to be related to dependencies but I
don't have much dependencies apart from a few POJOs which have been
included through context

regards,
Imran




On Wed, Sep 20, 2017 at 9:00 PM, Felix Cheung 
wrote:

> Could you include the code where it fails?
> Generally the best way to use gf is to use the --packages options with
> spark-submit command
>
> --
> *From:* Imran Rajjad 
> *Sent:* Wednesday, September 20, 2017 5:47:27 AM
> *To:* user @spark
> *Subject:* graphframes on cluster
>
> Trying to run graph frames on a spark cluster. Do I need to include the
> package in spark context settings? or the only the driver program is
> suppose to have the graphframe libraries in its class path? Currently the
> job is crashing when action function is invoked on graphframe classes.
>
> regards,
> Imran
>
> --
> I.R
>



-- 
I.R


Re: Checkpoints not cleaned using Spark streaming + watermarking + kafka

2017-09-22 Thread MathieuP
The expected setting to clean these files is :
- spark.sql.streaming.minBatchesToRetain

More info on structured streaming settings :
https://github.com/jaceklaskowski/spark-structured-streaming-book/blob/master/spark-sql-streaming-properties.adoc





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org