Using RDDs requires some 'low level' optimization techniques.
While using dataframes / Spark SQL allows you to leverage existing code.
If you can share some more of your use case, that would help other people
provide suggestions.
Thanks
> On May 6, 2016, at 6:57 PM, HARSH TAKKAR
Try to use Dataframe instead of RDD.
Here's an introduction to Dataframe:
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
2016-05-06 21:52 GMT+07:00 pratik gawande :
> Thanks Shao for quick reply. I will look
Hi Ted
I am aware that rdd are immutable, but in my use case i need to update same
data set after each iteration.
Following are the points which i was exploring.
1. Generating rdd in each iteration.( It might use a lot of memory).
2. Using Hive tables and update the same table after each
Thanks for the clarification Michael and good luck with Spark 2.0. It
really looks promising.
I am especially interested in adhoc queries aspect. Probably that is what
is being referred to as Continuous SQL in the slides. What is the timeframe
for availability this functionality?
regards
Sunita
Agreed.
Just sharing what I saw,
http://www.slideshare.net/databricks/realtime-spark-from-interactive-queries-to-streaming
http://www.slideshare.net/rxin/the-future-of-realtime-in-spark?next_slideshow=3
It claims to support kafka, files and databases. However, continuous SQL
will be available in
That is a forward looking design doc and not all of it has been implemented
yet. With Spark 2.0 the main sources and sinks will be file based, though
we hope to quickly expand that now that a lot of infrastructure is in place.
On Fri, May 6, 2016 at 2:11 PM, Ted Yu wrote:
I was
reading StructuredStreamingProgrammingAbstractionSemanticsandAPIs-ApacheJIRA.pdf
attached to SPARK-8360
On page 12, there was mentioning of .format(“kafka”) but I searched the
codebase and didn't find any occurrence.
FYI
On Fri, May 6, 2016 at 1:06 PM, Michael Malak <
Hi,
I have this code that filters out those prices that are over 99.8 within
the the sliding window. The code works OK as shown below.
Now I need to work out min(price), max(price) and avg(price) in the sliding
window. What I need is to have a counter and method of getting these values.
Any
Thank you Anthony. I am clearer on yarn-cluster and yarn-client now.
On Fri, May 6, 2016 at 1:05 PM, Anthony May wrote:
> Making the master yarn-cluster means that the driver is then running on
> YARN not just the executor nodes. It's then independent of your application
>
At first glance, it looks like the only streaming data sources available out of
the box from the github master branch are
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala
and
Making the master yarn-cluster means that the driver is then running on
YARN not just the executor nodes. It's then independent of your application
and can only be killed via YARN commands, or if it's batch and completes.
The simplest way to tie the driver to your app is to pass in yarn-client as
Hi Anthony,
I am passing
--master
yarn-cluster
--name
pysparkexample
--executor-memory
1G
--driver-memory
1G
--conf
Greetings Satish,
What are the arguments you're passing in?
On Fri, 6 May 2016 at 12:50 satish saley wrote:
> Hello,
>
> I am submitting a spark job using SparkSubmit. When I kill my application,
> it does not kill the corresponding spark job. How would I kill the
>
Hello,
I am submitting a spark job using SparkSubmit. When I kill my application,
it does not kill the corresponding spark job. How would I kill the
corresponding spark job? I know, one way is to use SparkSubmit again with
appropriate options. Is there any way though which I can tell SparkSubmit
For anyone interested, the problem ended up being that in some rare cases,
the value from the pair RDD on the right side of the left outer join was
Java's null. The Spark optionToOptional method attempted to apply Some()
to null, which caused the NPE to be thrown.
The lesson is to filter out any
Yeah, there isn't even a RC yet and no documentation but you can work off
the code base and test suites:
https://github.com/apache/spark
And this might help:
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/streaming/DataFrameReaderWriterSuite.scala
On Fri,
Hi folks,
I'm getting the following exception:
*Exception in thread "main" java.io.IOException: Cannot run program
"E:\Software\spark-1.6.1\bin\spark-submit.cmd": CreateProcess error=5,
Access is denied*
* at java.lang.ProcessBuilder.start(Unknown Source)*
* at
Spark 2.0 is yet to come out for public release.
I am waiting to get hands on it as well.
Please do let me know if i can download source and build spark2.0 from
github.
Thanks
Deepak
On Fri, May 6, 2016 at 9:51 PM, Sunita Arvind wrote:
> Hi All,
>
> We are evaluating a
Hi All,
We are evaluating a few real time streaming query engines and spark is my
personal choice. The addition of adhoc queries is what is getting me
further excited about it, however the talks I have heard so far only
mention about it but do not provide details. I need to build a prototype to
Hi Ted,
I am working on replicating the problem on a smaller scale.
I saw that Spark 2.0 is moving to Java 8 Optional instead of Guava
Optional, but in the meantime I'm stuck with 1.6.1.
-Adam
On Fri, May 6, 2016 at 9:40 AM, Ted Yu wrote:
> Is it possible to write a
Look carefully at the error message, the types you're passing in don't
match. For instance, you're passing in a message handler that returns
a tuple, but the rdd return type you're specifying (the 5th type
argument) is just String.
On Fri, May 6, 2016 at 9:49 AM, Eric Friedman
Hi Matthias,
Say with the following
you have
"Batch interval" is the basic interval at which the system with receive the
data in batches.
val ssc = new StreamingContext(sparkConf, Seconds(n))
// window length - The duration of the window below that must be multiple
of batch interval n in = >
Is this a limit of spark shuffle block currently?
On Tue, May 3, 2016 at 11:18 AM, Nirav Patel wrote:
> Hi,
>
> My spark application getting killed abruptly during a groupBy operation
> where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I
> see
Hi,
If i want to have a sliding average over the 10 minutes for some keys I can
do something like
groupBy(window(…),“my-key“).avg(“some-values“) in Spark 2.0
I try to implement this sliding average using Spark 1.6.x:
I tried with reduceByKeyAndWindow but it did not find a solution. Imo i
have to
Thanks Shao for quick reply. I will look into how pyspark jobs are executed.
Any suggestions or reference to docs on how to tune pyspark jobs?
On Thu, May 5, 2016 at 10:12 PM -0700, "Saisai Shao"
> wrote:
Writing RDD based application
My build dependencies:
compile 'org.scala-lang:scala-library:2.10.4'
compile 'org.apache.spark:spark-core_2.10:1.6.1'
compile 'org.apache.spark:spark-sql_2.10:1.6.1'
compile 'org.apache.spark:spark-hive_2.10:1.6.1'
compile
Hello,
I've been using createDirectStream with Kafka and now need to switch to the
version of that API that lets me supply offsets for my topics. I'm unable
to get this to compile for some reason, even if I lift the very same usage
from the Spark test suite.
I'm calling it like this:
val
Hello everyone:
I'm running an experiment in a Spark cluster where some of the machines are
highly loaded with CPU, memory and network consuming process ( let's call
them straggler machines ).
Obviously the tasks of these machines take longer to execute than in other
nodes of the cluster.
Is it possible to write a short test which exhibits this problem ?
For Spark 2.0, this part of code has changed:
[SPARK-4819] Remove Guava's "Optional" from public API
FYI
On Fri, May 6, 2016 at 6:57 AM, Adam Westerman wrote:
> Hi,
>
> I’m attempting to do a left outer
Yeah, so that means the driver talked to kafka and kafka told it the
highest available offset was 2723431. Then when the executor tried to
consume messages, it stopped getting messages before reaching that
offset. That almost certainly means something's wrong with Kafka,
have you looked at your
Hi all,
I have a spark application running to which I submit jobs continuosly.
These job use different instances of sqlContext. So the web ui of
application starts to fill up more and more with this instance.
Is there any way to prevent this? I don't want to see created sql context
in the web
Hi,
I’m attempting to do a left outer join in Spark, and I’m getting an NPE
that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0
in local mode on a Mac).
For a little background, the left outer join returns all keys from the left
side of the join regardless of whether or
Please see the doc at the beginning of RDD class:
* A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable,
* partitioned collection of elements that can be operated on in parallel.
This class contains the
* basic operations available on all RDDs, such
Hello All,I am trying to read text files from Amazon s3.Any solution for this
Error ASAP:
Exception in thread "main" com.fasterxml.jackson.databind.JsonMappingException:
Could not find creator property with name 'id' (in class
org.apache.spark.rdd.RDDOperationScope)
at [Source:
Hi
Is there a way i can modify a RDD, in for-each loop,
Basically, i have a use case in which i need to perform multiple iteration
over data and modify few values in each iteration.
Please help.
Hi All,
I tried to run a simple spark program to find out the metrics
collected while executing the program. What I observed is, I'm able to get
TaskMetrics.inputMetrics data like records read, bytesread etc. But I do
not get any metrics about the output.
I ran the below code in
Hi,
I am using spark 1.6.1 for my streaming jobs with Kinesis connector. Without
any extra configurations put, the jobs run fine but they only show one line
in UI.
I used to see the actual line number in Python script in previous version.
Please see the screenshot to understand what i mean.
Hi everybody! This code: DataFrame df = sqlContext.read().json(FILE_NAME); DataFrame profiles = df.select( column("_id"), struct( column("name.first").as("first_name"), column("name.last").as("last_name"), column("friends")
Hi,
I just stumbled upon some data quality check package for spark
https://github.com/FRosner/drunken-data-quality
Has any body used it ?
Would really appreciate the feedback .
Thanks,
Divya
Hi Madhukara,
What I understood from the code is that when ever runBatch return they
trigger constructBatch so whatever is processing time for a batch will be
ur batch time if u dnt specify a trigger.
one flaw which i think in this is if your processing time keeps increasing
with amount of data
This is the complete error.
2016-05-06 11:18:05,424 [task-result-getter-0] INFO
org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage
13.0 (TID 60) in 11692 ms on xx (6/8)
2016-05-06 11:18:08,978 [task-result-getter-1] WARN
org.apache.spark.scheduler.TaskSetManager - Lost
With Structured Streaming ,Spark would provide apis over spark sql engine.
Its like once you have the structured stream and dataframe created out of
this , you can do ad-hoc querying on the DF , which means you are actually
querying the stram without having to store or transform.
I have not used
I think that it's a kafka error, but I'm starting thinking if it could
be something about elasticsearch since I have seen more people with
same error using elasticsearch. I have no idea.
2016-05-06 11:05 GMT+02:00 Guillermo Ortiz :
> I'm trying to read data from Spark and
I'm trying to read data from Spark and index to ES with its library
(es-hadoop 2.2.1 version).
IIt was working right for a while but now it has started to happen this error.
I have delete the checkpoint and even the kafka topic and restart all
the machines with kafka and zookeeper but it didn't
Hi,
As I was playing with new structured streaming API, I noticed that spark
starts processing as and when the data appears. It's no more seems like
micro batch processing. Is spark structured streaming will be an event
based processing?
--
Regards,
Madhukara Phatak
http://datamantra.io/
I had same issue while using with storm. Than I found no of storm spout
instance should not be greater than no of partition.
if you increase that than nos were not matching.May be you can check
something similar for spark.
Regards,
Nirav
On May 5, 2016 9:48 PM, "Jerry"
47 matches
Mail list logo