Hi Gurus,
Please help.
But please don't tell me to use updateStateByKey because I need a
global variable (something like the clock time) across the micro
batches but not depending on key. For my case, it is not acceptable to
maintain a state for each key since each key comes in different times.
Hello Sparkers,
I would like to understand difference btw these Storage levels for a RDD
portion that doesn't fit in memory.
As it seems like in both storage levels, whatever portion doesnt fit in
memory will be spilled to disk. Any difference as such?
Thanks,
Harsha
MEMORY_ONLY will fail if there is not enough memory but MEMORY_AND_DISK
will spill to disk
Regards
Sab
On Tue, Aug 18, 2015 at 12:45 PM, Harsha HN 99harsha.h@gmail.com
wrote:
Hello Sparkers,
I would like to understand difference btw these Storage levels for a RDD
portion that doesn't
I think you are mixing the notion of job from hadoop map reduce world with
spark. In spark, RDDs are immutable and transformations are lazy. So the
first time rdd is actually fills up memory is when you run first
transformation. After that, it stays up in memory until either application
is stopped
Hi all,
I'm using Spark SQL using data from Openstack Swift.
I'm trying to load parquet files with partition discovery, but I can't do
it when the partitions don't match between two objects.
For example, container which contains:
/zone=0/object2
/zone=0/area=0/object1
Won't load, and will
Hi,
Following is copied from the spark EventTimeline UI. I don't understand why
there are overlapping between tasks?
I think they should be sequentially one by one in one executor(there are one
core each executor).
The blue part of each task is the scheduler delay time. Does it mean it is the
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz
wysakowicz.da...@gmail.com wrote:
No, the data is not stored between two jobs. But it is stored for a
lifetime of a job. Job can have multiple actions run.
I too thought so but wanted to confirm. Thanks.
For a matter of sharing an rdd
See if SparkContext.accumulator helps.
On Tue, Aug 18, 2015 at 2:27 PM, Joanne Contact joannenetw...@gmail.com
wrote:
Hi Gurus,
Please help.
But please don't tell me to use updateStateByKey because I need a
global variable (something like the clock time) across the micro
batches but not
num-executor only works for yarn mode. In standalone mode, I have to set
the --total-executor-cores and --executor-cores. Isn't this way so
intuitive ? Any reason for that ?
One spark application can have many jobs,eg,first call rdd.count then call
rdd.collect
At 2015-08-18 15:37:14, Hemant Bhanawat hemant9...@gmail.com wrote:
It is still in memory for future rdd transformations and actions.
This is interesting. You mean Spark holds the data in memory
When I do an rdd.collect().. The data moves back to driver Or is still
held in memory across the executors?
Take a look at the doc for the method:
/**
* Applies a schema to an RDD of Java Beans.
*
* WARNING: Since there is no guaranteed ordering for fields in a Java Bean,
* SELECT * queries will return the columns in an undefined order.
* @group dataframes
* @since
No, the data is not stored between two jobs. But it is stored for a
lifetime of a job. Job can have multiple actions run.
For a matter of sharing an rdd between jobs you can have a look at Spark
Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or
some In-Memory storages:
It is still in memory for future rdd transformations and actions.
This is interesting. You mean Spark holds the data in memory between two
job executions. How does the second job get the handle of the data in
memory? I am interested in knowing more about it. Can you forward me a
spark article or
The solution how to share offsetRanges after DirectKafkaInputStream is
transformed is in:
https://github.com/apache/spark/blob/master/external/kafka/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
It is still in memory for future rdd transformations and actions. What you
get in driver is a copy of the data.
Regards
Sab
On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:
When I do an rdd.collect().. The data moves back to driver Or is still
held in memory across the
Please check logs in your hadoop yarn cluster, there you would get precise
error or exception.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24308.html
Sent from the Apache Spark User List mailing list
It is definitely not the case for Spark SQL. A temporary table (much like
dataFrame) is a just a logical plan with a name and it is not iterated
unless a query is fired on it.
I am not sure if using rdd.take in py code to verify the schema is a right
approach as it creates a spark job.
BTW, why
I have a RDD which I am using to create the data frame based on one POJO, but
when Dataframe is created, the sequence of column order get changed.
DataFrame df=sqlCtx.createDataFrame(rdd, Pojo.class);
String[] columns=df.columns();
//columns here are of different order what has been defined in
Refer: http://spark.apache.org/docs/latest/hadoop-provided.html
Specifically if you want to refer s3a paths. Please edit spark-env.sh and
add following lines at end:
SPARK_DIST_CLASSPATH=$(/path/to/hadoop/hadoop-2.7.1/bin/hadoop classpath)
export
by default standalone creates 1 executor on every worker machine per
application
number of overall cores is configured with --total-executor-cores
so in general if you'll specify --total-executor-cores=1 then there would
be only 1 core on some executor and you'll get what you want
on the other
Still not sure what you are trying to achieve. If you could post some code that
doesn’t work the community can help you understand where the error (syntactic
or conceptual) is.
On 17 Aug 2015, at 17:42, dianweih001 [via Apache Spark User List]
ml-node+s1001560n24299...@n3.nabble.com wrote:
My company is interested in building a real-time time-series querying solution
using Spark and Cassandra. Specifically, we're interested in setting up a
Spark system against Cassandra running a hive thrift server. We need to be
able to perform real-time queries on time-series data - things
Do you mind providing a bit more information ?
release of Spark
code snippet of your app
version of Java
Thanks
On Tue, Aug 18, 2015 at 8:57 AM, unk1102 umesh.ka...@gmail.com wrote:
Hi this GC overhead limit error is making me crazy. I have 20 executors
using
25 GB each I dont understand
Hi Guru,
Thanks! Great to hear that someone tried it in production. How do you like
it so far?
Best Regards,
Jerry
On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com wrote:
Hi Jerry,
Yes. I’ve seen customers using this in production for data science work.
I’m currently
But KafkaRDD[K, V, U, T, R] is not subclass of RDD[R] as per java generic
inheritance is not supported so derived class cannot return different
genric typed subclass from overriden method.
On Wed, Aug 19, 2015 at 12:02 AM, Cody Koeninger c...@koeninger.org wrote:
Option is covariant and
Nope. Count action did not help to choose broadcast join.
All of my tables are hive external tables. So, I tried to trigger compute
statistics from sqlContext.sql. It gives me an error saying “nonsuch table”. I
am not sure that is due to following bug in 1.4.1
Option is covariant and KafkaRDD is a subclass of RDD
On Tue, Aug 18, 2015 at 1:12 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Is it that in scala its allowed for derived class to have any return type ?
And streaming jar is originally created in scala so its allowed for
Refer this post
http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/
Spark + Jupyter + Docker
On 18 August 2015 at 21:29, Jerry Lam chiling...@gmail.com wrote:
Hi Guru,
Thanks! Great to hear that someone tried it in production. How do you like
it so far?
Best Regards,
Normally people would establish maven project with Spark dependencies or,
use sbt.
Can you go with either approach ?
Cheers
On Tue, Aug 18, 2015 at 10:28 AM, Jerry jerry.c...@gmail.com wrote:
Hello,
So I setup Spark to run on my local machine to see if I can reproduce the
issue I'm having
Is there a way to store all the results in one file and keep the file roll
over separate than the spark streaming batch interval?
On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY umesh9...@gmail.com
wrote:
In Spark Streaming you can simply check whether your RDD contains any
records or not and
Hi
I am trying to compute stats on a lookup table from spark which resides in
hive. I am invoking spark API as follows. It gives me NoSuchTableException.
Table is double verified and subsequent statement “sqlContext.sql(“select *
from cpatext.lkup”)” picks up the table correctly. I am
just looking at the thread dump from your original email, the 3 executor
threads are all trying to load classes. (One thread is actually loading
some class, and the others are blocked waiting to load a class, most likely
trying to load the same thing.) That is really weird, definitely not
Hello,
So I setup Spark to run on my local machine to see if I can reproduce the
issue I'm having with data frames, but I'm running into issues with the
compiler.
Here's what I got:
$ echo $CLASSPATH
Hi, Imran:
Thanks for your reply. I am not sure what do you mean repl. Can you be more
detail about that?
This is only happened when the Spark 1.2.2 try to scan big data set, and cannot
reproduce if it scans smaller dataset.
FYI, I have to build and deploy Spark 1.3.1 on our production cluster.
Hi, thank you for further assistance
you can reproduce this by simply running
5 match { case java.math.BigDecimal = 2 }
In my personal case, I am applying a map acton to a Seq[Any], so the elements
inside are of type any, to which I need to apply a proper
.asInstanceOf[WhoYouShouldBe].
Saif
Hi Canan,
This is mainly for legacy reasons. The default behavior in standalone in
mode is that the application grabs all available resources in the cluster.
This effectively means we want one executor per worker, where each executor
grabs all the available cores and memory on that worker. In
Hi,
First you need to make your SLA clear. It does not sound for me they are
defined very well or that your solution is necessary for the scenario. I
also find it hard to believe that 1 customer has 100Million transactions
per month.
Time series data is easy to precalculate - you do not
Hi Jorn,
Of course we're planning on doing a proof of concept here - the difficulty is
that our timeline is short, so we cannot afford too many PoCs before we have to
make a decision. We also need to figure out *which* databases to proof of
concept.
Note that one tricky aspect of our problem
Under the covers we use Jackson's Streaming API as of Spark 1.4.
On Tue, Aug 18, 2015 at 1:12 PM, Udit Mehta ume...@groupon.com wrote:
Hi,
I was wondering what json serde does spark sql use. I created a JsonRDD
out of a json file and then registered it as a temp table to query. I can
then
Hi Muhammad,
On a high level, in hash-based shuffle each mapper M writes R shuffle
files, one for each reducer where R is the number of reduce partitions.
This results in M * R shuffle files. Since it is not uncommon for M and R
to be O(1000), this quickly becomes expensive. An optimization with
Hi,
I was wondering what json serde does spark sql use. I created a JsonRDD out
of a json file and then registered it as a temp table to query. I can then
query the table using dot notation for nested structs/arrays. I was
wondering how does spark sql deserialize the json data based on the query.
Hi Axel,
You can try setting `spark.deploy.spreadOut` to false (through your
conf/spark-defaults.conf file). What this does is essentially try to
schedule as many cores on one worker as possible before spilling over to
other workers. Note that you *must* restart the cluster through the sbin
On Tue, Aug 18, 2015 at 12:59 PM, saif.a.ell...@wellsfargo.com wrote:
5 match { case java.math.BigDecimal = 2 }
5 match { case _: java.math.BigDecimal = 2 }
--
Marcelo
-
To unsubscribe, e-mail:
On Tue, Aug 18, 2015 at 1:19 PM, saif.a.ell...@wellsfargo.com wrote:
Hi, Can you please elaborate? I am confused :-)
You did note that the two pieces of code are different, right?
See http://docs.scala-lang.org/tutorials/tour/pattern-matching.html
for how to match things in Scala, especially
Hi Satish,
The problem is that `--jars` accepts a comma-delimited list of jars! E.g.
spark-submit ... --jars lib1.jar,lib2.jar,lib3.jar main.jar
where main.jar is your main application jar (the one that starts a
SparkContext), and lib*.jar refer to additional libraries that your main
Are you using the Flume polling stream or the older stream?
Such problems of binding used to occur in the older push-based approach,
hence we built the polling stream (pull-based).
On Tue, Aug 18, 2015 at 4:45 AM, diplomatic Guru diplomaticg...@gmail.com
wrote:
I'm testing the Flume + Spark
sorry, by repl I mean spark-shell, I guess I'm used to them being used
interchangeably. From that thread dump, the one thread that isn't stuck is
trying to get classes specifically related to the shell / repl:
java.lang.Thread.State: RUNNABLE
at
Could you share your pattern matching expression that is failing?
On Tue, Aug 18, 2015, 3:38 PM saif.a.ell...@wellsfargo.com wrote:
Hi all,
I am trying to run a spark job, in which I receive *java.math.BigDecimal*
objects,
instead of the scala equivalents, and I am trying to convert them
HI All,
Please let me know if any arguments to be passed in CLI to retrieve FULL
STACK TRACE in Apache Spark
I am stuck in a issue for which it would be helpful to analyze full stack
trace
Regards,
Satish Chandra
if you error is on executors you need to check the executor logs for full
stacktrace
On Tue, Aug 18, 2015 at 10:01 PM, satish chandra j jsatishchan...@gmail.com
wrote:
HI All,
Please let me know if any arguments to be passed in CLI to retrieve FULL
STACK TRACE in Apache Spark
I am stuck in
Hi, Can you please elaborate? I am confused :-)
Saif
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Tuesday, August 18, 2015 5:15 PM
To: Ellafi, Saif A.
Cc: wrbri...@gmail.com; user@spark.apache.org
Subject: Re: Scala: How to match a java object
On Tue,
Hi all,
I was trying to use GraphX to compute pagerank and found that pagerank
value for several vertices is NaN.
I am using Spark 1.3. Any idea how to fix that?
--
Thanks,
-Khaled
Hi Andreas,
I believe the distinction is not between standalone and YARN mode, but
between client and cluster mode.
In client mode, your Spark submit JVM runs your driver code. In cluster
mode, one of the workers (or NodeManagers if you're using YARN) in the
cluster runs your driver code. In the
Hi all,
I'm trying to run a spark job (written in scala) that uses addFile to
download some small files to each node. However, one of the downloaded
files has an incorrect size (the other ones are ok), which causes an error
when using it in the code.
I have looked more into the issue and
Hi Saif,
Would this work?
import scala.collection.JavaConversions._
new java.math.BigDecimal(5) match { case x: java.math.BigDecimal =
x.doubleValue }
It gives me on the scala console.
res9: Double = 5.0
Assuming you had a stream of BigDecimals, you could just call map on it.
Hi All
Why am I getting ExecutorLostFailure and executors are completely lost for rest
of the processing? Eventually it makes job to fail. One thing for sure that lot
of shuffling happens across executors in my program.
Is there a way to understand and debug ExecutorLostFailure? Any pointers
I think I find the answer..
On the UI, the recording time of each task is when it is put into the thread
pool. Then the UI makes sense
At 2015-08-18 17:40:07, Todd bit1...@163.com wrote:
Hi,
Following is copied from the spark EventTimeline UI. I don't understand why
there are overlapping
Thanks. I tried. The problem is I have to updateStatebyKey to maintain
other states related to keys.
Not sure where to pass this accumulator variable into updateStateBykey.
On Tue, Aug 18, 2015 at 2:17 AM, Hemant Bhanawat hemant9...@gmail.com wrote:
See if SparkContext.accumulator helps.
On
Hi everyone,
I have two questions regarding the random forest implementation in mllib
1- maxBins: Say the value of a feature is between [0,100]. In my dataset there
are a lot of data points between [0,10] and one datapoint at 100 and nothing
between (10, 100). I am wondering how does the
All of you are right.
I was trying to create too many producers. My idea was to create a pool(for
now the pool contains only one producer) shared by all the executors.
After I realized it was related to the serializable issues (though I did
not find clear clues in the source code to indicate the
Hi Prabeesh,
That's even better!
Thanks for sharing
Jerry
On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote:
Refer this post
http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/
Spark + Jupyter + Docker
On 18 August 2015 at 21:29, Jerry Lam
Thank you Tathagata for your response. Yes, I'm using push model on Spark
1.2. For my scenario I do prefer the push model. Is this the case on the
later version 1.4 too?
I think I can find a workaround for this issue but only if I know how to
obtain the worker(executor) ID. I can get the detail
Usually more information as to the cause of this will be found down in your
logs. I generally see this happen when an out of memory exception has
occurred for one reason or another on an executor. It's possible your
memory settings are too small per executor or the concurrent number of
tasks you
I dont think there is a super clean way for doing this. Here is an idea.
Run a dummy job with large number of partitions/tasks, which will access
SparkEnv.get.blockManager().blockManagerId().host() and return it.
sc.makeRDD(1 to 100, 100).map { _ =
Why are you even trying to broadcast a producer? A broadcast variable is
some immutable piece of serializable DATA that can be used for processing
on the executors. A Kafka producer is neither DATA nor immutable, and
definitely not serializable.
The right way to do this is to create the producer
For python it is really great.
There is some work in progress in bringing Scala support to Jupyter as well.
https://github.com/hohonuuli/sparknotebook
https://github.com/hohonuuli/sparknotebook
https://github.com/alexarchambault/jupyter-scala
https://github.com/alexarchambault/jupyter-scala
Hey,
Actually, for Scala, I'd better using
https://github.com/andypetrella/spark-notebook/
It's deployed at several places like *Alibaba*, *EBI*, *Cray* and is
supported by both the Scala community and the company Data Fellas.
For instance, it was part of the Big Scala Pipeline training given
Hi,
I see the following error in my Spark Job even after using like 100 cores
and 16G memory. Did any of you experience the same problem earlier?
15/08/18 21:51:23 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block
input-0-1439959114400, and will not retry (0 retries)
Of course, Java or Scala can do that:
1) Create a FileWriter with append or roll over option
2) For each RDD create a StringBuilder after applying your filters
3) Write this StringBuilder to File when you want to write (The duration
can be defined as a condition)
On Tue, Aug 18, 2015 at 11:05 PM,
As long as Kafka producent is thread-safe you don't need any pool at all.
Just share single producer on every executor. Please look at my blog post
for more details. http://allegro.tech/spark-kafka-integration.html
19 sie 2015 2:00 AM Shenghua(Daniel) Wan wansheng...@gmail.com
napisał(a):
All of
Hi,
Does anyone have an example of how to create a DataFrame in SparkR which
specifies the column names - the csv files I have do not have column names
in the first row. I can get read a csv nicely with
com.databricks:spark-csv_2.10:1.0.3, but I end up with column names C1, C2,
C3 etc
thanks
I am using Spark 1.4.1 , in stand-alone mode, on a cluster of 3 nodes.
Using Spark sql and Hive Context, I am trying to run a simple scan query on
an existing Hive table (which is an external table consisting of rows in
text files stored in HDFS - it is NOT parquet, ORC or any other richer
Its a cool blog post! Tweeted it!
Broadcasting the configuration necessary for lazily instantiating the
producer is a good idea.
Nitpick: The first code example has an extra `}` ;)
On Tue, Aug 18, 2015 at 10:49 PM, Marcin Kuthan marcin.kut...@gmail.com
wrote:
As long as Kafka producent is
The superclass method in DStream is defined as returning an Option[RDD[T]]
On Tue, Aug 18, 2015 at 8:07 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Getting compilation error while overriding compute method of
DirectKafkaInputDStream.
[ERROR]
Looks like Scala 2.11.6 and Java 1.7.0_79.
✔ ~
09:17 $ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_79).
Type in expressions to have them evaluated.
Type :help for more information.
scala
✔ ~
09:26 $ echo $JAVA_HOME
Python version has been available since 1.4. It should be close to feature
parity with the jvm version in 1.5
On Tue, Aug 18, 2015 at 9:36 AM, ayan guha guha.a...@gmail.com wrote:
Hi Cody
A non-related question. Any idea when Python-version of direct receiver is
expected? Me personally
looking at source code of
org.apache.spark.streaming.kafka.DirectKafkaInputDStream
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
val untilOffsets = clamp(latestLeaderOffsets(maxRetries))
val rdd = KafkaRDD[K, V, U, T, R](
context.sparkContext,
Hi spark users and developers,
Did anyone have IPython Notebook (Jupyter) deployed in production that uses
Spark as the computational engine?
I know Databricks Cloud provides similar features with deeper integration
with Spark. However, Databricks Cloud has to be hosted by Databricks so we
The solution you found is also in the docs:
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
Java uses an atomic reference because Java doesn't allow you to close over
non-final references.
I'm not clear on your other question.
On Tue, Aug 18, 2015 at 3:43 AM, Petr Novak
Hi
Is there a way to execute spark jobs with Java 8 lambdas instead of
using anonymous inner classes as seen in the examples?
I think I remember seeing real lambdas in the examples before and in
articles [1]?
Cheers,
-Kristoffer
[1]
Hi,
Currently, I have my data in the cluster of Elasticsearch and I try to use
spark to analyse those data.
The cluster of Elasticsearch and the cluster of spark are two different
clusters. And I use hadoop input format(es-hadoop) to read data in ES.
I am wondering how this environment affect
Yes, it should Just Work. lambdas can be used for any method that
takes an instance of an interface with one method, and that describes
Function, PairFunction, etc.
On Tue, Aug 18, 2015 at 3:23 PM, Kristoffer Sjögren sto...@gmail.com wrote:
Hi
Is there a way to execute spark jobs with Java 8
Hi all,
I am trying to run a spark job, in which I receive java.math.BigDecimal
objects, instead of the scala equivalents, and I am trying to convert them into
Doubles.
If I try to match-case this object class, I get: error: object
java.math.BigDecimal is not a value
How could I get around
Hi,
Did anyone see java.util.ConcurrentModificationException when using
broadcast variables?
I encountered this exception when wrapping a Kafka producer like this in
the spark streaming driver.
Here is what I did.
KafkaProducerString, String producer = new KafkaProducerString,
String(properties);
I wouldn't expect a kafka producer to be serializable at all... among other
things, it has a background thread
On Tue, Aug 18, 2015 at 4:55 PM, Shenghua(Daniel) Wan wansheng...@gmail.com
wrote:
Hi,
Did anyone see java.util.ConcurrentModificationException when using
broadcast variables?
I
Hi,
I'd like to know where I could find more information related to the
depreciation of the actor system in spark (from 1.4.x).
I'm interested in the reasons for this decision,
Cheers
--
View this message in context:
Hi Jerry,
Yes. I’ve seen customers using this in production for data science work. I’m
currently using this for one of my projects on a cluster as well.
Also, here is a blog that describes how to configure this.
Hi this GC overhead limit error is making me crazy. I have 20 executors using
25 GB each I dont understand at all how can it throw GC overhead I also dont
that that big datasets. Once this GC error occurs in executor it will get
lost and slowly other executors getting lost because of IOException,
89 matches
Mail list logo