You can use persist(StorageLevel.MEMORY_AND_DISK) if you are not having
sufficient memory to cache everything.
Thanks
Best Regards
On Fri, Feb 27, 2015 at 7:20 PM, Siddharth Ubale
siddharth.ub...@syncoms.com wrote:
Hi,
How do we manage putting partial data in to memory and partial into
Dear all,
We mainly do large scale computer vision task (image classification,
retrieval, ...). The pipeline is really great stuff for that. We're trying
to reproduce the tutorial given on that topic during the latest spark
summit (
Hi, All
I was trying to run spark sql cli on windows 8 for debugging purpose,
however, seems the JLine hangs in waiting input after ENTER key, I didn't see
that under Linux, is there anybody meet the same issue?
The call stack as below:
main prio=6 tid=0x02548800 nid=0x17cc runnable
The coarsest level at which you can parallelize is topic. Topics are
all but unrelated to each other so can be consumed independently. But
you can parallelize within the context of a topic too.
A Kafka group ID defines a consumer group. One consumer in a group
receive each message to the topic
This was what I was thinking but wanted to verify. Thanks Sean!
On Fri, Feb 27, 2015 at 9:56 PM, Sean Owen so...@cloudera.com wrote:
The coarsest level at which you can parallelize is topic. Topics are
all but unrelated to each other so can be consumed independently. But
you can parallelize
It works after adding the -Djline.terminal=jline.UnsupportedTerminal
-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com]
Sent: Saturday, February 28, 2015 10:24 AM
To: user@spark.apache.org
Subject: JLine hangs under Windows8
Hi, All
I was trying to run spark sql cli
Hi,
I ran a spark job. Each executor is allocated a chuck of input data. For
the executor with a small chunk of input data, the performance is reasonable
good. But for the executor with a large chunk of input data, the
performance is not good. How can I tune Spark configuration parameters to
This seems like a job for userClassPathFirst. Or could be. It's
definitely an issue of visibility between where the serializer is and
where the user class is.
At the top you said Pat that you didn't try this, but why not?
On Fri, Feb 27, 2015 at 10:11 PM, Pat Ferrel p...@occamsmachete.com wrote:
RDD is not thread-safe. You should not use it in multiple threads.
Best Regards,
Shixiong Zhu
2015-02-27 23:14 GMT+08:00 rok rokros...@gmail.com:
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run operations on an RDD from multiple threads
I was actually just able to reproduce the issue. I do wonder if this is a
bug -- the docs say When not configured by the hive-site.xml, the context
automatically creates metastore_db and warehouse in the current directory.
But as you can see in from the message warehouse is not in the current
Hi ,
I know that spark on yarn has a configuration parameter(executor-cores NUM) to
specify the number of cores per executor.
How about spark standalone? I can specify the total cores, but how could I know
how many cores each executor will take(presume one node one executor)?
No. It should not be that slow. In my Mac, it took 1.4 minutes to do
`rdd.count()` on 4.3G text file ( 25M / s / CPU).
Could you turn on profile in pyspark to see what happened in Python process?
spark.python.profile = true
On Fri, Feb 27, 2015 at 4:14 PM, Guillaume Guy
Please put your logs, you can get logs follow below:
# An error report file with more information is saved as:
# /Users/anupamajoshi/spark-1.2.0-bin-hadoop2.4/bin/hs_err_pid4709.log
--
View this message in context:
Well, that would just show the JVM bug. This isnt a Spark issue. The JVM
crashes and not because of some native code used by Spark.
On Feb 28, 2015 2:04 AM, amoners amon...@lwjendure.com wrote:
Please put your logs, you can get logs follow below:
# An error report file with more information is
I'm streamin data from kafka topic using kafkautils doing some
computation and writing records to hbase.
Storage level is memory-and-disk-ser
On 27 Feb 2015 16:20, Akhil Das ak...@sigmoidanalytics.com wrote:
You could be hitting this issue
https://issues.apache.org/jira/browse/SPARK-4516
How many machines are on the cluster?
And what is the configuration of those machines (Cores/RAM)?
Small cluster is very subjective statement.
Guillaume Guy wrote:
Dear Spark users:
I want to see if anyone has an idea of the performance for a small
cluster.
Also my job is map only so there is no shuffle/reduce phase.
On Fri, Feb 27, 2015 at 7:10 PM, Mukesh Jha me.mukesh@gmail.com wrote:
I'm streamin data from kafka topic using kafkautils doing some
computation and writing records to hbase.
Storage level is memory-and-disk-ser
On 27 Feb
Currently if you use accumulators inside actions (like foreach) you have
guarantee that, even if partition will be recalculated, the values will be
correct. Same thing does NOT apply to transformations and you can not
relay 100% on the values.
Pawel Szulc
pt., 27 lut 2015, 4:54 PM Darin McBeath
Hey Darin,
Record count metrics are coming in Spark 1.3. Can you wait until it is
released? Or do you need a solution in older versions of spark.
Kostas
On Friday, February 27, 2015, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
I have a fairly large Spark job where I'm essentially
Thanks for you quick reply. Yes, that would be fine. I would rather wait/use
the optimal approach as opposed to hacking some one-off solution.
Darin.
From: Kostas Sakellis kos...@cloudera.com
To: Darin McBeath ddmcbe...@yahoo.com
Cc: User
Thanks for coming back to the list with response!
pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik himan...@gmail.com
napisał:
Hi,
I was able to solve the issue. Putting down the settings that worked for
me.
1) It was happening due to the large number of partitions.I *coalesce*'d
Yes, I used both.
The discussion on this seems to be at github now:
https://github.com/apache/spark/pull/4780
I am using more classes from a package from which spark uses HyperLogLog as
well.
So we are both including the jar file but Spark is excluding the dependent
package that is required.
Available in GML --
http://x10-lang.org/x10-community/applications/global-matrix-library.html
We are exploring how to make it available within Spark. Any ideas would
be much appreciated.
On 2/27/15 7:01 AM, shahab wrote:
Hi,
I just wonder if there is any Sparse Matrix implementation
Hi,
I was able to solve the issue. Putting down the settings that worked for me.
1) It was happening due to the large number of partitions.I *coalesce*'d
the RDD as early as possible in my code into lot less partitions ( used .
coalesce(1) to bring down from 500K to 10k)
2) Increased the
Thanks a lot Vijay, let me see how it performs.
Best
Shahab
On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:
Available in GML --
http://x10-lang.org/x10-community/applications/global-matrix-library.html
We are exploring how to make it available within Spark. Any ideas
That's very slow, and there are a lot of possible explanations. The
first one that comes to mind is: I assume your YARN and HDFS are on
the same machines, but are you running executors on all HDFS nodes
when you run this? if not, a lot of these reads could be remote.
You have 6 executor slots,
Thanks,
But do you know if access to Coordinated matrix elements is almost as fast
as a normal matrix or it has access time similar to RDD ( relatively slow)?
I am looking for some fast access sparse matrix data structure.
On Friday, February 27, 2015, Peter Rudenko petro.rude...@gmail.com
try using breeze (scala linear algebra library)
On Fri, Feb 27, 2015 at 5:56 PM, shahab shahab.mok...@gmail.com wrote:
Thanks a lot Vijay, let me see how it performs.
Best
Shahab
On Friday, February 27, 2015, Vijay Saraswat vi...@saraswat.org wrote:
Available in GML --
Hi Kundan,
Sorry even i am also facing the similar issue today.How did you resolve
this issue?
Regards,
Sandeep.v
On Thu, Feb 26, 2015 at 2:25 AM, Michael Armbrust mich...@databricks.com
wrote:
It looks like that is getting interpreted as a local path. Are you
missing a core-site.xml file
Dear Spark users:
I want to see if anyone has an idea of the performance for a small cluster.
Reading from HDFS, what should be the performance of a count() operation
on an 10GB RDD with 100M rows using pyspark. I looked into the CPU usage,
all 6 are at 100%.
Details:
- master yarn-client
Hi,
How do we manage putting partial data in to memory and partial into disk where
data resides in hive table ?
We have tried using the available documentation but unable to go ahead with
above approach , we are only able to cache the entire table or uncache it.
Thanks,
Siddharth Ubale,
Somehow my posts are not getting excepted, and replies are not visible here.
But I got following reply from Zhan.
From Zhan Zhang's reply, yes I still get the parquet's advantage.
My next question is, if I operate on SchemaRdd will I get the advantage of
Spark SQL's in memory columnar store
Why would you want to use spark to sequentially process your entire data
set? The entire purpose is to let you do distributed processing -- which
means letting partitions get processed simultaneously by different cores /
nodes.
that being said, occasionally in a bigger pipeline with a lot of
Thanks Michael! It worked. Some how my mails are not getting accepted by spark
user mailing list. :(
From: mich...@databricks.com
Date: Thu, 26 Feb 2015 17:49:43 -0800
Subject: Re: group by order by fails
To: tridib.sama...@live.com
CC: ak...@sigmoidanalytics.com; user@spark.apache.org
Assign
As you suggested, I tried to save the grouped RDD and persisted it in
memory before the iterations begin. The performance seems to be much better
now.
My previous comment that the run times doubled was from a wrong observation.
Thanks.
On Fri, Feb 27, 2015 at 10:27 AM, Vijayasarathy Kannan
Thanks.
I tried persist() on the RDD. The runtimes appear to have doubled now
(without persist() it was ~7s per iteration and now its ~15s). I am running
standalone Spark on a 8-core machine.
Any thoughts on why the increase in runtime?
On Thu, Feb 26, 2015 at 4:27 PM, Imran Rashid
Hi yana,
I have removed hive-site.xml from spark/conf directory but still getting
the same errors. Anyother way to work around.
Regards,
Sandeep
On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
I think you're mixing two things: the docs say When* not *configured
Hi Sparkers,
I am using hive version - hive 0.13 and copied hive-site.xml in spark/conf
and using default derby local metastore .
While creating a table in spark shell getting the following error ..Can any
one please look and give solution asap..
sqlContext.sql(CREATE TABLE IF NOT EXISTS
I'm seeing this java.util.NoSuchElementException: key not found: exception
pop up sometimes when I run operations on an RDD from multiple threads in a
python application. It ends up shutting down the SparkContext so I'm
assuming this is a bug -- from what I understand, I should be able to run
I think you're mixing two things: the docs say When* not *configured by
the hive-site.xml, the context automatically creates metastore_db and
warehouse in the current directory.. AFAIK if you want a local metastore,
you don't put hive-site.xml anywhere. You only need the file if you're
going to
Hi,
I am trying to do this in spark-shell:
val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val
listTables =hiveCtx.hql(show tables)
The second line fails to execute with this message:
warning: there were 1 deprecation warning(s); re-run with -deprecation for
details
I have a fairly large Spark job where I'm essentially creating quite a few
RDDs, do several types of joins using these RDDS resulting in a final RDD which
I write back to S3.
Along the way, I would like to capture record counts for some of these RDDs. My
initial approach was to use the count
Hi Joe, you might increase spark.yarn.executor.memoryOverhead to see if it
fixes the problem. Please take a look of this report:
https://issues.apache.org/jira/browse/SPARK-4996
Hope this helps.
On Tue, Feb 24, 2015 at 2:05 PM, Yiannis Gkoufas johngou...@gmail.com
wrote:
No problem, Joe. There
Hi Darin, you might increase spark.yarn.executor.memoryOverhead to see if
it fixes the problem. Please take a look of this report:
https://issues.apache.org/jira/browse/SPARK-4996
On Fri, Feb 27, 2015 at 12:38 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Can you share what error you
My program in pseudocode looks like this:
val conf = new SparkConf().setAppName(Test)
.set(spark.storage.memoryFraction,0.2) // default 0.6
.set(spark.shuffle.memoryFraction,0.12) // default 0.2
.set(spark.shuffle.manager,SORT) // preferred setting for
optimized joins
Are you sure the multiple invocations are not from previous runs of the
program?
TD
On Fri, Feb 27, 2015 at 12:16 PM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
Hi
Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd
party udp traffic generator, from the streaming
Its wasn't clear from the snippet whats going on. Can your provide the
whole Receiver code?
TD
On Fri, Feb 27, 2015 at 12:37 PM, Nastooh Avessta (navesta)
nave...@cisco.com wrote:
I am, as I issue killall -9 Prog, prior to testing.
Cheers,
[image:
Hello Everyone,
I'm having some issues launching (non-spark) applications via the
spark-submit commands. The common error I am getting is c/p below. I am
able to submit a spark streaming/kafka spark application, but can't start a
dynamoDB java app. The common error is related to joda-time.
1) I
I understand that I need to supply Guava to Spark. The HashBiMap is created in
the client and broadcast to the workers. So it is needed in both. To achieve
this there is a deps.jar with Guava (and Scopt but that is only for the
client). Scopt is found so I know the jar is fine for the client.
Hi
Under Spark 1.0.0, standalone, client mode am trying to invoke a 3rd party udp
traffic generator, from the streaming thread. The excerpt is as follows:
...
do{
try {
p = Runtime.getRuntime().exec(Prog );
socket.receive(packet);
Thank you for your time and effort. Here is the code:
---
public final class Multinode extends ReceiverOutput {
String host = null;
int portRx = -1;
int portTx = -1;
private final Semaphore sem = new
Hi,
I have a spark application that hangs on doing just one task (Rest 200-300
task gets completed in reasonable time)
I can see in the Thread dump which function gets stuck how ever I don't
have a clue as to what value is causing that behaviour.
Also, logging the inputs before the function is
Ah, I see. That makes a lot of sense now.
You might be running into some weird class loader visibility issue.
I've seen some bugs in jira about this in the past, maybe you're
hitting one of them.
Until I have some time to investigate (of if you're curious feel free
to scavenge jira), a
I am, as I issue killall -9 Prog, prior to testing.
Cheers,
[http://www.cisco.com/web/europe/images/email/signature/logo05.jpg]
Nastooh Avessta
ENGINEER.SOFTWARE ENGINEERING
nave...@cisco.com
Phone: +1 604 647 1527
Cisco Systems Limited
595 Burrard Street, Suite 2123 Three Bentall Centre, PO
Hi,
Not sure if it can help, but `StorageLevel.MEMORY_AND_DISK_SER` generates
many small objects that lead to very long GC time, causing the executor
losts, heartbeat not received, and GC overhead limit exceeded messages.
Could you try using `StorageLevel.MEMORY_AND_DISK` instead? You can also
Hi.
I have had a simliar issue. I had to pull the JavaSerializer source into my
own project, just so I got the classloading of this class under control.
This must be a class loader issue with spark.
-E
On Fri, Feb 27, 2015 at 8:52 PM, Pat Ferrel p...@occamsmachete.com wrote:
I understand
On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel p...@occamsmachete.com wrote:
@Marcelo do you mean by modifying spark.executor.extraClassPath on all
workers, that didn’t seem to work?
That's an app configuration, not a worker configuration, so if you're
trying to set it on the worker configuration
Hi All,
I am currently trying to build out a spark job that would basically convert
a csv file into parquet. From what I have seen it looks like spark sql is
the way to go and how I would go about this would be to load in the csv file
into an RDD and convert it into a schemaRDD by injecting in
Hi Jason:
Thanks for your feedback.
Beside the information above I mentioned, there are 3 machines in the
cluster.
*1st one*: Driver + has a bunch of Hadoop services. 32GB of RAM, 8 cores (2
used)
*2nd + 3rd: *16B of RAM, 4 cores (2 used each)
I hope this helps clarify.
Thx.
GG
Best,
I don’t use spark-submit I have a standalone app.
So I guess you want me to add that key/value to the conf in my code and make
sure it exists on workers.
On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com
@Erlend hah, we were trying to merge your PR and ran into this—small world. You
actually compile the JavaSerializer source in your project?
@Marcelo do you mean by modifying spark.executor.extraClassPath on all workers,
that didn’t seem to work?
On Feb 27, 2015, at 1:23 PM, Erlend Hamnaberg
On Fri, Feb 27, 2015 at 1:42 PM, Pat Ferrel p...@occamsmachete.com wrote:
I changed in the spark master conf, which is also the only worker. I added a
path to the jar that has guava in it. Still can’t find the class.
Sorry, I'm still confused about what config you're changing. I'm
suggesting
Hi,
I am trying to do this in spark-shell:
val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val
listTables =hiveCtx.hql(show tables)
The second line fails to execute with this message:
warning: there were 1 deprecation warning(s); re-run with -deprecation for
details
Hi Sean:
Thanks for your feedback. Scala is much faster. The count is performed in
~1 minutes (vs 17min). I would expect scala to be 2-5X faster but this gap
seems to be more than that. Is that also your conclusion?
Thanks.
Best,
Guillaume Guy
* +1 919 - 972 - 8750*
On Fri, Feb 27, 2015 at
Thanks! that worked.
On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote:
I don’t use spark-submit I have a standalone app.
So I guess you want me to add that key/value to the conf in my code and make
sure it exists on workers.
On Feb 27, 2015, at 1:47 PM, Marcelo Vanzin
I’ll try to find a Jira for it. I hope a fix is in 1.3
On Feb 27, 2015, at 1:59 PM, Pat Ferrel p...@occamsmachete.com wrote:
Thanks! that worked.
On Feb 27, 2015, at 1:50 PM, Pat Ferrel p...@occamsmachete.com wrote:
I don’t use spark-submit I have a standalone app.
So I guess you want me to
I changed in the spark master conf, which is also the only worker. I added a
path to the jar that has guava in it. Still can’t find the class.
Trying Erland’s idea next.
On Feb 27, 2015, at 1:35 PM, Marcelo Vanzin van...@cloudera.com wrote:
On Fri, Feb 27, 2015 at 1:30 PM, Pat Ferrel
http://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
On Fri, Feb 27, 2015 at 1:39 PM, kpeng1 kpe...@gmail.com wrote:
Hi All,
I am currently trying to build out a spark job that would basically convert
a csv file into parquet. From what I have
I think we need to just update the docs, it is a bit unclear right
now. At the time, we made it worded fairly sternly because we really
wanted people to use --jars when we deprecated SPARK_CLASSPATH. But
there are other types of deployments where there is a legitimate need
to augment the classpath
Can you share what error you are getting when the job fails.
On Thu, Feb 26, 2015 at 4:32 AM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:
I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
r3.8xlarge machines but limit the job to only 128 cores. I have also tried
Hi,
I have a four single core machines as slaves in my cluster. I set the
spark.default.parallelism to 4 and ran SparkTC given in examples. It took
around 26 sec.
Now, I increased the spark.default.parallelism to 8, but my performance
deteriorates. The same application takes 32 sec now.
I have
Passing RDD's around is not a good idea. RDD's are immutable and cant be
changed inside functions. Have you considered taking a different approach?
On Thu, Feb 26, 2015 at 3:42 AM, dritanbleco dritan.bl...@gmail.com wrote:
Hello
i am trying to pass as a parameter a org.apache.spark.rdd.RDD
Yes, spark.yarn.historyServer.address is used to access the spark history
server from yarn, it is not needed if you use only the yarn history server.
It may be possible to have both history servers running, but I have not tried
that yet.
Besides, as far as I have understood, yarn and spark
You could be hitting this issue
https://issues.apache.org/jira/browse/SPARK-4516
Apart from that little more information about your job would be helpful.
Thanks
Best Regards
On Wed, Feb 25, 2015 at 11:34 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Hi Experts,
My Spark Job is failing with
I have three tables with the following schema:
case class* date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp,
DATE_STRING: String, DAY_OF_WEEK: String, DAY_OF_MONTH: Int, DAY_OF_YEAR:
Int, END_OF_MONTH_FLAG: String, YEARWEEK: Int, CALENDAR_MONTH: String,
MONTH_NUM: Int, YEARMONTH: Int, QUARTER:
Hey Anish,
machine learning models that are updated with incoming data are commonly
known as online learning systems. Matrix factorization is one way to
implement recommender systems, but not the only one. There are papers about
how to do online matrix factorization, but you will likely have to
I have three tables with the following schema:
case class *date_d*(WID: Int, CALENDAR_DATE: java.sql.Timestamp,
DATE_STRING: String, DAY_OF_WEEK: String, DAY_OF_MONTH: Int, DAY_OF_YEAR:
Int, END_OF_MONTH_FLAG: String, YEARWEEK: Int, CALENDAR_MONTH: String,
MONTH_NUM: Int, YEARMONTH: Int, QUARTER:
Hi,
we are using RDD#mapPartitions() to achieve the same.
Are there advantages/disadvantages of using one method over the other?
Regards,
Jeff
2015-02-26 20:02 GMT+01:00 Mark Hamstra m...@clearstorydata.com:
rdd.map(foo).filter(bar) and rdd.filter(bar).map(foo) will each already be
pipelined
Is machine 1 the only one running an HDFS data node? You describe it as one
running Hadoop services.
On Feb 27, 2015 9:44 PM, Guillaume Guy guillaume.c@gmail.com wrote:
Hi Jason:
Thanks for your feedback.
Beside the information above I mentioned, there are 3 machines in the
cluster.
Hi Michael,
Would you help me understand the apparent difference here..
The Spark 1.2.1 programming guide indicates:
Note that if you call schemaRDD.cache() rather than
sqlContext.cacheTable(...), tables will *not* be cached using the in-memory
columnar format, and therefore
It is a simple text file.
I'm not using SQL. just doing a rdd.count() on it. Does the bug affect it?
On Friday, February 27, 2015, Davies Liu dav...@databricks.com wrote:
What is this dataset? text file or parquet file?
There is an issue with serialization in Spark SQL, which will make it
You can specify these jars (joda-time-2.7.jar, joda-convert-1.7.jar) either
as part of your build and assembly or via the --jars option to spark-submit.
HTH.
On Fri, Feb 27, 2015 at 2:48 PM, Su She suhsheka...@gmail.com wrote:
Hello Everyone,
I'm having some issues launching (non-spark)
Do you have a hive-site.xml file or a core-site.xml file? Perhaps
something is misconfigured there?
On Fri, Feb 27, 2015 at 7:17 AM, Anusha Shamanur anushas...@gmail.com
wrote:
Hi,
I am trying to do this in spark-shell:
val hiveCtx = neworg.apache.spark.sql.hive.HiveContext(sc) val
What is this dataset? text file or parquet file?
There is an issue with serialization in Spark SQL, which will make it
very slow, see https://issues.apache.org/jira/browse/SPARK-6055, will
be fixed very soon.
Davies
On Fri, Feb 27, 2015 at 1:59 PM, Guillaume Guy
guillaume.c@gmail.com wrote:
String query = select s.name, count(s.name) as tally from sample s group by
s.name order by tally;
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/group-by-order-by-fails-tp21815p21854.html
Sent from the Apache Spark User List mailing list archive at
Looking @ [1], it seems to recommend pull from multiple Kafka topics in
order to parallelize data received from Kafka over multiple nodes. I notice
in [2], however, that one of the createConsumer() functions takes a
groupId. So am I understanding correctly that creating multiple DStreams
with the
Thanks for your reply.
But your code snippet uses the `collect` which is not feasible for me.
My algorithm involves a large amount of data and I do not want to transmit
them.
Wush
2015-02-27 16:27 GMT+08:00 Yanbo Liang yblia...@gmail.com:
Actually, sortBy will return an ordered RDD.
Your
If you read the streaming programming guide, you'll notice that Spark does
not do real streaming but emulates it with a so-called mini-batching
approach. Let's say you want to work with a continuous stream of incoming
events from a computing centre:
Batch interval:
That's the basic heartbeat of
Hi,
I got NoSuchElementException when I tried to iterate through a Map which
contains some elements (not null, not empty). When I debug my code
(below). It seems the first part of the code which fills the Map is
executed after the second part that iterates the Map. The 1st part and
2nd part
Mostly, that particular executor is stuck on GC Pause, what operation are
you performing? You can try increasing the parallelism if you see only 1
executor is doing the task.
Thanks
Best Regards
On Fri, Feb 27, 2015 at 11:39 AM, twinkle sachdeva
twinkle.sachd...@gmail.com wrote:
Hi,
I am
I don't have an idea, but perhaps a little more context would be helpful.
What is the source of your streaming data? What's the storage level you're
using?
What are you doing? Some kind of windows operations?
Regards,
Jeff
2015-02-26 18:59 GMT+01:00 Mukesh Jha me.mukesh@gmail.com:
On
http://apache-spark-user-list.1001560.n3.nabble.com/file/n21853/pythonpath.jpg
Here's the PYTHONPATH. It points to the correct location.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Get-importerror-when-i-run-pyspark-with-ipython-1-tp21843p21853.html
From Zhan Zhang's reply, yes I still get the parquet's advantage.
You will need to at least use SQL or the DataFrame API (coming in Spark
1.3) to specify the columns that you want in order to get the parquet
benefits. The rest of your operations can be standard Spark.
My next question is,
Has anyone seen this -
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x000110fcd9cd, pid=4709, tid=11011
#
# JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build
1.8.0_25-b17)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02
I do.
What tags should I change in this?
I changed the value of hive.exec.scratchdir to /tmp/hive.
What else?
On Fri, Feb 27, 2015 at 2:14 PM, Michael Armbrust mich...@databricks.com
wrote:
Do you have a hive-site.xml file or a core-site.xml file? Perhaps
something is misconfigured there?
Dear List,
I'm investigating some problems related to native code integration
with Spark, and while picking through BlockManager I noticed that data
(de)serialization currently issues lots of array copies.
Specifically:
- Deserialization: BlockManager marshals all deserialized bytes
through a
Hi,
I just wonder if there is any Sparse Matrix implementation available in
Spark, so it can be used in spark application?
best,
/Shahab
Yes, it's called Coordinated
Matrix(http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix)
you need to fill it with elemets of type MatrixEntry( (Long, Long, Double))
Thanks,
Peter Rudenko
On 2015-02-27 14:01, shahab wrote:
Hi,
I just wonder if there is any Sparse
98 matches
Mail list logo