This looks like a bug to me. This happens because we serialize the code
that starts the receiver and send it across. And since we have not
registered the classes of akka library it does not work. I have not tried
myself, but may be by including something like chill-akka (
Hi All,
The data size of my task is about 30mb. It runs smoothly in local mode.
However, when I submit it to the cluster, it throws the titled error (Please
see below for the complete output).
Actually, my output is almost the same with
Hi Haiyang,
you are right, YARN takes over the resource management, bot I constantly
got Exception ConnectionRefused on mentioned port. So, I suppose some spark
internal communications are done via this port... but I don't know what
exactly and how can I change it...
Thank you,
Konstantin
We have a 6-nodes cluster , each node has 64GB memory.
here is the command:
./bin/spark-submit --class
org.apache.spark.examples.graphx.LiveJournalPageRank
examples/target/scala-2.10/spark-examples-1.0.1-hadoop1.0.4.jar
hdfs://dataset/twitter --tol=0.01 --numEPart=144 --numIter=10
But it ran out
Hi Konstantin,
Could you please post your first container's stderr log here which is
always the AM log?As far as I know, ports except 8020,8080,8081,50070,50071
are all random socket ports determined by each job. So 33007 maybe just a
temporary port for data transferation. The deeper reason for
As shown here:
2 - Why Is My Spark Job so Slow and Only Using a Single Thread?
http://engineering.sharethrough.com/blog/2013/09/13/top-3-troubleshooting-tips-to-keep-you-sparking/
123456789101112131415
object JSONParser { def parse(raw: String): String = ...}object
MyFirstSparkJob { def
Hi everyone,I have the following configuration. I am currently running my app
in local mode.
val conf = new
SparkConf().setMaster(local[2]).setAppName(ApproxStrMatch).set(spark.executor.memory,
3g).set(spark.storage.memoryFraction, 0.1)
I am getting the following error. I tried setting up
I am running Spark 0.9.1 and Shark 0.9.1. Sorry I didn't include that.
On Thu, Jul 31, 2014 at 9:50 AM, William Cox william@distilnetworks.com
wrote:
*The Shark-specific group appears to be in moderation pause, so I'm asking
here.*
I'm running Shark/Spark on EC2. I am using Shark to
Ah, thanks for the help! That worked great.
On Wed, Jul 30, 2014 at 10:31 AM, Zongheng Yang zonghen...@gmail.com
wrote:
To add to this: for this many (= 20) machines I usually use at least
--wait 600.
On Wed, Jul 30, 2014 at 9:10 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I’m working with a dataset where each row is stored as a single-line flat JSON
object. I want to leverage Spark SQL to run relational queries on this data.
Many of the object keys in this dataset have dots in them, e.g.:
{ “key.number1”: “value1”, “key.number2”: “value2” … }
I can successfully
Hi,
I was wondering what is the best way to store off dstreams in hdfs or
casandra.
Could somebody provide an example?
Thanks,
Ali
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/store-spark-streaming-dstream-in-hdfs-or-cassandra-tp11064.html
Sent from
I still see the same “Unresolved attributes” error when using hql + backticks.
Here’s a code snippet that replicates this behavior:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val sampleRDD = sc.parallelize(Array({key.one: value1, key.two:
value2}))
val sampleTable =
I have created https://issues.apache.org/jira/browse/SPARK-2775 to track it.
On Thu, Jul 31, 2014 at 11:47 AM, Budde, Adam bu...@amazon.com wrote:
I still see the same “Unresolved attributes” error when using hql +
backticks.
Here’s a code snippet that replicates this behavior:
val
Ok, I set the number of spark worker instances to 2 (below is my startup
command). But, this essentially had the effect of increasing my number of
workers from 3 to 6 (which was good) but it also reduced my number of cores per
worker from 8 to 4 (which was not so good). In the end, I would
Off the top of my head, you can use the ForEachDStream to which you pass
in the code that writes to Hadoop, and then register that as an output
stream, so the function you pass in is periodically executed and causes
the data to be written to HDFS. If you are ok with the data being in
text
I'm looking to write a select statement to get a distinct count on userId
grouped by keyword column on a parquet file SchemaRDD equivalent of:
SELECT keyword, count(distinct(userId)) from table group by keyword
How to write it using the chained select().groupBy() operations?
Thanks!
--
I got the number from the Hadoop admin. It's 1M actually. I suspect the
consolidation didn't work as expected? Any other reason?
On Thu, Jul 31, 2014 at 11:01 AM, Shao, Saisai saisai.s...@intel.com
wrote:
I don’t think it’s a bug of consolidated shuffle, it’s a Linux
configuration problem.
Hi,
I'm learning Spark and I am confused about when to use the many different
operations on RDDs. Does anyone have any examples which show example inputs
and resulting outputs for the various RDD operations and if the operation
takes an Function a simple example of the code?
For example,
So if I try this again but in the Scala shell (as opposed to the Python
one), this is what I get:
scala val a = sc.textFile(s3n://some-path/*.json,
minPartitions=sc.defaultParallelism * 3).cache()
a: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at console:12
scala
I would check out the source examples on Spark's Github:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples
And, Zhen He put together a great web page with summaries and examples of
each function:
countDistinct is recently added and is in 1.0.2. If you are using that
or the master branch, you could try something like:
r.select('keyword, countDistinct('userId)).groupBy('keyword)
On Thu, Jul 31, 2014 at 12:27 PM, buntu buntu...@gmail.com wrote:
I'm looking to write a select statement
To read/write from/to Cassandra I recommend you to use the Spark-Cassandra
connector at [1]
Using it, saving a Spark Streaming RDD to Cassandra is fairly easy:
sparkConfig.set(CassandraConnectionHost, cassandraHost)
val sc = new SparkContext(sparkConfig)
val ssc = new StreamingContext(sc,
Davies,
That was it. Removing the call to cache() let the job run successfully, but
this challenges my understanding of how Spark handles caching data.
I thought it was safe to cache data sets larger than the cluster could hold
in memory. What Spark would do is cache as much as it could and
Looking at what this patch [1] has to do to achieve it, I am not sure
if you can do the same thing in 1.0.0 using DSL only. Just curious,
why don't you use the hql() / sql() methods and pass a query string
in?
[1] https://github.com/apache/spark/pull/1211/files
On Thu, Jul 31, 2014 at 2:20 PM,
I was not sure if registerAsTable() and then query against that table have
additional performance impact and if DSL eliminates that.
On Thu, Jul 31, 2014 at 2:33 PM, Zongheng Yang zonghen...@gmail.com wrote:
Looking at what this patch [1] has to do to achieve it, I am not sure
if you can do
The performance should be the same using the DSL or SQL strings.
On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote:
I was not sure if registerAsTable() and then query against that table have
additional performance impact and if DSL eliminates that.
On Thu, Jul 31, 2014 at
Thanks Michael for confirming!
On Thu, Jul 31, 2014 at 2:43 PM, Michael Armbrust mich...@databricks.com
wrote:
The performance should be the same using the DSL or SQL strings.
On Thu, Jul 31, 2014 at 2:36 PM, Buntu Dev buntu...@gmail.com wrote:
I was not sure if registerAsTable() and then
Hi Rahul,
I am not sure about bootstrapping while creating but we have downloaded the
tar ball , extracted and configured accordingly and it worked fine.
I believe you would want to write a custom script which does all these
things and add it like a bootstrap action.
Thanks,
Sai
On Jul 31, 2014
Hey all,
I was able to spawn up a cluster, but when I'm trying to submit a simple
jar via spark-submit it fails to run. I am trying to run the simple
Standalone Application from the quickstart.
Oddly enough, I could get another application running through the
spark-shell. What am I doing wrong
It seems to me that the way makeLinkRDDs works is by taking advantage of the
fact that partition IDs happen to coincide with what we get from
userPartitioner, since the HashPartitioner in
*val grouped = ratingsByUserBlock.partitionBy(new
HashPartitioner(numUserBlocks))*
is actually preserving
Hi,
My spark job finishes with this output:
14/07/31 16:33:25 INFO SparkContext: Job finished: count at
RetrieveData.scala:18, took 0.013189 s
However, the command line doesn't go back to normal and instead just hangs.
This is my first time running a spark job - is this normal? If not, how do I
Have you tried flag --spark-version of spark-ec2 ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Installing-Spark-0-9-1-on-EMR-Cluster-tp11084p11096.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
When are you guys getting the error? When Sparkcontext is created? Or
when it is being shutdown?
If this error is being thrown when the SparkContext is created, then
one possible reason maybe conflicting versions of Akka. Spark depends
on a version of Akka which is different from that of Scala,
Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself.
TD
On Wed, Jul 30, 2014 at 8:32 PM, liuwei stupi...@126.com wrote:
Hi, Tathagata Das:
I followed your advice and solved this problem, thank you very much!
在 2014年7月31日,上午3:07,Tathagata Das
I filed a JIRA for this task for future reference.
https://issues.apache.org/jira/browse/SPARK-2780
On Thu, Jul 31, 2014 at 5:37 PM, Tathagata Das
tathagata.das1...@gmail.com wrote:
Whoa! That worked! I was half afraid it wont, since I hadnt tried it myself.
TD
On Wed, Jul 30, 2014 at 8:32
Hi Tathagata,
I didn't mean to say this was an error. According to the other thread I
linked, right now there shouldn't be any conflicts, so I wanted to use
streaming in the shell for easy testing.
I thought I had to create my own project in which I'd add streaming as a
dependency, but if I can
Hi All,
I am using the spark-submit command to submit my jar to a standalone cluster
with two executor.
When I use the spark-submit it deploys the application twice and I see two
application entries in the master UI.
The master logs as shown below also indicate that submit try to deploy the
Hey Simon,
The stuff you are trying to show - logs, contents of spark-env.sh,
etc. are missing from the email. At least I am not able to see it
(viewing through gmail). Are you pasting screenshots? Those might get
blocked out somehow!
TD
On Thu, Jul 31, 2014 at 6:55 PM, durin
Hey Dean! Thanks!
Did you try running this on a local environment or one generated by the
spark-ec2 script?
The environment I am running on is a 4 data node 1 master spark cluster
generated by the spark-ec2 script. I haven't modified anything in the
environment except for adding data to the
cacheTable uses a special columnar caching technique that is optimized for
SchemaRDDs. It something similar to MEMORY_ONLY_SER but not quite. You
can specify the persistence level on the SchemaRDD itself and register that
as a temporary table, however it is likely you will not get as good
Could you enable HistoryServer and provide the properties and CLASSPATH for the
spark-shell? And 'env' command to list your environment variables?
By the way, what does the spark logs says? Enable debug mode to see what's
going on in spark-shell when it tries to interact and init HiveContext.
Glad to help you
On Fri, Aug 1, 2014 at 11:28 AM, Bin wubin_phi...@126.com wrote:
Hi Haiyang,
Thanks, it really is the reason.
Best,
Bin
在 2014-07-31 08:05:34,Haiyang Fu haiyangfu...@gmail.com 写道:
Have you tried to increase the dirver memory?
On Thu, Jul 31, 2014 at 3:54 PM, Bin
Hi,
here are two tips for you,
1. increase the parallism level
2.increase the driver memory
On Fri, Aug 1, 2014 at 12:58 AM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
I have the following configuration. I am currently running my app in local
mode.
val conf = new
43 matches
Mail list logo