dear user of spark
I've got a program, streaming a folder, when a new file is created in this
folder, I count a word, which appears in this document and update it (I
used StatefulNetworkWordCount to do it). And it work like charm. However, I
would like to know the different of top 10 word at now
Hello Akhil,
I chanced my Kafka dependency to 2.10 (which is the version of kafka that
was on 10.0.1.232). I am getting a slightly different error, but at the
same place as the previous error (pasted below).
FYI, when I make these changes to the pom file, I do mvn clean package
then cp the new
Add this jar in the dependency
http://mvnrepository.com/artifact/com.yammer.metrics/metrics-core/2.2.0
Thanks
Best Regards
On Mon, Dec 29, 2014 at 1:31 PM, Suhas Shekar suhsheka...@gmail.com wrote:
Hello Akhil,
I chanced my Kafka dependency to 2.10 (which is the version of kafka that
was on
You can use reduceByKeyAndWindow for that. Here's a pretty clean example
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala
Thanks
Best Regards
On Mon, Dec 29, 2014 at 1:30 PM, Hoai-Thu Vuong thuv...@gmail.com wrote:
I'm very close! So I added that and then I added this:
http://mvnrepository.com/artifact/org.apache.kafka/kafka-clients/0.8.2-beta
and it seems as though the stream is working as it says Stream 0 received 1
or 2 blocks as I enter in messages on my kafka producer. However, the
Receiver seems to
If you want to stop the streaming after 10 seconds, then use
ssc.awaitTermination(1). Make sure you push some data to kafka for the
streaming to consume within the 10 seconds.
Thanks
Best Regards
On Mon, Dec 29, 2014 at 1:53 PM, Suhas Shekar suhsheka...@gmail.com wrote:
I'm very close! So
Hmmm..soo I added 1 (10,000) to jssc.awaitTermination , however it does
not stop. When I am not pushing in any data it gives me this:
14/12/29 08:35:12 INFO ReceiverTracker: Stream 0 received 0 blocks
14/12/29 08:35:12 INFO JobScheduler: Added jobs for time 1419860112000 ms
14/12/29 08:35:14
Now, Add these lines to get ride of those logs
import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger(org).setLevel(Level.OFF)
Logger.getLogger(akka).setLevel(Level.OFF)
Thanks
Best Regards
On Mon, Dec 29, 2014 at 2:09 PM, Suhas Shekar
How many cores are you allocated/seeing in the webui? (that usually runs on
8080, for cloudera i think its 18080). Most likely the job is being
allocated 1 core (should be = 2 cores) and that's why the count is never
happening.
Thanks
Best Regards
On Mon, Dec 29, 2014 at 2:22 PM, Suhas Shekar
You don't submit it like that :/
You use [*] things when you run the job in local mode, whereas here you are
running it in stand alone cluster mode.
You can try either of these:
1.
/opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.10.12/lib/spark/bin/spark-submit
--class SimpleApp --master
I thought I was running it in local mode as
http://spark.apache.org/docs/1.1.1/submitting-applications.html says that
if I don't include --deploy-mode cluster then it will run as local mode?
I tried both of the scripts above and they gave the same result as the
script I was running before.
Also,
Thanks Ted. Adding dependency to spark-network-yarn would allow resolution
to YarnShuffleService which from docs suggests that it runs on Node Manager
and perhaps isn't useful while submitting jobs programmatically. I think
what I need is a dependency to spark-yarn module so that classes like
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup. Yes, it was a matter of not
materializing the entire result of a flatMap-like function after the
cogroup, since this will accept just an Iterator (actually,
TraversableOnce).
I'd say that wherever
Thanks Tsuyoshi and Shixiong for the info. Awesome with more documentation
about the feature!
Was afraid that the node manager needed reconfiguration (and restart). Any
idea of how much resources will the shuffle service take on the node
manager? In a multi-tenant Hadoop cluster environment, it
I see that there is already a request to add wildcard support to the
SQLContext.parquetFile function
https://issues.apache.org/jira/browse/SPARK-3928.
What seems like a useful thing for our use case is to associate the
directory structure with certain columns in the table, but it does not seem
Given (label, terms) you can just transform the values to a TF vector,
then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
make a LabeledPoint from (label, vector) pairs. Is that what you're
looking for?
On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
I found the TF-IDF
How to make the spark ec2 script to install hive and spark sql on ec2 when I
run the spark ec2 script and go to bin and run ./spark-sql and execute query
I'm getting connection refused on master:9000 what else has to be configured
for this?
--
View this message in context:
So what should be the value for --hadoop-major-version the follwing hadoop
versions
Hadoop1.x is 1
CDH4
Hadoop2.3
Hadoop2.4
MapR 3.x
MapR 4.x
--
View this message in context:
Hi,
I wish to cluster a set of textual documents into undefined number of
classes. The clustering algorithm provided in MLlib i.e. K-means requires me
to give a pre-defined number of classes.
Is there any algorithm which is intelligent enough to identify how many
classes should be made based on
Hi Josh
Is there documentation available for status API? I would like to use it.
Thanks,
Aniket
On Sun Dec 28 2014 at 02:37:32 Josh Rosen rosenvi...@gmail.com wrote:
The console progress bars are implemented on top of a new stable status
API that was added in Spark 1.2. It's possible to
You can try several values of k, apply some evaluation metric to the
clustering, and then use that to decide what k is best, or at least
pretty good. If it's a completely unsupervised problem, the metrics
you can use tend to be some function of the inter-cluster and
intra-cluster distances (good
Hi,
It seems that spark-defaults.conf is not read by spark-sql. Is it used only by
spark-shell?
Thanks,
Chirag
I believe If you use spark-shell or spark-submit, then it will pick up the
conf from spark-defaults.conf, If you are running independent application
then you can set all those confs while creating the SparkContext.
Thanks
Best Regards
On Mon, Dec 29, 2014 at 5:40 PM, Chirag Aggarwal
Hi All,
I am trying to do a comparison, by building the model locally using R and
on cluster using spark.
There is some difference in the results.
Any idea what is the internal implementation of Decision Tree in Spark
MLLib.. (ID3 or C4.5 or C5.0 or CART algorithm).
Thanks,
AnoopShiralige
14/12/29 18:10:56 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 2,
nj09mhf0730.mhf.mhc, PROCESS_LOCAL, 1246 bytes)
14/12/29 18:10:56 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on
executor nj09mhf0730.mhf.mhc: org.apache.spark.SparkException (
Error from python worker:
Here is what I did for this case : https://github.com/andypetrella/tf-idf
Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
Given (label, terms) you can just transform the values to a TF vector,
then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
make a
/bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 1g
--executor-cores 1 --queue thequeue lib/spark-examples-1.2.0-hadoop2.6.0.jar 10
Got the same error by the above command, I think I missed the jar
It wasn't so much the cogroup that was optimized here, but what is
done to the result of cogroup.
Right.
Yes, it was a matter of not materializing the entire result of a
flatMap-like function after the cogroup, since this will accept just
an Iterator (actually, TraversableOnce).
Yeah...I
Hi,
I want to find the time taken for replicating an rdd in spark cluster along
with the computation time on the replicated rdd.
Can someone please suggest a suitable spark profiler?
Thank you
On Mon, Dec 29, 2014 at 2:11 PM, Stephen Haberman
stephen.haber...@gmail.com wrote:
Yeah...I was trying to poke around, are the Iterables that Spark passes
into cogroup already materialized (e.g. the bug was making a copy of
an already-in-memory list) or are the Iterables streaming?
The result
The Iterable from cogroup is CompactBuffer, which is already materialized.
It's not a lazy Iterable. So now Spark cannot handle skewed data that some
key has too many values that cannot be fit into the memory.
It's accessed through the `statusTracker` field on SparkContext.
*Scala*:
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkStatusTracker
*Java*:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaSparkStatusTracker.html
Don't create new
Thanks Josh. Looks promising. I will give it a try.
Thanks,
Aniket
On Mon, Dec 29, 2014, 9:55 PM Josh Rosen rosenvi...@gmail.com wrote:
It's accessed through the `statusTracker` field on SparkContext.
*Scala*:
Hi Shixiong,
The Iterable from cogroup is CompactBuffer, which is already
materialized. It's not a lazy Iterable. So now Spark cannot handle
skewed data that some key has too many values that cannot be fit into
the memory.
Cool, thanks for the confirmation.
- Stephen
Resolved.
I changed to Apache Hadoop 2.4.0 Apache spark 1.2.0 combination, all works
fine.
Must be because the 1.2.0 version of spark was compiled with hadoop 2.4.0
--
View this message in context:
Here's the Streaming KMeans from Spark
1.2http://spark.apache.org/docs/latest/mllib-clustering.html#examples-1
Steaming KMeans still needs an initial 'k' to be specified, it then progresses
to come up with an optimal 'k' IIRC.
From: Sean Owen so...@cloudera.com
To: jatinpreet
I running the master branch.
Finally I can make it work, changing all occurrences of *public_dns_name*
property with *private_ip_address* in the spark_ec2.py script.
My VPC instances always have null value in *public_dns_name* property
Now my script only work for VPC instances.
Regards
It would be very helpful if there is any such tool, but the distributed
nature may be difficult to capture.
I had been trying to run a task where merging the accumulators was taking
an inordinately long time and was not reflecting in the standalone
cluster's web UI.
What I think will be useful is
Hi Mukesh,
Based on your spark-submit command, it looks like you're only running with
2 executors on YARN. Also, how many cores does each machine have?
-Sandy
On Mon, Dec 29, 2014 at 4:36 AM, Mukesh Jha me.mukesh@gmail.com wrote:
Hello Experts,
I'm bench-marking Spark on YARN (
Jatin,
One approach for determining K would be to sample the data set and run PCA.
Then evaluate how many many of the resulting eigenvalue/eigenvector pairs
to use before you reach diminishing returns on cumulative error. That
number provides a reasonably good value for K to use in KMeans.
With
Sorry Sandy, The command is just for reference but I can confirm that there
are 4 executors and a driver as shown in the spark UI page.
Each of these machines is a 8 core box with ~15G of ram.
On Mon, Dec 29, 2014 at 11:23 PM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Hi Mukesh,
Based on
And this is with spark version 1.2.0.
On Mon, Dec 29, 2014 at 11:43 PM, Mukesh Jha me.mukesh@gmail.com
wrote:
Sorry Sandy, The command is just for reference but I can confirm that
there are 4 executors and a driver as shown in the spark UI page.
Each of these machines is a 8 core box
On Mon, Dec 29, 2014 at 12:00 AM, Eric Friedman eric.d.fried...@gmail.com
wrote:
I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains
empty despite my calling cache().
Just a small note on this, and perhaps you already know: Calling cache() is
not enough to cache something and
Are you setting --num-executors to 8?
On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Sorry Sandy, The command is just for reference but I can confirm that
there are 4 executors and a driver as shown in the spark UI page.
Each of these machines is a 8 core box
*oops, I mean are you setting --executor-cores to 8
On Mon, Dec 29, 2014 at 10:15 AM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Are you setting --num-executors to 8?
On Mon, Dec 29, 2014 at 10:13 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Sorry Sandy, The command is just for reference
I¹m running PySpark on YARN, and I¹m reading in SequenceFiles for which I
have a custom KeyConverter class. My KeyConverter needs to have some
configuration options passed to it, but I am unable to find a way to get the
options to that class without modifying the Spark source. Is there a
Nope, I am setting 5 executors with 2 cores each. Below is the command
that I'm using to submit in YARN mode. This starts up 5 executor nodes and
a drives as per the spark application master UI.
spark-submit --master yarn-cluster --num-executors 5 --driver-memory 1024m
--executor-memory 1024m
When running in standalone mode, each executor will be able to use all 8
cores on the box. When running on YARN, each executor will only have
access to 2 cores. So the comparison doesn't seem fair, no?
-Sandy
On Mon, Dec 29, 2014 at 10:22 AM, Mukesh Jha me.mukesh@gmail.com
wrote:
Nope, I
Makes sense, I've also tries it in standalone mode where all 3 workers
driver were running on the same 8 core box and the results were similar.
Anyways I will share the results in YARN mode with 8 core yarn containers.
On Mon, Dec 29, 2014 at 11:58 PM, Sandy Ryza sandy.r...@cloudera.com
wrote:
Hey Eric, sounds like you are running into several issues, but thanks for
reporting them. Just to comment on a few of these:
I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty
despite my calling cache().
This is expected until you compute the RDDs the first time
were you able to fix the issue? I'm facing a similar issue when trying to use
yarn client from spark-jobserver
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/trying-to-understand-yarn-client-mode-tp7925p20888.html
Sent from the Apache Spark User List
Hello All,
I need to clean up app folder(include app downloaded jar) in spark under
work folder.
I have tried to set below configuration in spark env. but it is not working
as expected.
SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=10
I cannot clear
I'm also interested in the solution to this.
Thanks,
Mike
On Mon, Dec 29, 2014 at 12:01 PM, hutashan [via Apache Spark User List]
ml-node+s1001560n20889...@n3.nabble.com wrote:
Hello All,
I need to clean up app folder(include app downloaded jar) in spark under
work folder.
I have tried
I want to have a SparkContext inside of a web application running in Jetty
that i can use to submit jobs to a cluster of Spark executors. I am running
YARN.
Ultimately, I would love it if I could just use somethjing like
SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp
I am facing the same issue in spark-1.1.0 versions
/12/29 20:44:31 INFO scheduler.TaskSetManager: Starting task 5.0 in stage
1.1 (TID 1373, X.X.X.X , ANY, 2185 bytes)
14/12/29 20:44:31 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 3.0
(TID 1367, iX.X.X.X): java.io.IOException: failed to
On Dec 29, 2014, at 1:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Eric, sounds like you are running into several issues, but thanks for
reporting them. Just to comment on a few of these:
I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty
despite
Was your spark assembly jarred with Java 7? There's a known issue with jar
files made with that version. It prevents them from being used on PYTHONPATH.
You can rejar with Java 6 for better results.
Eric Friedman
On Dec 29, 2014, at 8:01 AM, Naveen Kumar Pokala npok...@spcapitaliq.com
I am trying to do the sbt assembly for spark 1.2
sbt/sbt -Pmapr4 -Phive -Phive-thriftserver assembly
and I am getting the errors below. Any thoughts? Thanks in advance!
[warn] ::
[warn] :: FAILED DOWNLOADS::
[warn] :: ^ see
Let's say I have an RDD which gets cached and has two children which do
something with it:
val rdd1 = ...cache()
rdd1.saveAsSequenceFile()
rdd1.groupBy()..saveAsSequenceFile()
If I were to submit both calls to saveAsSequenceFile() in thread to take
advantage of concurrency (where
Got it to work...thanks a lot for the help! I started a new cluster where
Spark has Yarn as a dependency. I ran it with the script with local[2] and
it worked (this same script did not work with Spark in standalone mode).
A follow up question...I have seen this question posted around the internet
I got same error when specifying -Pmapr4.
For the following command:
sbt/sbt -Pyarn -Phive -Phive-thriftserver assembly
I got:
how can getCommonSuperclass() do its job if different class symbols get the
same bytecode-level internal name:
org/apache/spark/sql/catalyst/dsl/package$ScalaUdfBuilder
Hi All,
I have tried to pass the properties via the SparkContext.setLocalProperty and
HiveContext.setConf, both failed. Based on the results (haven't get a chance to
look into the code yet), HiveContext will try to initiate the JDBC connection
right away, I couldn't set other properties
Hopefully the new pipeline API addresses this problem. We have a code
example here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
-Xiangrui
On Mon, Dec 29, 2014 at 5:22 AM, andy petrella
A follow up on the hive-site.xml, if you
1. Specify it in spark/conf, then you can NOT apply it via the
--driver-class-path option, otherwise, you will get the following exceptions
when initializing SparkContext.
org.apache.spark.SparkException: Found both spark.driver.extraClassPath
Yeah, this looks like a regression in the API due to the addition of
arbitrary decimal support. Can you open a JIRA?
On Sun, Dec 28, 2014 at 12:23 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi Zigen,
Looks like they missed it.
Thanks
Best Regards
On Sat, Dec 27, 2014 at 12:43 PM,
Could you post your code? It sounds like a bug. One thing to check is
that wheher you set regType, which is None by default. -Xiangrui
On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan thomas.k...@manage.com wrote:
Hi there
We are on mllib 1.1.1, and trying different regularization parameters. We
b0c1, did you apply model.predict to a DStream? Maybe it would help
understand your question better if you can post your code. -Xiangrui
On Tue, Dec 23, 2014 at 11:54 AM, boci boci.b...@gmail.com wrote:
Xiangrui: Hi, I want to using this with streaming and with job too. I using
kafka
There is an SVD++ implementation in GraphX. It would be nice if you
can compare its performance vs. Mahout. -Xiangrui
On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani prafulla.w...@gmail.com wrote:
hi ,
Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is
implemented in
Hi
Could Spark-SQL be used from within a custom actor that acts as a receiver
for a streaming application? If yes, what is the recommended way of passing
the SparkContext to the actor?
Thanks for your help.
- Ranga
--
View this message in context:
I am able to fix it by adding the the jars(in the spark distribution) to the
classpath. In my sbt file I changed the scope to provided.
Let me know if you need more details.
--
View this message in context:
Hi all,
The size of shuffle write showing in spark web UI is mush different when I
execute same spark job on same input data(100GB) in both spark 1.1 and spark
1.2.
At the same sortBy stage, the size of shuffle write is 39.7GB in spark 1.1
but 91.0GB in spark 1.2.
I set spark.shuffle.manager
If I have 2 RDDs which depend on the same RDD like the following:
val rdd1 = ...
val rdd2 = rdd1.groupBy()...
val rdd3 = rdd1.groupBy()...
If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2
and one for rdd3)?
-dev +user
In general you cannot create new RDDs inside closures that run on the
executors (which is what sql inside of a foreach is doing).
I think what you want here is something like:
sqlContext.parquetFile(Data\\Test\\Parquet\\2).registerTempTable(temp2)
sql(SELECT col1, col2 FROM
Is there a way to get these set by default in spark-sql shell
Thanks,
Chirag
From: Akhil Das ak...@sigmoidanalytics.commailto:ak...@sigmoidanalytics.com
Date: Monday, 29 December 2014 5:53 PM
To: Chirag Aggarwal
chirag.aggar...@guavus.commailto:chirag.aggar...@guavus.com
Cc:
You can't do this now without writing a bunch of custom logic (see here for
an example:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala
)
I would like to make this easier as part of improvements to the datasources
api that we are
I would expect this to work. Can you run the standard spark-shell?
On Mon, Dec 29, 2014 at 2:34 AM, critikaled isasmani@gmail.com wrote:
How to make the spark ec2 script to install hive and spark sql on ec2 when
I
run the spark ec2 script and go to bin and run ./spark-sql and execute
76 matches
Mail list logo