Took me just about all night (it's 3am here in EST) but I finally figured
out how to get this working. I pushed up my example code for others who may
be struggling with this same problem. It really took an understanding of
how the classpath needs to be configured both in YARN and in the client
Yes, it is easy to simply start a new factorization from the current model
solution. It works well. That's more like incremental *batch* rebuilding of
the model. That is not in MLlib but fairly trivial to add.
You can certainly 'fold in' new data to approximately update with one new
datum too,
Hi,
You can turn off these messages using log4j.properties.
On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote:
Do you have some example code of what you are trying to do?
Robin
--
View this message in context:
Hi experts!
I am currently working on spark streaming with kafka. I have couple of
questions related to this task.
1) Is there a way to find number of partitions given a topic name?
2)Is there a way to detect whether kafka server is running or not ?
Thanks
--
View this message in context:
Do you know a place where I could find a sample or tutorial for this?
I'm still very new at this. And struggling a bit...
Thanks in advance
Wouter
Sent from my iPhone.
On 03 Jan 2015, at 10:36, Sean Owen so...@cloudera.com wrote:
Yes, it is easy to simply start a new factorization from
Hi Sean,
Initially I thought on the lines of bit1...@163.com but I just changed how
I connect to the internet. I ran sc.parallelize(1 to 1000).count() and it
seemed to work.
Another quick question on the development workflow. What is the best way to
rebuild once I make a few modifications to the
This indicates a network problem in getting third party artifacts. Is there
a proxy you need to go through?
On Jan 3, 2015 11:17 AM, Manoj Kumar manojkumarsivaraj...@gmail.com
wrote:
Hello,
I tried to build Spark from source using this command (all dependencies
installed)
but it fails this
The error hints that the maven module scala-compiler can't be fetched from
repo1.maven.org. Should some repositoy urls be added to the Maven's settings
file?
bit1...@163.com
From: Manoj Kumar
Date: 2015-01-03 18:46
To: user
Subject: Unable to build spark from source
Hello,
I tried to
Is this through a streaming app? I've done this before by publishing results
out to a queue our message bus, with a web app listening on the other end. If
it's just batch or infrequent you could save the results out to a file.
From:
Hello,
I tried to build Spark from source using this command (all dependencies
installed)
but it fails this error. Any help would be appreciated.
mvn -DskipTests clean package
[INFO] Spark Project Parent POM .. FAILURE
[28:14.408s]
[INFO] Spark Project Networking
This is a noise,please ignore
I figured out what happens...
bit1...@163.com
From: bit1...@163.com
Date: 2015-01-03 19:03
To: user
Subject: sqlContext is undefined in the Spark Shell
Hi,
In the spark shell, I do the following two things:
1. scala val cxt = new
If you can paste the code here I can certainly help.
Also confirm the version of spark you are using
Regards
Pankaj
Infoshore Software
India
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark
Hi,
In the spark shell, I do the following two things:
1. scala val cxt = new org.apache.spark.sql.SQLContext(sc);
2. scala import sqlContext._
The 1st one succeeds while the 2nd one fails with the following error,
console:10: error: not found: value sqlContext
import sqlContext._
Is there
thats great. i tried this once and gave up after a few hours.
On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote:
Took me just about all night (it's 3am here in EST) but I finally figured
out how to get this working. I pushed up my example code for others who may
be
No, that is part of every Maven build by default. The repo is fine and I
(and I assume everyone else) can reach it.
How can you run Spark if you can't build it? Are you running something else
or did it succeed?
The error hints that the maven module scala-compiler can't be fetched from
hey Ted,
i am aware of the upgrade efforts for akka. however if spark 1.2 forces me
to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force me
to use akka 2.2.x then we cannot build one application that runs on all
spark 1.x versions, which i would consider a major incompatibility.
You can use the lowlevel consumer
http://github.com/dibbhatt/kafka-spark-consumer for this, it has an API
call
https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/DynamicBrokersReader.java#L81
to retrieve the number of partitions from a topic.
Easiest way
@lailaBased on the error u mentioned in the nabble link below, it seems like
there are no permissions to write to HDFS. So this is possibly why
saveAsTextFile is failing.
From: Pankaj Narang pankajnaran...@gmail.com
To: user@spark.apache.org
Sent: Saturday, January 3, 2015 4:07 AM
Hi ,
I entered this Kaggle's CTR challenge using scikit python framework.
Although , it gave me a reasonable score , I am just wondering to explore
Spark Mlib which I haven't used it before. Tried with Vowpal Wobbit also .
Can someone who has already worked with MLIB ,help me if Spark Mlib
My question was that if once I make changes in the source code to a file,
do I rebuild it using any other command, such that it takes in only the
changes (because it takes a lot of time)?
On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com
wrote:
Yes, I've built spark
Yes, I've built spark successfully, using the same command
mvn -DskipTests clean package
but it built because now I do not work behind a proxy.
Thanks.
Best Regards,
Sujeevan. N
You can use the same build commands, but it's well worth setting up a zinc
server if you're doing a lot of builds. That will allow incremental scala
builds, which speeds up the process significantly.
SPARK-4501 might be of interest too.
Simon
On 3 Jan 2015, at 17:27, Manoj Kumar
I have a two pair RDDs in spark like this
rdd1 = (1 - [4,5,6,7])
(2 - [4,5])
(3 - [6,7])
rdd2 = (4 - [1001,1000,1002,1003])
(5 - [1004,1001,1006,1007])
(6 - [1007,1009,1005,1008])
(7 - [1011,1012,1013,1010])
I would like to combine them to look like this.
joinedRdd = (1 -
Hi Experts,
Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC
file.
I am using spark 1.2.0.
As per the link below, looks like its not part of 1.2.0, so any latest
update would be great.
https://issues.apache.org/jira/browse/SPARK-2883
Till the next release, is there a
This is my design. Now let me try and code it in Spark.
rdd1.txt =1~4,5,6,72~4,53~6,7
rdd2.txt
4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010
TRANSFORM 1===map each value to key (like an inverted
index)4~15~16~17~15~24~26~37~3
Hello everyone,
I'm using SparkSQL and would like to understand how can I determine right
value for spark.sql.shuffle.partitions parameter? For example if I'm
joining two RDDs where first has 10 partitions and second - 60, how big
this parameter should be?
Thank you,
Yuri
Which version of Spark are you using? It seems like the issue here is that
the map output statuses are too large to fit in the Akka frame size. This
issue has been fixed in Spark 1.2 by using a different encoding for map
outputs for jobs with many reducers (
I have a hack to gather custom application metrics in a Streaming job, but
I wanted to know if there is any better way of doing this.
My hack consists of this singleton:
object Metriker extends Serializable {
@transient lazy val mr: MetricRegistry = {
val metricRegistry = new
In the middle of doing the architecture for a new project, which has
various machine learning and related components, including:
recommender systems, search engines and sequence [common intersection]
matching.
Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue,
backed by
One way is to use sparkSQL.
scala sqlContext.sql(create table orc_table(key INT, value STRING)
stored as orc)scala sqlContext.sql(insert into table orc_table
select * from schema_rdd_temp_table)scala sqlContext.sql(FROM
orc_table select *)
On 4 January 2015 at 00:57, SamyaMaiti
Hi,
I think there’s a StreamingSource in Spark Streaming which exposes the Spark
Streaming running status to the metrics sink, you can connect it with Graphite
sink to expose metrics to Graphite. I’m not sure is this what you want.
Besides you can customize the Source and Sink of the
I am running into similar problem. Have you found any resolution to this
issue ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Elastic-allocation-spark-dynamicAllocation-enabled-results-in-task-never-being-executed-tp18969p20957.html
Sent from the Apache
- thanks, it is success builded
- .but where is builded zip file? I not find finished .zip or .tar.gz
package
2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List]
ml-node+s1001560n20927...@n3.nabble.com:
Hi J_soft,
for me it is working, I didn't put -Dscala-2.10 -X
hi Take a look at the code here I
wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala
/*rdd1.txt
1~4,5,6,7
2~4,5
3~6,7
rdd2.txt
4~1001,1000,1002,1003
5~1004,1001,1006,1007
call `map(_.toList)` to convert `CompactBuffer` to `List`
Best Regards,
Shixiong Zhu
2015-01-04 12:08 GMT+08:00 Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid:
hi
Take a look at the code here I wrote
it should be under
ls assembly/target/scala-2.10/*
On Sat, Jan 3, 2015 at 10:11 PM, j_soft zsof...@gmail.com wrote:
- thanks, it is success builded
- .but where is builded zip file? I not find finished .zip or .tar.gz
package
2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark
so I changed the code tordd1InvIndex.join(rdd2Pair).map(str =
str._2).groupByKey().map(str =
(str._1,str._2.toList)).collect().foreach(println)
Now it prints. Don't worry I will work on this to not output as List(...) But I
am hoping that the JOIN question that @Dilip asked is hopefully
Alec,
Good questions. Suggestions:
1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer,
Cache, Queue, App Server, App (Interface), App (backend ML) et al.
2. Then slot-in the appropriate technologies - may be even multiple
technologies for the same layer and
I use the 1.2 version.
On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote:
Which version of Spark are you using? It seems like the issue here is
that the map output statuses are too large to fit in the Akka frame size.
This issue has been fixed in Spark 1.2 by using a
Thanks Sanjay. I will give it a try.
Thanks
Dilip
On Sat, Jan 3, 2015 at 11:25 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com wrote:
so I changed the code to
rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str =
(str._1,str._2.toList)).collect().foreach(println)
Now
so it appears that i need to be on the same network which is fine. Now I
would like some advice on the best way to use the shell, is running the
shell from the master or a working fine or should I create a new ec2
instance?
Bobby
--
View this message in context:
Hi,
I compiled using sbt and it takes lesser time. Thanks for the tip. I'm able
to run these examples (
https://spark.apache.org/docs/latest/mllib-linear-methods.html related to
MLib in the pyspark shell.
However I got some errors related to Spark SQL while compiling. Is that a
reason to worry?
Hi Pro,
I have a question regarding calling cache()/persist() on an RDD. All RDDs in
Spark are lazily evaluated, but does calling cache()/persist() on a RDD trigger
its immediate evaluation?
My spark app is something like this:
val rdd = sc.textFile().map()
rdd.persist()
while(true){
val
44 matches
Mail list logo