Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Corey Nolet
Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be struggling with this same problem. It really took an understanding of how the classpath needs to be configured both in YARN and in the client

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-03 Thread Sean Owen
Yes, it is easy to simply start a new factorization from the current model solution. It works well. That's more like incremental *batch* rebuilding of the model. That is not in MLlib but fairly trivial to add. You can certainly 'fold in' new data to approximately update with one new datum too,

Re: DAG info

2015-01-03 Thread madhu phatak
Hi, You can turn off these messages using log4j.properties. On Fri, Jan 2, 2015 at 1:51 PM, Robineast robin.e...@xense.co.uk wrote: Do you have some example code of what you are trying to do? Robin -- View this message in context:

getting number of partition per topic in kafka

2015-01-03 Thread Hafiz Mujadid
Hi experts! I am currently working on spark streaming with kafka. I have couple of questions related to this task. 1) Is there a way to find number of partitions given a topic name? 2)Is there a way to detect whether kafka server is running or not ? Thanks -- View this message in context:

Re: Is it possible to do incremental training using ALSModel (MLlib)?

2015-01-03 Thread Wouter Samaey
Do you know a place where I could find a sample or tutorial for this? I'm still very new at this. And struggling a bit... Thanks in advance Wouter Sent from my iPhone. On 03 Jan 2015, at 10:36, Sean Owen so...@cloudera.com wrote: Yes, it is easy to simply start a new factorization from

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar
Hi Sean, Initially I thought on the lines of bit1...@163.com but I just changed how I connect to the internet. I ran sc.parallelize(1 to 1000).count() and it seemed to work. Another quick question on the development workflow. What is the best way to rebuild once I make a few modifications to the

Re: Unable to build spark from source

2015-01-03 Thread Sean Owen
This indicates a network problem in getting third party artifacts. Is there a proxy you need to go through? On Jan 3, 2015 11:17 AM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this

Re: Unable to build spark from source

2015-01-03 Thread bit1...@163.com
The error hints that the maven module scala-compiler can't be fetched from repo1.maven.org. Should some repositoy urls be added to the Maven's settings file? bit1...@163.com From: Manoj Kumar Date: 2015-01-03 18:46 To: user Subject: Unable to build spark from source Hello, I tried to

RE: Publishing streaming results to web interface

2015-01-03 Thread Silvio Fiorito
Is this through a streaming app? I've done this before by publishing results out to a queue our message bus, with a web app listening on the other end. If it's just batch or infrequent you could save the results out to a file. From:

Unable to build spark from source

2015-01-03 Thread Manoj Kumar
Hello, I tried to build Spark from source using this command (all dependencies installed) but it fails this error. Any help would be appreciated. mvn -DskipTests clean package [INFO] Spark Project Parent POM .. FAILURE [28:14.408s] [INFO] Spark Project Networking

Re: sqlContext is undefined in the Spark Shell

2015-01-03 Thread bit1...@163.com
This is a noise,please ignore I figured out what happens... bit1...@163.com From: bit1...@163.com Date: 2015-01-03 19:03 To: user Subject: sqlContext is undefined in the Spark Shell Hi, In the spark shell, I do the following two things: 1. scala val cxt = new

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark

sqlContext is undefined in the Spark Shell

2015-01-03 Thread bit1...@163.com
Hi, In the spark shell, I do the following two things: 1. scala val cxt = new org.apache.spark.sql.SQLContext(sc); 2. scala import sqlContext._ The 1st one succeeds while the 2nd one fails with the following error, console:10: error: not found: value sqlContext import sqlContext._ Is there

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Koert Kuipers
thats great. i tried this once and gave up after a few hours. On Sat, Jan 3, 2015 at 2:59 AM, Corey Nolet cjno...@gmail.com wrote: Took me just about all night (it's 3am here in EST) but I finally figured out how to get this working. I pushed up my example code for others who may be

Re: Unable to build spark from source

2015-01-03 Thread Sean Owen
No, that is part of every Maven build by default. The repo is fine and I (and I assume everyone else) can reach it. How can you run Spark if you can't build it? Are you running something else or did it succeed? The error hints that the maven module scala-compiler can't be fetched from

Re: different akka versions and spark

2015-01-03 Thread Koert Kuipers
hey Ted, i am aware of the upgrade efforts for akka. however if spark 1.2 forces me to upgrade all our usage of akka to 2.3.x while spark 1.0 and 1.1 force me to use akka 2.2.x then we cannot build one application that runs on all spark 1.x versions, which i would consider a major incompatibility.

Re: getting number of partition per topic in kafka

2015-01-03 Thread Akhil Das
You can use the lowlevel consumer http://github.com/dibbhatt/kafka-spark-consumer for this, it has an API call https://github.com/dibbhatt/kafka-spark-consumer/blob/master/src/main/java/consumer/kafka/DynamicBrokersReader.java#L81 to retrieve the number of partitions from a topic. Easiest way

Re: saveAsTextFile

2015-01-03 Thread Sanjay Subramanian
@lailaBased on the error u mentioned in the nabble link below, it seems like there are no permissions to write to HDFS. So this is possibly why saveAsTextFile is failing. From: Pankaj Narang pankajnaran...@gmail.com To: user@spark.apache.org Sent: Saturday, January 3, 2015 4:07 AM

Spark MLIB for Kaggle's CTR challenge

2015-01-03 Thread Maisnam Ns
Hi , I entered this Kaggle's CTR challenge using scikit python framework. Although , it gave me a reasonable score , I am just wondering to explore Spark Mlib which I haven't used it before. Tried with Vowpal Wobbit also . Can someone who has already worked with MLIB ,help me if Spark Mlib

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar
My question was that if once I make changes in the source code to a file, do I rebuild it using any other command, such that it takes in only the changes (because it takes a lot of time)? On Sat, Jan 3, 2015 at 10:40 PM, Manoj Kumar manojkumarsivaraj...@gmail.com wrote: Yes, I've built spark

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar
Yes, I've built spark successfully, using the same command mvn -DskipTests clean package but it built because now I do not work behind a proxy. Thanks.

[no subject]

2015-01-03 Thread Sujeevan
Best Regards, Sujeevan. N

Re: Unable to build spark from source

2015-01-03 Thread Simon Elliston Ball
You can use the same build commands, but it's well worth setting up a zinc server if you're doing a lot of builds. That will allow incremental scala builds, which speeds up the process significantly. SPARK-4501 might be of interest too. Simon On 3 Jan 2015, at 17:27, Manoj Kumar

Joining by values

2015-01-03 Thread dcmovva
I have a two pair RDDs in spark like this rdd1 = (1 - [4,5,6,7]) (2 - [4,5]) (3 - [6,7]) rdd2 = (4 - [1001,1000,1002,1003]) (5 - [1004,1001,1006,1007]) (6 - [1007,1009,1005,1008]) (7 - [1011,1012,1013,1010]) I would like to combine them to look like this. joinedRdd = (1 -

save rdd to ORC file

2015-01-03 Thread SamyaMaiti
Hi Experts, Like saveAsParquetFile on schemaRDD, there is a equivalent to store in ORC file. I am using spark 1.2.0. As per the link below, looks like its not part of 1.2.0, so any latest update would be great. https://issues.apache.org/jira/browse/SPARK-2883 Till the next release, is there a

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
This is my design. Now let me try and code it in Spark. rdd1.txt =1~4,5,6,72~4,53~6,7 rdd2.txt  4~1001,1000,1002,10035~1004,1001,1006,10076~1007,1009,1005,10087~1011,1012,1013,1010 TRANSFORM 1===map each value to key (like an inverted index)4~15~16~17~15~24~26~37~3

spark.sql.shuffle.partitions parameter

2015-01-03 Thread Yuri Makhno
Hello everyone, I'm using SparkSQL and would like to understand how can I determine right value for spark.sql.shuffle.partitions parameter? For example if I'm joining two RDDs where first has 10 partitions and second - 60, how big this parameter should be? Thank you, Yuri

Re: spark.akka.frameSize limit error

2015-01-03 Thread Josh Rosen
Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a different encoding for map outputs for jobs with many reducers (

Better way of measuring custom application metrics

2015-01-03 Thread Enno Shioji
I have a hack to gather custom application metrics in a Streaming job, but I wanted to know if there is any better way of doing this. My hack consists of this singleton: object Metriker extends Serializable { @transient lazy val mr: MetricRegistry = { val metricRegistry = new

Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Alec Taylor
In the middle of doing the architecture for a new project, which has various machine learning and related components, including: recommender systems, search engines and sequence [common intersection] matching. Usually I use: MongoDB (as db), Redis (as cache) and celery (as queue, backed by

Re: save rdd to ORC file

2015-01-03 Thread Manku Timma
One way is to use sparkSQL. scala sqlContext.sql(create table orc_table(key INT, value STRING) stored as orc)scala sqlContext.sql(insert into table orc_table select * from schema_rdd_temp_table)scala sqlContext.sql(FROM orc_table select *) On 4 January 2015 at 00:57, SamyaMaiti

RE: Better way of measuring custom application metrics

2015-01-03 Thread Shao, Saisai
Hi, I think there’s a StreamingSource in Spark Streaming which exposes the Spark Streaming running status to the metrics sink, you can connect it with Graphite sink to expose metrics to Graphite. I’m not sure is this what you want. Besides you can customize the Source and Sink of the

Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-03 Thread firemonk9
I am running into similar problem. Have you found any resolution to this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Elastic-allocation-spark-dynamicAllocation-enabled-results-in-task-never-being-executed-tp18969p20957.html Sent from the Apache

Re: building spark1.2 meet error

2015-01-03 Thread j_soft
- thanks, it is success builded - .but where is builded zip file? I not find finished .zip or .tar.gz package 2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark User List] ml-node+s1001560n20927...@n3.nabble.com: Hi J_soft, for me it is working, I didn't put -Dscala-2.10 -X

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
hi Take a look at the code here I wrotehttps://raw.githubusercontent.com/sanjaysubramanian/msfx_scala/master/src/main/scala/org/medicalsidefx/common/utils/PairRddJoin.scala /*rdd1.txt 1~4,5,6,7 2~4,5 3~6,7 rdd2.txt 4~1001,1000,1002,1003 5~1004,1001,1006,1007

Re: Joining by values

2015-01-03 Thread Shixiong Zhu
call `map(_.toList)` to convert `CompactBuffer` to `List` Best Regards, Shixiong Zhu 2015-01-04 12:08 GMT+08:00 Sanjay Subramanian sanjaysubraman...@yahoo.com.invalid: hi Take a look at the code here I wrote

Re: building spark1.2 meet error

2015-01-03 Thread Boromir Widas
it should be under ls assembly/target/scala-2.10/* On Sat, Jan 3, 2015 at 10:11 PM, j_soft zsof...@gmail.com wrote: - thanks, it is success builded - .but where is builded zip file? I not find finished .zip or .tar.gz package 2014-12-31 19:22 GMT+08:00 xhudik [via Apache Spark

Re: Joining by values

2015-01-03 Thread Sanjay Subramanian
so I changed the code tordd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now it prints. Don't worry I will work on this to not output as List(...) But I am hoping that the JOIN question that @Dilip asked is hopefully

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-03 Thread Krishna Sankar
Alec, Good questions. Suggestions: 1. Refactor the problem into layers viz. DFS, Data Store, DB, SQL Layer, Cache, Queue, App Server, App (Interface), App (backend ML) et al. 2. Then slot-in the appropriate technologies - may be even multiple technologies for the same layer and

Re: spark.akka.frameSize limit error

2015-01-03 Thread Saeed Shahrivari
I use the 1.2 version. On Sun, Jan 4, 2015 at 3:01 AM, Josh Rosen rosenvi...@gmail.com wrote: Which version of Spark are you using? It seems like the issue here is that the map output statuses are too large to fit in the Akka frame size. This issue has been fixed in Spark 1.2 by using a

Re: Joining by values

2015-01-03 Thread Dilip Movva
Thanks Sanjay. I will give it a try. Thanks Dilip On Sat, Jan 3, 2015 at 11:25 PM, Sanjay Subramanian sanjaysubraman...@yahoo.com wrote: so I changed the code to rdd1InvIndex.join(rdd2Pair).map(str = str._2).groupByKey().map(str = (str._1,str._2.toList)).collect().foreach(println) Now

Re: simple job + futures timeout

2015-01-03 Thread brichards
so it appears that i need to be on the same network which is fine. Now I would like some advice on the best way to use the shell, is running the shell from the master or a working fine or should I create a new ec2 instance? Bobby -- View this message in context:

Re: Unable to build spark from source

2015-01-03 Thread Manoj Kumar
Hi, I compiled using sbt and it takes lesser time. Thanks for the tip. I'm able to run these examples ( https://spark.apache.org/docs/latest/mllib-linear-methods.html related to MLib in the pyspark shell. However I got some errors related to Spark SQL while compiling. Is that a reason to worry?

does calling cache()/persist() on a RDD trigger its immediate evaluation?

2015-01-03 Thread Pengcheng YIN
Hi Pro, I have a question regarding calling cache()/persist() on an RDD. All RDDs in Spark are lazily evaluated, but does calling cache()/persist() on a RDD trigger its immediate evaluation? My spark app is something like this: val rdd = sc.textFile().map() rdd.persist() while(true){ val