Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-18 Thread Sean Owen
Adding a hadoop-2.6 profile is not necessary. Use hadoop-2.4, which already exists and is intended for 2.4+. In fact this declaration is missing things that Hadoop 2 needs. On Thu, Dec 18, 2014 at 3:46 AM, Kyle Lin kylelin2...@gmail.com wrote: Hi there The following is my steps. And got the

Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Canoe
I did not compile spark 1.1.0 source code on CDH4.3.0 with yarn successfully. Does it support CDH4.3.0 with yarn ? And will spark 1.2.0 support CDH5.1.2? -- View this message in context:

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Shixiong Zhu
@Rui do you mean the spark-core jar in the maven central repo are incompatible with the same version of the the official pre-built Spark binary? That's really weird. I thought they should have used the same codes. Best Regards, Shixiong Zhu 2014-12-18 17:22 GMT+08:00 Sean Owen

Re: Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Sean Owen
The question is really: will Spark 1.1 work with a particular version of YARN? many, but not all versions of YARN are supported. The stable versions are (2.2.x+). Before that, support is patchier, and in fact has been removed in Spark 1.3. The yarn profile supports YARN stable which is about

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi, I have the following code in my application: tmpRdd.foreach(item = { println(abc: + item) }) tmpRdd.foreachPartition(iter = { iter.map(item = { println(xyz: + item) }) }) In the output, I see only the abc prints

Re: Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Zhihang Fan
Hi, Sean Thank you for your reply. I will try to use Spark 1.1 and 1.2 on CHD5.X. :) 2014-12-18 17:38 GMT+08:00 Sean Owen so...@cloudera.com: The question is really: will Spark 1.1 work with a particular version of YARN? many, but not all versions of YARN are supported. The stable

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Have a look at https://issues.apache.org/jira/browse/SPARK-2075 It's not quite that the API is different, but indeed building different 'flavors' of the same version (hadoop1 vs 2) can strangely lead to this problem, even though the public API is identical and in theory the API is completely

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer t...@preferred.jp wrote: tmpRdd.foreachPartition(iter = { iter.map(item = { println(xyz: + item) }) }) Uh, with iter.foreach(...) it works... the reason being apparently that

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread M. Dale
I did not encounter this with my Avro records using Spark 1.10 (see https://github.com/medale/spark-mail/blob/master/analytics/src/main/scala/com/uebercomputing/analytics/basic/UniqueSenderCounter.scala). I do use the default Java serialization but all the fields in my Avro object are

Re: Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-18 Thread Ian Wilkinson
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4. On Dec 3, 2014, at 3:31 PM, Ian Wilkinson ia...@me.com wrote: Hi, I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). In the following I provide the query (as query dsl): import org.elasticsearch.spark._

Re: SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-18 Thread Sean Owen
Yes, although once you have multiple ClassLoaders, you are operating as if in multiple JVMs for most intents and purposes. I think the request for this kind of functionality comes from use cases where multiple ClassLoaders wouldn't work, like, wanting to have one app (in one ClassLoader) managing

Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Hi, I’m getting some seemingly invalid results when I collect an RDD. This is happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac. See the following code snippet: JavaRDDThing rdd= pairRDD.values(); rdd.foreach( e - System.out.println ( RDD Foreach: + e ) ); rdd.collect().forEach( e -

Can we specify driver running on a specific machine of the cluster on yarn-cluster mode?

2014-12-18 Thread LinQili
Hi all,On yarn-cluster mode, can we let the driver running on a specific machine that we choose in cluster ? Or, even the machine not in the cluster?

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
NP man, The thing is that since you're in a dist env, it'd be cumbersome to do that. Remember that Spark works basically on block/partition, they are the unit of distribution and parallelization. That means that actions have to be run against it **after having been scheduled on the cluster**. The

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
It sounds a lot like your values are mutable classes and you are mutating or reusing them somewhere? It might work until you actually try to materialize them all and find many point to the same object. On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers tris...@blackfrog.org wrote: Hi, I’m

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread Juan Rodríguez Hortalá
Hi Andy, Thanks again for your thoughts on this, I haven't found much information about the internals of Spark, so I find very useful and interesting these kind of explanations about its low level mechanisms. It's also nice to know that the two pass approach is a viable solution. Regards, Juan

Re: Help with updateStateByKey

2014-12-18 Thread Tathagata Das
Another point to start playing with updateStateByKey is the example StatefulNetworkWordCount. See the streaming examples directory in the Spark repository. TD On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb richard.pierce.l...@gmail.com wrote: I am trying to run stateful Spark Streaming

Re: Spark Streaming Python APIs?

2014-12-18 Thread Tathagata Das
A more updated version of the streaming programming guide is here http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html Please refer to this until we make the official release of Spark 1.2 TD On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Suspected the same thing, but because the underlying data classes are deserialised by Avro I think they have to be mutable as you need to provide the no-args constructor with settable fields. Nothing is being cached in my code anywhere, and this can be reproduced using data directly out of the

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
Being mutable is fine; reusing and mutating the objects is the issue. And yes the objects you get back from Hadoop are reused by Hadoop InputFormats. You should just map the objects to a clone before using them where you need them to exist all independently at once, like before a collect(). (That

create table in yarn-cluster mode vs yarn-client mode

2014-12-18 Thread Chirag Aggarwal
Hi, I have a simple app, where I am trying to create a table. I am able to create the table on running app in yarn-client mode, but not with yarn-cluster mode. Is this some known issue? Has this already been fixed? Please note that I am using spark-1.1 over hadoop-2.4.0 App: - import

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sun, Rui
Owen, Since we have individual module jars published into the central maven repo for an official release, then we need to make sure the official Spark assembly jar should be assembled exactly from these jars, so there will be no binary compatibility issue. We can also publish the official

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread Anish Haldiya
Hi, I had the same problem. One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. Other is using Kryo Serialization. by default spark uses Java Serialization, you can specify kryo serialization while creating spark context. val conf = new

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Recording the outcome here for the record. Based on Sean’s advice I’ve confirmed that making defensive copies of records that will be collected avoids this problem - it does seem like Avro is being a bit too aggressive when deciding it’s safe to reuse an object for a new record. On 18 December

Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll
Hi, This was all my fault. It turned out I had a line of code buried in a library that did a repartition. I used this library to wrap an RDD to present it to legacy code as a different interface. That's what was causing the data to spill to disk. The really stupid thing is it took me the better

pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread mj
Hi, I'm trying to use pyspark to save a simple rdd to a text file (code below), but it keeps throwing an error. - Python Code - items=[Hello, world] items2 = sc.parallelize(items) items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv') - Error

Re: pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread Akhil Das
It seems You are missing HADOOP_HOME in the environment. As it says: java.io.IOException: Could not locate executable *null*\bin\winutils.exe in the Hadoop binaries. That null is supposed to be your HADOOP_HOME. Thanks Best Regards On Thu, Dec 18, 2014 at 7:10 PM, mj jone...@gmail.com wrote:

Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-18 Thread Jon Chase
I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not having any download speed issues (Amazon's EMR provides a custom

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Hi Pierce, You shouldn’t have to use groupByKey because updateStateByKey will get a Seq of all the values for that key already. I used that for realtime sessionization as well. What I did was key my incoming events, then send them to udpateStateByKey. The updateStateByKey function then

Spark 1.2 Release Date

2014-12-18 Thread Al M
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we are already in the latter part of the release window. -- View this message in context:

Re: Spark 1.2 Release Date

2014-12-18 Thread Silvio Fiorito
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On 12/18/14, 2:09 PM, Al M alasdair.mcbr...@gmail.com wrote: Is there a planned release date for Spark 1.2? I saw on the Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we are

EC2 VPC script

2014-12-18 Thread Eduardo Cusa
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing

undefined

2014-12-18 Thread Eduardo Cusa
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing

Re: Spark 1.2 Release Date

2014-12-18 Thread nitin
Soon enough :) http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html Sent from the Apache Spark User List

Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same

Standalone Spark program

2014-12-18 Thread Akshat Aranya
Hi, I am building a Spark-based service which requires initialization of a SparkContext in a main(): def main(args: Array[String]) { val conf = new SparkConf(false) .setMaster(spark://foo.example.com:7077) .setAppName(foobar) val sc = new SparkContext(conf) val rdd =

Re: Spark 1.2 Release Date

2014-12-18 Thread Al M
Awesome. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20767.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To

Re: Effects problems in logistic regression

2014-12-18 Thread Sean Owen
Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate

Re: Standalone Spark program

2014-12-18 Thread Akhil Das
You can build a jar of your project and add it to the sparkContext (sc.addJar(/path/to/your/project.jar)) then it will get shipped to the worker and hence no classNotfoundException! Thanks Best Regards On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I am building

RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will

Re: When will Spark SQL support building DB index natively?

2014-12-18 Thread Michael Armbrust
It is implemented in the same way as Hive and interoperates with the hive metastore. In 1.2 we are considering adding partitioning to the SparkSQL data source API as well.. However, for now, you should create a hive context and a partitioned table. Spark SQL will automatically select partitions

does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Sadhan Sood
Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also compresses it (using lzo/gzip/snappy) along with column level encoding or just does the column level encoding when *spark.sql.inMemoryColumnarStorage.compressed *is set to true. This is because when I

Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
This produces the expected output, thank you! On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Ok, I have a better idea of what you’re trying to do now. I think the prob might be the map. The first time the function runs, currentValue will be None. Using

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Great, glad it worked out! Just keep an eye on memory usage as you roll it out. Like I said before, if you’ll be running this 24/7 consider cleaning up sessions by returning None after some sort of timeout. On 12/18/14, 8:25 PM, Pierce Lamb richard.pierce.l...@gmail.com wrote: This produces

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: Standalone Spark program

2014-12-18 Thread Andrew Or
Hey Akshat, What is the class that is not found, is it a Spark class or classes that you define in your own application? If the latter, then Akhil's solution should work (alternatively you can also pass the jar through the --jars command line option in spark-submit). If it's a Spark class,

Spark GraphX question.

2014-12-18 Thread Tae-Hyuk Ahn
Hi All, I am wondering what is the best way to remove transitive edges with maximum spanning tree. For example, Edges: 1 - 2 (30) 2 - 3 (30) 1 - 3 (25) where parenthesis is a weight for each edge. Then, I'd like to get the reduced edges graph after Transitive Reduction with considering the

Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread bethesda
We have a very large RDD and I need to create a new RDD whose values are derived from each record of the original RDD, and we only retain the few new records that meet a criteria. I want to avoid creating a second large RDD and then filtering it since I believe this could tax system resources

RE: Control default partition when load a RDD from HDFS

2014-12-18 Thread Shuai Zheng
Hmmm, how to do that? You mean for each file create a RDD? Then I will have tons of RDD. And my calculation need to rely on other input, not just the file itself Can you show some pseudo code for that logic? Regards, Shuai From: Diego García Valverde [mailto:dgarci...@agbar.es]

Re: hello

2014-12-18 Thread Harihar Nahak
You mean to Spark User List, Its pretty easy. check the first email it has all instructions On 18 December 2014 at 21:56, csjtx1021 [via Apache Spark User List] ml-node+s1001560n20759...@n3.nabble.com wrote: i want to join you -- If you reply to this email,

Re: Spark GraphX question.

2014-12-18 Thread Harihar Nahak
Hi Ted, I've no idea what is Transitive Reduction but the expected result you can achieve by graph.subgraph(graph.edges.filter()) syntax and which filter edges by its weight and give you new graph as per your condition. On 19 December 2014 at 11:11, Tae-Hyuk Ahn [via Apache Spark User List]

How to increase parallelism in Yarn

2014-12-18 Thread Suman Somasundar
Hi, I am using Spark 1.1.1 on Yarn. When I try to run K-Means, I see from the Yarn dashboard that only 3 containers are being used. How do I increase the number of containers used? P.S: When I run K-Means on Mahout with the same settings, I see that there are 25-30 containers being

Re: Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread Sean Owen
I don't think you can avoid examining each element of the RDD, if that's what you mean. Your approach is basically the best you can do in general. You're not making a second RDD here, and even if you did this in two steps, the second RDD is really more of a bookkeeping that a second huge data

Re: How to increase parallelism in Yarn

2014-12-18 Thread Andrew Or
Hi Suman, I'll assume that you are using spark submit to run your application. You can pass the --num-executors flag to ask for more containers. If you want to allocate more memory for each executor, you may also pass in the --executor-memory flag (this accepts a string in the format 1g, 512m

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-18 Thread Xiangrui Meng
Hi Jay, Please try increasing executor memory (if the available memory is more than 2GB) and reduce numBlocks in ALS. The current implementation stores all subproblems in memory and hence the memory requirement is significant when k is large. You can also try reducing k and see whether the

RE: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-18 Thread Bui, Tri
Thanks dbtsai for the info. Are you using the case class for: Case(response, vec) = ? Also, what library do I need to import to use .toBreeze ? Thanks, tri -Original Message- From: dbt...@dbtsai.com [mailto:dbt...@dbtsai.com] Sent: Friday, December 12, 2014 3:27 PM To: Bui,

Sharing sqlContext between Akka router and routee actors ...

2014-12-18 Thread Manoj Samel
Hi, Akka router creates a sqlContext and creates a bunch of routees actors with sqlContext as parameter. The actors then execute query on that sqlContext. Would this pattern be a issue ? Any other way sparkContext etc. should be shared cleanly in Akka routers/routees ? Thanks,

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Michael Armbrust
There is only column level encoding (run length encoding, delta encoding, dictionary encoding) and no generic compression. On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood sadhan.s...@gmail.com wrote: Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also

Re: Sharing sqlContext between Akka router and routee actors ...

2014-12-18 Thread Soumya Simanta
why do you need a router? I mean cannot you do with just one actor which has the SQLContext inside it? On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Akka router creates a sqlContext and creates a bunch of routees actors with sqlContext as parameter. The

java.lang.ExceptionInInitializerError/Unable to load YARN support

2014-12-18 Thread maven
All, I just built Spark-1.2 on my enterprise server (which has Hadoop 2.3 with YARN). Here're the steps I followed for the build: $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package $ export SPARK_HOME=/path/to/spark/folder $ export HADOOP_CONF_DIR=/etc/hadoop/conf

When will spark 1.2 released?

2014-12-18 Thread vboylin1...@gmail.com
Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang 发自网易邮箱大师

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren inv...@gmail.com wrote: Hi, I am using SparkSQL on 1.2.1 branch. The problem comes froms the following 4-line code: *val

Re: When will spark 1.2 released?

2014-12-18 Thread madhu phatak
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com vboylin1...@gmail.com wrote: Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang

Re: When will spark 1.2 released?

2014-12-18 Thread Ted Yu
Interesting, the maven artifacts were dated Dec 10th. However vote for RC2 closed recently: http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+ Cheers On Dec 18, 2014, at 10:02 PM, madhu phatak phatak@gmail.com wrote: It’s on Maven

Re: When will spark 1.2 released?

2014-12-18 Thread Andrew Ash
Patrick is working on the release as we speak -- I expect it'll be out later tonight (US west coast) or tomorrow at the latest. On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com wrote: Interesting, the maven artifacts were dated Dec 10th. However vote for RC2 closed recently:

Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight. On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote: Patrick is working on the