Re: Submiting Spark application through code

2014-10-31 Thread sivarani
I tried running it but dint work public static final SparkConf batchConf= new SparkConf(); String master = spark://sivarani:7077; String spark_home =/home/sivarani/spark-1.0.2-bin-hadoop2/; String jar = /home/sivarani/build/Test.jar; public static final JavaSparkContext batchSparkContext = new

Re: Scaladoc

2014-10-31 Thread Kamal Banga
In IntelliJ, Tools Generate Scaladoc. Kamal On Fri, Oct 31, 2014 at 5:35 AM, Alessandro Baretta alexbare...@gmail.com wrote: How do I build the scaladoc html files from the spark source distribution? Alex Bareta

Re: Submiting Spark application through code

2014-10-31 Thread Sonal Goyal
What do your worker logs say? Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 11:44 AM, sivarani whitefeathers...@gmail.com wrote: I tried running it but dint work public static final SparkConf batchConf= new

Re: Manipulating RDDs within a DStream

2014-10-31 Thread lalit1303
Hi, Since, the cassandra object is not serializable you can't open the connection on driver level and access the object inside foreachRDD (i.e. at worker level). You have to open connection inside foreachRDD only, perform the operation and then close the connection. For example:

Re: NonSerializable Exception in foreachRDD

2014-10-31 Thread Akhil Das
Are you expecting something like this? val data = ssc.textFileStream(hdfs://akhldz:9000/input/) val rdd = ssc.sparkContext.parallelize(Seq(foo, bar)) val sample = data.foreachRDD(x= { val new_rdd = x.union(rdd) new_rdd.saveAsTextFile(hdfs://akhldz:9000/output/) })

Re: Spark Streaming Issue not running 24/7

2014-10-31 Thread Akhil Das
It says 478548 on host 172.18.152.36: java.lang.ArrayIndexOutOfBoundsException Can you try putting a try { }catch around all those operations that you are doing on the DStream? In that way it will not stop the entire application due to corrupt data/field etc. Thanks Best Regards On Fri, Oct 31,

different behaviour of the same code

2014-10-31 Thread lieyan
I am trying to write some sample code under IntelliJ IDEA. I start with a non-sbt scala project. In order that the program compile, I add *spark-assembly-1.1.0-hadoop2.4.0.jar* in the *spark/lib* directory as one external library of the IDEA project.

Re: Using a Database to persist and load data from

2014-10-31 Thread Kamal Banga
You can also use PairRDDFunctions' saveAsNewAPIHadoopFile that takes an OutputFormat class. So you will have to write a custom OutputFormat class that extends OutputFormat. In this class, you will have to implement a getRecordWriter which returns a custom RecordWriter. So you will also have to

about aggregateByKey and standard deviation

2014-10-31 Thread qinwei
Hi, everyone    I have an RDD filled with data like        (k1, v11)        (k1, v12)        (k1, v13)        (k2, v21)        (k2, v22)        (k2, v23)         ...     I want to calculate the average and standard deviation of (v11, v12, v13) and (v21, v22, v23) group by there keys    for

Re: Using a Database to persist and load data from

2014-10-31 Thread Sonal Goyal
I think you can try to use the Hadoop DBOutputFormat Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at 1:00 PM, Kamal Banga ka...@sigmoidanalytics.com wrote: You can also use PairRDDFunctions' saveAsNewAPIHadoopFile

Re: SparkContext UI

2014-10-31 Thread Sean Owen
No, empty parens do no matter when calling no-arg methods in Scala. This invocation should work as-is and should result in the RDD showing in Storage. I see that when I run it right now. Since it really does/should work, I'd look at other possibilities -- is it maybe taking a short time to start

Re: Doing RDD.count in parallel , at at least parallelize it as much as possible?

2014-10-31 Thread Sean Owen
cache() won't speed up a single operation on an RDD, since it is computed the same way before it is persisted. On Thu, Oct 30, 2014 at 7:15 PM, Sameer Farooqui same...@databricks.com wrote: By the way, in case you haven't done so, do try to .cache() the RDD before running a .count() on it as

ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found

2014-10-31 Thread Dai, Kevin
Hi, all My job failed and there are a lot of ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId not found information in the log. Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.

Spark SQL on Cassandra

2014-10-31 Thread cis
I am dealing with a Lambda Architecture. This means I have Hadoop on the batch layer, Storm on the speed layer and I'm storing the precomputed views from both layers in Cassandra. I understand that Spark is a substitute for Hadoop but at the moment I would like not to change the batch layer. I

Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes
Hi, I have inpot data that are many of very small files containing one .json. For performance reasons (I use PySpark) I have to do repartioning, currently I do: sc.textFile(files).coalesce(100))   Problem is that I have to guess the number of partitions in a such way that it's as fast as

Re: how idf is calculated

2014-10-31 Thread Andrejs Abele
I found my problem. I assumed based on TF-IDF in Wikipedia , that log base 10 is used, but as I found in this discussion https://groups.google.com/forum/#!topic/scala-language/K5tbYSYqQc8, in scala it is actually ln (natural logarithm). Regards, Andrejs On Thu, Oct 30, 2014 at 10:49 PM, Ashic

Re: how idf is calculated

2014-10-31 Thread Sean Owen
Yes, here the base doesn't matter as it just multiplies all results by a constant factor. Math libraries tend to have ln, not log10 or log2. ln is often the more, er, natural base for several computations. So I would assume that log = ln in the context of ML. On Fri, Oct 31, 2014 at 11:31 AM,

Re: Executor and BlockManager memory size

2014-10-31 Thread Gen
Hi, I meet the same problem in the context of spark and yarn. When I open pyspark with the following command: spark/bin/pyspark --master yarn-client --num-executors 1 --executor-memory 2500m It turns out *INFO storage.BlockManagerMasterActor: Registering block manager

Re: sbt/sbt compile error [FATAL]

2014-10-31 Thread HansPeterS
Hi Thanks Branch 1.1. did not work but 1.0 worked. Why could that be? Regards Hans-Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-compile-error-FATAL-tp17629p17817.html Sent from the Apache Spark User List mailing list archive at

SQL COUNT DISTINCT

2014-10-31 Thread Bojan Kostic
While i testing Spark SQL i noticed that COUNT DISTINCT works really slow. Map partitions phase finished fast, but collect phase is slow. It's only runs on single executor. Should this run this way? And here is the simple code which i use for testing: val sqlContext = new

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread Ganelin, Ilya
Hi Jan. I've actually written a function recently to do precisely that using the RDD.randomSplit function. You just need to calculate how big each element of your data is, then how many of each data can fit in each RDD to populate the input to rqndomSplit. Unfortunately, in my case I wind up

unsubscribe

2014-10-31 Thread Hongbin Liu
Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing list. But the email I signed up is not right one. The link below to unsubscribe seems not working. https://spark.apache.org/community.html Can anyone help?

Re: unsubscribe

2014-10-31 Thread Corey Nolet
Hongbin, Please send an email to user-unsubscr...@spark.apache.org in order to unsubscribe from the user list. On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote: Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing

Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q
Hi, I am trying to make Spark SQL 1.1 to work to replace part of our ETL processes that are currently done by Hive 0.12. A common problem that I have encountered is the Too many files open error. Once that happened, the query just failed. I started the spark-shell by using ulimit -n 4096

Re: CANNOT FIND ADDRESS

2014-10-31 Thread akhandeshi
Thanks for the pointers! I did tried but didn't seem to help... In my latest try, I am doing spark-submit local But see the same message in spark App ui (4040) localhost CANNOT FIND ADDRESS In the logs, I see a lot of in-memory map to disk. I don't understand why that is the case.

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Sean Owen
It's almost surely the workers, not the driver (shell) that have too many files open. You can change their ulimit. But it's probably better to see why it happened -- a very big shuffle? -- and repartition or design differently to avoid it. The new sort-based shuffle might help in this regard. On

SparkContext.stop() ?

2014-10-31 Thread ll
what is it for? when do we call it? thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Bill Q
Hi Sean, Thanks for the reply. I think both driver and worker have the problem. You are right that the ulimit fixed the driver side too many files open error. And there is a very big shuffle. My maybe naive thought is to migrate the HQL scripts directly from Hive to Spark SQL and make them work.

Re: does updateStateByKey accept a state that is a tuple?

2014-10-31 Thread spr
Based on execution on small test cases, it appears that the construction below does what I intend. (Yes, all those Tuple1()s were superfluous.) var lines = ssc.textFileStream(dirArg) var linesArray = lines.map( line = (line.split(\t))) var newState = linesArray.map( lineArray =

Re: Out of memory with Spark Streaming

2014-10-31 Thread Aniket Bhatnagar
Thanks Chris for looking at this. I was putting data at roughly the same 50 records per batch max. This issue was purely because of a bug in my persistence logic that was leaking memory. Overall, I haven't seen a lot of lag with kinesis + spark setup and I am able to process records at roughly

RE: Repartitioning by partition size, not by number of partitions.

2014-10-31 Thread jan.zikes
Hi Ilya, This seems to me as quiet complicated solution, I'm thinking that easier (though not optimal) solution might be for example to use heuristicaly something like RDD.coalesce(RDD.getNumPartitions() / N), but it keeps me wonder that Spark does not have something like

Re: How to set Spark to perform only one map at once at each cluster node

2014-10-31 Thread jan.zikes
Yes I would expect it as you say, setting executor-cores as 1 would work, but  it seems to me that when I do use executor-cores=1 than it does actually perform more than one job on each of the machines at one time moment (at least based on what top says).

Re: spark streaming - saving kafka DStream into hadoop throws exception

2014-10-31 Thread Sean Owen
Hm, now I am also seeing this problem. The essence of my code is: final JavaSparkContext sparkContext = new JavaSparkContext(sparkConf); JavaStreamingContextFactory streamingContextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() {

A Spark Design Problem

2014-10-31 Thread Steve Lewis
The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a scoring function between keys and locks and a key that fits a lock will have a high score. There may be many keys fitting one lock and a key

Re: Too many files open with Spark 1.1 and CDH 5.1

2014-10-31 Thread Nicholas Chammas
As Sean suggested, try out the new sort-based shuffle in 1.1 if you know you're triggering large shuffles. That should help a lot. 2014년 10월 31일 금요일, Bill Qbill.q@gmail.com님이 작성한 메시지: Hi Sean, Thanks for the reply. I think both driver and worker have the problem. You are right that the

Re: Measuring Performance in Spark

2014-10-31 Thread mahsa
Is there any tools like Ganglia that I can use to get performance on Spark or I need to do it myself? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17836.html Sent from the Apache Spark User List mailing

Re: Measuring Performance in Spark

2014-10-31 Thread Otis Gospodnetic
Hi Mahsa, Use SPM http://sematext.com/spm/. See http://blog.sematext.com/2014/10/07/apache-spark-monitoring/ . Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Fri, Oct 31, 2014 at 1:00 PM, mahsa

properties file on a spark cluster

2014-10-31 Thread Daniel Takabayashi
Hi Guys, I'm trying to execute a spark job using python, running on a cluster of Yarn (managed by cloudera manager). The python script is using a set of python programs installed in each member of cluster. This set of programs need an property file found by a local system path. My problem is:

Re: Measuring Performance in Spark

2014-10-31 Thread mahsa
Oh this is Awesome! exactly what I needed! Thank you Otis! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Measuring-Performance-in-Spark-tp17376p17839.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: SparkContext.stop() ?

2014-10-31 Thread Daniel Siegmann
It is used to shut down the context when you're done with it, but if you're using a context for the lifetime of your application I don't think it matters. I use this in my unit tests, because they start up local contexts and you can't have multiple local contexts open so each test must stop its

Re: SQL COUNT DISTINCT

2014-10-31 Thread Nicholas Chammas
The only thing in your code that cannot be parallelized is the collect() because -- by definition -- it collects all the results to the driver node. This has nothing to do with the DISTINCT in your query. What do you want to do with the results after you collect them? How many results do you have

Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Hi, I am using the latest Cassandra-Spark Connector to access Cassandra tables form Spark. While I successfully managed to connect Cassandra using CassandraRDD, the similar SparkSQL approach does not work. Here is my code for both methods: import com.datastax.spark.connector._ import

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
Hi Harold, Can you include the versions of spark and spark-cassandra-connector you are using? Thanks! Helena @helenaedelson On Oct 30, 2014, at 12:58 PM, Harold Nguyen har...@nexgate.com wrote: Hi all, I'd like to be able to modify values in a DStream, and then send it off to an

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Harold Nguyen
Thanks Lalit, and Helena, What I'd like to do is manipulate the values within a DStream like this: DStream.foreachRDD( rdd = { val arr = record.toArray } I'd then like to be able to insert results from the arr back into Cassadnra, after I've manipulated the arr array. However, for all

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
Hi Shahab, I’m just curious, are you explicitly needing to use thrift? Just using the connector with spark does not require any thrift dependencies. Simply: com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” But to your question, you declare the keyspace but also unnecessarily

Re: A Spark Design Problem

2014-10-31 Thread francois . garillot
Hi Steve, Are you talking about sequence alignment ? — FG On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2...@gmail.com wrote: The original problem is in biology but the following captures the CS issues, Assume I have a large number of locks and a large number of keys. There is a

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
Hi Harold, Yes, that is the problem :) Sorry for the confusion, I will make this clear in the docs ;) since master is work for the next version. All you need to do is use spark 1.1.0 as you have it already com.datastax.spark %% spark-cassandra-connector % 1.1.0-beta1” and assembly - not from

Re: Manipulating RDDs within a DStream

2014-10-31 Thread Helena Edelson
Hi Harold, This is a great use case, and here is how you could do it, for example, with Spark Streaming: Using a Kafka stream: https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrweather/KafkaStreamingActor.scala#L50 Save raw data to

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
Thanks Helena. I tried setting the KeySpace, but I got same result. I also removed other Cassandra dependencies, but still same exception! I also tried to see if this setting appears in the CassandraSQLContext or not, so I printed out the output of configustion val cc = new

Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia
You don't have to call it if you just exit your application, but it's useful for example in unit tests if you want to create and shut down a separate SparkContext for each test. Matei On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote: In cluster settings if you don't

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread Helena Edelson
Hi Shahab, The apache cassandra version looks great. I think that doing cc.setKeyspace(mydb) cc.sql(SELECT * FROM mytable) versus cc.setKeyspace(mydb) cc.sql(select * from mydb.mytable ) Is the problem? And if not, would you mind creating a ticket off-list for us to help further?

LinearRegression and model prediction threshold

2014-10-31 Thread Sameer Tilak
Hi All, I am using LinearRegression and have a question about the details on model.predict method. Basically it is predicting variable y given an input vector x. However, can someone point me to the documentation about what is the threshold used in the predict method? Can that be changed ? I am

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
OK, I created an issue. Hopefully it will be resolved soon. Again thanks, best, /Shahab On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson helena.edel...@datastax.com wrote: Hi Shahab, The apache cassandra version looks great. I think that doing cc.setKeyspace(mydb) cc.sql(SELECT * FROM

Re: SparkContext.stop() ?

2014-10-31 Thread Marcelo Vanzin
Actually, if you don't call SparkContext.stop(), the event log information that is used by the history server will be incomplete, and your application will never show up in the history server's UI. If you don't use that functionality, then you're probably ok not calling it as long as your

Re: LinearRegression and model prediction threshold

2014-10-31 Thread Sean Owen
It sounds like you are asking about logistic regression, not linear regression. If so, yes that's just what it does. The default would be 0.5 in logistic regression. If you 'clear' the threshold you get the raw margin out of this and other linear classifiers. On Fri, Oct 31, 2014 at 7:18 PM,

Re: A Spark Design Problem

2014-10-31 Thread Sonal Goyal
Does the following help? JavaPairRDDbin,key join with JavaPairRDDbin,lock If you partition both RDDs by the bin id, I think you should be able to get what you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Fri, Oct 31, 2014 at

Re: LinearRegression and model prediction threshold

2014-10-31 Thread Sonal Goyal
You can serialize the model to a local/hdfs file system and use it later when you want. Best Regards, Sonal Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Sat, Nov 1, 2014 at 12:02 AM, Sean Owen so...@cloudera.com wrote: It sounds like you are asking about

Example of Fold

2014-10-31 Thread Ron Ayoub
I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there.

Spark Build

2014-10-31 Thread Terry Siu
I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve built the master branch successfully previously and am trying to rebuild again to take advantage of the new Hive 0.13.1 profile. I execute the following command: $ mvn

Re: Spark Build

2014-10-31 Thread Shivaram Venkataraman
Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I am synced up to the Spark master branch as of commit 23468e7e96. I have Maven 3.0.5, Scala 2.10.3, and SBT 0.13.1. I’ve

Re: Spark Build

2014-10-31 Thread Terry Siu
Thanks for the update, Shivaram. -Terry On 10/31/14, 12:37 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Yeah looks like https://github.com/apache/spark/pull/2744 broke the build. We will fix it soon On Fri, Oct 31, 2014 at 12:21 PM, Terry Siu terry@smartfocus.com wrote: I

Spark Standalone on cluster stops

2014-10-31 Thread TJ Klein
Hi, I have an issue with running Spark in standalone mode on a cluster. Everything seems to run fine for a couple of minutes until Spark stops executing the tasks. Any idea? Would appreciate some help. Thanks in advance, Tassilo I get errors like that at the end: 14/10/31 16:16:59 INFO

Re: Example of Fold

2014-10-31 Thread Daniil Osipov
You should look at how fold is used in scala in general to help. Here is a blog post that may also give some guidance: http://blog.madhukaraphatak.com/spark-rdd-fold The zero value should be your bean, with the 4th parameter set to the minimum value. Your fold function should compare the 4th

Re: Help with error initializing SparkR.

2014-10-31 Thread tongzzz
try run this code: sudo -E R CMD javareconf and then start spark, basically, it syncs R's java configuration with your Java configuration Good luck! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-error-initializing-SparkR-tp4495p17871.html

hadoop_conf_dir when running spark on yarn

2014-10-31 Thread ameyc
How do i setup hadoop_conf_dir correctly when I'm running my spark job on Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the configuration that I pull from sc.hadoopConfiguration() is incorrect. -- View this message in context:

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required.

Re: SparkSQL performance

2014-10-31 Thread Du Li
We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are

Spark Meetup in Singapore

2014-10-31 Thread Social Marketing
Dear Sir/Madam, We want to become an organiser of Singapore Meetup to promote the regional SPARK and big data community in ASEAN area. My name is Songtao, I am a big data consultant in Singapore and have great passion for Spark technologies. Thanks, Songtao

Re: SparkContext UI

2014-10-31 Thread Stuart Horsman
Hi Sean/Sameer, It seems you're both right. In the python shell I need to explicitly call the empty parens data.cache(), then run an action and it appears in the storage tab. Using the scala shell I can just call data.cache without the parens, run an action tthat works. Thanks for your help.

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
I agree. My personal experience with Spark core is that it performs really well once you tune it properly. As far I understand SparkSQL under the hood performs many of these optimizations (order of Spark operations) and uses a more efficient storage format. Is this assumption correct? Has anyone