Re: jar changed on src filesystem

2014-07-17 Thread Chester@work
Since you are running in yarn-cluster mode, and you are supply the spark assembly jar file. There is no need to install spark on each node. Is it possible two spark jars have different version ? Chester Sent from my iPad On Jul 16, 2014, at 22:49, cmti95035 cmti95...@gmail.com wrote: Hi,

Using RDD in RDD transformation

2014-07-17 Thread tbin
I implemented a simple KNN classifier. And i can run it successfully on a single sample, but it occurs an error when it is run on a test samples RDD. I attach the source code in attachment. Look forward for you replay! Best wishes to you! The following is source code. import math from pyspark

Re: can we insert and update with spark sql

2014-07-17 Thread Akhil Das
Is this what you are looking for? https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/sql/parquet/InsertIntoParquetTable.html According to the doc, it says Operator that acts as a sink for queries on RDDs and can be used to store the output inside a directory of Parquet files. This

Re: jar changed on src filesystem

2014-07-17 Thread cmti95035
They're all the same version. Actually even without the --jars parameter it got the same error. Looks like it needs to copy the assembly jar for running the example jar anyway during the staging. -- View this message in context:

Re: Errors accessing hdfs while in local mode

2014-07-17 Thread Akhil Das
You can try the following in the spark-shell: 1. Run it in *Clustermode* by going inside the spark directory: $ SPARK_MASTER=spark://masterip:7077 ./bin/spark-shell val textFile = sc.textFile(hdfs://masterip/data/blah.csv) textFile.take(10).foreach(println) 2. Now try running in *Localmode:*

preservesPartitioning

2014-07-17 Thread Kamal Banga
Hi All, The function *mapPartitions *in RDD.scala https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala takes a boolean parameter *preservesPartitioning. *It seems if that parameter is passed as *false*, the passed function f will operate on the data only

Re: preservesPartitioning

2014-07-17 Thread Matei Zaharia
Hi Kamal, This is not what preservesPartitioning does -- actually what it means is that if the RDD has a Partitioner set (which means it's an RDD of key-value pairs and the keys are grouped into a known way, e.g. hashed or range-partitioned), your map function is not changing the partition of

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
Yes, both run in parallel. Random is a baseline implementation of initialization, which may ignore small clusters. k-means++ improves random initialization by adding weights to points far away to the current candidates. You can view k-means|| as a more scalable version of K-means++. We don't

class after join

2014-07-17 Thread Luis Guerra
Hi all, I am a newbie Spark user with many doubts, so sorry if this is a silly question. I am dealing with tabular data formatted as text files, so when I first load the data, my code is like this: case class data_class( V1: String, V2: String, V3: String, V4: String, V5: String,

Re: MLLib - Regularized logistic regression in python

2014-07-17 Thread Xiangrui Meng
1) This is a miss, unfortunately ... We will add support for regularization and intercept in the coming v1.1. (JIRA: https://issues.apache.org/jira/browse/SPARK-2550) 2) It has overflow problems in Python but not in Scala. We can stabilize the computation by ensuring exp only takes a negative

Re: Kyro deserialisation error

2014-07-17 Thread Tathagata Das
Seems like there is some sort of stream corruption, causing Kryo read to read a weird class name from the stream (the name arl Fridtjof Rode in the exception cannot be a class!). Not sure how to debug this. @Patrick: Any idea? On Wed, Jul 16, 2014 at 10:14 PM, Hao Wang wh.s...@gmail.com wrote:

Re: Kyro deserialisation error

2014-07-17 Thread Sean Owen
Not sure if this helps, but it does seem to be part of a name in a Wikipedia article, and Wikipedia is the data set. So something is reading this class name from the data. http://en.wikipedia.org/wiki/Carl_Fridtjof_Rode On Thu, Jul 17, 2014 at 9:40 AM, Tathagata Das tathagata.das1...@gmail.com

Re: class after join

2014-07-17 Thread Luis Guerra
Thank you for your fast reply. We are considering this Map[String, String] solution, but there are some details that we do not control yet. What would happen if we have different data types for different fields? Also, with this solution, we have to repeat the field names for every row that we

Re: Pysparkshell are not listing in the web UI while running

2014-07-17 Thread Akhil Das
Hi Neethu, Your application is running on local mode and that's the reason why you are not seeing the driver app in the 8080 webUI. You can pass the Master IP to your pyspark and get it running in cluster mode. eg: IPYTHON_OPTS=notebook --pylab inline $SPARK_HOME/bin/pyspark --master

Bad Digest error while doing aws s3 put

2014-07-17 Thread lmk
Hi, I am getting the following error while trying save a large dataset to s3 using the saveAsHadoopFile command with apache spark-1.0. org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 PUT failed for

Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi All, Currently we are reading (multiple) topics from Apache kafka and storing that in HBase (multiple tables) using twitter storm (1 tuple stores in 4 different tables). but we are facing some performance issue with HBase. so we are replacing* HBase* with *Parquet* file and *storm* with

Spark scheduling with Capacity scheduler

2014-07-17 Thread Konstantin Kudryavtsev
Hi all, I'm using HDP 2.0, YARN. I'm running both MapReduce and Spark jobs on this cluster, is it possible somehow use Capacity scheduler for Spark jobs management as well as MR jobs? I mean, I'm able to send MR job to specific queue, may I do the same with Spark job? thank you in advance Thank

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
1. You can put in multiple kafka topics in the same Kafka input stream. See the example KafkaWordCount https://github.com/apache/spark/blob/68f28dabe9c7679be82e684385be216319beb610/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala . However they will all be read

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at

Re: Kyro deserialisation error

2014-07-17 Thread Hao Wang
Hi, all Yes, it's a name of Wikipedia article. I am running WikipediaPageRank example of Spark Bagels. I am wondering whether there is any relation to buffer size of Kyro. The page rank can be successfully finished, sometimes not because this kind of Kyro exception happens too many times, which

Re: Pysparkshell are not listing in the web UI while running

2014-07-17 Thread MEETHU MATHEW
Hi Akhil, That fixed the problem...Thanks   Thanks Regards, Meethu M On Thursday, 17 July 2014 2:26 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi Neethu, Your application is running on local mode and that's the reason why you are not seeing the driver app in the 8080 webUI. You

GraphX Pragel implementation

2014-07-17 Thread Arun Kumar
Hi I am trying to implement belief propagation algorithm in GraphX using the pragel API. *def* pregel[A] (initialMsg*:* A, maxIter*:* Int = *Int*.*MaxValue*, activeDir*:* EdgeDirection = *EdgeDirection*.*Out*) (vprog*:* (VertexId, VD, A) *=* *VD*,

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Mahebub Sayyed
Hi, To migrate data from *HBase *to *Parquet* we used following query through * Impala*: INSERT INTO table PARQUET_HASHTAGS( key, city_name, country_name, hashtag_date, hashtag_text, hashtag_source, hashtag_month, posted_time, hashtag_time, tweet_id, user_id, user_name, hashtag_year )

Re: Speeding up K-Means Clustering

2014-07-17 Thread Ravishankar Rajagopalan
Hi Xiangrui, Yes I am using Spark v0.9 and am not running it in local mode. I did the memory setting using export SPARK_MEM=4G before starting the Spark instance. Also previously, I was starting it with -c 1 but changed it to -c 12 since it is a 12 core machine. It did bring down the time

Re: Getting pyspark.resultiterable.ResultIterable at xxxxxx in local shell

2014-07-17 Thread newbee88
Could someone please help me resolve This post has NOT been accepted by the mailing list yet. issue. I registered and subscribed to the mailing list many days ago but my post is still in unaccepted state. -- View this message in context:

Re: Spark Streaming timing considerations

2014-07-17 Thread Laeeq Ahmed
Hi TD, I have been able to filter the first WindowedRDD, but I am not sure how to make a generic filter. The larger window is 8 seconds and want to fetch 4 second based on application-time-stamp. I have seen an earlier post which suggest timeStampBasedwindow but I am not sure how to make

Equivalent functions for NVL() and CASE expressions in Spark SQL

2014-07-17 Thread pandees waran
Do we have any equivalent scala functions available for NVL() and CASE expressions to use in spark sql?

Re: Simple record matching using Spark SQL

2014-07-17 Thread Sarath Chandra
Added below 2 lines just before the sql query line - *...* *file1_schema.count;* *file2_schema.count;* *...* and it started working. But I couldn't get the reason. Can someone please explain me? What was happening earlier and what is happening with addition of these 2 lines? ~Sarath On Thu,

Re: Simple record matching using Spark SQL

2014-07-17 Thread Michael Armbrust
What version are you running? Could you provide a jstack of the driver and executor when it is hanging? On Thu, Jul 17, 2014 at 10:55 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Added below 2 lines just before the sql query line - *...* *file1_schema.count;*

Re: class after join

2014-07-17 Thread Michael Armbrust
If you intern the string it will be more efficient, but still significantly more expensive than the class based approach. ** VERY EXPERIMENTAL ** We are working with EPFL on a lightweight syntax for naming the results of spark transformations in scala (and are going to make it interoperate with

Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi guys, sure you have similar use case and want to know how you deal with that. In our application, we want to check the previous state of some keys and compare with their current state. AFAIK, Spark Streaming does not have key-value access. So current what I am doing is storing the previous

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Matei Zaharia
It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on http://spark.apache.org/docs/latest/running-on-yarn.html but it appears if you just type spark-submit --help or spark-submit with no arguments. Matei On Jul 17, 2014, at 2:33 AM, Konstantin

Re: Spark scheduling with Capacity scheduler

2014-07-17 Thread Derek Schoettle
unsubscribe From: Matei Zaharia matei.zaha...@gmail.com To: user@spark.apache.org Date: 07/17/2014 12:41 PM Subject:Re: Spark scheduling with Capacity scheduler It's possible using the --queue argument of spark-submit. Unfortunately this is not documented on

Error while running example/scala application using spark-submit

2014-07-17 Thread ShanxT
Hi, I am receiving below error while submitting any spark example or scala application. Really appreciate any help. spark version = 1.0.0 hadoop version = 2.4.0 Windows/Standalone mode 14/07/17 22:13:19 INFO TaskSchedulerImpl: Cancelling stage 0 Exception in thread main

Need help on Spark UDF (Join) Performance tuning .

2014-07-17 Thread S Malligarjunan
Hello Experts, I am facing performance problem when I use the UDF function call. Please help me to tune the query. Please find the details below shark select count(*) from table1; OK 151096 Time taken: 7.242 seconds shark select count(*) from table2;  OK 938 Time taken: 1.273 seconds Without

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
On Wed, Jul 16, 2014 at 12:36 PM, Matt Work Coarr mattcoarr.w...@gmail.com wrote: Thanks Marcelo, I'm not seeing anything in the logs that clearly explains what's causing this to break. One interesting point that we just discovered is that if we run the driver and the slave (worker) on the

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Please try val parsedData3 = data3.repartition(12).map(_.split(\t)).map(_.toDouble).cache() and check the storage and driver/executor memory in the WebUI. Make sure the data is fully cached. -Xiangrui On Thu, Jul 17, 2014 at 5:09 AM, Ravishankar Rajagopalan viora...@gmail.com wrote: Hi

Re: Spark Streaming Json file groupby function

2014-07-17 Thread srinivas
hi TD, Thanks for the solutions for my previous post...I am running into other issue..i am getting data from json file and i am trying to parse it and trying to map it to a record given below val jsonf

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
For accessing previous version, I would do it the same way. :) 1. Can you elaborate on what you mean by that with an example? What do you mean by accessing keys? 2. Yeah, that is hard to do with the ability to do point lookups into an RDD, which we dont support yet. You could try embedding the

Re: Spark Streaming Json file groupby function

2014-07-17 Thread Tathagata Das
This is a basic scala problem. You cannot apply toInt to Any. Try doing toString.toInt For such scala issues, I recommend trying it out in the Scala shell. For example, you could have tried this out as the following. [tdas @ Xion streaming] scala Welcome to Scala version 2.10.3 (Java HotSpot(TM)

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread ShanxT
Thanks Sean, 1) Yes, I am trying to run locally without Hadoop. 2) I also see the error in the provided link while launching spark-shell but post launch I am able to execute same code I have in the sample application. Read any local file and perform some reduction operation. But not through

Re: Spark Streaming timing considerations

2014-07-17 Thread Tathagata Das
You have to define what is the range records that needs to be filtered out in every windowed RDD, right? For example, when the DStream.window has data from from times 0 - 8 seconds by DStream time, you only want to filter out data that falls into say 4 - 8 seconds by application time. This latter

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Stephen Boesch
Hi Sean RE: Windows and hadoop 2.4.x HortonWorks - all the hype aside - only supports Windows Server 2008/2012. So this general concept of supporting Windows is bunk. Given that - and since the vast majority of Windows users do not happen to have Windows Server on their laptop - do you have any

Re: GraphX Pragel implementation

2014-07-17 Thread Ankur Dave
If your sendMsg function needs to know the incoming messages as well as the vertex value, you could define VD to be a tuple of the vertex value and the last received message. The vprog function would then store the incoming messages into the tuple, allowing sendMsg to access them. For example, if

Custom Metrics Sink

2014-07-17 Thread jjaffe
What is the preferred way of adding a custom metrics sink to Spark? I noticed that the Sink Trait has been private since April, so I cannot simply extend Sink in an outside package, but I would like to avoid having to create a custom build of Spark. Is this possible? -- View this message in

Re: Error while running example/scala application using spark-submit

2014-07-17 Thread Sean Owen
I am probably the wrong person to ask as I never use Hadoop on Windows. But from looking at the code just now it is clearly trying to accommodate Windows shell commands. Yes I would not be surprised if it still needs Cygwin. A slightly broader point is that ideally it doesnt matter whether Hadoop

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD, Thank you for the quick replying and backing my approach. :) 1) The example is this: 1. In the first 2 second interval, after updateStateByKey, I get a few keys and their states, say, (a - 1, b - 2, c - 3) 2. In the following 2 second interval, I only receive c and d and their value. But

Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-17 Thread hsy...@gmail.com
Thanks Tathagata, so can I say RDD size(from the stream) is window size. and the overlap between 2 adjacent RDDs are sliding size. But I still don't understand what it batch size, why do we need this since data processing is RDD by RDD right? And does spark chop the data into RDDs at the very

Re: Error: No space left on device

2014-07-17 Thread Chris DuBois
Hi Xiangrui, Thanks. I have taken your advice and set all 5 of my slaves to be c3.4xlarge. In this case /mnt and /mnt2 have plenty of space by default. I now do sc.textFile(blah).repartition(N).map(...).cache() with N=80 and spark.executor.memory to be 20gb and --driver-memory 20g. So far things

Re: using multiple dstreams together (spark streaming)

2014-07-17 Thread Walrus theCat
Thanks! On Wed, Jul 16, 2014 at 6:34 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Have you taken a look at DStream.transformWith( ... ) . That allows you apply arbitrary transformation between RDDs (of the same timestamp) of two different streams. So you can do something like this.

Re: Supported SQL syntax in Spark SQL

2014-07-17 Thread Nicholas Chammas
FYI: I've created SPARK-2560 https://issues.apache.org/jira/browse/SPARK-2560 to track creating SQL reference docs for Spark SQL. On Mon, Jul 14, 2014 at 2:06 PM, Michael Armbrust mich...@databricks.com wrote: You can find the parser here:

replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Eric Friedman
I used to use SPARK_LIBRARY_PATH to specify the location of native libs for lzo compression when using spark 0.9.0. The references to that environment variable have disappeared from the docs for spark 1.0.1 and it's not clear how to specify the location for lzo. Any guidance?

Re: Error: No space left on device

2014-07-17 Thread Bill Jay
Hi, I also have some issues with repartition. In my program, I consume data from Kafka. After I consume data, I use repartition(N). However, although I set N to be 120, there are around 18 executors allocated for my reduce stage. I am not sure how the repartition command works ton ensure the

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Matt Work Coarr
Thanks Marcelo! This is a huge help!! Looking at the executor logs (in a vanilla spark install, I'm finding them in $SPARK_HOME/work/*)... It launches the executor, but it looks like the CoarseGrainedExecutorBackend is having trouble talking to the driver (exactly what you said!!!). Do you

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Zongheng Yang
One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here: http://spark.apache.org/docs/latest/configuration.html On Thu, Jul 17, 2014 at 1:25 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I used to use

unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi guys, need some help in this problem. In our use case, we need to continuously insert values into the database. So our approach is to create the jdbc object in the main method and then do the inserting operation in the DStream foreachRDD operation. Is this approach reasonable? Then the

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Marcelo Vanzin
Could you share some code (or pseudo-code)? Sounds like you're instantiating the JDBC connection in the driver, and using it inside a closure that would be run in a remote executor. That means that the connection object would need to be serializable. If that sounds like what you're doing, it

Re: Include permalinks in mail footer

2014-07-17 Thread Matei Zaharia
Good question.. I'll ask INFRA because I haven't seen other Apache mailing lists provide this. It would indeed be helpful. Matei On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote: Can we modify the mailing list to include permalinks to the thread in the footer of

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
The updateFunction given in updateStateByKey should be called on ALL the keys are in the state, even if there is no new data in the batch for some key. Is that not the behavior you see? What do you mean by show all the existing states? You have access to the latest state RDD by doing

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Tathagata Das
And if Marcelo's guess is correct, then the right way to do this would be to lazily / dynamically create the jdbc connection server as a singleton in the workers/executors and use that. Something like this. dstream.foreachRDD(rdd = { rdd.foreachPartition((iterator: Iterator[...]) = {

Re: Spark Streaming timestamps

2014-07-17 Thread Bill Jay
Hi Tathagata, Thanks for your answer. Please see my further question below: On Wed, Jul 16, 2014 at 6:57 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Answers inline. On Wed, Jul 16, 2014 at 5:39 PM, Bill Jay bill.jaypeter...@gmail.com wrote: Hi all, I am currently using Spark

how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Chen Song
I am using spark 0.9.0 and I am able to submit job to YARN, https://spark.apache.org/docs/0.9.0/running-on-yarn.html. I am trying to turn on gc logging on executors but could not find a way to set extra Java opts for workers. I tried to set spark.executor.extraJavaOptions but that did not work.

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Chen Song
Thanks Luis and Tobias. On Tue, Jul 1, 2014 at 11:39 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Wed, Jul 2, 2014 at 1:57 AM, Chen Song chen.song...@gmail.com wrote: * Is there a way to control how far Kafka Dstream can read on topic-partition (via offset for example). By setting

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Tathagata Das
val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post val transformedStream = kafkaStream.map ... // whatever transformation you want to do transformedStream.foreachRDD((rdd: RDD[...], time: Time) = { // save the rdd to parquet file, using time as the file

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Marcelo Vanzin
Hi Matt, I'm not very familiar with setup on ec2; the closest I can point you at is to look at the launch_cluster in ec2/spark_ec2.py, where the ports seem to be configured. On Thu, Jul 17, 2014 at 1:29 PM, Matt Work Coarr mattcoarr.w...@gmail.com wrote: Thanks Marcelo! This is a huge help!!

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Marcelo and TD, Thank you for the help. If I use TD's approache, it works and there is no exception. Only drawback is that it will create many connections to the DB, which I was trying to avoid. Here is a snapshot of my code. Mark as red for the important code. What I was thinking is that, if

Re: Apache kafka + spark + Parquet

2014-07-17 Thread Michael Armbrust
We don't have support for partitioned parquet yet. There is a JIRA here: https://issues.apache.org/jira/browse/SPARK-2406 On Thu, Jul 17, 2014 at 5:00 PM, Tathagata Das tathagata.das1...@gmail.com wrote: val kafkaStream = KafkaUtils.createStream(... ) // see the example in my previous post

Re: Retrieve dataset of Big Data Benchmark

2014-07-17 Thread Tom
Hi Burak, I tried running it through the Spark shell, but I still ended with the same error message as in Hadoop: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the

Error with spark-submit

2014-07-17 Thread ranjanp
Hi,I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2 workers) cluster. From the Web UI at the master, I see that the workers are registered. But when I try running the SparkPi example from the master node, I get the following message and then an exception.14/07/17 01:20:36

Large scale ranked recommendation

2014-07-17 Thread m3.sharma
Hi, I am trying to develop a recommender system for about 1 million users and 10 thousand items. Currently it's a simple regression based model where for every user, item pair in dataset we generate some features and learn model from it. Till training and evaluation everything is fine the

Re: Release date for new pyspark

2014-07-17 Thread Paul Wais
Thanks all! (And thanks Matei for the developer link!) I was able to build using maven[1] but `./sbt/sbt assembly` results in build errors. (Not familiar enough with the build to know why; in the past sbt worked for me and maven did not). I was able to run the master version of pyspark, which

Error with spark-submit (formatting corrected)

2014-07-17 Thread ranjanp
Hi, I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2 workers) cluster. From the Web UI at the master, I see that the workers are registered. But when I try running the SparkPi example from the master node, I get the following message and then an exception. 14/07/17

Re: Large scale ranked recommendation

2014-07-17 Thread m3.sharma
We are using RegressionModels that comes with *mllib* package in SPARK. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-scale-ranked-recommendation-tp10098p10103.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Large scale ranked recommendation

2014-07-17 Thread Shuo Xiang
Hi, Are you suggesting that taking simple vector dot products or sigmoid function on 10K * 1M data takes 5hrs? On Thu, Jul 17, 2014 at 3:59 PM, m3.sharma sharm...@umn.edu wrote: We are using RegressionModels that comes with *mllib* package in SPARK. -- View this message in context:

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Bill Jay
I also have an issue consuming from Kafka. When I consume from Kafka, there are always a single executor working on this job. Even I use repartition, it seems that there is still a single executor. Does anyone has an idea how to add parallelism to this job? On Thu, Jul 17, 2014 at 2:06 PM, Chen

Re: Large scale ranked recommendation

2014-07-17 Thread m3.sharma
Yes, thats what prediction should be doing, taking dot products or sigmoid function for each user,item pair. For 1 million users and 10 K items data there are 10 billion pairs. -- View this message in context:

Spark Streaming

2014-07-17 Thread Guangle Fan
Hi, All When I run spark streaming, in one of the flatMap stage, I want to access database. Code looks like : stream.flatMap( new FlatMapFunction { call () { //access database cluster } } ) Since I don't want to create database connection every time call() was called, where

Re: unserializable object in Spark Streaming context

2014-07-17 Thread Yan Fang
Hi Sean, Thank you. I see your point. What I was thinking is that, do computation in a distributed fashion and do the storing from a single place. But you are right, having multiple DB connections actually is fine. Thanks for answering my questions. That helps me understand the system. Cheers,

Hive From Spark

2014-07-17 Thread JiajiaJing
Hello Spark Users, I am new to Spark SQL and now trying to first get the HiveFromSpark example working. However, I got the following error when running HiveFromSpark.scala program. May I get some help on this please? ERROR MESSAGE: org.apache.thrift.TApplicationException: Invalid method name:

Re: replacement for SPARK_LIBRARY_PATH ?

2014-07-17 Thread Koert Kuipers
but be aware that spark-defaults.conf is only used if you use spark-submit On Jul 17, 2014 4:29 PM, Zongheng Yang zonghen...@gmail.com wrote: One way is to set this in your conf/spark-defaults.conf: spark.executor.extraLibraryPath /path/to/native/lib The key is documented here:

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Yan Fang
Hi TD, Thank you. Yes, it behaves as you described. Sorry for missing this point. Then my only concern is in the performance side - since Spark Streaming operates on all the keys everytime a new batch comes, I think it is fine when the state size is small. When the state size becomes big, say, a

Re: can't get jobs to run on cluster (enough memory and cpus are available on worker)

2014-07-17 Thread Andrew Or
Hi Matt, The security group shouldn't be an issue; the ports listed in `spark_ec2.py` are only for communication with the outside world. How did you launch your application? I notice you did not launch your driver from your Master node. What happens if you did? Another thing is that there seems

Re: Include permalinks in mail footer

2014-07-17 Thread Tobias Pfeiffer
On Jul 17, 2014, at 12:59 PM, Nick Chammas nicholas.cham...@gmail.com wrote: I often find myself wanting to reference one thread from another, or from a JIRA issue. Right now I have to google the thread subject and find the link that way. +1

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tobias Pfeiffer
Bill, are you saying, after repartition(400), you have 400 partitions on one host and the other hosts receive nothing of the data? Tobias On Fri, Jul 18, 2014 at 8:11 AM, Bill Jay bill.jaypeter...@gmail.com wrote: I also have an issue consuming from Kafka. When I consume from Kafka, there

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Tathagata Das
Can you check in the environment tab of Spark web ui to see whether this configuration parameter is in effect? TD On Thu, Jul 17, 2014 at 2:05 PM, Chen Song chen.song...@gmail.com wrote: I am using spark 0.9.0 and I am able to submit job to YARN,

Re: spark streaming rate limiting from kafka

2014-07-17 Thread Tathagata Das
You can create multiple kafka stream to partition your topics across them, which will run multiple receivers or multiple executors. This is covered in the Spark streaming guide. http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving And for the

Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Andrew Or
Hi ranjanp, If you go to the master UI (masterIP:8080), what does the first line say? Verify that this is the same as what you expect. Another thing is that --master in spark submit overwrites whatever you set MASTER to, so the environment variable won't actually take effect. Another obvious

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Andrew Or
Hi Chen, spark.executor.extraJavaOptions is introduced in Spark 1.0, not in Spark 0.9. You need to export SPARK_JAVA_OPTS= -Dspark.config1=value1 -Dspark.config2=value2 in conf/spark-env.sh. Let me know if that works. Andrew 2014-07-17 18:15 GMT-07:00 Tathagata Das

Re: jar changed on src filesystem

2014-07-17 Thread Andrew Or
Hi Jian, In yarn-cluster mode, Spark submit automatically uploads the assembly jar to a distributed cache that all executor containers read from, so there is no need to manually copy the assembly jar to all nodes (or pass it through --jars). It seems there are two versions of the same jar in your

Re: Spark Streaming

2014-07-17 Thread Tathagata Das
Step 1: use rdd.mapPartitions(). Thats equivalent to mapper of the MapReduce. You can open connection, get all the data and buffer it, close connection, return iterator to the buffer Step 2: Make step 1 better, by making it reuse connections. You can use singletons / static vars, to lazily

Re: Errors accessing hdfs while in local mode

2014-07-17 Thread Andrew Or
Hi Chris, Did you ever figure this out? It should just work provided that your HDFS is set up correctly. If you don't call setMaster, it actually uses the spark://[master-node-ip]:7077 by default (this is configured in your conf/spark-env.sh). However, even if you use a local master, it should

Re: Is there a way to get previous/other keys' state in Spark Streaming?

2014-07-17 Thread Tathagata Das
Yes, this is the limitation of the current implementation. But this will be improved a lt when we have IndexedRDD https://github.com/apache/spark/pull/1297 in the Spark that allows faster single value updates to a key-value (within each partition, without processing the entire partition.

Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Jay Vyas
I think I know what is happening to you. I've looked some into this just this week, and so its fresh in my brain :) hope this helps. When no workers are known to the master, iirc, you get this message. I think this is how it works. 1) You start your master 2) You start a slave, and give it

iScala or Scala-notebook

2014-07-17 Thread ericjohnston1989
Hey everyone, I know this was asked before but I'm wondering if there have since been any updates. Are there any plans to integrate iScala/Scala-notebook with spark in the near future? This seems like something a lot of people would find very useful, so I was just wondering if anyone has started

Cannot connect to hive metastore

2014-07-17 Thread linkpatrickliu
Seems like the mysql connector jar is not included in the classpath. Where can I set the jar to the classpath? hive-site.xml: property namejavax.jdo.option.ConnectionURL/name valuejdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=trueamp;characterEncoding=UTF-8/value

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Chen Song
Thanks Andrew. Say that I want to turn on CMS gc for each worker. All I need to do is add the following line to conf/spark-env.sh on node where I submit the application. -XX:+UseConcMarkSweepGC Is that correct? Will this option be populated to each worker in yarn? On Thu, Jul 17, 2014 at

Re: Spark Streaming Json file groupby function

2014-07-17 Thread srinivas
Hi TD, It Worked...Thank you so much for all your help. Thanks, -Srinivas. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Json-file-groupby-function-tp9618p10132.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: how to pass extra Java opts to workers for spark streaming jobs

2014-07-17 Thread Andrew Or
You will need to include that in the SPARK_JAVA_OPTS environment variable, so add the following line to spark-env.sh: export SPARK_JAVA_OPTS= -XX:+UseConcMarkSweepGC This should propagate to the executors. (Though you should double check, since 0.9 is a little old and I could be forgetting

Last step of processing is using too much memory.

2014-07-17 Thread Roch Denis
Hello, I have an issue where my spark code is using too much memory in the final step ( a count for testing purpose, it will write the result to a db when it works ). I'm really not too sure how I can break down the last step to use less RAM. So, basically my data is log lines and each log line

spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-17 Thread Victor Sheng
when I run a query to a hadoop file. mobile.registerAsTable(mobile) val count = sqlContext.sql(select count(1) from mobile) res5: org.apache.spark.sql.SchemaRDD = SchemaRDD[21] at RDD at SchemaRDD.scala:100 == Query Plan == ExistingRdd [data_date#0,mobile#1,create_time#2], MapPartitionsRDD[4] at

  1   2   >