RDD decouple store implementations

2014-11-27 Thread Guy Doulberg
Hi guys I am playing with spark, and I was thinking if there is a way to share RDD across multiple implementations in a decoupled way , i.e assuming I have RDD that comes from a stream in spark streaming, I want to be able to store the same stream on two different s3 folders using two

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Sean Owen
Ah of course. Great explanation. So I suppose you should see desired results with lambda = 0, although you don't generally want to set this to 0. On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng men...@gmail.com wrote: The training RMSE may increase due to regularization. Squared loss only

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-27 Thread vdiwakar.malladi
Hi, I setup maven environment on a Linux machine and able to build the pom file in spark home directory. Each module refreshed with corresponding target directory with jar files. In order to include all the libraries to classpath, what I need to do? earlier, I used single assembly jar file to

Spark 1.1.1 released but not available on maven repositories

2014-11-27 Thread Luis Ángel Vicente Sánchez
I have just read on the website that spark 1.1.1 has been released but when I upgraded my project to use 1.1.1 I discovered that the artefacts are not on maven yet. [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ... [warn] module not found:

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Kostas Kloudas
Thanks a lot for your time guys and your quick replies! On Nov 26, 2014, at 7:53 PM, Xiangrui Meng men...@gmail.com wrote: The training RMSE may increase due to regularization. Squared loss only represents part of the global loss. If you watch the sum of the squared loss and the

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas
Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to MEMORY_ONLY_SER and memory runs out, then you will get a 'Cannot

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet
Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the

Exception while starting thrift server

2014-11-27 Thread vdiwakar.malladi
Hi, When I'm starting thrift server, I'm getting the following exception. Could you any one help me on this. I placed hive-site.xml in $SPARK_HOME/conf folder and the property hive.metastore.sasl.enabled set to 'false'. org.apache.hive.service.ServiceException: Unable to login to kerberos with

[graphx] failed to submit an application with java.lang.ClassNotFoundException

2014-11-27 Thread Yifan LI
Hi, I just tried to submit an application from graphx examples directory, but it failed: yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs at

Best way to do a lookup in Spark

2014-11-27 Thread Ashic Mahtab
Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread Sean Owen
No, the feature vector is not converted. It contains count n_i of how often each term t_i occurs (or a TF-IDF transformation of those). You are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is maximized. In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ... So your n_1 counts (or

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao

Mesos killing Spark Driver

2014-11-27 Thread Gerard Maas
Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos and then restarted on some other node by Marathon. I've no clue why Mesos is killing the driver and

Percentile

2014-11-27 Thread Franco Barrientos
Hi folks!, Anyone known how can I calculate for each elements of a variable in a RDD its percentile? I tried to calculate trough Spark SQL with subqueries but I think that is imposible in Spark SQL. Any idea will be welcome. Thanks in advance, Franco Barrientos Data Scientist Málaga

Using Breeze in the Scala Shell

2014-11-27 Thread Dean Jones
Hi, I'm trying to use the breeze library in the spark scala shell, but I'm running into the same problem documented here: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html As I'm using the shell, I don't have a pom.xml, so the

Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das
I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT.

ALS failure with size Integer.MAX_VALUE

2014-11-27 Thread Bharath Ravi Kumar
We're training a recommender with ALS in mllib 1.1 against a dataset of 150M users and 4.5K items, with the total number of training records being 1.2 Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. For the training, rank=10, and we've configured {number of user data

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
Gerard, That is a good observation. However, the strange thing I meet is if I use MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10 seconds to process my data of every batch, which is one minute. It fails after 10 hours with the cannot compute split error. Bill On Thu,

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-27 Thread Kelly, Jonathan
Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue. Glad to find that

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das
If it regularly fails after 8 hours then could you get me the log4j logs? To limit the size, set default log level to Warn and the level of logs for all classes in package o.a.s.streaming to Debug. Then I can take a look. On Nov 27, 2014 11:01 AM, Bill Jay bill.jaypeter...@gmail.com wrote:

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-27 Thread Harihar Nahak
Thanks Ankur, Its really help full. I've few queries on optimization techniques. for the current I used RandomVertexCut partition. But what partition should be used if have: 1. No. of edges in edgeList file are to large like 50,000,000; where multiple edges to same pair of vertices are many 2. No

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Harihar Nahak
When there is new data comes in a stream spark use streams classes to convert it into RDD and as you mention its follow with transformation and finally action. Till the time user doesn't destroy or application is alive All RDD remain in Memory as far as I experienced. On 26 November 2014 at

SVD Plus Plus in GraphX

2014-11-27 Thread Deep Pradhan
Hi, I was just going through the two codes in GraphX namely SVDPlusPlus and TriangleCount. In the first I see an RDD as an input to run ie, run(edges: RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...) Can anyone explain me the difference between these two? Infact SVDPlusPlus is the

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao
Hi Jianshi, I couldn’t reproduce that with latest MASTER, and I can always get the BroadcastHashJoin for managed tables (in .csv file) in my testing, are there any external tables in your case? In general probably couple of things you can try first (with HiveContext): 1) ANALYZE TABLE xxx

Creating a SchemaRDD from an existing API

2014-11-27 Thread Niranda Perera
Hi, I am evaluating Spark for an analytic component where we do batch processing of data using SQL. So, I am particularly interested in Spark SQL and in creating a SchemaRDD from an existing API [1]. This API exposes elements in a database as datasources. Using the methods allowed by this data

Re: read both local path and HDFS path

2014-11-27 Thread Prannoy
Hi, The configuration you provide is just to access the HDFS when you give an HDFS path. When you provide a HDFS path with the HDFS nameservice, like in your case hmaster155:9000 it goes inside the HDFS to look for the file. For accessing local file just give the local path of the file. Go to the

Unable to compile spark 1.1.0 on windows 8.1

2014-11-27 Thread Ishwardeep Singh
Hi, I am trying to compile spark 1.1.0 on windows 8.1 but I get the following exception. [info] Compiling 3 Scala sources to D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes... [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26: object sbt is