date:20141127

Unable to compile spark 1.1.0 on windows 8.1

2014-11-27 Thread Ishwardeep Singh

Hi, I am trying to compile spark 1.1.0 on windows 8.1 but I get the following exception. [info] Compiling 3 Scala sources to D:\myworkplace\software\spark-1.1.0\project\target\scala-2.10\sbt0.13\classes... [error] D:\myworkplace\software\spark-1.1.0\project\SparkBuild.scala:26: object sbt is not

Re: read both local path and HDFS path

2014-11-27 Thread Prannoy

Hi, The configuration you provide is just to access the HDFS when you give an HDFS path. When you provide a HDFS path with the HDFS nameservice, like in your case hmaster155:9000 it goes inside the HDFS to look for the file. For accessing local file just give the local path of the file. Go to the

Re: Percentile

2014-11-27 Thread Akhil Das

Hi Franco, Hive percentile UDAF is added in the master branch. You can have a look at it. I think it would work like "select percentile(col_name, 1) from sigmoid_logs" Thanks Best Regards On Thu, Nov 27, 2014 at 8:58 PM, Franco Barrientos < franco.barrien...@exalitica.com> wrote: > Hi folks!, >

Creating a SchemaRDD from an existing API

2014-11-27 Thread Niranda Perera

Hi, I am evaluating Spark for an analytic component where we do batch processing of data using SQL. So, I am particularly interested in Spark SQL and in creating a SchemaRDD from an existing API [1]. This API exposes elements in a database as datasources. Using the methods allowed by this data s

RE: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Cheng, Hao

Hi Jianshi, I couldn’t reproduce that with latest MASTER, and I can always get the BroadcastHashJoin for managed tables (in .csv file) in my testing, are there any external tables in your case? In general probably couple of things you can try first (with HiveContext): 1) ANALYZE TABLE xxx

SVD Plus Plus in GraphX

2014-11-27 Thread Deep Pradhan

Hi, I was just going through the two codes in GraphX namely SVDPlusPlus and TriangleCount. In the first I see an RDD as an input to run ie, run(edges: RDD[Edge[Double]],...) and in the other I see run(VD:..., ED:...) Can anyone explain me the difference between these two? Infact SVDPlusPlus is the

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Harihar Nahak

When there is new data comes in a stream spark use streams classes to convert it into RDD and as you mention its follow with transformation and finally action. Till the time user doesn't destroy or application is alive All RDD remain in Memory as far as I experienced. On 26 November 2014 at 20:05

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-27 Thread Harihar Nahak

Thanks Ankur, Its really help full. I've few queries on optimization techniques. for the current I used RandomVertexCut partition. But what partition should be used if have: 1. No. of edges in edgeList file are to large like 50,000,000; where multiple edges to same pair of vertices are many 2. No

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Tathagata Das

If it regularly fails after 8 hours then could you get me the log4j logs? To limit the size, set default log level to Warn and the level of logs for all classes in package o.a.s.streaming to Debug. Then I can take a look. On Nov 27, 2014 11:01 AM, "Bill Jay" wrote: > Gerard, > > That is a good ob

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

2014-11-27 Thread Kelly, Jonathan

Yeah, only a few hours after I sent my message I saw some correspondence on this other thread: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-complex-types-like-map-lt-string-map-lt-string-int-gt-gt-in-spark-sql-td19603.html, which is the exact same issue. Glad to find that t

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay

Gerard, That is a good observation. However, the strange thing I meet is if I use "MEMORY_AND_DISK_SER, the job even fails earlier. In my case, it takes 10 seconds to process my data of every batch, which is one minute. It fails after 10 hours with the "cannot compute split" error. Bill On Thu,

ALS failure with size > Integer.MAX_VALUE

2014-11-27 Thread Bharath Ravi Kumar

We're training a recommender with ALS in mllib 1.1 against a dataset of 150M users and 4.5K items, with the total number of training records being 1.2 Billion (~30GB data). The input data is spread across 1200 partitions on HDFS. For the training, rank=10, and we've configured {number of user data

Re: Using Breeze in the Scala Shell

2014-11-27 Thread Debasish Das

I have used breeze fine with scala shell: scala -cp ./target/spark-mllib_2.10-1.3.0-SNAPSHOT. jar:/Users/v606014/.m2/repository/com/github/fommil/netlib/core/1.1.2/core-1.1.2.jar:/Users/v606014/.m2/repository/org/jblas/jblas/1.2.3/jblas-1.2.3.jar:/Users/v606014/.m2/repository/org/scalanlp/breeze_2

Using Breeze in the Scala Shell

2014-11-27 Thread Dean Jones

Hi, I'm trying to use the breeze library in the spark scala shell, but I'm running into the same problem documented here: http://apache-spark-user-list.1001560.n3.nabble.com/org-apache-commons-math3-random-RandomGenerator-issue-td15748.html As I'm using the shell, I don't have a pom.xml, so the s

Percentile

2014-11-27 Thread Franco Barrientos

Hi folks!, Anyone known how can I calculate for each elements of a variable in a RDD its percentile? I tried to calculate trough Spark SQL with subqueries but I think that is imposible in Spark SQL. Any idea will be welcome. Thanks in advance, Franco Barrientos Data Scientist Málaga #1

Mesos killing Spark Driver

2014-11-27 Thread Gerard Maas

Hi, We are currently running our Spark + Spark Streaming jobs on Mesos, submitting our jobs through Marathon. We see with some regularity that the Spark Streaming driver gets killed by Mesos and then restarted on some other node by Marathon. I've no clue why Mesos is killing the driver and lookin

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang

Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao wrote: > Are

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread Sean Owen

No, the feature vector is not converted. It contains count n_i of how often each term t_i occurs (or a TF-IDF transformation of those). You are finding the class c such that P(c) * P(t_1|c)^n_1 * ... is maximized. In log space it's log(P(c)) + n_1*log(P(t_1|c)) + ... So your n_1 counts (or TF-IDF

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Debasish Das

Running with lambda=0 fails the ALS code since the matrices no longer stays positive def and cholesky fails... Run with a very low lambda (I tested with 1e-4) and you should see the decrease in RMSE as you expect... On Thu, Nov 27, 2014 at 3:04 AM, Kostas Kloudas wrote: > Thanks a lot for your

reduceByKey and empty output files

2014-11-27 Thread Praveen Sripati

Hi, When I run the below program, I see two files in the HDFS because the number of partitions in 2. But, one of the file is empty. Why is it so? Is the work not distributed equally to all the tasks? textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)). *reduceByKey*(lambda a,

Best way to do a lookup in Spark

2014-11-27 Thread Ashic Mahtab

Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current imple

[graphx] failed to submit an application with java.lang.ClassNotFoundException

2014-11-27 Thread Yifan LI

Hi, I just tried to submit an application from graphx examples directory, but it failed: yifan2:bin yifanli$ MASTER=local[*] ./run-example graphx.PPR_hubs java.lang.ClassNotFoundException: org.apache.spark.examples.graphx.PPR_hubs at java.net.URLClassLoader$1.run(URLClassLoader.java:202)

Exception while starting thrift server

2014-11-27 Thread vdiwakar.malladi

Hi, When I'm starting thrift server, I'm getting the following exception. Could you any one help me on this. I placed hive-site.xml in $SPARK_HOME/conf folder and the property hive.metastore.sasl.enabled set to 'false'. org.apache.hive.service.ServiceException: Unable to login to kerberos with g

Re: Accessing posterior probability of Naive Baye's prediction

2014-11-27 Thread jatinpreet

Hi, I have been running through some troubles while converting the code to Java. I have done the matrix operations as directed and tried to find the maximum score for each category. But the predicted category is mostly different from the prediction done by MLlib. I am fetching iterators of the pi

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Gerard Maas

Hi TD, We also struggled with this error for a long while. The recurring scenario is when the job takes longer to compute than the job interval and a backlog starts to pile up. Hint: Check If the DStream storage level is set to "MEMORY_ONLY_SER" and memory runs out, then you will get a 'Cannot c

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-27 Thread Cheng Lian

Actually all components are merged into the assembly jar under assembly/target/scala-2.10. You don’t need to configure the classpath unless you need to include some customized jars (e.g. customized Hive SerDes). Existing scripts can compute the classpath correctly (if they can’t, it should be a

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Kostas Kloudas

Thanks a lot for your time guys and your quick replies! > On Nov 26, 2014, at 7:53 PM, Xiangrui Meng wrote: > > The training RMSE may increase due to regularization. Squared loss > only represents part of the global loss. If you watch the sum of the > squared loss and the regularization, it shou

Spark 1.1.1 released but not available on maven repositories

2014-11-27 Thread Luis Ángel Vicente Sánchez

I have just read on the website that spark 1.1.1 has been released but when I upgraded my project to use 1.1.1 I discovered that the artefacts are not on maven yet. [info] Resolving org.apache.spark#spark-streaming-kafka_2.10;1.1.1 ... > > [warn] module not found: org.apache.spark#spark-streaming-

Re: Unable to generate assembly jar which includes jdbc-thrift server

2014-11-27 Thread vdiwakar.malladi

Hi, I setup maven environment on a Linux machine and able to build the pom file in spark home directory. Each module refreshed with corresponding target directory with jar files. In order to include all the libraries to classpath, what I need to do? earlier, I used single assembly jar file to inc

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

2014-11-27 Thread Sean Owen

Ah of course. Great explanation. So I suppose you should see desired results with lambda = 0, although you don't generally want to set this to 0. On Wed, Nov 26, 2014 at 7:53 PM, Xiangrui Meng wrote: > The training RMSE may increase due to regularization. Squared loss > only represents part of th

RDD decouple store implementations

2014-11-27 Thread Guy Doulberg

Hi guys I am playing with spark, and I was thinking if there is a way to share RDD across multiple implementations in a decoupled way , i.e assuming I have RDD that comes from a stream in spark streaming, I want to be able to store the same stream on two different s3 folders using two differe

Unable to compile spark 1.1.0 on windows 8.1

Re: read both local path and HDFS path

Re: Percentile

Creating a SchemaRDD from an existing API

RE: Auto BroadcastJoin optimization failed in latest Spark

SVD Plus Plus in GraphX

Re: Lifecycle of RDD in spark-streaming

Re: Is Spark? or GraphX runs fast? a performance comparison on Page Rank

Re: Lifecycle of RDD in spark-streaming

Re: SchemaRDD.saveAsTable() when schema contains arrays and was loaded from a JSON file using schema auto-detection

Re: Lifecycle of RDD in spark-streaming

ALS failure with size > Integer.MAX_VALUE

Re: Using Breeze in the Scala Shell

Using Breeze in the Scala Shell

Percentile

Mesos killing Spark Driver

Re: Auto BroadcastJoin optimization failed in latest Spark

Re: Accessing posterior probability of Naive Baye's prediction

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

reduceByKey and empty output files

Best way to do a lookup in Spark

[graphx] failed to submit an application with java.lang.ClassNotFoundException

Exception while starting thrift server

Re: Accessing posterior probability of Naive Baye's prediction

Re: Lifecycle of RDD in spark-streaming

Re: Unable to generate assembly jar which includes jdbc-thrift server

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

Spark 1.1.1 released but not available on maven repositories

Re: Unable to generate assembly jar which includes jdbc-thrift server

Re: RMSE in MovieLensALS increases or stays stable as iterations increase.

RDD decouple store implementations

31 matches

Site Navigation

Mail list logo

Footer information