RE: Spark cluster set up on EC2 customization

2015-02-26 Thread Sameer Tilak
On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak wrote: Hi, I was looking at the documentation for deploying Spark cluster on EC2. http://spark.apache.org/docs/latest/ec2-scripts.html We are using Pig to build the data pipeline and then use MLLib for analytics. I was wondering if someone has

Spark cluster set up on EC2 customization

2015-02-25 Thread Sameer Tilak
Hi, I was looking at the documentation for deploying Spark cluster on EC2. http://spark.apache.org/docs/latest/ec2-scripts.html We are using Pig to build the data pipeline and then use MLLib for analytics. I was wondering if someone has any experience to include additional tools/services such

RE: Interpreting MLLib's linear regression o/p

2014-12-22 Thread Sameer Tilak
ssti...@live.com > CC: user@spark.apache.org > > Did you check the indices in the LIBSVM data and the master file? Do > they match? -Xiangrui > > On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak wrote: > > Hi All, > > I use LIBSVM format to specify my input feature vector,

Interpreting MLLib's linear regression o/p

2014-12-20 Thread Sameer Tilak
Hi All,I use LIBSVM format to specify my input feature vector, which used 1-based index. When I run regression the o/p is 0-indexed based. I have a master lookup file that maps back these indices to what they stand or. However, I need to add offset of 2 and not 1 to the regression outcome during

MLLib: Saving and loading a model

2014-12-15 Thread Sameer Tilak
Hi All,Resending this: I am using LinearRegressionWithSGD and then I save the model weights and intercept. File that contains weights have this format: 1.204550.13560.000456.. Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like

MLlib: Libsvm: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-10 Thread Sameer Tilak
Hi All,When I am running LinearRegressionWithSGD, I get the following error. Any help on how to debug this further will be highly appreciated. 14/12/10 20:26:02 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException: 150323 at breez

MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-08 Thread Sameer Tilak
Hi All, I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB sparse). I have now filtered the data and I am running regression on a subset of it (~ 200 MB). I see this error, which is strange since it was running fine with the superset data. Is this a formatting issue

MLLib: loading saved model

2014-12-03 Thread Sameer Tilak
Hi All,I am using LinearRegressionWithSGD and then I save the model weights and intercept. File that contains weights have this format: 1.204550.13560.000456.. Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like to initialize a

RE: Model characterization

2014-11-04 Thread Sameer Tilak
Excellent, many thanks. Really appreciate your help. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Xiangrui Meng Date:11/03/2014 9:04 PM (GMT-08:00) To: Sameer Tilak Cc: user@spark.apache.org Subject: Re: Model characterization

Model characterization

2014-11-03 Thread Sameer Tilak
Hi All, I have been using LinearRegression model of MLLib and very pleased with its scalability and robustness. Right now, we are just calculating MSE of our model. We would like to characterize the performance of our model. I was wondering adding support for computing things such as Confidence

LinearRegression and model prediction threshold

2014-10-31 Thread Sameer Tilak
Hi All, I am using LinearRegression and have a question about the details on model.predict method. Basically it is predicting variable y given an input vector x. However, can someone point me to the documentation about what is the threshold used in the predict method? Can that be changed ? I am

MLLib: libsvm - default value initialization

2014-10-29 Thread Sameer Tilak
Hi All,I have my sparse data in libsvm format. val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "mllib/data/sample_libsvm_data.txt") I am running Linear regression. Let us say that my data has following entry:1 1:0 4:1 I think it will assume 0 for indices 2 and 3, right? I would li

RE: MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Great, I will sort them. Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Original message From: Xiangrui Meng Date:10/21/2014 3:29 PM (GMT-08:00) To: Sameer Tilak Cc: user@spark.apache.org Subject: Re: MLLib libsvm format Yes. "where the indice

MLLib libsvm format

2014-10-21 Thread Sameer Tilak
Hi All,I have a question regarding the ordering of indices. The document says that the indices indices are one-based and in ascending order. However, do the indices within a row need to be sorted in ascending order? Sparse dataIt is very common in practice to have sparse training data. MLlib s

RE: MLLib Linear regression

2014-10-08 Thread Sameer Tilak
you test that > combination? Are there any linear dependency between A's columns and > D's columns? -Xiangrui > > On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote: > > BTW, one detail: > > > > When number of iterations is 100 all weights are zero or below

RE: MLLib Linear regression

2014-10-07 Thread Sameer Tilak
BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does no

MLLib Linear regression

2014-10-07 Thread Sameer Tilak
Hi All,I have following classes of features: class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + D3. A

RE: MLLib: Missing value imputation

2014-10-01 Thread Sameer Tilak
ne is committed to work on this feature. For now, you can filter out examples containing missing values and use the rest for training. -Xiangrui On Tue, Sep 30, 2014 at 11:26 AM, Sameer Tilak wrote: > Hi All, > Can someone please me to the documentation that describes how missing value &

MLLib: Missing value imputation

2014-09-30 Thread Sameer Tilak
Hi All,Can someone please me to the documentation that describes how missing value imputation is done in MLLib. Also, any information of how this fits in the overall roadmap will be great.

RE: MLUtils.loadLibSVMFile error

2014-09-25 Thread Sameer Tilak
features may create a large number of temp objects, which may also cause GC to happen. Hope this helps!Liquan On Wed, Sep 24, 2014 at 9:50 PM, Sameer Tilak wrote: Hi All,I was able to solve this formatting issue. However, I have another question. When I do the following, val examples: RDD

RE: MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
n't say > anything more. > > On Wed, Sep 24, 2014 at 11:02 PM, Sameer Tilak wrote: > > Hi All, > > > > > > When I try to load dataset using MLUtils.loadLibSVMFile, I have the > > following problem. Any help will be greatly appreciated! > > >

MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
Hi All, When I try to load dataset using MLUtils.loadLibSVMFile, I have the following problem. Any help will be greatly appreciated! Code snippet: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import org.ap

MLLib regression model weights

2014-09-18 Thread Sameer Tilak
Hi All, I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB Libsvm file of sparse data) with 6700 features. val model = LinearRegressionWithSGD.train(examples, numIterations) At the end I get a model that model.weights.sizeres6: Int = 6699 I am assuming each entry in the mo

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
lace(")", "") } On Wed, Sep 17, 2014 at 9:11 PM, Burak Yavuz wrote: Hi, The spacing between the inputs should be a single space, not a tab. I feel like your inputs have tabs between them instead of a single space. Therefore the parser cannot parse the input. B

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
gt; The spacing between the inputs should be a single space, not a tab. I feel > like your inputs have tabs between them instead of a single space. Therefore > the parser > cannot parse the input. > > Best, > Burak > > - Original Message - > From: "S

MLLib: LIBSVM issue

2014-09-17 Thread Sameer Tilak
Hi All,We have a fairly large amount of sparse data. I was following the following instructions in the manual: Sparse dataIt is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and L

RDD projection and sorting

2014-09-16 Thread Sameer Tilak
Hi All, I have data in for following format:L 1st column is userid and the second column onward are class ids for various products. I want to save this in Libsvm format and an intermediate step is to sort (in ascending manner) the class ids. For example: I/Puid1 12433580 2670122

MLLib sparse vector

2014-09-15 Thread Sameer Tilak
Hi All,I have transformed the data into following format: First column is user id, and then all the other columns are class ids. For a user only class ids that appear in this row have value 1 and others are 0. I need to crease a sparse vector from this. Does the API for creating a sparse vector

RE: MLLib decision tree: Weights

2014-09-03 Thread Sameer Tilak
> become 1. -Xiangrui > > On Tue, Sep 2, 2014 at 1:05 PM, Sameer Tilak wrote: > > Hi everyone, > > > > > > We are looking to apply a weight to each training example; this weight > > should be used when computing the penalty of a misclassified example. For

MLLib decision tree: Weights

2014-09-02 Thread Sameer Tilak
Hi everyone, We are looking to apply a weight to each training example; this weight should be used when computing the penalty of a misclassified example. For instance, without weighting, each example is penalized 1 point when evaluating the model of a classifier, such as a decision tree

MLBase status

2014-08-27 Thread Sameer Tilak
Hi All,I was wondering can someone please tell me the status of MLbase and its roadmap in terms of software release. We are very interested in exploring it for our applications.

RE: Execution time increasing with increase of cluster size

2014-08-27 Thread Sameer Tilak
Can you tell which nodes were doing the computation in each case? Date: Wed, 27 Aug 2014 20:29:38 +0530 Subject: Execution time increasing with increase of cluster size From: sarathchandra.jos...@algofusiontech.com To: user@spark.apache.org Hi, I've written a simple scala program which reads a fi

RE: Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
kings in lower case after the format and size you want them > in. > They should be there. > > Best, > Burak > > - Original Message - > From: "Sameer Tilak" > To: user@spark.apache.org > Sent: Wednesday, August 27, 2014 11:42:28 AM > Subject:

Amplab: big-data-benchmark

2014-08-27 Thread Sameer Tilak
Hi All, I am planning to run amplab benchmark suite to evaluate the performance of our cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions about data avallability at: s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where /tiny/, /1n

MLlib: issue with increasing maximum depth of the decision tree

2014-08-21 Thread Sameer Tilak
Resending this: Hi All, My dataset is fairly small -- a CSV file with around half million rows and 600 features. Everything works when I set maximum depth of the decision tree to 5 or 6. However, I get this error for larger values of that parameter -- For example when I set it to 10.

RE: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
ke look at the docs of DecisionTree.trainClassifier and > trainRegressor: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360 > > -Xiangrui > > On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak wrote: > > H

FW: Decision tree: categorical variables

2014-08-20 Thread Sameer Tilak
cala#L360 > > -Xiangrui > > On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak wrote: > > Hi All, > > > > Is there any example of MLlib decision tree handling categorical variables? > > My dataset includes few categorical variables (20 out of 100 features) so >

RE: How to set KryoRegistrator class in spark-shell

2014-08-20 Thread Sameer Tilak
Hi Wang,Have you tried doing this in your application? conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") conf.set("spark.kryo.registrator", "yourpackage.MyKryoRegistrator") You then don't need to specify it via commandline. Date: Wed, 20 Aug 2014 12:25:14 -0

MLlib: issue with increasing maximum depth of the decision tree

2014-08-20 Thread Sameer Tilak
Hi All,My dataset is fairly small -- a CSV file with around half million rows and 600 features. Everything works when I set maximum depth of the decision tree to 5 or 6. However, I get this error for larger values of that parameter -- For example when I set it to 10. Have others encountered a s

Decision tree: categorical variables

2014-08-19 Thread Sameer Tilak
Hi All, Is there any example of MLlib decision tree handling categorical variables? My dataset includes few categorical variables (20 out of 100 features) so was interested in knowing how I can use the current version of decision tree implementation to handle this situation? I looked at the Labe

RE: spark kryo serilizable exception

2014-08-18 Thread Sameer Tilak
Hi,I was able to set this parameter in my application to resolve this issue: set("spark.kryoserializer.buffer.mb", "256") Please let me know if this helps. Date: Mon, 18 Aug 2014 21:50:02 +0800 From: dujinh...@hzduozhun.com To: user@spark.apache.org Subject: spark kryo serilizable exception

mlib model viewing and saving

2014-08-15 Thread Sameer Tilak
Hi All, I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLea

Mlib model: viewing and saving

2014-08-14 Thread Sameer Tilak
I have a mlib model: val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth) I see model has following methods:algo asInstanceOf isInstanceOf predicttoString topNode model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf = fal

RE: java.lang.UnknownError: no bin was found for continuous variable.

2014-08-14 Thread Sameer Tilak
...@gmail.com To: ssti...@live.com Can you supply the detail code and data you used.From the log, it looks like can not find the bin for specific feature.The bin for continuous feature is a unit that covers a specific range of the feature. 2014-08-14 7:43 GMT+08:00 Sameer Tilak : Hi All, I am

java.lang.UnknownError: no bin was found for continuous variable.

2014-08-13 Thread Sameer Tilak
Hi All, I am using the decision tree algorithm and I get the following error. Any help would be great! java.lang.UnknownError: no bin was found for continuous variable. at org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492) at org.apache.spark.mllib.tree.DecisionT

Random Forest implementation in MLib

2014-08-11 Thread Sameer Tilak
Hi All,I read on the mailing list that random forest implementation was on the roadmap. I wanted to check about its status? We are currently using Weka and would like to move over to MLib for performance.

Help with debugging a performance issue

2014-08-06 Thread Sameer Tilak
Hi All, I am running my spark-job using spark-submit (yarn-client mode). In this PoC phase, am loading a ~200MB TSV file and then doing computation over strings. I generate few small files (in KB). The goal is then to load up around 250GB input file rather than 200 MB and run the analytics. In

java.lang.IllegalArgumentException: Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer"

2014-08-05 Thread Sameer Tilak
Hi All, I am trying to move away from spark-shell to spark-submit and have been making some code changes. However, I am now having problem with serialization. It used to work fine before the code update. Not sure what I did wrong. However, here is the code JaccardScore.scala package approxs

java.lang.IllegalArgumentException: Unable to create serializer "com.esotericsoftware.kryo.serializers.FieldSerializer"

2014-08-04 Thread Sameer Tilak
Hi All, I am trying to move away from spark-shell to spark-submit and have been making some code changes. However, I am now having problem with serialization. It used to work fine before the code update. Not sure what I did wrong. However, here is the code JaccardScore.scala packa

java.lang.OutOfMemoryError: Java heap space

2014-07-31 Thread Sameer Tilak
Hi everyone,I have the following configuration. I am currently running my app in local mode. val conf = new SparkConf().setMaster("local[2]").setAppName("ApproxStrMatch").set("spark.executor.memory", "3g").set("spark.storage.memoryFraction", "0.1") I am getting the following error. I tried set

Spark partition

2014-07-30 Thread Sameer Tilak
Hi All, >From the documention RDDs are already partitioned distributed. However, there >is a way to repartition a given RDD using the following function. Can someone >please point out the best practices for using this. I have a 10 GB TSV file >stored in HDFS and I have a 4 node cluster with 1 ma

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
t to guide you how to read data from Hive via Spark SQL. So don't worry too much about the blog post. The programming guide I referred to demonstrate how to read data from Hive using Spark SQL. It is a good starting point. Best Regards, Jerry On Fri, Jul 25, 2014 at 5:38 PM, Sameer T

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
ing tables in in the MetaStore and writing queries using HiveQL. conf/ is a top level directory in the spark distribution that you downloaded. On Fri, Jul 25, 2014 at 2:35 PM, Sameer Tilak wrote: Hi Jerry,Thanks for your reply. I was following the steps in this programming guide. It d

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hive via Spark SQL. So don't worry too much about the blog post. The programming guide I referred to demonstrate how to read data from Hive using Spark SQL. It is a good starting point. Best Regards, Jerry On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak wrote: Hi Michael,Thanks

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
-guide.html#hive-tables Best Regards, Jerry On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak wrote: Hi All,I am trying to load data from Hive tables using Spark SQL. I am using spark-shell. Here is what I see: val trainingDataTable = sql("""SELECT prod.prod_num, demo

RE: Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
ubject: Re: Spark SQL and Hive tables From: chiling...@gmail.com To: user@spark.apache.org Hi Sameer, Maybe this page will help you: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Best Regards, Jerry On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak wrote: Hi All,I am

Spark SQL and Hive tables

2014-07-25 Thread Sameer Tilak
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using spark-shell. Here is what I see: val trainingDataTable = sql("""SELECT prod.prod_num, demographics.gender, demographics.birth_year, demographics.income_group FROM prod p JOIN demographics d ON d.user_id = p.user_id"""

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
o your > classpath. > > It may be much easier to manage this as a project with SBT or Maven > and let it sort out dependencies. > > On Wed, Jul 23, 2014 at 6:01 PM, Sameer Tilak wrote: > > Hi everyone, > > I was using Spark1.0 from Apache site and I was able to com

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
ch easier to manage this as a project with SBT or Maven > and let it sort out dependencies. > > On Wed, Jul 23, 2014 at 6:01 PM, Sameer Tilak wrote: > > Hi everyone, > > I was using Spark1.0 from Apache site and I was able to compile my code > > successfully using: >

error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
Hi everyone,I was using Spark1.0 from Apache site and I was able to compile my code successfully using: scalac -classpath /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar:/apps/software/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar:/apps/software/spark-1.0.0

CoarseGrainedExecutorBackend: Driver Disassociated‏

2014-07-09 Thread Sameer Tilak
Hi,This time instead of manually starting worker node using ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT I used start-slaves script on the master node. I also enabled -v (verbose flag) in ssh. Here is the o/p that I see. The log file for to the worker node was not

RE: How should I add a jar?

2014-07-09 Thread Sameer Tilak
Hi Nicholas, I am using Spark 1.0 and I use this method to specify the additional jars. First jar is the dependency and the second one is my application. Hope this will work for you. ./spark-shell --jars /apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar,/apps/software

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Tue, Jul 8, 2014 at 11:52 AM, Sameer Tilak wrote: Dear All, When I look inside the following directory on my worker node:$SPARK_HOME/work/app-20140708110707-0001/3 I see the following error message: log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration

CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Dear All, When I look inside the following directory on my worker node:$SPARK_HOME/work/app-20140708110707-0001/3 I see the following error message: log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system properly.log

Further details on spark cluster set up

2014-07-08 Thread Sameer Tilak
Hi All, I used ip addresses in my scripts (spark-env.sh) and slaves contain ip addresses of master and slave nodes respectively. However, I still have no luck. Here is the relevant log file snippet: Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster

RE: Spark: All masters are unresponsive!

2014-07-08 Thread Sameer Tilak
vm2018 7077 from the machines where you are running the spark shell. ThanksBest Regards On Tue, Jul 8, 2014 at 12:21 PM, Sameer Tilak wrote: Hi All, I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and

Spark: All masters are unresponsive!

2014-07-07 Thread Sameer Tilak
Hi All, I am having a few issues with stability and scheduling. When I use spark shell to submit my application. I get the following error message and spark shell crashes. I have a small 4-node cluster for PoC. I tried both manual and scripts-based cluster set up. I tried using FQDN as well for

Spark shell error messages and app exit issues

2014-07-07 Thread Sameer Tilak
Hi All,When I run my application, it runs for a while and give me part of the o/p correctly. I then get the following error and it then spark shell exits. 14/07/07 13:54:53 INFO SendingConnection: Initiating connection to [localhost.localdomain/127.0.0.1:57423]14/07/07 13:54:53 INFO ConnectionM

Error while launching spark cluster manaually

2014-07-07 Thread Sameer Tilak
Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but not sure, any help with this will be great! On the master node I get the following error:I start master using ./start-master.shstarting org.apache.spark.deploy.master.Master, logging to /apps/software/spark-1.0.0-bin-

RDD join: composite keys

2014-07-02 Thread Sameer Tilak
Hi everyone, Is it possible to join RDDs using composite keys? I would like to join these two RDDs with RDD1.id1 = RDD2.id1 and RDD1.id2 RDD2.id2RDD1 (id1, id2, scoretype1) RDD2 (id1, id2, scoretype2) I want the result to be ResultRDD = (id1, id2, (score1, score2)) Would really appreciate if you

RE: Serialization of objects

2014-06-30 Thread Sameer Tilak
Hi everyone,I was able to solve this issue. For now I changed the library code and added the following to the class com.wcohen.ss.BasicStringWrapper: public class BasicStringWrapper implements Serializable However, I am still curious to know ho to get around the issue when you don't have acces

Serialization of objects

2014-06-26 Thread Sameer Tilak
Hi everyone, Aaron, thanks for your help so far. I am trying to serialize objects that I instantiate from a 3rd party library namely instances of com.wcohen.ss.Jaccard, and com.wcohen.ss.BasicStringWrapper. However, I am having problems with serialization. I am (at least trying to) using Kryo fo

Worker nodes: Error messages

2014-06-25 Thread Sameer Tilak
Hi All, I see the following error messages on my worker nodes. Are they due to improper cleanup or wrong configuration? Any help with this would be great! 14/06/25 12:30:55 INFO SecurityManager: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties14/06/25 12:30:55 INFO

RE: Basic Scala and Spark questions

2014-06-24 Thread Sameer Tilak
org.apache.spark.rdd.RDD From: Sameer Tilak [mailto:ssti...@live.com] Sent: Monday, June 23, 2014 10:38 AM To: u...@spark.incubator.apache.org Subject: Basic Scala and Spark questions Hi All, I am new so Scala and Spark. I have a basic question. I have the following import statements in my Scala program

RE: DAGScheduler: Failed to run foreach

2014-06-24 Thread Sameer Tilak
of the job's closure. In Java, class methods have an implicit reference to "this", so it tried to serialize the class CalculateScore, which is presumably not marked as Serializable.) On Mon, Jun 23, 2014 at 5:45 PM, Sameer Tilak wrote: The subject should be: org.apache.spar

RE: DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
The subject should be: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: and not DAGScheduler: Failed to run foreach If I call printScoreCanndedString with a hard-coded string and identical 2nd parameter, it works fine.

DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
Hi All, I am using spark for text analysis. I have a source file that has few thousand sentences and a dataset of tens of millions of statements. I want to compare each statement from the sourceFile with each statement from the dataset and generate a score. I am having following problem. I would

RE: Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I was able to solve both these issues. Thanks! Just FYI: For 1: import org.apache.spark.rdd; import org.apache.spark.rdd.RDD; For 2: rdd.map(x => jc_.score(str1, new StringWrapper(x))) From: ssti...@live.com To: u...@spark.incubator.apache.org Subject: Basic Scala

Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I am new so Scala and Spark. I have a basic question. I have the following import statements in my Scala program. I want to pass my function (printScore) to Spark. It will compare a string import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import o

Running Spark alongside Hadoop

2014-06-20 Thread Sameer Tilak
Dear Spark users, I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB memory and 500GB disk. I am currently running Hadoop on it. I would like to run Spark (in standalone mode) along side Hadoop on the same nodes. Given the configuration of my nodes, will that work? D

Spark and Hadoop cluster

2014-03-21 Thread Sameer Tilak
Hi everyone,We are planning to set up Spark. The documentation mentions that it is possible to run Spark in standalone mode on a Hadoop cluster. Does anyone have any comments on stability and performance of this mode?

RE: Pig on Spark

2014-03-10 Thread Sameer Tilak
ark To: user@spark.apache.org Hi Sameer,Did you make any progress on this. My team is also trying it out would love to know some detail so progress. Mayur Rustagi Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak wrote:

RE: Pig on Spark

2014-03-06 Thread Sameer Tilak
Thursday, March 6, 2014 3:11 PM, Sameer Tilak wrote: Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other eff

Pig on Spark

2014-03-06 Thread Sameer Tilak
Hi everyone, We are using to Pig to build our data pipeline. I came across Spork -- Pig on Spark at: https://github.com/dvryaboy/pig and not sure if it is still active. Can someone please let me know the status of Spork or any other effort that will let us run Pig on Spark? We can significantl