On Thu, Feb 26, 2015 at 7:08 AM, Sameer Tilak ssti...@live.com wrote:
Hi,
I was looking at the documentation for deploying Spark cluster on EC2.
http://spark.apache.org/docs/latest/ec2-scripts.html
We are using Pig to build the data pipeline and then use MLLib for analytics. I
was wondering
Hi,
I was looking at the documentation for deploying Spark cluster on EC2.
http://spark.apache.org/docs/latest/ec2-scripts.html
We are using Pig to build the data pipeline and then use MLLib for analytics. I
was wondering if someone has any experience to include additional
tools/services
@spark.apache.org
Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui
On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
I use LIBSVM format to specify my input feature vector, which used 1-based
index. When I run regression the o/p
Hi All,When I am running LinearRegressionWithSGD, I get the following error.
Any help on how to debug this further will be highly appreciated.
14/12/10 20:26:02 WARN TaskSetManager: Loss was due to
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException:
150323 at
Hi All,
I was able to run LinearRegressionwithSGD for a largeer dataset ( 2GB sparse).
I have now filtered the data and I am running regression on a subset of it (~
200 MB). I see this error, which is strange since it was running fine with the
superset data. Is this a formatting issue
Hi All,I am using LinearRegressionWithSGD and then I save the model weights and
intercept. File that contains weights have this format:
1.204550.13560.000456..
Intercept is 0 since I am using train not setting the intercept so it can be
ignored for the moment. I would now like to initialize
Hi All,
I have been using LinearRegression model of MLLib and very pleased with its
scalability and robustness. Right now, we are just calculating MSE of our
model. We would like to characterize the performance of our model. I was
wondering adding support for computing things such as Confidence
Hi All,
I am using LinearRegression and have a question about the details on
model.predict method. Basically it is predicting variable y given an input
vector x. However, can someone point me to the documentation about what is the
threshold used in the predict method? Can that be changed ? I am
Hi All,I have my sparse data in libsvm format.
val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
mllib/data/sample_libsvm_data.txt)
I am running Linear regression. Let us say that my data has following entry:1
1:0 4:1
I think it will assume 0 for indices 2 and 3, right? I would
Hi All,I have a question regarding the ordering of indices. The document says
that the indices indices are one-based and in ascending order. However, do the
indices within a row need to be sorted in ascending order?
Sparse dataIt is very common in practice to have sparse training data. MLlib
Great, I will sort them.
Sent via the Samsung GALAXY S®4, an ATT 4G LTE smartphone
div Original message /divdivFrom: Xiangrui Meng
men...@gmail.com /divdivDate:10/21/2014 3:29 PM (GMT-08:00)
/divdivTo: Sameer Tilak ssti...@live.com /divdivCc:
user@spark.apache.org
dependency between A's columns and
D's columns? -Xiangrui
On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak ssti...@live.com wrote:
BTW, one detail:
When number of iterations is 100 all weights are zero or below and the
indices are only from set A.
When number of iterations is 150 I see
Hi All,I have following classes of features:
class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000
features.
I use linear regression (over sparse data). I get excellent results with low
RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C +
D3.
BTW, one detail:
When number of iterations is 100 all weights are zero or below and the indices
are only from set A.
When number of iterations is 150 I see 30+ non-zero weights (when sorted by
weight) and indices are distributed across al sets. however MSE is high (5.xxx)
and the result does
. But no one is committed to work on this feature. For now, you
can filter out examples containing missing values and use the rest for
training. -Xiangrui
On Tue, Sep 30, 2014 at 11:26 AM, Sameer Tilak ssti...@live.com wrote:
Hi All,
Can someone please me to the documentation that describes
Hi All,Can someone please me to the documentation that describes how missing
value imputation is done in MLLib. Also, any information of how this fits in
the overall roadmap will be great.
features may create a large number of temp objects, which
may also cause GC to happen.
Hope this helps!Liquan
On Wed, Sep 24, 2014 at 9:50 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,I was able to solve this formatting issue. However, I have another
question. When I do the following,
val
Hi All,
When I try to load dataset using MLUtils.loadLibSVMFile, I have the following
problem. Any help will be greatly appreciated!
Code snippet:
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.rdd.RDD
import
, 2014 at 11:02 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,
When I try to load dataset using MLUtils.loadLibSVMFile, I have the
following problem. Any help will be greatly appreciated!
Code snippet:
import org.apache.spark.mllib.regression.LabeledPoint
import
should be a single space, not a tab. I feel
like your inputs have tabs between them instead of a single space. Therefore
the parser
cannot parse the input.
Best,
Burak
- Original Message -
From: Sameer Tilak ssti...@live.com
To: user@spark.apache.org
Sent: Wednesday, September 17
Yavuz bya...@stanford.edu wrote:
Hi,
The spacing between the inputs should be a single space, not a tab. I feel like
your inputs have tabs between them instead of a single space. Therefore the
parser
cannot parse the input.
Best,
Burak
- Original Message -
From: Sameer Tilak
Hi All,
I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB
Libsvm file of sparse data) with 6700 features.
val model = LinearRegressionWithSGD.train(examples, numIterations)
At the end I get a model that
model.weights.sizeres6: Int = 6699
I am assuming each entry in the
Hi All,We have a fairly large amount of sparse data. I was following the
following instructions in the manual:
Sparse dataIt is very common in practice to have sparse training data. MLlib
supports reading training examples stored in LIBSVM format, which is the
default format used by LIBSVM and
Hi All,
I have data in for following format:L
1st column is userid and the second column onward are class ids for various
products. I want to save this in Libsvm format and an intermediate step is to
sort (in ascending manner) the class ids. For example: I/Puid1 12433580
2670122
Hi All,I have transformed the data into following format: First column is user
id, and then all the other columns are class ids. For a user only class ids
that appear in this row have value 1 and others are 0. I need to crease a
sparse vector from this. Does the API for creating a sparse
:05 PM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
We are looking to apply a weight to each training example; this weight
should be used when computing the penalty of a misclassified example. For
instance, without weighting, each example is penalized 1 point when
evaluating
Hi everyone,
We are looking to apply a weight to each training example; this weight should
be used when computing the penalty of a misclassified example. For instance,
without weighting, each example is penalized 1 point when evaluating the model
of a classifier, such as a decision
Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of our
cluster. I looked at: https://amplab.cs.berkeley.edu/benchmark/ and it mentions
about data avallability at:
s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix]where
/tiny/,
be there.
Best,
Burak
- Original Message -
From: Sameer Tilak ssti...@live.com
To: user@spark.apache.org
Sent: Wednesday, August 27, 2014 11:42:28 AM
Subject: Amplab: big-data-benchmark
Hi All,
I am planning to run amplab benchmark suite to evaluate the performance of
our
Can you tell which nodes were doing the computation in each case?
Date: Wed, 27 Aug 2014 20:29:38 +0530
Subject: Execution time increasing with increase of cluster size
From: sarathchandra.jos...@algofusiontech.com
To: user@spark.apache.org
Hi,
I've written a simple scala program which reads a
Hi All,I was wondering can someone please tell me the status of MLbase and its
roadmap in terms of software release. We are very interested in exploring it
for our applications.
Hi All,My dataset is fairly small -- a CSV file with around half million rows
and 600 features. Everything works when I set maximum depth of the decision
tree to 5 or 6. However, I get this error for larger values of that parameter
-- For example when I set it to 10. Have others encountered a
Hi Wang,Have you tried doing this in your application?
conf.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
conf.set(spark.kryo.registrator, yourpackage.MyKryoRegistrator)
You then don't need to specify it via commandline.
Date: Wed, 20 Aug 2014 12:25:14 -0700
. But if you want to try something now, you can
take look at the docs of DecisionTree.trainClassifier and
trainRegressor:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360
-Xiangrui
On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak ssti
:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L360
-Xiangrui
On Tue, Aug 19, 2014 at 4:24 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,
Is there any example of MLlib decision tree handling categorical variables?
My
Hi All,
Is there any example of MLlib decision tree handling categorical variables? My
dataset includes few categorical variables (20 out of 100 features) so was
interested in knowing how I can use the current version of decision tree
implementation to handle this situation? I looked at the
Hi,I was able to set this parameter in my application to resolve this issue:
set(spark.kryoserializer.buffer.mb, 256)
Please let me know if this helps.
Date: Mon, 18 Aug 2014 21:50:02 +0800
From: dujinh...@hzduozhun.com
To: user@spark.apache.org
Subject: spark kryo serilizable exception
Hi All,
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)
I see model has following methods:algo asInstanceOf isInstanceOf
predicttoString topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0,
...@gmail.com
To: ssti...@live.com
Can you supply the detail code and data you used.From the log, it looks like
can not find the bin for specific feature.The bin for continuous feature is a
unit that covers a specific range of the feature.
2014-08-14 7:43 GMT+08:00 Sameer Tilak ssti...@live.com
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)
I see model has following methods:algo asInstanceOf isInstanceOf
predicttoString topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf =
Hi All,
I am using the decision tree algorithm and I get the following error. Any help
would be great!
java.lang.UnknownError: no bin was found for continuous variable. at
org.apache.spark.mllib.tree.DecisionTree$.findBin$1(DecisionTree.scala:492) at
Hi All,I read on the mailing list that random forest implementation was on the
roadmap. I wanted to check about its status? We are currently using Weka and
would like to move over to MLib for performance.
Hi All,
I am trying to move away from spark-shell to spark-submit and have been making
some code changes. However, I am now having problem with serialization. It
used to work fine before the code update. Not sure what I did wrong. However,
here is the code
JaccardScore.scala
package
Hi All,
I am trying to move away from spark-shell to spark-submit and have been making
some code changes. However, I am now having problem with serialization. It
used to work fine before the code update. Not sure what I did wrong. However,
here is the code
JaccardScore.scala
Hi everyone,I have the following configuration. I am currently running my app
in local mode.
val conf = new
SparkConf().setMaster(local[2]).setAppName(ApproxStrMatch).set(spark.executor.memory,
3g).set(spark.storage.memoryFraction, 0.1)
I am getting the following error. I tried setting up
Hi All,I am trying to load data from Hive tables using Spark SQL. I am using
spark-shell. Here is what I see:
val trainingDataTable = sql(SELECT prod.prod_num, demographics.gender,
demographics.birth_year, demographics.income_group FROM prod p JOIN
demographics d ON d.user_id = p.user_id)
From: chiling...@gmail.com
To: user@spark.apache.org
Hi Sameer,
Maybe this page will help you:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Best Regards,
Jerry
On Fri, Jul 25, 2014 at 5:25 PM, Sameer Tilak ssti...@live.com wrote:
Hi All,I am trying to load data
via Spark SQL. So
don't worry too much about the blog post.
The programming guide I referred to demonstrate how to read data from Hive
using Spark SQL. It is a good starting point.
Best Regards,
Jerry
On Fri, Jul 25, 2014 at 5:38 PM, Sameer Tilak ssti...@live.com wrote:
Hi Michael,Thanks
Hi everyone,I was using Spark1.0 from Apache site and I was able to compile my
code successfully using:
scalac -classpath
.
On Wed, Jul 23, 2014 at 6:01 PM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
I was using Spark1.0 from Apache site and I was able to compile my code
successfully using:
scalac -classpath
/apps/software/secondstring/secondstring/dist/lib/secondstring-20140630.jar:/apps/software/spark
or Maven
and let it sort out dependencies.
On Wed, Jul 23, 2014 at 6:01 PM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
I was using Spark1.0 from Apache site and I was able to compile my code
successfully using:
scalac -classpath
/apps/software/secondstring/secondstring/dist/lib
Hi Nicholas,
I am using Spark 1.0 and I use this method to specify the additional jars.
First jar is the dependency and the second one is my application. Hope this
will work for you.
./spark-shell --jars
Hi,This time instead of manually starting worker node using
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
I used start-slaves script on the master node. I also enabled -v (verbose flag)
in ssh. Here is the o/p that I see. The log file for to the worker node was not
Hi All,
I am having a few issues with stability and scheduling. When I use spark shell
to submit my application. I get the following error message and spark shell
crashes. I have a small 4-node cluster for PoC. I tried both manual and
scripts-based cluster set up. I tried using FQDN as well for
Hi All,
I used ip addresses in my scripts (spark-env.sh) and slaves contain ip
addresses of master and slave nodes respectively. However, I still have no
luck. Here is the relevant log file snippet:
Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError
Dear All,
When I look inside the following directory on my worker
node:$SPARK_HOME/work/app-20140708110707-0001/3
I see the following error message:
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j
system
?
On Tue, Jul 8, 2014 at 11:52 AM, Sameer Tilak ssti...@live.com wrote:
Dear All,
When I look inside the following directory on my worker
node:$SPARK_HOME/work/app-20140708110707-0001/3
I see the following error message:
log4j:WARN No appenders could be found for logger
Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but
not sure, any help with this will be great!
On the master node I get the following error:I start master using
./start-master.shstarting org.apache.spark.deploy.master.Master, logging to
Hi everyone,
Is it possible to join RDDs using composite keys? I would like to join these
two RDDs with RDD1.id1 = RDD2.id1 and RDD1.id2 RDD2.id2RDD1 (id1, id2,
scoretype1) RDD2 (id1, id2, scoretype2)
I want the result to be ResultRDD = (id1, id2, (score1, score2))
Would really appreciate if you
Hi everyone,I was able to solve this issue. For now I changed the library code
and added the following to the class com.wcohen.ss.BasicStringWrapper:
public class BasicStringWrapper implements Serializable
However, I am still curious to know ho to get around the issue when you don't
have
Hi everyone,
Aaron, thanks for your help so far. I am trying to serialize objects that I
instantiate from a 3rd party library namely instances of com.wcohen.ss.Jaccard,
and com.wcohen.ss.BasicStringWrapper. However, I am having problems with
serialization. I am (at least trying to) using Kryo
Hi All,
I see the following error messages on my worker nodes. Are they due to improper
cleanup or wrong configuration? Any help with this would be great!
14/06/25 12:30:55 INFO SecurityManager: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties14/06/25 12:30:55 INFO
were trying to pass printScoreCanndedString as part of the job's
closure. In Java, class methods have an implicit reference to this, so it
tried to serialize the class CalculateScore, which is presumably not marked as
Serializable.)
On Mon, Jun 23, 2014 at 5:45 PM, Sameer Tilak ssti...@live.com
org.apache.spark.rdd.RDD
From: Sameer Tilak [mailto:ssti...@live.com]
Sent: Monday, June 23, 2014 10:38 AM
To: u...@spark.incubator.apache.org
Subject: Basic Scala and Spark questions
Hi All,
I am new so Scala and Spark. I have a basic question. I have the following
import statements in my Scala program
Hi All,I am new so Scala and Spark. I have a basic question. I have the
following import statements in my Scala program. I want to pass my function
(printScore) to Spark. It will compare a string
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import
Hi All,
I am using spark for text analysis. I have a source file that has few thousand
sentences and a dataset of tens of millions of statements. I want to compare
each statement from the sourceFile with each statement from the dataset and
generate a score. I am having following problem. I
The subject should be: org.apache.spark.SparkException: Job aborted due to
stage failure: Task not serializable: java.io.NotSerializableException: and
not DAGScheduler: Failed to run foreach
If I call printScoreCanndedString with a hard-coded string and identical 2nd
parameter, it works fine.
Dear Spark users,
I have a small 4 node Hadoop cluster. Each node is a VM -- 4 virtual cores, 8GB
memory and 500GB disk. I am currently running Hadoop on it. I would like to run
Spark (in standalone mode) along side Hadoop on the same nodes. Given the
configuration of my nodes, will that work?
Hi everyone,We are planning to set up Spark. The documentation mentions that it
is possible to run Spark in standalone mode on a Hadoop cluster. Does anyone
have any comments on stability and performance of this mode?
To: user@spark.apache.org
Hi Sameer,Did you make any progress on this. My team is also trying it out
would love to know some detail so progress. Mayur Rustagi
Ph: +1 (760) 203 3257http://www.sigmoidanalytics.com@mayur_rustagi
On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak ssti...@live.com
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.
Can someone please let me know the status of Spork or any other effort that
will let us run Pig on Spark? We can
On Thursday, March 6, 2014 3:11 PM, Sameer Tilak ssti...@live.com wrote:
Hi everyone,
We are using to Pig to build our data pipeline. I came across Spork -- Pig on
Spark at: https://github.com/dvryaboy/pig and not sure if it is still active.
Can someone please let me know
72 matches
Mail list logo