I am running the sample multinomial regression code given in spark docs
(Version 2.2.0)
LogisticRegression lr = new
LogisticRegression().setMaxIter(100).setRegParam(0.3).setElasticNetParam(0.8);
LogisticRegressionModel lrModel = lr.fit(training);
But in the dataset I am adding a constant field
predictions.select("prediction", "label", "features").show(5)
I have labels as line numbers but they are getting printed in exponential
format. Is there a way to print it in normal double notation.
Kundan
e);
>
>
>
>
> On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
>
>> I am using
>>
>> Dataset result = model.transform(testData).select("probability",
>> "label","features");
>> res
I am using
Dataset result = model.transform(testData).select("probability",
"label","features");
result.show(1000, false);
In this case the feature vector is being printed as output. Is there a way
that my original raw data gets printed instead of the feature vector OR is
there a way to reverse
Sorry, my bad.
The issue got resolved.
Thanks,
Kundan
On Mon, Sep 5, 2016 at 3:58 PM, kundan kumar <iitr.kun...@gmail.com> wrote:
> Hi,
>
> I am unable to get the raw probabilities despite of clearing the
> threshold. Its still printing the predicted label.
>
>
Hi,
I am unable to get the raw probabilities despite of clearing the threshold.
Its still printing the predicted label.
Can someone help resolve this issue.
Here is the code snippet.
LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();
LogisticRegressionModel model =
umber of feature values, but
> maybe that's what you have. It's more problematic the smaller your
> hash space is.
>
> On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
> > Hi ,
> >
> > I am trying to use StreamingLogisticRegre
Hi ,
I am trying to use StreamingLogisticRegressionwithSGD to build a CTR
prediction model.
The document :
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
mentions that the numFeatures should be *constant*.
The problem that I am facing is :
Since most
sig_campaign=external_links>
>
> 2016-05-27 9:09 GMT+02:00 kundan kumar <iitr.kun...@gmail.com>:
>
>> Hi ,
>>
>> Do we have a streaming version of Logistic Regression in Spark ? I can
>> see its there for the Linear Regression.
>>
>> Has anyone u
Hi ,
Do we have a streaming version of Logistic Regression in Spark ? I can see
its there for the Linear Regression.
Has anyone used logistic regression on streaming data, it would be really
helpful if you share your insights on how to train the incoming data.
In my use case I am trying to use
Hi Guys,
Anyone faced this issue with spark ?
Why does it happen so in Spark Streaming that the executors are still shown
on the UI even when the worker is killed and not in the cluster.
This severely impacts my running jobs which takes too longer and the stages
failing with the exception
Hi TD/Cody,
Why does it happen so in Spark Streaming that the executors are still shown
on the UI even when the worker is killed and not in the cluster.
This severely impacts my running jobs which takes too longer and the stages
failing with the exception
java.io.IOException: Failed to connect
Hi Cody,
My use case is something like follows :
My application dies at X time and I write the offsets to a DB.
Now when my application starts at time Y (few minutes later) and spark
streaming reads the latest offsets using createDirectStream method. Now
here I want to get the exact offset that
Hi,
I am using spark streaming check-pointing mechanism and reading the data
from Kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.
So, my batches run at following intervals...
- 09:45
- 10:00
- 10:15
- 10:30
- and so on
When my job is
Sure
Thanks !!
On Sun, Nov 15, 2015 at 9:13 PM, Cody Koeninger <c...@koeninger.org> wrote:
> Not sure on that, maybe someone else can chime in
>
> On Sat, Nov 14, 2015 at 4:51 AM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
>
>> Hi Cody ,
>>
>> Tha
not time of processing.
>
> On Fri, Nov 13, 2015 at 4:36 AM, kundan kumar <iitr.kun...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am using spark streaming check-pointing mechanism and reading the data
>> from kafka. The window duration for my application is 2 hrs
Hi,
I am using spark streaming check-pointing mechanism and reading the data
from kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.
So, my batches run at following intervals...
09:45
10:00
10:15
10:30 and so on
Suppose, my running batch dies at 09:55
Hi,
Below my code snippet where I am using checkpointing feature of spark
streaming. The SPARK_DURATION that I am using is 5 minutes and the
batch duration is 15 minutes. I am checkpointing the data at each
SPARK_DURATION (5 minutes). When I kill the job and start the next batch
it takes
Hi,
I am running a Spark Streaming Job. I was testing the fault tolerance by
killing one of the workers using the kill -9 command.
What I understand is, when I kill a worker the process should not die and
resume the execution.
But, I am getting the following error and my process is halted.
I have set spark.sql.shuffle.partitions=1000 then also its failing.
On Tue, Aug 25, 2015 at 11:36 AM, Raghavendra Pandey
raghavendra.pan...@gmail.com wrote:
Did you try increasing sql partitions?
On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com
wrote:
I am running
Hi,
I am trying to implement equal depth and equal height binning methods in
spark.
Any insights, existing code for this would be really helpful.
Thanks,
Kundan
Hi,
I need to store terabytes of data which will be used for BI tools like
qlikview.
The queries can be on the basis of filter on any column.
Currently, we are using redshift for this purpose.
I am trying to explore things other than the redshift .
Is it possible to gain better performance in
technology
Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :
Hi,
I need to store terabytes of data which will be used for BI tools like
qlikview.
The queries can be on the basis of filter on any column.
Currently, we are using redshift for this purpose.
I am trying to explore
. But the major challenge I faced there was, secondary indexing was
not supported for bulk loading process.
Only the sequential loading process supported the secondary indexes, which
took longer time.
Any comments on this ?
On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com wrote:
I
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory or
unable to create one)
Could you verify that you (the user you are running under) has the rights
to create the necessary folders within HDFS?
On Tue, Feb 24, 2015 at 9:06 PM kundan kumar
Hi ,
I have placed my hive-site.xml inside spark/conf and i am trying to execute
some hive queries given in the documentation.
Can you please suggest what wrong am I doing here.
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext:
I want to create/access the hive tables from spark.
I have placed the hive-site.xml inside the spark/conf directory. Even
though it creates a local metastore in the directory where I run the spark
shell and exists with an error.
I am getting this error when I try to create a new hive table. Even
Hi ,
I am getting the following error when I am trying query a hive table from
spark shell.
I have placed my hive-site.xml in the spark/conf directory.
Please suggest how to resolve this error.
scala sqlContext.sql(select count(*) from
offers_new).collect().foreach(println)
15/02/11 01:48:01
mode.
Regards,
Kundan
On Thu, Feb 5, 2015 at 12:49 PM, Felix C felixcheun...@hotmail.com wrote:
Is YARN_CONF_DIR set?
--- Original Message ---
From: Aniket Bhatnagar aniket.bhatna...@gmail.com
Sent: February 4, 2015 6:16 AM
To: kundan kumar iitr.kun...@gmail.com, spark users
user
Hi,
I am trying to execute my code on a yarn cluster
The command which I am using is
$SPARK_HOME/bin/spark-submit --class EDDApp
target/scala-2.10/edd-application_2.10-1.0.jar --master yarn-cluster
--num-executors 3 --driver-memory 6g --executor-memory 7g outpuPath
But, I can see that this
I have a RDD which is of type
org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))]
I want to write it as a csv file.
Please suggest how this can be done.
myrdd.map(line = (line._1 + , + line._2._1.mkString(,) + , +
line._2._2.mkString(','))).saveAsTextFile(hdfs://...)
]] = ???
optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or
whatever default value you have for this.
kr, Gerard.
On Tue, Feb 3, 2015 at 2:09 PM, kundan kumar iitr.kun...@gmail.com
wrote:
I have a RDD which is of type
org.apache.spark.rdd.RDD[(String, (Array[String], Option
Hi,
Whenever I start spark shell I get this warning.
WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Whats the meaning of this and does/how can it impact the execution of my
spark jobs ?
Please suggest how can I fix
Is there any inbuilt function for calculating percentile over a dataset ?
I want to calculate the percentiles for each column in my data.
Regards,
Kundan
I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each
index
Can someone help me to understand the usage of foreachActive function
introduced for the Vectors.
I am trying to understand its usage in MultivariateOnlineSummarizer class
for summary statistics.
sample.foreachActive { (index, value) =
if (value != 0.0) {
if (currMax(index)
Hi ,
Is there something like summary function in spark like that in R.
The summary calculation which comes with
spark(MultivariateStatisticalSummary) operates only on numeric types.
I am interested in getting the results for string types also like the first
four max occuring strings(groupby
37 matches
Mail list logo