Error running multinomial regression on a dataset with a field having constant value

2018-03-11 Thread kundan kumar
I am running the sample multinomial regression code given in spark docs (Version 2.2.0) LogisticRegression lr = new LogisticRegression().setMaxIter(100).setRegParam(0.3).setElasticNetParam(0.8); LogisticRegressionModel lrModel = lr.fit(training); But in the dataset I am adding a constant field

Output of select in non exponential form.

2017-06-08 Thread kundan kumar
predictions.select("prediction", "label", "features").show(5) I have labels as line numbers but they are getting printed in exponential format. Is there a way to print it in normal double notation. Kundan

Re: Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
e); > > > > > On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar <iitr.kun...@gmail.com> > wrote: > >> I am using >> >> Dataset result = model.transform(testData).select("probability", >> "label","features"); >> res

Convert the feature vector to raw data

2017-06-07 Thread kundan kumar
I am using Dataset result = model.transform(testData).select("probability", "label","features"); result.show(1000, false); In this case the feature vector is being printed as output. Is there a way that my original raw data gets printed instead of the feature vector OR is there a way to reverse

Re: Unable to get raw probabilities after clearing model threshold

2016-09-05 Thread kundan kumar
Sorry, my bad. The issue got resolved. Thanks, Kundan On Mon, Sep 5, 2016 at 3:58 PM, kundan kumar <iitr.kun...@gmail.com> wrote: > Hi, > > I am unable to get the raw probabilities despite of clearing the > threshold. Its still printing the predicted label. > >

Unable to get raw probabilities after clearing model threshold

2016-09-05 Thread kundan kumar
Hi, I am unable to get the raw probabilities despite of clearing the threshold. Its still printing the predicted label. Can someone help resolve this issue. Here is the code snippet. LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD(); LogisticRegressionModel model =

Re: Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-13 Thread kundan kumar
umber of feature values, but > maybe that's what you have. It's more problematic the smaller your > hash space is. > > On Tue, Jul 12, 2016 at 10:21 AM, kundan kumar <iitr.kun...@gmail.com> > wrote: > > Hi , > > > > I am trying to use StreamingLogisticRegre

Handling categorical variables in StreamingLogisticRegressionwithSGD

2016-07-12 Thread kundan kumar
Hi , I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model. The document : http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression mentions that the numFeatures should be *constant*. The problem that I am facing is : Since most

Re: Logistic Regression in Spark Streaming

2016-05-27 Thread kundan kumar
sig_campaign=external_links> > > 2016-05-27 9:09 GMT+02:00 kundan kumar <iitr.kun...@gmail.com>: > >> Hi , >> >> Do we have a streaming version of Logistic Regression in Spark ? I can >> see its there for the Linear Regression. >> >> Has anyone u

Logistic Regression in Spark Streaming

2016-05-27 Thread kundan kumar
Hi , Do we have a streaming version of Logistic Regression in Spark ? I can see its there for the Linear Regression. Has anyone used logistic regression on streaming data, it would be really helpful if you share your insights on how to train the incoming data. In my use case I am trying to use

Executor still on the UI even if the worker is dead

2016-04-22 Thread kundan kumar
Hi Guys, Anyone faced this issue with spark ? Why does it happen so in Spark Streaming that the executors are still shown on the UI even when the worker is killed and not in the cluster. This severely impacts my running jobs which takes too longer and the stages failing with the exception

Executor still on the UI even if the worker is dead

2016-04-20 Thread kundan kumar
Hi TD/Cody, Why does it happen so in Spark Streaming that the executors are still shown on the UI even when the worker is killed and not in the cluster. This severely impacts my running jobs which takes too longer and the stages failing with the exception java.io.IOException: Failed to connect

Re: Getting kafka offsets at beginning of spark streaming application

2016-01-11 Thread kundan kumar
Hi Cody, My use case is something like follows : My application dies at X time and I write the offsets to a DB. Now when my application starts at time Y (few minutes later) and spark streaming reads the latest offsets using createDirectStream method. Now here I want to get the exact offset that

ReduceByKeyAndWindow does repartitioning twice on recovering from checkpoint

2015-11-15 Thread kundan kumar
Hi, I am using spark streaming check-pointing mechanism and reading the data from Kafka. The window duration for my application is 2 hrs with a sliding interval of 15 minutes. So, my batches run at following intervals... - 09:45 - 10:00 - 10:15 - 10:30 - and so on When my job is

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-15 Thread kundan kumar
Sure Thanks !! On Sun, Nov 15, 2015 at 9:13 PM, Cody Koeninger <c...@koeninger.org> wrote: > Not sure on that, maybe someone else can chime in > > On Sat, Nov 14, 2015 at 4:51 AM, kundan kumar <iitr.kun...@gmail.com> > wrote: > >> Hi Cody , >> >> Tha

Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-14 Thread kundan kumar
not time of processing. > > On Fri, Nov 13, 2015 at 4:36 AM, kundan kumar <iitr.kun...@gmail.com> > wrote: > >> Hi, >> >> I am using spark streaming check-pointing mechanism and reading the data >> from kafka. The window duration for my application is 2 hrs

Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread kundan kumar
Hi, I am using spark streaming check-pointing mechanism and reading the data from kafka. The window duration for my application is 2 hrs with a sliding interval of 15 minutes. So, my batches run at following intervals... 09:45 10:00 10:15 10:30 and so on Suppose, my running batch dies at 09:55

Batch Recovering from Checkpoint is taking longer runtime than usual

2015-11-09 Thread kundan kumar
Hi, Below my code snippet where I am using checkpointing feature of spark streaming. The SPARK_DURATION that I am using is 5 minutes and the batch duration is 15 minutes. I am checkpointing the data at each SPARK_DURATION (5 minutes). When I kill the job and start the next batch it takes

org.apache.spark.shuffle.FetchFailedException: Failed to connect to ..... on worker failure

2015-10-28 Thread kundan kumar
Hi, I am running a Spark Streaming Job. I was testing the fault tolerance by killing one of the workers using the kill -9 command. What I understand is, when I kill a worker the process should not die and resume the execution. But, I am getting the following error and my process is halted.

Re: org.apache.spark.shuffle.FetchFailedException

2015-08-25 Thread kundan kumar
I have set spark.sql.shuffle.partitions=1000 then also its failing. On Tue, Aug 25, 2015 at 11:36 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: Did you try increasing sql partitions? On Tue, Aug 25, 2015 at 11:06 AM, kundan kumar iitr.kun...@gmail.com wrote: I am running

Equal Height and Depth Binning in Spark

2015-04-29 Thread kundan kumar
Hi, I am trying to implement equal depth and equal height binning methods in spark. Any insights, existing code for this would be really helpful. Thanks, Kundan

Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
Hi, I need to store terabytes of data which will be used for BI tools like qlikview. The queries can be on the basis of filter on any column. Currently, we are using redshift for this purpose. I am trying to explore things other than the redshift . Is it possible to gain better performance in

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
technology Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit : Hi, I need to store terabytes of data which will be used for BI tools like qlikview. The queries can be on the basis of filter on any column. Currently, we are using redshift for this purpose. I am trying to explore

Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
. But the major challenge I faced there was, secondary indexing was not supported for bulk loading process. Only the sequential loading process supported the secondary indexes, which took longer time. Any comments on this ? On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com wrote: I

Re: Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create the necessary folders within HDFS? On Tue, Feb 24, 2015 at 9:06 PM kundan kumar

Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
Hi , I have placed my hive-site.xml inside spark/conf and i am trying to execute some hive queries given in the documentation. Can you please suggest what wrong am I doing here. scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext:

Unable to query hive tables from spark

2015-02-11 Thread kundan kumar
I want to create/access the hive tables from spark. I have placed the hive-site.xml inside the spark/conf directory. Even though it creates a local metastore in the directory where I run the spark shell and exists with an error. I am getting this error when I try to create a new hive table. Even

Error while querying hive table from spark shell

2015-02-10 Thread kundan kumar
Hi , I am getting the following error when I am trying query a hive table from spark shell. I have placed my hive-site.xml in the spark/conf directory. Please suggest how to resolve this error. scala sqlContext.sql(select count(*) from offers_new).collect().foreach(println) 15/02/11 01:48:01

Re: Spark Job running on localhost on yarn cluster

2015-02-05 Thread kundan kumar
mode. Regards, Kundan On Thu, Feb 5, 2015 at 12:49 PM, Felix C felixcheun...@hotmail.com wrote: Is YARN_CONF_DIR set? --- Original Message --- From: Aniket Bhatnagar aniket.bhatna...@gmail.com Sent: February 4, 2015 6:16 AM To: kundan kumar iitr.kun...@gmail.com, spark users user

Spark Job running on localhost on yarn cluster

2015-02-04 Thread kundan kumar
Hi, I am trying to execute my code on a yarn cluster The command which I am using is $SPARK_HOME/bin/spark-submit --class EDDApp target/scala-2.10/edd-application_2.10-1.0.jar --master yarn-cluster --num-executors 3 --driver-memory 6g --executor-memory 7g outpuPath But, I can see that this

Writing RDD to a csv file

2015-02-03 Thread kundan kumar
I have a RDD which is of type org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] I want to write it as a csv file. Please suggest how this can be done. myrdd.map(line = (line._1 + , + line._2._1.mkString(,) + , + line._2._2.mkString(','))).saveAsTextFile(hdfs://...)

Re: Writing RDD to a csv file

2015-02-03 Thread kundan kumar
]] = ??? optArrStr.map(arr = arr.mkString(,)).getOrElse() // empty string or whatever default value you have for this. kr, Gerard. On Tue, Feb 3, 2015 at 2:09 PM, kundan kumar iitr.kun...@gmail.com wrote: I have a RDD which is of type org.apache.spark.rdd.RDD[(String, (Array[String], Option

WARN NativeCodeLoader warning in spark shell

2015-01-30 Thread kundan kumar
Hi, Whenever I start spark shell I get this warning. WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Whats the meaning of this and does/how can it impact the execution of my spark jobs ? Please suggest how can I fix

Percentile Calculation

2015-01-28 Thread kundan kumar
Is there any inbuilt function for calculating percentile over a dataset ? I want to calculate the percentiles for each column in my data. Regards, Kundan

Index wise most frequently occuring element

2015-01-27 Thread kundan kumar
I have a an array of the form val array: Array[(Int, (String, Int))] = Array( (idx1,(word1,count1)), (idx2,(word2,count2)), (idx1,(word1,count1)), (idx3,(word3,count1)), (idx4,(word4,count4))) I want to get the top 10 and bottom 10 elements from this array for each index

foreachActive functionality

2015-01-25 Thread kundan kumar
Can someone help me to understand the usage of foreachActive function introduced for the Vectors. I am trying to understand its usage in MultivariateOnlineSummarizer class for summary statistics. sample.foreachActive { (index, value) = if (value != 0.0) { if (currMax(index)

summary for all columns (numeric, strings) in a dataset

2015-01-24 Thread kundan kumar
Hi , Is there something like summary function in spark like that in R. The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types. I am interested in getting the results for string types also like the first four max occuring strings(groupby