Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma

how to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi all, I have calculated a covariance?? it's a Matrix type ,now i want to save the result to hdfs, how can i do it? thx

Re: [Streaming-Kafka] How to start from topic offset when streamcontext is using checkpoint

2016-01-24 Thread Raju Bairishetti
Hi Yash, Basically, my question is how to avoid storing the kafka offsets in spark checkpoint directory. Streaming context is getting build from checkpoint directory and proceeding with the offsets in checkpointed RDD. I want to consume data from kafka from specific offsets along with the

high CPU usage for acceptor and qtp threads

2016-01-24 Thread alberto.scolari
Hi everybody, since I am new to Spark, I am familiarizing with it by writing CPU-intensive applications like kmeans and knn. However, I observe some threads other than the worker threads using a lot of CPU. In particular, in jvisualvm, I observe the Acceptor and qtp threads to show such behavior,

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-24 Thread JoshuaTaylor
(Apologies if this has arrived more than once. I've subscribed to the list, and tried posting via email with no success. This is an attempt through the Nabble interface.) I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that in many places,

Re: How to query data in tachyon with spark-sql

2016-01-24 Thread Gene Pang
Hi, You should be able to point Hive to Tachyon instead of HDFS, and that should allow Hive to access data in Tachyon. If Spark SQL was pointing to an HDFS file, you could instead point it to a Tachyon file, and that should work too. Hope that helps, Gene On Wed, Jan 20, 2016 at 2:06 AM, Sea

show to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi all, I have calculated a covariance?? it's a Matrix type ,now i want to save the result to hdfs, how can i do it? thx

Group by Dynamically

2016-01-24 Thread Divya Gehlot
Hi, I have two files File1 Group by Condition Field1 Y Field 2 N Field3 Y File2 is data file having field1,field2,field3 etc.. field1 field2 field3 field4 field5 data1 data2 data3 data4 data 5 data11 data22 data33 data44 data 55 Now my requirement is to group

Re: Spark Cassandra clusters

2016-01-24 Thread vivek.meghanathan
Thanks mohammed and Ted. I will try out the options and let you all know the progress. Also had posted in spark Cassandra connector community, got similar response. Regards Vivek On Sat, Jan 23, 2016 at 11:37 am, Mohammed Guller > wrote:

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy, I will take a look at your code after your share it. Thanks! Yanbo 2016-01-23 0:18 GMT+08:00 Andy Davidson : > Hi Yanbo > > I recently code up the trivial example from >

Re: Clarification on Data Frames joins

2016-01-24 Thread Madabhattula Rajesh Kumar
Hi, Any suggestions on this approach? Regards, Rajesh On Sat, Jan 23, 2016 at 11:24 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi, > > I have a big database table(1 million plus records) in oracle. I need to > query records based on input numbers. For this use case, I am

Re: Spark SQL . How to enlarge output rows ?

2016-01-24 Thread Eli Super
Unfortunately still getting error when use .show() with `false` or `False` or `FALSE` Py4JError: An error occurred while calling o153.showString. Trace: py4j.Py4JException: Method showString([class java.lang.String, class java.lang.Boolean]) does not exist at

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Hi Devesh, RFormula will encode category variables(column of string type) as dummy variables automatically. You do not need to do dummy transform explicitly if you want to train machine learning model using SparkR. Although SparkR only supports a limited ML algorithms(GLM) currently. Thanks

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Worker's BlockManager Folder not getting cleared

2016-01-24 Thread Abhishek Anand
Hi All, How long the shuffle files and data files are stored on the block manager folder of the workers. I have a spark streaming job with window duration of 2 hours and slide interval of 15 minutes. When I execute the following command in my block manager path find . -type f -cmin +150 -name

?????? how to save Matrix type result to hdfs file using java

2016-01-24 Thread zhangjp
Hi Yanbo, I'm using java language and the environment is spark 1.4.1. Can u tell me how to do it more detail , the follows is my code, how can i save the cov to hdfs file ? " RowMatrix mat = new RowMatrix(rows.rdd()); Matrix cov = mat.computeCovariance(); "

NA value handling in sparkR

2016-01-24 Thread Devesh Raj Singh
Hi, I have applied the following code on airquality dataset available in R , which has some missing values. I want to omit the rows which has NAs library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"') sc <-

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Ilya Ganelin
The solution I normally use is to zipWithIndex() and then use the filter operation. Filter is an O(m) operation where m is the size of your partition, not an O(N) operation. -Ilya Ganelin On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel wrote: > Problem is I have RDD of

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Sonal Goyal
One thing you can also look at is to save your data in a way that can be accessed through file patterns. Eg by hour, zone etc so that you only load what you need. On Jan 24, 2016 10:00 PM, "Ilya Ganelin" wrote: > The solution I normally use is to zipWithIndex() and then use

Re: 10hrs of Scheduler Delay

2016-01-24 Thread Sanders, Isaac B
I am not getting anywhere with any of the suggestions so far. :( Trying some more outlets, I will share any solution I find. - Isaac On Jan 23, 2016, at 1:48 AM, Renu Yadav > wrote: If you turn on spark.speculation on then that might help. it worked

Spark master takes more time with local[8] than local[1]

2016-01-24 Thread jimitkr
Hi All, I have a machine with the following configuration: 32 GB RAM 500 GB HDD 8 CPUs Following are the parameters i'm starting my Spark context with: val conf = new SparkConf().setAppName("MasterApp").setMaster("local[1]").set("spark.executor.memory", "20g") I'm reading a 4.3 GB file and

Re: Spark master takes more time with local[8] than local[1]

2016-01-24 Thread Ted Yu
bq. I'm reading a 4.3 GB file The contents of the file can be held in one executor. Can you try files with much larger size ? Cheers On Sun, Jan 24, 2016 at 12:11 PM, jimitkr wrote: > Hi All, > > I have a machine with the following configuration: > 32 GB RAM > 500 GB HDD

Trouble dropping columns from a DataFrame that has other columns with dots in their names

2016-01-24 Thread Joshua TAYLOR
I've been having lots of trouble with DataFrames whose columns have dots in their names today. I know that in many places, backticks can be used to quote column names, but the problem I'm running into now is that I can't drop a column that has *no* dots in its name when there are *other* columns