Re: How to read a Json file with a specific format?

2015-07-29 Thread mélanie gallois
Can you give an example with my extract?

Mélanie Gallois

2015-07-29 16:55 GMT+02:00 Young, Matthew T :

> The built-in Spark JSON functionality cannot read normal JSON arrays. The
> format it expects is a bunch of individual JSON objects without any outer
> array syntax, with one complete JSON object per line of the input file.
>
> AFAIK your options are to read the JSON in the driver and parallelize it
> out to the workers or to fix your input file to match the spec.
>
> For one-off conversions I usually use a combination of jq and
> regex-replaces to get the source file in the right format.
>
> 
> From: SparknewUser [melanie.galloi...@gmail.com]
> Sent: Wednesday, July 29, 2015 6:37 AM
> To: user@spark.apache.org
> Subject: How to read a Json file with a specific format?
>
> I'm trying to read a Json file which is like :
> [
>
> {"IFAM":"EQR","KTM":143000640,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ]}
>
> ,{"IFAM":"EQR","KTM":143000640,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}
> ]}
> ]
>
> I've tried the command:
> val df = sqlContext.read.json("namefile")
> df.show()
>
>
> But this does not work : my columns are not recognized...
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-Json-file-with-a-specific-format-tp24061.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
*Mélanie*


Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?

2015-05-29 Thread mélanie gallois
When will Spark 1.4 be available exactly?
To answer to "Model selection can be achieved through high
lambda resulting lots of zero in the coefficients" : Do you mean that
putting a high lambda as a parameter of the logistic regression keeps only
a few significant variables and "deletes" the others with a zero in the
coefficients? What is a high lambda for you?
Is the lambda a parameter available in Spark 1.4 only or can I see it in
Spark 1.3?

2015-05-23 0:04 GMT+02:00 Joseph Bradley :

> If you want to select specific variable combinations by hand, then you
> will need to modify the dataset before passing it to the ML algorithm.  The
> DataFrame API should make that easy to do.
>
> If you want to have an ML algorithm select variables automatically, then I
> would recommend using L1 regularization for now and possibly elastic net
> after 1.4 is release, per DB's suggestion.
>
> If you want detailed model statistics similar to what R provides, I've
> created a JIRA for discussing how we should add that functionality to
> MLlib.  Those types of stats will be added incrementally, but feedback
> would be great for prioritization:
> https://issues.apache.org/jira/browse/SPARK-7674
>
> To answer your question: "How are the weights calculated: is there a
> correlation calculation with the variable of interest?"
> --> Weights are calculated as with all logistic regression algorithms, by
> using convex optimization to minimize a regularized log loss.
>
> Good luck!
> Joseph
>
> On Fri, May 22, 2015 at 1:07 PM, DB Tsai  wrote:
>
>> In Spark 1.4, Logistic Regression with elasticNet is implemented in ML
>> pipeline framework. Model selection can be achieved through high
>> lambda resulting lots of zero in the coefficients.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> Blog: https://www.dbtsai.com
>>
>>
>> On Fri, May 22, 2015 at 1:19 AM, SparknewUser
>>  wrote:
>> > I am new in MLlib and in Spark.(I use Scala)
>> >
>> > I'm trying to understand how LogisticRegressionWithLBFGS and
>> > LogisticRegressionWithSGD work.
>> > I usually use R to do logistic regressions but now I do it on Spark
>> > to be able to analyze Big Data.
>> >
>> > The model only returns weights and intercept. My problem is that I have
>> no
>> > information about which variable is significant and which variable I had
>> > better
>> > to delete to improve my model. I only have the confusion matrix and the
>> AUC
>> > to evaluate the performance.
>> >
>> > Is there any way to have information about the variables I put in my
>> model?
>> > How can I try different variable combinations, do I have to modify the
>> > dataset
>> > of origin (e.g. delete one or several columns?)
>> > How are the weights calculated: is there a correlation calculation with
>> the
>> > variable
>> > of interest?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
*Mélanie*


How to get the best performance with LogisticRegressionWithSGD?

2015-05-27 Thread mélanie gallois
I'm new to Spark and I'm getting bad performance with classification
methods on Spark MLlib (worse than R in terms of AUC).
I am trying to put my own parameters rather than the default parameters.
Here is the method I want to use :

train(RDD 
<https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html>https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/regression/LabeledPoint.html>>
input,
int numIterations,
  double stepSize,
 double miniBatchFraction,
Vector 
<https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/linalg/Vector.html>
initialWeights)

How to choose "numIterations" and "stepSize"?
What does miniBatchFraction mean?
Is initialWeights necessary to have a good model? Then, how to choose them?


Regards,

Mélanie Gallois