Re: How to read a Json file with a specific format?
Can you give an example with my extract? Mélanie Gallois 2015-07-29 16:55 GMT+02:00 Young, Matthew T : > The built-in Spark JSON functionality cannot read normal JSON arrays. The > format it expects is a bunch of individual JSON objects without any outer > array syntax, with one complete JSON object per line of the input file. > > AFAIK your options are to read the JSON in the driver and parallelize it > out to the workers or to fix your input file to match the spec. > > For one-off conversions I usually use a combination of jq and > regex-replaces to get the source file in the right format. > > > From: SparknewUser [melanie.galloi...@gmail.com] > Sent: Wednesday, July 29, 2015 6:37 AM > To: user@spark.apache.org > Subject: How to read a Json file with a specific format? > > I'm trying to read a Json file which is like : > [ > > {"IFAM":"EQR","KTM":143000640,"COL":21,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ]} > > ,{"IFAM":"EQR","KTM":143000640,"COL":22,"DATA":[{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ,{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"} > ]} > ] > > I've tried the command: > val df = sqlContext.read.json("namefile") > df.show() > > > But this does not work : my columns are not recognized... > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-a-Json-file-with-a-specific-format-tp24061.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- *Mélanie*
Re: MLlib: how to get the best model with only the most significant explanatory variables in LogisticRegressionWithLBFGS or LogisticRegressionWithSGD ?
When will Spark 1.4 be available exactly? To answer to "Model selection can be achieved through high lambda resulting lots of zero in the coefficients" : Do you mean that putting a high lambda as a parameter of the logistic regression keeps only a few significant variables and "deletes" the others with a zero in the coefficients? What is a high lambda for you? Is the lambda a parameter available in Spark 1.4 only or can I see it in Spark 1.3? 2015-05-23 0:04 GMT+02:00 Joseph Bradley : > If you want to select specific variable combinations by hand, then you > will need to modify the dataset before passing it to the ML algorithm. The > DataFrame API should make that easy to do. > > If you want to have an ML algorithm select variables automatically, then I > would recommend using L1 regularization for now and possibly elastic net > after 1.4 is release, per DB's suggestion. > > If you want detailed model statistics similar to what R provides, I've > created a JIRA for discussing how we should add that functionality to > MLlib. Those types of stats will be added incrementally, but feedback > would be great for prioritization: > https://issues.apache.org/jira/browse/SPARK-7674 > > To answer your question: "How are the weights calculated: is there a > correlation calculation with the variable of interest?" > --> Weights are calculated as with all logistic regression algorithms, by > using convex optimization to minimize a regularized log loss. > > Good luck! > Joseph > > On Fri, May 22, 2015 at 1:07 PM, DB Tsai wrote: > >> In Spark 1.4, Logistic Regression with elasticNet is implemented in ML >> pipeline framework. Model selection can be achieved through high >> lambda resulting lots of zero in the coefficients. >> >> Sincerely, >> >> DB Tsai >> --- >> Blog: https://www.dbtsai.com >> >> >> On Fri, May 22, 2015 at 1:19 AM, SparknewUser >> wrote: >> > I am new in MLlib and in Spark.(I use Scala) >> > >> > I'm trying to understand how LogisticRegressionWithLBFGS and >> > LogisticRegressionWithSGD work. >> > I usually use R to do logistic regressions but now I do it on Spark >> > to be able to analyze Big Data. >> > >> > The model only returns weights and intercept. My problem is that I have >> no >> > information about which variable is significant and which variable I had >> > better >> > to delete to improve my model. I only have the confusion matrix and the >> AUC >> > to evaluate the performance. >> > >> > Is there any way to have information about the variables I put in my >> model? >> > How can I try different variable combinations, do I have to modify the >> > dataset >> > of origin (e.g. delete one or several columns?) >> > How are the weights calculated: is there a correlation calculation with >> the >> > variable >> > of interest? >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-how-to-get-the-best-model-with-only-the-most-significant-explanatory-variables-in-LogisticRegr-tp22993.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> > For additional commands, e-mail: user-h...@spark.apache.org >> > >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- *Mélanie*
How to get the best performance with LogisticRegressionWithSGD?
I'm new to Spark and I'm getting bad performance with classification methods on Spark MLlib (worse than R in terms of AUC). I am trying to put my own parameters rather than the default parameters. Here is the method I want to use : train(RDD <https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html>https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/regression/LabeledPoint.html>> input, int numIterations, double stepSize, double miniBatchFraction, Vector <https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/mllib/linalg/Vector.html> initialWeights) How to choose "numIterations" and "stepSize"? What does miniBatchFraction mean? Is initialWeights necessary to have a good model? Then, how to choose them? Regards, Mélanie Gallois