Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not what you are looking for? https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression .setRegParam(0.0) .setElasticNetParam(0.0) Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado

Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread pikufolgado
Hi, I would like to carry out a classic logistic regression analysis. In other words, without using penalised regression ("glmnet" in R). I have read the documentation and am not able to find this kind of models. Is it possible to estimate this? In R the name of the function is "glm". Best

[Spark Structured Streaming] Running out of disk quota due to /work/tmp

2018-10-11 Thread subramgr
We have a Spark Structured Streaming job which runs out of disk quota after some days. The primary reason is there are bunch of empty folders that are getting created in the /work/tmp directory. Any idea how to prune them? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
Hi there, is there any best practice guideline on yarn resource overcommit with cpu / vcores, such as yarn config options, candidate cases ideal for overcommiting vcores etc.? this slide below (from 2016) seems to address the memory overcommit topic and hint a "future" topic on cpu overcommit:

Re: Process Million Binary Files

2018-10-11 Thread Nicolas PARIS
Hi Joel I built such pipeline to transform pdf-> text https://github.com/EDS-APHP/SparkPdfExtractor You can take a look It transforms 20M pdfs in 2 hours on a 5 node spark cluster Le 2018-10-10 23:56, Joel D a écrit : > Hi, > > I need to process millions of PDFs in hdfs using spark. First I’m

Re: Process Million Binary Files

2018-10-11 Thread Jörn Franke
I believe your use case can be better covered with an own data source reading PDF files. On Big Data platforms in general you have the issue that individual PDF files are very small and are a lot of them - this is not very efficient for those platforms. That could be also one source of your