Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
/ShortCircuitLocalReads.html) to use unix socket for local communication or just directly read a part from other's jvm shuffle file. But yes, it's not available in spark out of box. Thanks, Peter Rudenko пт, 19 жовт. 2018 о 16:54 Peter Liu пише: > Hi Peter, > > thank you for the reply and detailed informati

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
to either non-present pages or mapping changes. So if you have an RDMA capable NIC (or you can try on Azure cloud https://azure.microsoft.com/en-us/blog/introducing-the-new-hb-and-hc-azure-vm-sizes-for-hpc/ ), have a try. For network intensive apps you should get better performance. Thanks, Peter

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
It doesn't matter - just an example. Imagine yarn cluster with 100GB of ram and i submit simultaneously a lot of jobs in a loop. Thanks, Peter Rudenko On 4/6/16 7:22 PM, Ted Yu wrote: Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too

[Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
or a while. Is it possible to set some sort of timeout for acquiring executors otherwise kill application? Thanks, Peter Rudenko - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-ma

Re: spark.ml : eval model outside sparkContext

2016-03-16 Thread Peter Rudenko
Hi Emmanuel, looking for a similar solution. For now found only: https://github.com/truecar/mleap Thanks, Peter Rudenko On 3/16/16 12:47 AM, Emmanuel wrote: Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
As i've tried cgroups - seems the isolation is done by percantage not by cores number. E.g. i've set min share to 256 - i still see all 8 cores, but i could only load only 20% of each core. Thanks, Peter Rudenko On 2015-11-10 15:52, Saisai Shao wrote: From my understanding, it depends

[Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
l 8 cores? Thanks, Peter Rudenko

[Yarn] How to set user in ContainerLaunchContext?

2015-11-02 Thread Peter Rudenko
tputBuffer(); credentials.writeTokenStorageToStream(dob); ByteBuffer.wrap(dob.getData(),0, dob.getLength()).duplicate(); } val cCLC = Records.newRecord(classOf[ContainerLaunchContext]) cCLC.setCommands(List("spark-submit --master yarn ...")) cCLC.setTokens(setupTokens(user)) Thanks, Peter Rudenko

input file from tar.gz

2015-09-29 Thread Peter Rudenko
Hi, i have a huge tar.gz file on dfs. This file contains several files, but i want to use only one of them as input. Is it possible to filter somehow a tar.gz schema, something like this: sc.textFile("hdfs:///data/huge.tar.gz#input.txt") Thanks, Pet

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Peter Rudenko
Cache(true) boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]]) val model = GradientBoostedTrees.train(instances, boostingStrategy) | Thanks, Peter Rudenko On 2015-08-14 00:33, Sean Owen wrote: Not that I have

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
(SI1, SI2).setOutputCol(features) - features 00 11 01 22 HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1) bucket1 bucket2 a,a,b c HT1 3 //Hash collision 3 3 1 Thanks, Peter Rudenko On 2015-08-07 09:55, praveen S wrote: Is StringIndexer + VectorAssembler equivalent

Re: Delete NA in a dataframe

2015-08-04 Thread Peter Rudenko
this: val rv = allyears2k.filter(COLUMN != `NA`) Thanks, Peter Rudenko On 2015-08-04 15:03, clark djilo kuissu wrote: Hello, I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package When I do this: allyears2k.na.drop() I have no result. Can you help me

Re: what is metadata in StructField ?

2015-07-15 Thread Peter Rudenko
/attributes.scala Take a look how i'm using metadata to get summary statistics from h2o: https://github.com/h2oai/sparkling-water/pull/17/files Let me know if you'll have questions. Thanks, Peter Rudenko On 2015-07-15 12:48, matd wrote: I see in StructField that we can provide metadata

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Peter Rudenko
application correctly terminates (using sc.stop()). But in my case when it fills all disk space it was stucked and couldn't stop correctly. After i restarted yarn i don't know how easily trigger cache cleanup except of manually on all the nodes. Thanks, Peter Rudenko On 2015-07-10 20:07, Andrew

How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Peter Rudenko
understood is of APPLICATION type. Is it possible to restrict a disk space for spark application? Will spark fail if it wouldn't be able to persist on disk (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source? Thanks, Peter Rudenko

Re: MLLib- Probabilities with LogisticRegression

2015-06-30 Thread Peter Rudenko
Hi Klaus, you can use new ml api with dataframe: val model = (new LogisticRegresion).setInputCol(fetures).setProbabilityCol(probability).setOutputCol(prediction).fit(data) Thanks, Peter Rudenko On 2015-06-30 14:00, Klaus Schaefers wrote: Hello, is there a way to get the during the predict

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Peter Rudenko
Thanks, Peter Rudenko On 2015-06-25 20:37, Daniel Haviv wrote: Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Peter Rudenko
technics than grid search (random search crossvalidator, bayesian optimization CV, etc.). Thanks, Peter Rudenko On 2015-06-18 01:58, Xiangrui Meng wrote: On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira h...@inesctec.pt wrote: Hi, I am currently experimenting with linear regression (SGD) (Spark

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Peter Rudenko
Hi Brandon, they are available, but private to ml package. They are now public in 1.4. For 1.3.1 you can define your transformer in org.apache.spark.ml package - then you could use these traits. Thanks, Peter Rudenko On 2015-06-04 20:28, Brandon Plaster wrote: Is HasInputCol and HasOutputCol

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
Hi Dimple, take a look to existing transformers: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala

Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko
Hi Cesar, try to do: hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) It's a bit inefficient, but should shuffle the whole dataframe. Thanks, Peter Rudenko On 2015-06-01 22:49, Cesar Flores wrote: I would like to know what will be the best approach to randomly

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hm, thanks. Do you know what this setting mean: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178 ? Thanks, Peter Rudenko On 2015-05-08 17:48, ayan guha wrote: From S3. As the dependency of df will be on s3. And because rdds

[SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hi, i have a next question: |val data = sc.textFile(s3:///) val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) | if during someAction some workers would die, would recomputation download files from s3 or from hdfs parquet? Thanks, Peter Rudenko ​

[Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-24 Thread Peter Rudenko
practice to handle partitions in dataframes with a lots of columns? Should i repartition manually after adding columns? What’s better faster: Applying 30 transformers for each numeric column or combine these columns to 1 vector column and apply 1 transformer? Thanks, Peter Rudenko ​

Reading files from http server

2015-04-13 Thread Peter Rudenko
downloading them first to hdfs?. Something like this: sc.textFile( http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{0-23}.gz;), so it will have 24 partitions. Thanks, Peter Rudenko

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) = LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago

Re: Spark ML Pipeline inaccessible types

2015-03-25 Thread Peter Rudenko
this: |StructType(vectorTypeColumn, SparkVector.VectorUDT, false)) | Thanks, Peter Rudenko On 2015-03-25 13:14, zapletal-mar...@email.cz wrote: Sean, thanks for your response. I am familiar with /NoSuchMethodException/ in general, but I think it is not the case this time. The code actually attempts

Re: ML Pipeline question about caching

2015-03-17 Thread Peter Rudenko
of combinations (num parameters for transformer /number parameters for estimator / number of folds). Thanks, Peter Rudenko On 2015-03-18 00:26, Cesar Flores wrote: Hello all: I am using the ML Pipeline, which I consider very powerful. I have the next use case: * I have three transformers, which I

Re: Workflow layer for Spark

2015-03-13 Thread Peter Rudenko
Take a look to the new spark ml api http://spark.apache.org/docs/latest/ml-guide.html with Pipeline functionality and also to spark dataflow https://github.com/cloudera/spark-dataflow - Google Cloud Dataflow API implementation on top of spark. Thanks, Peter Rudenko On 2015-03-13 17:46

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread Peter Rudenko
Yes, it's called Coordinated Matrix(http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix) you need to fill it with elemets of type MatrixEntry( (Long, Long, Double)) Thanks, Peter Rudenko On 2015-02-27 14:01, shahab wrote: Hi, I just wonder if there is any Sparse

Re: ML Transformer

2015-02-19 Thread Peter Rudenko
Hi Cesar, these methods would be private until new ml api would stabilize (aprox. in spark 1.4). My solution for the same issue was to create org.apache.spark.ml package in my project and extends/implement everything there. Thanks, Peter Rudenko On 2015-02-18 22:17, Cesar Flores wrote: I