broadcast() multiple times the same df. Is it cached ?

2017-06-12 Thread matd
Hi spark folks, In our application, we have to join a dataframe with several other df (not always the same joining column). This left-hand side df is not very large, so a broadcast hint may be beneficial. My questions : - if the same df get broadcast multiple times, will the transfer occur once

Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread matd
Hi folks, I have a Spark application executing various jobs for different users simultaneously, via several Spark sessions on several threads. My customer would like to kerberize his hadoop cluster. I wonder if there is a way to configure impersonation such as each of these jobs would be ran

Online learning of LDA model in Spark (update an existing model)

2017-03-13 Thread matd
Hi folks, I would like to train an LDA model in an online fashion, ie. be able to update the resulting model with new documents as they are available. I understand that, under the hood, an online algo is implemented in OnlineLDAOptimizer, but don't understand from the API how I can update an

spark 2.0 bloom filters

2016-07-06 Thread matd
A question for Spark developers I see that Bloom filters have been integrated in Spark 2.0 . Hadoop already has some Bloom filter implementations, especially a dynamic one

Get both feature importance and ROC curve from a random forest classifier

2016-06-15 Thread matd
Hi ml folks ! I'm using a Random Forest for a binary classification. I'm interested in getting both the ROC *curve* and the feature importance from the trained model. If I'm not missing something obvious, the ROC curve is only available in the old mllib world, via BinaryClassificationMetrics. In

spark w/ scala 2.11 and PackratParsers

2016-05-04 Thread matd
Hi folks, Our project is a mess of scala 2.10 and 2.11, so I tried to switch everything to 2.11. I had some exasperating errors like this : java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/DDLParser at org.apache.spark.sql.SQLContext.(SQLContext.scala:208) at

fp growth - clean up repetitions in input

2016-01-06 Thread matd
Hi folks, I'm interested in using FP growth to identify sequence patterns. Unfortunately, my input sequences have cycles : ...1,2,4,1,2,5... And this is not supported by fp-growth (I get a SparkException: Items in a transaction must be unique but got WrappedArray) Do you know a way to identify

Handle null/NaN values in mllib classifier

2015-09-25 Thread matd
Hi folks, I have a set of categorical columns (strings), that I'm parsing and converting into Vectors of features to pass to a mllib classifier (random forest). In my input data, some columns have null values. Say, in one of those columns, I have p values + a null value : How should I build my

S3n, parallelism, partitions

2015-08-17 Thread matd
Hello, I would like to understand how the work is parallelized accross a Spark cluster (and what is left to the driver) when I read several files from a single folder in s3 s3n://bucket_xyz/some_folder_having_many_files_in_it/ How files (or file parts) are mapped to partitions ? Thanks Mathieu

what is metadata in StructField ?

2015-07-15 Thread matd
I see in StructField that we can provide metadata. What is it meant for ? How is it used by Spark later on ? Are there any rules on what we can/cannot do with it ? I'm building some DataFrame processing, and I need to maintain a set of (meta)data along with the DF. I was wondering if I can use

spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread matd
Hi, Spark ec2 scripts are useful, but they install everything as root. AFAIK, it's not a good practice ;-) Why is it so ? Should these scripts reserved for test/demo purposes, and not to be used for a production system ? Is it planned in some roadmap to improve that, or to replace ec2-scripts