broadcast() multiple times the same df. Is it cached ?

2017-06-12 Thread matd
Hi spark folks,

In our application, we have to join a dataframe with several other df (not
always the same joining column).

This left-hand side df is not very large, so a broadcast hint may be
beneficial.

My questions :
- if the same df get broadcast multiple times, will the transfer occur once
(the broadcast data is somehow cached on executors), or multiple times ?
- If the join concern different cols, will it be cached as well ?

Thanks for your insights
Mathieu




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-multiple-times-the-same-df-Is-it-cached-tp28756.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Kerberos impersonation of a Spark Context at runtime

2017-05-04 Thread matd
Hi folks,

I have a Spark application executing various jobs for different users
simultaneously, via several Spark sessions on several threads.

My customer would like to kerberize his hadoop cluster. I wonder if there is
a way to configure impersonation such as each of these jobs would be ran
with the different proxy users. From what I see in spark conf and code, it's
not possible to do that at runtime for a specific context, but I'm not
familiar with Kerberos nor with this part of Spark.

Anyone can confirm/infirm this ?

Mathieu

(also on S.O
http://stackoverflow.com/questions/43765044/kerberos-impersonation-of-a-spark-context-at-runtime)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-impersonation-of-a-Spark-Context-at-runtime-tp28651.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Online learning of LDA model in Spark (update an existing model)

2017-03-13 Thread matd
Hi folks,

I would like to train an LDA model in an online fashion, ie. be able to
update the resulting model with new documents as they are available.

I understand that, under the hood, an online algo is implemented in
OnlineLDAOptimizer, but don't understand from the API how I can update an
existing model with a new batch of docs.

is it possible ? Any hint or code sample will be appreciated.

Thanks
Mat



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Online-learning-of-LDA-model-in-Spark-update-an-existing-model-tp28489.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark 2.0 bloom filters

2016-07-06 Thread matd
A question for Spark developers

I see that Bloom filters have been integrated in  Spark 2.0

 
.

Hadoop already has some Bloom filter implementations, especially a  dynamic
one

 
, very interesting when the number of keys largely exceed what was imagined.

Is there any rationale (performance, implem...) for this implem in Spark
instead of re-using the one from Hadoop ?

Thanks !



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-2-0-bloom-filters-tp27297.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Get both feature importance and ROC curve from a random forest classifier

2016-06-15 Thread matd
Hi ml folks !

I'm using a Random Forest for a binary classification.
I'm interested in getting both the ROC *curve* and the feature importance
from the trained model.

If I'm not missing something obvious, the ROC curve is only available in the
old mllib world, via BinaryClassificationMetrics. In the new ml package,
only the areaUnderROC and areaUnderPR are available through
BinaryClassificationEvaluator.

The feature importance is only available in ml package, through
RandomForestClassificationModel.

Any idea to get both ?

Mathieu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-both-feature-importance-and-ROC-curve-from-a-random-forest-classifier-tp27175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark w/ scala 2.11 and PackratParsers

2016-05-04 Thread matd
Hi folks,

Our project is a mess of scala 2.10 and 2.11, so I tried to switch
everything to 2.11.

I had some exasperating errors like this :

java.lang.NoClassDefFoundError:
org/apache/spark/sql/execution/datasources/DDLParser
at org.apache.spark.sql.SQLContext.(SQLContext.scala:208)
at org.apache.spark.sql.SQLContext.(SQLContext.scala:77)
at org.apache.spark.sql.SQLContext$.getOrCreate(SQLContext.scala:1295)

... that I was unable to fix, until I figured out that this error came first
:

java.lang.NoClassDefFoundError: scala/util/parsing/combinator/PackratParsers
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

...that finally managed to fix by adding this dependency :
"org.scala-lang.modules"  %% "scala-parser-combinators" % "1.0.4"

As this is not documented anywhere, I'd like to now if it's just a missing
doc somewhere, or if it's hiding another problem that will jump out at my
face at some point ?

Mathieu




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-w-scala-2-11-and-PackratParsers-tp26877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



fp growth - clean up repetitions in input

2016-01-06 Thread matd
Hi folks,

I'm interested in using FP growth to identify sequence patterns.

Unfortunately, my input sequences have cycles :
...1,2,4,1,2,5...

And this is not supported by fp-growth
(I get a SparkException: Items in a transaction must be unique but got
WrappedArray)

Do you know a way to identify and clean up cycles before giving them to
fp-growth ?

thanks for your input.
Mat



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/fp-growth-clean-up-repetitions-in-input-tp25897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Handle null/NaN values in mllib classifier

2015-09-25 Thread matd
Hi folks,

I have a set of categorical columns (strings), that I'm parsing and
converting into Vectors of features to pass to a mllib classifier (random
forest). 

In my input data, some columns have null values. Say, in one of those
columns, I have p values + a null value :
How should I build my feature Vectors, and the categoricalFeaturesInfo map
of the classifier ?
* option 1 : I tell p values in categoricalFeaturesInfo, and I use
Double.NaN in my input Vectors ?  [ How NaNs are handled by classifiers ? ]
* option 2 : I consider nulls as a value, so I tell (p+1) values in
categoricalFeaturesInfo, and I map nulls to some int ?


Thanks for your help.

Mathieu

(PS : I know the the new dataframe + pipeline + vectorindexer API, but for
reasons it doesn't fit well my need, so I need to do that by myself)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Handle-null-NaN-values-in-mllib-classifier-tp24822.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



S3n, parallelism, partitions

2015-08-17 Thread matd
Hello,

I would like to understand how the work is parallelized accross a Spark
cluster (and what is left to the driver) when I read several files from a
single folder in s3 s3n://bucket_xyz/some_folder_having_many_files_in_it/

How files (or file parts) are mapped to partitions ?

Thanks 
Mathieu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/S3n-parallelism-partitions-tp24293.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



what is metadata in StructField ?

2015-07-15 Thread matd
I see in StructField that we can provide metadata.

What is it meant for ?  How is it used by Spark later on ?
Are there any rules on what we can/cannot do with it ?

I'm building some DataFrame processing, and I need to maintain a set of
(meta)data along with the DF. I was wondering if I can use
StructField.metadata for this use, or if I should build my own structure.

Mathieu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/what-is-metadata-in-StructField-tp23854.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread matd
Hi,

Spark ec2 scripts are useful, but they install everything as root. 
AFAIK, it's not a good practice ;-)

Why is it so ?
Should these scripts reserved for test/demo purposes, and not to be used for
a production system ?
Is it planned in some roadmap to improve that, or to replace ec2-scripts
with something else ?

Would it be difficult to change them to use a sudo-er instead ?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-as-non-root-any-plan-to-improve-that-in-the-future-tp23734.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org