unit testing in spark

2016-12-08 Thread pseudo oduesp
somone can tell me how i can make unit test on pyspark ? (book, tutorial ...)

create new spark context from ipython or jupyter

2016-12-07 Thread pseudo oduesp
Hi, how we can create new sparkcontext from Ipython or jupyter session i mean if i use current sparkcontext and i run sc.stop() how i can launch new one from ipython without restart newsession of ipython by refreshing browser ?? why i code some functions and i figreout i forgot something insde

add jars like spark-csv to ipython notebook with pyspakr

2016-09-09 Thread pseudo oduesp
Hi , how i can add jar to Ipython notebooke i tied Pyspark_submit_args without succes ? thanks

pyspakr 1.5.0 boradcast join

2016-09-08 Thread pseudo oduesp
hi , some one can show me an example for broadcast join in this version 1.5.0 with data frame in pyspark thanks

long lineage

2016-08-16 Thread pseudo oduesp
Hi , how we can deal after raise stackoverflow trigger by long lineage ? i mean i have this error and how resolve it wiyhout creating new session thanks

java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread pseudo oduesp
hi, i cretae new columns with udf after i try to filter this columns : i get this error why ? : java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true]) at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:221) at

zip for pyspark

2016-08-08 Thread pseudo oduesp
hi, how i can export all project on pyspark like zip from local session to cluster and deploy with spark submit i mean i have a large project with all dependances and i want create zip containing all of dependecs and deploy it on cluster

Re: pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269) Process finished with exit code 1 2016-08-05 15:35 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > HI, > > i configured th pycha

pyspark on pycharm on WINDOWS

2016-08-05 Thread pseudo oduesp
HI, i configured th pycharm like describe on stack overflow with spark_home and hadoop_conf_dir and donwload winutils to use it with prebuild version of spark 2.0 (pyspark 2.0) and i get this error i f you can help me to find solution thanks

WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
hi , with pyspark 2.0 i get this errors WindowsError: [Error 2] The system cannot find the file specified someone can help me to find solution thanks

Re: WindowsError: [Error 2] The system cannot find the file specified

2016-08-04 Thread pseudo oduesp
da2\lib\subprocess.py", line 711, in __init__ errread, errwrite) File "C:\Users\AppData\Local\Continuum\Anaconda2\lib\subprocess.py", line 959, in _execute_child startupinfo) WindowsError: [Error 2] Le fichier sp�cifi� est introuvable Process finished with exit code 1 201

pycharm and pyspark on windows

2016-08-04 Thread pseudo oduesp
Hi , what is good conf for pyspark and pycharm on windwos ? tahnks

decribe function limit of columns

2016-08-02 Thread pseudo oduesp
Hi in spark 1.5.0 i used descibe function with more than 100 columns . someone can tell me if any limit exsiste now ? thanks

Re: java.net.UnknownHostException

2016-08-02 Thread pseudo oduesp
someone can help me please 2016-08-01 11:51 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > hi > i get the following erreors when i try using pyspark 2.0 with ipython on > yarn > somone can help me please . > java.lang.IllegalArgumentException: java.net.Unknown

java.net.UnknownHostException

2016-08-01 Thread pseudo oduesp
hi i get the following erreors when i try using pyspark 2.0 with ipython on yarn somone can help me please . java.lang.IllegalArgumentException: java.net.UnknownHostException: s001.bigdata.;s003.bigdata;s008bigdata. at

estimation of necessary time of execution

2016-07-29 Thread pseudo oduesp
Hi, on hive we have a awosome function for estimation of time of execution before launch ? in spark can find any function to estimate the time of lineage of spark dag execution ? Thanks

sparse vector to dense vecotor in pyspark

2016-07-26 Thread pseudo oduesp
Hi , with standerscaler we get a sparse vector how i can transform it to list or dense vector without missing the sparse values thanks

Re: PCA machine learning

2016-07-26 Thread pseudo oduesp
for each value in feature the name of variable . how i can identify names of principal component in second vector ? 2016-07-26 10:39 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > Hi, > when i perform PCA reduction dimension i get dense vector with length of > number of prin

PCA machine learning

2016-07-26 Thread pseudo oduesp
Hi, when i perform PCA reduction dimension i get dense vector with length of number of principla component my question : -How i get the name of features giving this vectors ? -the values inside vectors result its value of projection of all features on this componenets ? - how to use it ?

Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
PYSPARK_SUBMIT_ARGS = --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar without succecs thanks 2016-07-25 13:27 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > Hi , > someone can telle me how i can add jars to ipython i try spark > > >

add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
Hi , someone can telle me how i can add jars to ipython i try spark

spark and plot data

2016-07-21 Thread pseudo oduesp
Hi , i know spark it s engine to compute large data set but for me i work with pyspark and it s very wonderful machine my question we don't have tools for ploting data each time we have to switch and go back to python for using plot. but when you have large result scatter plot or roc curve

RandomForestClassifier

2016-07-20 Thread pseudo oduesp
hi , we have parmaters named labelCol="labe" ,featuresCol="features", when i precise the value here (label and features) if train my model on data frame with other columns tha algorithme choos only label columns and features columns ? thanks

lift coefficien

2016-07-20 Thread pseudo oduesp
Hi , how we can claculate lift coeff from pyspark result of prediction ? thanks ?

which one spark ml or spark mllib

2016-07-19 Thread pseudo oduesp
HI, i don't have any idea why we have to library ML and mlib ml you can use it with data frame and mllib with rdd but ml have some lakes like: save model most important if you want create web api with score my question why we don't have all features in MLlib on ML ? ( i use pyspark 1.5.0

pyspark 1.5 0 save model ?

2016-07-18 Thread pseudo oduesp
Hi, how i can save model under pyspakr 1.5.0 ? i use RandomForestClassifier() thanks in advance.

Feature importance IN random forest

2016-07-12 Thread pseudo oduesp
Hi, i use pyspark 1.5.0 can i ask you how i can get feature imprtance for a randomforest algorithme in pyspark and please give me example thanks for advance.

categoricalFeaturesInfo

2016-07-07 Thread pseudo oduesp
Hi, how i can use this option in Random Forest . when i transform my vector (100 features ) i have 20 categoriel feature include. if i understand categorielFeatureinfo , i should past the position of my 20 categoriels feature inside of the vector containing 100 with map{ positionof feature

remove row from data frame

2016-07-05 Thread pseudo oduesp
Hi , how i can remove row from data frame verifying some condition on some columns ? thanks

alter table with hive context

2016-06-26 Thread pseudo oduesp
Hi, how i can alter table by adiing new columns to table in hivecontext ?

add multiple columns

2016-06-26 Thread pseudo oduesp
Hi who i can add multiple columns to data frame withcolumns allow to add one columns but when you have multiple i have to loop on eache columns ? thanks

Re: categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
,f_index)) like that i keep order of variable in this order i have all f_index from 517: to 824 but when i create lable point i lose this order and i lose type int . 2016-06-24 9:40 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > Hi, > how i can keep type of my variable like int

categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp
Hi, how i can keep type of my variable like int because i get this error when i call random forest algorithm with model = RandomForest.trainClassifier(rdf, numClasses=2, categoricalFeaturesInfo=d,

categoricalFeaturesInfo

2016-06-23 Thread pseudo oduesp
Hi, i am pyspark user and i want test the Randoforest algrithmes. i found this parmeters categoricalFeaturesInfo how i can use it from list of categoriels variables . thanks.

feture importance or variable importance

2016-06-21 Thread pseudo oduesp
hi , i am pyspark user and i want to extract var imprtance in randomforest model for plot how i can deal with that ? thanks

Labeledpoint

2016-06-21 Thread pseudo oduesp
Hi, i am pyspark user and i want test Randomforest. i have dataframe with 100 columns i should give Rdd or data frame to algorithme i transformed my dataframe to only tow columns label ands features columns df.label df.features 0(517,(0,1,2,333,56 ... 1

)

2016-06-21 Thread pseudo oduesp
hi, help me please to resolve this issues

cast only some columns

2016-06-21 Thread pseudo oduesp
Hi , with fillna we can select some columns to perform replace some values with chosing columns with dict {columns :values } but how i can do same with cast i have data frame with 300 columns and i want just cats 4 from list columns but with select query like that :

read.parquet or read.load

2016-06-21 Thread pseudo oduesp
hi , realy i m angry about parquet file each time i get error like Could not read footer: java.lang.RuntimeException: or error occuring when o127.load why we have à lot of issuse with this format ? thanks

Unable to acquire bytes of memory

2016-06-20 Thread pseudo oduesp
Hi , i don t have no idea why i get this error Py4JJavaError: An error occurred while calling o69143.parquet. : org.apache.spark.SparkException: Job aborted. at

plot importante variable in pyspark

2016-06-19 Thread pseudo oduesp
hi, who can get score for each row of classification algortithmes , and how i can plot features importance of variables like sickit learn ? thanks.

binding two data frame

2016-06-17 Thread pseudo oduesp
Hi, in R we have function named Cbind and rbind for data frame how i can repduce this functions on pyspark df1.col1 df1.col2 df1.col3 df2.col1 df2.col2 df2.col3 fincal result : new data frame df1.col1 df1.col2 df1.col3 df2.col1 df2.col2 df2.col3 thanks

update data frame inside function

2016-06-17 Thread pseudo oduesp
Hi, how i can update data frame inside function ? why ? i have to apply Stingindexer multiple time because i tried Pipeline but it still extremly slow for 84 columns to Stringindexed eache one have 10 modalities and data frame with 21Milion row i need 15 hours of processing . now i want try

Stringindexers on multiple columns >1000

2016-06-17 Thread pseudo oduesp
Hi, i want aplly string indexers on multiple coluns but when use Stringindexer and pipline that take lang time . Indexer = StringIndexer(inputCol="Feature1", outputCol="indexed1") this it practice for one or two or teen lines but when you have more the 1000 lines how you can do ? thanks

difference between dataframe and dataframwrite

2016-06-16 Thread pseudo oduesp
hi, what is difference between dataframe and dataframwrite ?

Re: advise please

2016-06-16 Thread pseudo oduesp
hi , i use pyspark 1.5.0 on yarn cluster with 19 nodes and 200 GO and 4 cores eache (include driver) 2016-06-16 15:42 GMT+02:00 pseudo oduesp <pseudo20...@gmail.com>: > Hi , > who i can dummies large set of columns with STRINGindexer fast ? > becasue i tested with 89 values an

advise please

2016-06-16 Thread pseudo oduesp
Hi , who i can dummies large set of columns with STRINGindexer fast ? becasue i tested with 89 values and eache one had 10 max distinct values and that take lot of time thanks

cache datframe

2016-06-16 Thread pseudo oduesp
hi, if i cache same data frame and transforme and add collumns i should cache second times df.cache() transforamtion add new columns df.cache() ?

String indexer

2016-06-16 Thread pseudo oduesp
hi , what is limite of modalties in Stingindexer : if i have columns with 1000 modalities it good to use STRINGindexers ? or i should try other function and which one please ? thanks

STringindexer

2016-06-16 Thread pseudo oduesp
Hi , i have dataframe with 1000 columns to dummies with stingIndexer when i apply pipliene take long times whene i want merge result with other data frame i mean : originnal data frame + columns indexed by STringindexers PB save stage it s long why ? code indexers =

vecotors inside columns

2016-06-15 Thread pseudo oduesp
hi , i want ask question about vector.dense or spars : imagine i have dataframe with columns and one of them contain vectors . my question can i give this columns to machine learning algorithmes like one value ? df.col1 | df.col2 | 1 | (1,[2],[3] ,[] ...[6]) 2 | (1,[5],[3]

MAtcheERROR : STRINGTYPE

2016-06-14 Thread pseudo oduesp
hello why i get this error when using assembleur = VectorAssembler( inputCols=l_CDMVT, outputCol="aev"+"CODEM") output = assembler.transform(df_aev) L_CDMTV list of columns thanks ?

data frame or RDD for machine learning

2016-06-09 Thread pseudo oduesp
Hi, after spark 1.3 we have dataframe ( thanks good ) , instead rdd : in machine learning algorithmes we should give him an RDD or dataframe? i mean when i build modele : Model = algoritme(rdd) or Model = algorithme(df) if you have an exemple with data frame i prefer work with

oozie and spark on yarn

2016-06-08 Thread pseudo oduesp
hi , i want ask if somone used oozie with spark ? if you can give me example: how ? we can configure on yarn thanks

np.unique and collect

2016-06-03 Thread pseudo oduesp
Hi , why np.unique return list instead of list in this function ? def unique_item_df(df,list_var): l = df.select(list_var).distinct().collect() return np.unique(l) df it s data frmae and list it lits of variables . (pyspark) code thanks .

hivecontext and date format

2016-06-01 Thread pseudo oduesp
Hi , can i ask you how we can convert string like dd/mm/ to date type in hivecontext? i try with unix_timestemp and with format date but i get null . thank you.

equvalent beewn join sql and data frame

2016-05-30 Thread pseudo oduesp
hi guys , it s similare thing to do : sqlcontext.join("select * from t1 join t2 on condition) and df1.join(df2,condition,'inner")?? ps:df1.registertable('t1') ps:df2.registertable('t2') thanks

never understand

2016-05-25 Thread pseudo oduesp
hi guys , -i get this errors with pyspark 1.5.0 under cloudera CDH 5.5 (yarn) -i use yarn to deploy job on cluster. -i use hive context and parquet file to save my data. limit container 16 GB number of executor i tested befor it s 12 GB (executor memory) -i tested to increase number of

orgin of error

2016-05-15 Thread pseudo oduesp
someone can help me about this issues py4j.protocol.Py4JJavaError: An error occurred while calling o126.parquet. : org.apache.spark.SparkException: Job aborted. at

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp
hi guys , i have this error after 5 hours of processing i make lot of joins 14 left joins with small table : i saw in the spark ui and console log evrithing ok but when he save last join i get this error Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet

multiple tables for join

2016-03-24 Thread pseudo oduesp
hi , i spent two months of my times to make 10 joins whith folowin tables : 1go tbal1 3go table 2 500mo table 3 400 mo table 4 20 mo table 5 100 mo table 6 30 mo table 7 40 mo table 8 700 mo table 9 800 mo table 10 i use hivecontext.sql("select * from table1