Re: Spark 2.1.1: A bug in org.apache.spark.ml.linalg.* when using VectorAssembler.scala

2017-07-13 Thread Yan Facai
Hi, junjie. As Nick said, spark.ml indeed contains Vector, Vectors and VectorUDT by itself, see: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala:36: sealed trait Vector extends Serializable So, which bug do you find with VectorAssembler? Could you give more details?

Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread Yan Facai
It seems that split will always stop when count of nodes is less than max(X, Y). Hence, are they different? On Tue, Jun 27, 2017 at 11:07 PM, OBones wrote: > Hello, > > Reading around on the theory behind tree based regression, I concluded > that there are various reasons to

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
, then > having transformers and set it into pipeline stages ,would be better option > . > > Regards > Pralabh Kumar > > On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai@gmail.com> > wrote: > >> To filter data, how about using sql? >> >> df.cre

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Yan Facai
;> >>> >>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com> >>> wrote: >>> >>>> Hi Yan, >>>> >>>> Basically the reason I was looking for the categorical datatype is as >>&

Re: Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread Yan Facai
You can use some Transformers to handle categorical data, For example, StringIndexer encodes a string column of labels to a column of label indices: http://spark.apache.org/docs/latest/ml-features.html#stringindexer On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994

Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread Yan Facai
Hi, OBones. 1. which columns are features? For ml, use `setFeaturesCol` and `setLabelCol` to assign input column: https://spark.apache.org/docs/2.1.0/api/scala/index.html# org.apache.spark.ml.classification.DecisionTreeClassifier 2. which ones are categorical? For ml, use Transformer to create

Re: LibSVM should have just one input file

2017-06-11 Thread Yan Facai
Hi, yaphet. It seems that the code you pasted should be located in LibSVM, rather than SVM. Do I misunderstand? For LibSVMDataSource, 1. if numFeatures is unspecified, only one file is valid input. val df = spark.read.format("libsvm") .load("data/mllib/sample_libsvm_data.txt") 2. otherwise,

Re: Convert the feature vector to raw data

2017-06-07 Thread Yan Facai
Hi, kumar. How about removing the `select` in your code? namely, Dataset result = model.transform(testData); result.show(1000, false); On Wed, Jun 7, 2017 at 5:00 PM, kundan kumar wrote: > I am using > > Dataset result =

Re: Adding header to an rdd before saving to text file

2017-06-05 Thread Yan Facai
Hi, upendra. It will be easier to use DataFrame to read/save csv file with header, if you'd like. On Tue, Jun 6, 2017 at 5:15 AM, upendra 1991 wrote: > I am reading a CSV(file has headers header 1st,header2) and generating > rdd, > After few transformations I

Re: Sharing my DataFrame (DataSet) cheat sheet.

2017-05-25 Thread Yan Facai
Thanks, Yuhao. Similarly, I write a 10-minuters-to-spark-dataframe to share the code snippets collected by myself. + https://github.com/facaiy/Spark-for-the-Impatient/blob/master/doc/10_minuters_to_spark_dataframe.md + https://facaiy.github.io/misc/2017/05/24/collection-of-spark-doc.html I hope

Re: Could any one please tell me why this takes forever to finish?

2017-05-01 Thread Yan Facai
Hi, 10.x.x.x is private network, see https://en.wikipedia.org/wiki/IP_address. You should use the public IP of your AWS. On Sat, Apr 29, 2017 at 6:35 AM, Yuan Fang wrote: > > object SparkPi { > private val logger = Logger(this.getClass) > > val sparkConf = new

Re: one hot encode a column of vector

2017-04-24 Thread Yan Facai
How about using countvectorizer? http://spark.apache.org/docs/latest/ml-features.html#countvectorizer On Tue, Apr 25, 2017 at 9:31 AM, Zeming Yu wrote: > how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] > > > FYI here's my code for one hot encoding

Re: how to add new column using regular expression within pyspark dataframe

2017-04-24 Thread Yan Facai
; > flight2 = (flight2.withColumn('stop_duration1', > udfGetMinutes(flight2.stop_duration1)) > ) > > > > On Sat, Apr 22, 2017 at 8:51 PM, 颜发才(Yan Facai) <facai@gmail.com> > wrote: > >> Hi, Zeming. >> >> I prefer to convert String to DateTime, like this:

Re: Spark-shell's performance

2017-04-20 Thread Yan Facai
Hi, Hanson. Perhaps I’m digressing here. If I'm wrong or mistake, please correct me. SPARK_WORKER_* is the configuration for whole cluster, and it's fine to write those global variable in spark-env.sh. However, SPARK_DRIVER_* and SPARK_EXECUTOR_* is the configuration for application (your code),

Re: how to add new column using regular expression within pyspark dataframe

2017-04-19 Thread Yan Facai
How about using `withColumn` and UDF? example: + https://gist.github.com/zoltanctoth/2deccd69e3d1cde1dd78 + https://ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example/ On Mon, Apr 17, 2017 at 8:25 PM, Zeming Yu

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Yan Facai
-classification-regression.html. On Mon, Apr 10, 2017 at 1:45 PM, 颜发才(Yan Facai) <facai@gmail.com> wrote: > how about using > > val dataset = spark.read.format("libsvm") > .option("numFeatures", "780") > .load("data/mll

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Yan Facai
how about using val dataset = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") instead of val dataset = MLUtils.loadLibSVMFile(spark.sparkContext, "data/mnist.bz2") On Mon, Apr 10, 2017 at 11:19 AM, Ryan wrote:

Re: Returning DataFrame for text file

2017-04-06 Thread Yan Facai
SparkSession.read returns a DataFrameReader. DataFrameReader supports a series of format, such as csv, json, text as you mentioned. check API to find more details: + http://spark.apache.org/docs/latest/api/scala/index.html#org .apache.spark.sql.SparkSession +

Re: Master-Worker communication on Standalone cluster issues

2017-04-06 Thread Yan Facai
1. For worker and master: spark.worker.timeout 60s see: http://spark.apache.org/docs/latest/spark-standalone.html 2. For executor and driver: spark.executor.heartbeatInterval 10s see: http://spark.apache.org/docs/latest/configuration.html Please correct me if I'm wrong. On Thu, Apr 6, 2017

Re: Why chinese character gash appear when i use spark textFile?

2017-04-05 Thread Yan Facai
Perhaps your file is not utf-8. I cannot reconstruct it. ### HADOOP: ~/Downloads ❯❯❯ hdfs -cat hdfs:///test.txt 17/04/06 13:43:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 1.0 862910025238798

Re: Read file and represent rows as Vectors

2017-04-05 Thread Yan Facai
You can try `mapPartitions` method. example as below: http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#mapPartitions On Mon, Apr 3, 2017 at 8:05 PM, Old-School wrote: > I have a dataset that contains DocID, WordID and frequency (count) as

Re: Spark data frame map problem

2017-03-23 Thread Yan Facai
Could you give more details of your code? On Wed, Mar 22, 2017 at 2:40 AM, Shashank Mandil wrote: > Hi All, > > I have a spark data frame which has 992 rows inside it. > When I run a map on this data frame I expect that the map should work for > all the 992 rows. >

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-20 Thread Yan Facai
SQLTransformer is a good solution if all operators are combined with SQL. By the way, if you like to get hands dirty, writing a Transformer in scala is not hard, and multiple output columns is valid in such case. On Fri, Mar 17, 2017 at 9:10 PM, Yanbo Liang wrote: > Hi

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yan Facai
Hi, jinhong. Do you use `setRegParam`, which is 0.0 by default ? Both elasticNetParam and regParam are required if regularization is need. val regParamL1 = $(elasticNetParam) * $(regParam) val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) On Mon, Mar 20, 2017 at 6:31 PM, Yanbo Liang

Re: How to improve performance of saveAsTextFile()

2017-03-11 Thread Yan Facai
How about increasing RDD's partitions / rebalancing data? On Sat, Mar 11, 2017 at 2:33 PM, Parsian, Mahmoud wrote: > How to improve performance of JavaRDD.saveAsTextFile(“hdfs://…“). > This is taking over 30 minutes on a cluster of 10 nodes. > Running Spark on YARN. > >

Re: org.apache.spark.SparkException: Task not serializable

2017-03-11 Thread Yan Facai
For scala, make your class Serializable, like this ``` class YourClass *extends Serializable {}```* On Sat, Mar 11, 2017 at 3:51 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > hi mina, > > can you paste your new code here pleasel > i meet this issue too but do not get Ankur's idea. > > thanks >

Re: Sharing my DataFrame (DataSet) cheat sheet.

2017-03-05 Thread Yan Facai
Thanks, very useful! On Sun, Mar 5, 2017 at 4:55 AM, Yuhao Yang wrote: > > Sharing some snippets I accumulated during developing with Apache Spark > DataFrame (DataSet). Hope it can help you in some way. > > https://github.com/hhbyyh/DataFrameCheatSheet. > > [image: 内嵌图片 1] >

Re: attempting to map Dataset[Row]

2017-02-27 Thread Yan Facai
Hi, Fletcher. case class can help construct complex structure. and also, RDD, StructType and StructureField are helpful if you need. However, the code is a little confusing, source.map{ row => { val key = row(0) val buff = new ArrayBuffer[Row]() buff += row (key,buff)

Re: Spark runs out of memory with small file

2017-02-26 Thread Yan Facai
Hi, Tremblay. Your file is .gz format, which is not splittable for hadoop. Perhaps the file is loaded by only one executor. How many executors do you start? Perhaps repartition method could solve it, I guess. On Sun, Feb 26, 2017 at 3:33 AM, Henry Tremblay wrote: > I

Re: question on SPARK_WORKER_CORES

2017-02-18 Thread Yan Facai
Hi, kodali. SPARK_WORKER_CORES is designed for cluster resource manager, see http://spark.apache.org/docs/latest/cluster-overview.html if interested. For standalone mode, you should use the following 3 arguments to allocate resource for normal spark tasks: - --executor-memory -

Re: Query data in subdirectories in Hive Partitions using Spark SQL

2017-02-17 Thread Yan Facai
Hi, Abdelfatah, How to you read these files? spark.read.parquet or spark.sql? Could you show some code? On Wed, Feb 15, 2017 at 8:47 PM, Ahmed Kamal Abdelfatah < ahmed.abdelfa...@careem.com> wrote: > Hi folks, > > > > How can I force spark sql to recursively get data stored in parquet format >

Re: How to specify default value for StructField?

2017-02-17 Thread Yan Facai
I agree with Yong Zhang, perhaps spark sql with hive could solve the problem: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Thu, Feb 16, 2017 at 12:42 AM, Yong Zhang wrote: > If it works under hive, do you try just create the DF from

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Yan Facai
Hi, Basu, if all columns is separated by delimter "\t", csv parser might be a better choice. for example: ```scala spark.read .option("sep", "\t") .option("header", fasle) .option("inferSchema", true) .csv("/user/root/spark_demo/scala/data/Stations.txt") ```

Re: is dataframe thread safe?

2017-02-12 Thread Yan Facai
DataFrame is immutable, so it should be thread safe, right? On Sun, Feb 12, 2017 at 6:45 PM, Sean Owen wrote: > No this use case is perfectly sensible. Yes it is thread safe. > > > On Sun, Feb 12, 2017, 10:30 Jörn Franke wrote: > >> I think you should

Re: Kryo On Spark 1.6.0

2017-01-14 Thread Yan Facai
For scala, you could fix it by using: conf.registerKryoClasses(Array(Class.forName("scala.collection.mutable. WrappedArray$ofRef"))) By the way, if the class is array of primitive class of Java, say byte[], then to use: Class.forName("[B") if it is array of other class, say

Re: Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-08 Thread Yan Facai
Hi, Owen, it is fixed after registering manually: conf.registerKryoClasses(Array(Class.forName("[D"))) I believe that Kyro (latest version) have supported double[] : addDefaultSerializer(double[].class, DoubleArraySerializer.class);

Spark 2.0.2, KyroSerializer, double[] is not registered.

2017-01-07 Thread Yan Facai
Hi, all. I enable kyro in spark with spark-defaults.conf: spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true A KryoException is raised when a logistic regression algorithm is running: Note: To register this class use:

Re: Best practice for preprocessing feature with DataFrame

2016-11-22 Thread Yan Facai
('gender === 1, "male").otherwise(when('gender === 2, > > "female").otherwise("unknown")) as "gender" > > ) > > modifiedRows.show > > > > +---+---+ > > |age| gender| > > +---+---+ > > | 90| male|

Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Yan Facai
g/apache/ > spark/sql/functions.html > > Hope this helps . > > Thanks, > Divya > > > On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Hi, >> I have a sample, like: >> +---+--++ >> |age|gender|

Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Yan Facai
Hi, I have a sample, like: +---+--++ |age|gender| city_id| +---+--++ | | 1|1042015:city_2044...| |90s| 2|1042015:city_2035...| |80s| 2|1042015:city_2061...| +---+--++ and expectation is: "age": 90s

Re: Spark joins using row id

2016-11-13 Thread Yan Facai
pairRDD can use (hash) partition information to do some optimizations when joined, while I am not sure if dataset could. On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma wrote: > For datasets structured as > > ds1 > rowN col1 > 1 A > 2 B > 3 C > 4

Vector is not found in case class after import

2016-11-04 Thread Yan Facai
Hi, My spark-shell version is 2.0.1. I import the Vector and hope to use it in case class, while spark-shell throws an error: not found. scala> import org.apache.spark.ml.linalg.{Vector => OldVector} import org.apache.spark.ml.linalg.{Vector=>OldVector} scala> case class Movie(vec: OldVector)

Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
2.0.1 has fixed the bug. Thanks very much. On Thu, Nov 3, 2016 at 6:22 PM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Thanks, Armbrust. > I'm using 2.0.0. > Does 2.0.1 stable version fix it? > > On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust <mich...@databricks.com>

Re: How to return a case class in map function?

2016-11-03 Thread Yan Facai
Thanks, Armbrust. I'm using 2.0.0. Does 2.0.1 stable version fix it? On Thu, Nov 3, 2016 at 2:01 AM, Michael Armbrust <mich...@databricks.com> wrote: > Thats a bug. Which version of Spark are you running? Have you tried > 2.0.2? > > On Wed, Nov 2, 2016 at 12:01 AM, 颜

How to return a case class in map function?

2016-11-02 Thread Yan Facai
Hi, all. When I use a case class as return value in map function, spark always raise a ClassCastException. I write an demo, like: scala> case class Record(key: Int, value: String) scala> case class ID(key: Int) scala> val df = Seq(Record(1, "a"), Record(2, "b")).toDF scala> df.map{x =>

Addition of two SparseVector

2016-11-01 Thread Yan Facai
Hi, all. How can I add a Vector to another one? scala> val a = Vectors.sparse(20, Seq((1,1.0), (2,2.0))) a: org.apache.spark.ml.linalg.Vector = (20,[1,2],[1.0,2.0]) scala> val b = Vectors.sparse(20, Seq((2,2.0), (3,3.0))) b: org.apache.spark.ml.linalg.Vector = (20,[2,3],[2.0,3.0]) scala> a + b

Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
dFunction = UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true Where is wrong with the udf function `testUDF` ? On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: >

Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
(:27) ... Where did I do wrong? On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <l...@databricks.com> wrote: > You may either use SQL function "array" and "named_struct" or define a > case class with expected field names. > > Cheng > > On 10/21/16 2:

Re: Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
ue) > > 2016-10-21 > ---------- > lk_spark > -- > > *发件人:*颜发才(Yan Facai) <yaf...@gmail.com> > *发送时间:*2016-10-21 15:35 > *主题:*Re: How to iterate the element of an array in DataFrame? > *收件人:*"user.spark"<user@spark.apache.org> &

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
truct<_1:string,_2:string>>] How to express `struct<string, string>` ? On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Hi, I want to extract the attribute `weight` of an array, and combine them > to construct a sparse ve

How to iterate the element of an array in DataFrame?

2016-10-20 Thread Yan Facai
Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category:

Re: Dataframe, Java: How to convert String to Vector ?

2016-10-02 Thread Yan Facai
, Peter Figliozzi <pete.figlio...@gmail.com> wrote: > I'm sure there's another way to do it; I hope someone can show us. I > couldn't figure out how to use `map` either. > > On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Thanks,

Re: Spark ML Decision Trees Algorithm

2016-10-02 Thread Yan Facai
Perhaps the best way is to read the code. The Decision tree is implemented by 1-tree Random forest, whose entry point is `run` method: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88 I'm not familiar with the so-called

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Yan Facai
oks like a Scala wildcard when we refer to > it in code > > val myConvert = (x: String) => { Vectors.parse(x) } > val myConvertUDF = udf(myConvert) > > val newDf = df.withColumn("parsed", myConvertUDF(col("_2"))) > > newDf.show() > > On Mon, Sep 19, 2016 at

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-19 Thread Yan Facai
umn. > > See here > <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column> > on how to apply a function to a Column to make a new column. > > On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Hi, >&g

study materials for operators on Dataframe

2016-09-18 Thread Yan Facai
Hi, I am a newbie, and the official document of spark is too concise for me, especially the introduction of operators on dataframe. For python, pandas gives a very detailed document: [Pandas]( http://pandas.pydata.org/pandas-docs/stable/index.html) so, does anyone know some sites or cookbooks

Re: Does it run distributed if class not Serializable

2016-09-10 Thread Yan Facai
I believe that Serializable is necessary for distributing. On Fri, Sep 9, 2016 at 7:10 PM, Gourav Sengupta wrote: > And you are using JAVA? > > AND WHY? > > Regards, > Gourav > > On Fri, Sep 9, 2016 at 11:47 AM, Yusuf Can Gürkan > wrote: >

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-08 Thread Yan Facai
o. I found it very helpful to study the Scala language > without Spark. The documentation found here > <http://docs.scala-lang.org/index.html> is excellent. > > Pete > > On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > >> Hi Peter, &

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Yan Facai
ood place to try things out. > > You can make a function out of this transformation and apply it to your > features column to make a new column. Then add this with > Dataset.withColumn. > > See here > <http://stackoverflow.com/questions/35227568/applying-function-to-spark-datafram

How to convert String to Vector ?

2016-09-06 Thread Yan Facai
Hi, I have a csv file like: uid mid features label 1235231[0, 1, 3, ...]True Both "features" and "label" columns are used for GBTClassifier. However, when I read the file: Dataset samples = sparkSession.read().csv(file); The type of samples.select("features") is

Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Yan Facai
Hi, I have a csv file like: uid mid features label 1235231[0, 1, 3, ...]True Both "features" and "label" columns are used for GBTClassifier. However, when I read the file: Dataset samples = sparkSession.read().csv(file); The type of samples.select("features") is

[Spark 2.0] ClassNotFoundException is thrown when using Hive

2016-08-18 Thread Yan Facai
Hi, all. I copied hdfs-site.xml, core-site.xml and hive-site.xml to $SPARK_HOME/conf. And spark-submit is used to submit task to yarn, and run as **client** mode. However, ClassNotFoundException is thrown. some details of logs are list below: ``` 16/08/12 17:07:32 INFO hive.HiveUtils:

[Spark 2.0] spark.sql.hive.metastore.jars doesn't work

2016-08-12 Thread Yan Facai
Hi, everyone. According the official guide, I copied hdfs-site.xml, core-site.xml and hive-site.xml to $SPARK_HOME/conf, and write code as below: ```Java SparkSession spark = SparkSession .builder() .appName("Test Hive for Spark")