Re: Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread Yan Facai
You can use some Transformers to handle categorical data, For example, StringIndexer encodes a string column of labels to a column of label indices: http://spark.apache.org/docs/latest/ml-features.html#stringindexer On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994

Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread Yan Facai
Hi, OBones. 1. which columns are features? For ml, use `setFeaturesCol` and `setLabelCol` to assign input column: https://spark.apache.org/docs/2.1.0/api/scala/index.html# org.apache.spark.ml.classification.DecisionTreeClassifier 2. which ones are categorical? For ml, use Transformer to create

Re: Re: Re: how to call udf with parameters

2017-06-15 Thread lk_spark
thanks Kumar , that really helpful !! 2017-06-16 lk_spark 发件人:Pralabh Kumar 发送时间:2017-06-16 18:30 主题:Re: Re: how to call udf with parameters 收件人:"lk_spark" 抄送:"user.spark" val getlength=udf((idx1:Int,idx2:Int, data :

Re: Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
val getlength=udf((idx1:Int,idx2:Int, data : String)=> data.substring(idx1,idx2)) data.select(getlength(lit(1),lit(2),data("col1"))).collect On Fri, Jun 16, 2017 at 10:22 AM, Pralabh Kumar wrote: > Use lit , give me some time , I'll provide an example > > On 16-Jun-2017

Re: Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
Use lit , give me some time , I'll provide an example On 16-Jun-2017 10:15 AM, "lk_spark" wrote: > thanks Kumar , I want to know how to cao udf with multiple parameters , > maybe an udf to make a substr function,how can I pass parameter with begin > and end index ? I try it

Re: Re: how to call udf with parameters

2017-06-15 Thread lk_spark
thanks Kumar , I want to know how to cao udf with multiple parameters , maybe an udf to make a substr function,how can I pass parameter with begin and end index ? I try it with errors. Does the udf parameters could only be a column type? 2017-06-16 lk_spark 发件人:Pralabh Kumar

Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
sample UDF val getlength=udf((data:String)=>data.length()) data.select(getlength(data("col1"))) On Fri, Jun 16, 2017 at 9:21 AM, lk_spark wrote: > hi,all > I define a udf with multiple parameters ,but I don't know how to > call it with DataFrame > > UDF: > > def ssplit2

how to call udf with parameters

2017-06-15 Thread lk_spark
hi,all I define a udf with multiple parameters ,but I don't know how to call it with DataFrame UDF: def ssplit2 = udf { (sentence: String, delNum: Boolean, delEn: Boolean, minTermLen: Int) => val terms = HanLP.segment(sentence).asScala . Call : scala> val output =

spark-submit: file not found exception occurs

2017-06-15 Thread Shupeng Geng
Hi, everyone, An annoying problem occurs to me. When submitting a spark job, the jar file not found exception is thrown as follows: does not existread "main" java.io.FileNotFoundException: File file:/home/algo/shupeng/eeop_bridger/EeopBridger-1.0-SNAPSHOT.jar at

Re: featureSubsetStrategy parameter for GradientBoostedTreesModel

2017-06-15 Thread Pralabh Kumar
Hi everyone Currently GBT doesn't expose featureSubsetStrategy as exposed by Random Forest. . GradientBoostedTrees in Spark have hardcoded feature subset strategy to "all" while calling random forest in DecisionTreeRegressor.scala val trees = RandomForest.run(data, oldStrategy, numTrees = 1,

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
Hi Owen, More issues about this topic. Is two the up limit od dependency please? In the code, firstParent RDD is used to get partitions.Why is firstparent RDD used please? If first parent RDD has 5 partitions and sevond has 6, is it reasonable to use first please? thanks

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust
You might also try with a newer version. Several instance of code generation failures have been fixed since 2.0. On Thu, Jun 15, 2017 at 1:15 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi Michael, > Spark 2.0.2 - but I have a very interesting test case actually > The

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
On HDFS you have storage policies where you can define ssd etc https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html Not sure if this is a similar offering to what you refer to. Open stack swift is similar to S3 but for your own data center

Re: fetching and joining data from two different clusters

2017-06-15 Thread Mich Talebzadeh
In Isilon etc you have SSD, middle layer and archive later where data is moved. Can that be implemented in HDFS itself Yorn? What is swift. Isa that low level archive disk? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
Well this happens also if you use amazon EMR - most data will be stored on S3 and there you have also no data locality. You can move it temporary to HDFS or in-memory (ignite) and you can use sampling etc to avoid the need to process all the data. In fact, that is done in Spark machine learning

Re: fetching and joining data from two different clusters

2017-06-15 Thread Mich Talebzadeh
thanks Jorn. If the idea is to separate compute from data using Isilon etc then one is going to lose the locality of data. Also the argument is that we would like to run queries/reports against two independent clusters simultaneously so do this 1. Use Isilon OneFS

Re: Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Michael Armbrust
Which version of Spark? If its recent I'd open a JIRA. On Thu, Jun 15, 2017 at 6:04 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > when we create recursive calls to "struct" (up to 5 levels) for extending > a complex datastructure we end up with the following

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-15 Thread Michael Armbrust
Continuous processing is still a work in progress. I would really like to at least have a basic version in Spark 2.3. The announcement about 2.2 is that we are planning to remove the experimental tag from Structured Streaming. On Thu, Jun 15, 2017 at 11:53 AM, kant kodali

Re: What is the real difference between Kafka streaming and Spark Streaming?

2017-06-15 Thread kant kodali
vow! you caught the 007! Is continuous processing mode available in 2.2? The ticket says the target version is 2.3 but the talk in the Video says 2.2 and beyond so I am just curious if it is available in 2.2 or should I try it from the latest build? Thanks! On Wed, Jun 14, 2017 at 5:32 PM,

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Gourav Sengupta
It might be something that I am saying wrong but sometimes it may just make sense to see the difference between *” *and " <”> 8221, Hex 201d, Octal 20035 <"> 34, Hex 22, Octal 042 Regards, Gourav On Thu, Jun 15, 2017 at 6:45 PM, Michael Mior wrote: > Assuming the

access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-15 Thread Anton Kravchenko
How one would access a broadcasted variable from within ForeachPartitionFunction Spark(2.0.1) Java API ? Integer _bcv = 123; Broadcast bcv = spark.sparkContext().broadcast(_bcv); Dataset df_sql = spark.sql("select * from atable"); df_sql.foreachPartition(new ForeachPartitionFunction() {

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Michael Mior
Assuming the parameter to your UDF should be start"end (with a quote in the middle) then you need to insert a backslash into the query (which must also be escaped in your code). So just add two extra backslashes before the quote inside the string. sqlContext.sql("SELECT * FROM mytable WHERE

Serialization of fastutils reference collections

2017-06-15 Thread Leonid Toshchev
Hi all, As a part of memory optimization we move almost every collection to fastutils. It works fine with primitive to primitive collections (like Int2LongOpenHashMap, Long2LongOpenHashMap or IntArrayList), but at the same moment as we start to use Int2ReferenceOpenHashMap we got an error:

[SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread mark.jenki...@baesystems.com
Hi, I have a query sqlContext.sql("SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND (myudfsearchfor(\"start\"end\"))" How should I escape the double quote so that it successfully parses? I know I can use single quotes but I do not want to since I may need to search for a single

Re: fetching and joining data from two different clusters

2017-06-15 Thread Jörn Franke
It does not matter to Spark you just put the HDFS URL of the namenode there. Of course the issue is that you loose data locality, but this would be also the case for Oracle. > On 15. Jun 2017, at 18:03, Mich Talebzadeh wrote: > > Hi, > > With Spark how easy is it

fetching and joining data from two different clusters

2017-06-15 Thread Mich Talebzadeh
Hi, With Spark how easy is it to fetch data from two different clusters and do a join in Spark. I can use two JDBC connections to join two tables from two different Oracle instances in Spark though creating two Data Frames and joining them together. would that be possible for data residing on

Re: [How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread OBones
OBones wrote: So, I tried to rewrite my sample code using the ml package and it is very much easier to use, no need for the LabeledPoint transformation. Here is the code I came up with: val dt = new DecisionTreeRegressor() .setPredictionCol("Y") .setImpurity("variance")

Best alternative for Category Type in Spark Dataframe

2017-06-15 Thread saatvikshah1994
Hi, I'm trying to convert a Pandas -> Spark dataframe. One of the columns I have is of the Category type in Pandas. But there does not seem to be support for this same type in Spark. What is the best alternative? -- View this message in context:

Nested "struct" fonction call creates a compilation error in Spark SQL

2017-06-15 Thread Olivier Girardot
Hi everyone, when we create recursive calls to "struct" (up to 5 levels) for extending a complex datastructure we end up with the following compilation error : org.codehaus.janino.JaninoRuntimeException: Code of method "(I[Lscala/collection/Iterator;)V" of class

Spark don't run all code when is submit to yarn-cluster mode.

2017-06-15 Thread Cosmin Posteuca
Hi, I have the following problem: After SparkSession is initialized i create a task: val task = new Runnable { } where i make a REST API, and from it's response i read some data from internet/ES/Hive. This task is running to every 5 second with Akka scheduler: scheduler.schedule( Duration(0,

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

2017-06-15 Thread Georg Heiler
What about using map partitions instead? RD schrieb am Do. 15. Juni 2017 um 06:52: > Hi Spark folks, > > Is there any plan to support the richer UDF API that Hive supports for > Spark UDFs ? Hive supports the GenericUDF API which has, among others > methods like

Re: Create dataset from dataframe with missing columns

2017-06-15 Thread Riccardo Ferrari
Hi Jason, Is there a reason why you are not adding the desired column before mapping it to a Dataset[CC]? You could just do something like: df = df.withColumn('f2', ) then do the: df.as(CC) Of course your default value can be null: lit(None).cast(to-some-type) best, On Thu, Jun 15, 2017 at

[How-To] Migrating from mllib.tree.DecisionTree to ml.regression.DecisionTreeRegressor

2017-06-15 Thread OBones
Hello, I have written the following scala code to train a regression tree, based on mllib: val conf = new SparkConf().setAppName("DecisionTreeRegressionExample") val sc = new SparkContext(conf) val spark = new SparkSession.Builder().getOrCreate() val sourceData =

Create dataset from dataframe with fewer columns

2017-06-15 Thread tokeman24
Is it possible to concisely create a dataset from a dataframe with fewer columns? Specifically, suppose I create a dataframe with: val df: DataFrame = Seq(("v1"),("v2")).toDF("f1") Then, I have a case class for a dataset defined as: case class CC(f1: String, f2: Option[String] = None) I’d like

Re: [How-To] Custom file format as source

2017-06-15 Thread OBones
Thanks to both of you, this should get me started. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Repartition vs PartitionBy Help/Understanding needed

2017-06-15 Thread Aakash Basu
Hi all, Everybody is giving a difference between coalesce and repartition, but nowhere I found a difference between partitionBy and repartition. My question is, is it better to write a data set in parquet partitioning by a column and then reading the respective directories to work on that column

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Sean Owen
Yes. Imagine an RDD that results from a union of other RDDs. On Thu, Jun 15, 2017, 09:11 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue

Re: the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread Reynold Xin
A join? On Thu, Jun 15, 2017 at 1:11 AM 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > The RDD code keeps a member as below: > dependencies_ : seq[Dependency[_]] > > It is a seq, that means it can keep more than one dependency. > > I have an issue about this. > Is it possible that its size is

the dependence length of RDD, can its size be greater than 1 pleaae?

2017-06-15 Thread ??????????
Hi all, The RDD code keeps a member as below: dependencies_ : seq[Dependency[_]] It is a seq, that means it can keep more than one dependency. I have an issue about this. Is it possible that its size is greater than one please? If yes, how to produce it please? Would you like show me some