Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
I think I figured it out. There is indeed "something deeper in Scala” :-) abstract class A { def a: this.type } class AA(i: Int) extends A { def a = this } the above works ok. But if you return anything other than “this”, you will get a compile error. abstract class A { def a: this.type

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com On Aug 30, 2016, at 2:08 PM, Sean Owen wrote: I think it's imitating, for example,

Iterative mapWithState

2016-08-30 Thread Matt Smith
Is is possible to use mapWithState iteratively? In other words, I would like to keep calling mapWithState with the output from the last mapWithState until there is no output. For a given minibatch mapWithState could be called anywhere from 1..200ish times depending on the input/current state.

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
Hi Bryan, Thanks for the reply. I am indexing 5 columns ,then using these indexed columns to generate the "feature" column thru vector assembler. Which essentially means that I cannot use *fit()* directly on "completeDataset" dataframe since it will neither have the "feature" column and nor the 5

Best way to share state in a streaming cluster

2016-08-30 Thread C. Josephson
We have a timestamped input stream and we need to share the latest processed timestamp across spark master and slaves. This will be monotonically increasing over time. What is the easiest way to share state across spark machines? An accumulator is very close to what we need, but since only the

Re: Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
thanks Sean. I am cross posting on dev to see why the code was written that way. Perhaps, this.type doesn’t do what is needed. Mohit Jaggi Founder, Data Orchard LLC www.dataorchardllc.com > On Aug 30, 2016, at 2:08 PM, Sean Owen wrote: > > I think it's imitating, for

Re: How to use custom class in DataSet

2016-08-30 Thread Jakob Odersky
Implementing custom encoders is unfortunately not well supported at the moment (IIRC there are plans to eventually add an api for user defined encoders). That being said, there are a couple of encoders that can work with generic, serializable data types: "javaSerialization" and "kryo", found here

Re: Random Forest Classification

2016-08-30 Thread Bryan Cutler
You need to first fit just the VectorIndexer which returns the model, then add the model to the pipeline where it will only transform. val featureVectorIndexer = new VectorIndexer() .setInputCol("feature") .setOutputCol("indexedfeature") .setMaxCategories(180)

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread ayan guha
Given Record Service is yet to be added to main distributions, I believe the only available solution now is to use hdfs acl to restrict access for spark. On 31 Aug 2016 03:07, "Mich Talebzadeh" wrote: > Have you checked using views in Hive to restrict user access to

Re: reuse the Spark SQL internal metrics

2016-08-30 Thread Jacek Laskowski
Hi, If the stats are in web UI, they should be flying over the wire and so you can catch the events by implementing SparkListener [1] -- a developer API for custom Spark listeners. That's how web UI gets the data and History Server. I think the stats are sent as accumulator updates in

reuse the Spark SQL internal metrics

2016-08-30 Thread Ai Deng
Hi there, I think the metrics inside of the different SparkPlan (like "numOutputRows" in FilterExec) are useful to build any Dev dashboard or business monitoring. Are there a easy way or exist solution to expose and persist these metrics out of Spark UI (ex: send to Graphite)? Currently they are

Re: Model abstract class in spark ml

2016-08-30 Thread Sean Owen
I think it's imitating, for example, how Enum is delcared in Java: abstract class Enum> this is done so that Enum can refer to the actual type of the derived enum class when declaring things like public final int compareTo(E o) to implement Comparable. The type is redundant in a sense, because

Model abstract class in spark ml

2016-08-30 Thread Mohit Jaggi
Folks, I am having a bit of trouble understanding the following: abstract class Model[M <: Model[M]] Why is M <: Model[M]? Cheers, Mohit.

Re: Dynamic Allocation & Spark Streaming

2016-08-30 Thread Liren Ding
It's has been a while since last update on the thread. Now Spark 2.0 is available, do you guys know if there's any progress on Dynamic Allocation & Spark Streaming? On Mon, Oct 19, 2015 at 1:13 PM, robert towne wrote: > I have watched a few videos from

Spark build 1.6.2 error

2016-08-30 Thread Diwakar Dhanuskodi
Hi, While building Spark 1.6.2 , getting below error in spark-sql. Much appreciate for any help. ERROR] missing or invalid dependency detected while loading class file 'WebUI.class'. Could not access term eclipse in package org, because it (or its dependencies) are missing. Check your build

Re: Issues with Spark On Hbase Connector and versions

2016-08-30 Thread Weiqing Yang
The PR will be reviewed soon. Thanks, Weiqing From: Sachin Jain > Date: Sunday, August 28, 2016 at 11:12 PM To: spats > Cc: user >

newlines inside csv quoted values

2016-08-30 Thread Koert Kuipers
i noticed much to my surprise that spark csv supports newlines inside quoted values. ok thats cool but how does this work with splitting files when reading? i assume splitting is still simply done on newlines or something similar. wouldnt that potentially split in the middle of a record?

Re: S3A + EMR failure when writing Parquet?

2016-08-30 Thread Steve Loughran
On 29 Aug 2016, at 18:18, Everett Anderson > wrote: Okay, I don't think it's really just S3A issue, anymore. I can run the job using fs.s3.impl/spark.hadoop.fs.s3.impl set to the S3A impl as a --conf param from the EMR console

Re: Spark 2.0.0 - What all access is needed to save model to S3?

2016-08-30 Thread Steve Loughran
On 30 Aug 2016, at 06:20, Aseem Bansal > wrote: So what all access are needed? Asking this as I need to ask someone to give me appropriate access and I cannot just ask them to give me all access to the bucket. Commented on the JIRA in

How to convert an ArrayType to DenseVector within DataFrame?

2016-08-30 Thread evanzamir
I have a DataFrame with a column containing a list of numeric features to be used for a regression. When I run the regression, I get the following error: *pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7

Does Spark on YARN inherit or replace the Hadoop/YARN configs?

2016-08-30 Thread Everett Anderson
Hi, I've had a bit of trouble getting Spark on YARN to work. When executing in this mode and submitting from outside the cluster, one must set HADOOP_CONF_DIR or YARN_CONF_DIR , from which spark-submit can find the params it needs to

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Mich Talebzadeh
Have you checked using views in Hive to restrict user access to certain tables and columns only. Have a look at this link HTH Dr Mich Talebzadeh LinkedIn *

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
Is it possible to execute any query using SQLContext even if the DB is secured using roles or tools such as Sentry? Thanks Deepak On Tue, Aug 30, 2016 at 7:52 PM, Rajani, Arpan wrote: > Hi All, > > In our YARN cluster, we have setup spark 1.6.1 , we plan to give

ApacheCon Seville CFP closes September 9th

2016-08-30 Thread Rich Bowen
It's traditional. We wait for the last minute to get our talk proposals in for conferences. Well, the last minute has arrived. The CFP for ApacheCon Seville closes on September 9th, which is less than 2 weeks away. It's time to get your talks in, so that we can make this the best ApacheCon yet.

Re: Does it has a way to config limit in query on STS by default?

2016-08-30 Thread Chen Song
I tried both of the following with STS but neither works for me. Starting STS with --hiveconf hive.limit.optimize.fetch.max=50 and Setting common.max_count in Zeppelin Without setting such limits, a query that outputs lots of rows could cause the driver to OOM and makes TS unusable. Any

Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Rajani, Arpan
Hi All, In our YARN cluster, we have setup spark 1.6.1 , we plan to give access to all the end users/developers/BI users, etc. But we learnt any valid user after getting their own user kerb TGT, can get hold of sqlContext (in program or in shell) and can run any query against any secure

Re: broadcast fails on join

2016-08-30 Thread Takeshi Yamamuro
Hi, How about making the value of `spark.sql.broadcastTimeout` bigger? The value is 300 by default. // maropu On Tue, Aug 30, 2016 at 9:09 PM, AssafMendelson wrote: > Hi, > > I am seeing a broadcast failure when doing a join as follows: > > Assume I have a dataframe

回复:ApplicationMaster + Fair Scheduler + Dynamic resource allocation

2016-08-30 Thread 梅西0247
1) Is that what you want? spark.yarn.am.memory when yarn-client spark.driver.memory    when   yarn-cluster 2)I think you need to set these configs in spark-default.conf spark.dynamicAllocation.minExecutors spark.dynamicAllocation.maxExecutors 3) It's not about the fair

Hi, guys, does anyone use Spark in finance market?

2016-08-30 Thread Taotao.Li
Hi, guys, I'm a quant engineer in China, and I believe it's very promising when using Spark in the financial market. But I didn't find cases which combine spark and finance. So here I wanna do a small survey: - do you guys use Spark in financial market related project? - if yes,

Re: Spark metrics when running with YARN?

2016-08-30 Thread Vijay Kiran
Hi Otis, Did you check the REST API as documented in http://spark.apache.org/docs/latest/monitoring.html Regards, Vijay > On 30 Aug 2016, at 14:43, Otis Gospodnetić wrote: > > Hi Mich and Vijay, > > Thanks! I forgot to include an important bit - I'm looking

Re: Spark metrics when running with YARN?

2016-08-30 Thread Otis Gospodnetić
Hi Mich and Vijay, Thanks! I forgot to include an important bit - I'm looking for a *programmatic* way to get Spark metrics when running Spark under YARN - so JMX or API of some kind. Thanks, Otis -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting

Re: Design patterns involving Spark

2016-08-30 Thread Todd Nist
Have not tried this, but looks quite useful if one is using Druid: https://github.com/implydata/pivot - An interactive data exploration UI for Druid On Tue, Aug 30, 2016 at 4:10 AM, Alonso Isidoro Roman wrote: > Thanks Mitch, i will check it. > > Cheers > > > Alonso

broadcast fails on join

2016-08-30 Thread AssafMendelson
Hi, I am seeing a broadcast failure when doing a join as follows: Assume I have a dataframe df with ~80 million records I do: df2 = df.filter(cond) # reduces to ~50 million records grouped = broadcast(df.groupby(df2.colA).count()) total = df2.join(grouped, df2.colA == grouped.colA, "inner")

Re: Writing to Hbase table from Spark

2016-08-30 Thread Todd Nist
Have you looked at spark-packges.org? There are several different HBase connectors there, not sure if any meet you need or not. https://spark-packages.org/?q=hbase HTH, -Todd On Tue, Aug 30, 2016 at 5:23 AM, ayan guha wrote: > You can use rdd level new hadoop format

ApplicationMaster + Fair Scheduler + Dynamic resource allocation

2016-08-30 Thread Cleosson José Pirani de Souza
Hi I am using Spark 1.6.2 and Hadoop 2.7.2 in a single node cluster (Pseudo-Distributed Operation settings for testing propose). For every spark application that I submit I get: - ApplicationMaster with 1024 MB of RAM and 1 vcore - And one container with 1024 MB of RAM and 1 vcore I have

Re: Spark metrics when running with YARN?

2016-08-30 Thread Mich Talebzadeh
Spark UI regardless of deployment mode Standalone, yarn etc runs on port 4040 by default that can be accessed directly Otherwise one can specify a specific port with --conf "spark.ui.port=5" for example 5 HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark metrics when running with YARN?

2016-08-30 Thread Vijay Kiran
From Yarm RM UI, find the spark application Id, and in the application details, you can click on the “Tracking URL” which should give you the Spark UI. ./Vijay > On 30 Aug 2016, at 07:53, Otis Gospodnetić wrote: > > Hi, > > When Spark is run on top of YARN,

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread shengshanzhang
Hi ,try this way val df = sales_demand.join(product_master,sales_demand("INVENTORY_ITEM_ID") === product_master("INVENTORY_ITEM_ID"),"inner") > 在 2016年8月30日,下午5:52,Jacek Laskowski 写道: > > Hi Mich, > > This is the first time I've been told about $ for string interpolation (as

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread Jacek Laskowski
Hi Mich, This is the first time I've been told about $ for string interpolation (as the function not the placeholder). Thanks for letting me know about it! What is often used is s"whatever you want to reference inside the string $-prefix unless it is a complex expression" i.e. scala> s"I'm

Re: Writing to Hbase table from Spark

2016-08-30 Thread ayan guha
You can use rdd level new hadoop format api and pass on appropriate classes. On 30 Aug 2016 19:13, "Mich Talebzadeh" wrote: > Hi, > > Is there an existing interface to read from and write to Hbase table in > Spark. > > Similar to below for Parquet > > val s =

Writing to Hbase table from Spark

2016-08-30 Thread Mich Talebzadeh
Hi, Is there an existing interface to read from and write to Hbase table in Spark. Similar to below for Parquet val s = spark.read.parquet("oraclehadoop.sales2") s.write.mode("overwrite").parquet("oraclehadoop.sales4") Or need too write Hive table which is already defined over Hbase? Thanks

Re: How to convert List into json object / json Array

2016-08-30 Thread Sivakumaran S
Look at scala.util.parsing.json or the Jackson library for json manipulation. Also read http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets Regards, Sivakumaran S

How to convert List into json object / json Array

2016-08-30 Thread Sree Eedupuganti
Here is the snippet of my code : Dataset rows_salaries = spark.read().json("/Users/Macbook/Downloads/rows_salaries.json"); rows_salaries.createOrReplaceTempView("salaries"); List df = spark.sql("select * from salaries").collectAsList(); I need to read the json data from 'List df =

Re: Spark 2.0 - Join statement compile error

2016-08-30 Thread Mich Talebzadeh
Actually I doubled checked this ‘s’ String Interpolator In Scala scala> val chars = "This is Scala" chars: String = This is Scala scala> println($"$chars") This is Scala OK so far fine. In shell (ksh) can do chars="This is Scala" print "$chars" This is Scala In Shell print "$charsand it is

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
Thanks Mitch, i will check it. Cheers Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman 2016-08-30 9:52 GMT+02:00 Mich Talebzadeh : >

Re: Design patterns involving Spark

2016-08-30 Thread Mich Talebzadeh
You can use Hbase for building real time dashboards Check this link HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Design patterns involving Spark

2016-08-30 Thread Alonso Isidoro Roman
HBase for real time queries? HBase was designed with the batch in mind. Impala should be a best choice, but i do not know what Druid can do Cheers Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

How to convert List into json object / json Array

2016-08-30 Thread Sree Eedupuganti
Any suggesttions please.

Re: Design patterns involving Spark

2016-08-30 Thread Mich Talebzadeh
Hi Chanh, Druid sounds like a good choice. But again the point being is that what else Druid brings on top of Hbase. Unless one decides to use Druid for both historical data and real time data in place of Hbase! It is easier to write API against Druid that Hbase? You still want a UI dashboard?

Re: Spark metrics when running with YARN?

2016-08-30 Thread Mich Talebzadeh
Have you checked spark UI on port HOST:4040 by default? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *