Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
yes Hive external table is partitioned on a daily basis (datestamp below) CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData ( KEY string , SECURITY string , TIMECREATED string , PRICE float ) COMMENT 'From prices Kakfa delivered by Flume location by day' ROW FORMAT

Re: Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Can any one help me out On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha wrote: > Hi, > > I am in situation where I want to generate unique Id for each row. > > I have use monotonicallyIncreasingId but it is giving increasing values > and start generating from start if it

Re: mllib model in production web API

2016-10-17 Thread Aseem Bansal
@Nicolas No, ours is different. We required predictions within 10ms time frame so we needed much less latency than that. Every algorithm has some parameters. Correct? We took the parameters from the mllib and used them to create ml package's model. ml package's model's prediction time was much

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your RandomForestClassifier or RandomForestRegressor. It's a warning which will not affect the result but may lead your training slower. Thanks Yanbo On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) wrote: > Hi Xi Shen >

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-17 Thread Efe Selcuk
Bump! On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk wrote: > I have a use case where I want to build a dataset based off of > conditionally available data. I thought I'd do something like this: > > case class SomeData( ... ) // parameters are basic encodable types like >

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mike Metzger
I've not done this in Scala yet, but in PySpark I've run into a similar issue where having too many dataframes cached does cause memory issues. Unpersist by itself did not clear the memory usage, but rather setting the variable equal to None allowed all the references to be cleared and the memory

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread 市场部
Hi Xi Shen The warning message wasn’t removed after I had upgraded my java to V8, but anyway I appreciate your kind help. Since it’s just a WARN, I suppose I can bear with it and nothing bad would really happen. Am I right? 6/10/18 11:12:42 WARN RandomForest: Tree learning is using

Fwd: jdbcRDD for data ingestion from RDBMS

2016-10-17 Thread Ninad Shringarpure
Hi Team, One of my client teams is trying to see if they can use Spark to source data from RDBMS instead of Sqoop. Data would be substantially large in the order of billions of records. I am not sure reading the documentations whether jdbcRDD by design is going to be able to scale well for this

Re: previous stage results are not saved?

2016-10-17 Thread Mark Hamstra
There is no need to do that if 1) the stage that you are concerned with either made use of or produced MapOutputs/shuffle files; 2) reuse of those shuffle files (which may very well be in the OS buffer cache of the worker nodes) is sufficient for your needs; 3) the relevant Stage objects haven't

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
First of all, Thank you for your comments. Actually, What I mean "update" is generate a new data frame with modified data. The more detailed while loop will be something like below. var continue = 1 var dfA = "a data frame" dfA.persist while (continue > 0) { val temp = "modified dfA"

Re: previous stage results are not saved?

2016-10-17 Thread ayan guha
You can use cache or persist. On Tue, Oct 18, 2016 at 10:11 AM, Yang wrote: > I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell > > it seems that after all 10 finished successfully, if I run the last, or > the 9th again, > spark reruns all the previous

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread ayan guha
I do not see a rationale to have hbase in this scheme of thingsmay be I am missing something? If data is delivered in HDFS, why not just add partition to an existing Hive table? On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh wrote: > Thanks Mike, > > My test

Re: K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
I think I got it parsedData.foreach( new VoidFunction() { @Override public void call(Vector vector) throws Exception { System.out.println(clusters.predict(vector)); } }

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Its a quasi columnar store. Sort of a hi-bred approach. On Oct 17, 2016, at 4:30 PM, Mich Talebzadeh > wrote: I assume that Hbase is more of columnar data store by virtue of it storing column data together. many interpretation of

previous stage results are not saved?

2016-10-17 Thread Yang
I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell it seems that after all 10 finished successfully, if I run the last, or the 9th again, spark reruns all the previous stages from scratch, instead of utilizing the partial results. this is quite serious since I can't experiment

Re: Consuming parquet files built with version 1.8.1

2016-10-17 Thread Cheng Lian
Hi Dinesh, Thanks for reporting. This is kinda weird and I can't reproduce this. Were doing the experiments using a clean compiled Spark master branch? And I don't think you have to use parquet-mr 1.8.1 to read Parquet files generated using parquet-mr 1.8.1 unless you are using something not

Broadcasting Complex Custom Objects

2016-10-17 Thread Pedro Tuero
Hi guys, I'm trying to do a a job with Spark, using Java. The thing is I need to have an index of words of about 3 GB in each machine, so I'm trying to broadcast custom objects to represent the index and the interface with it. I'm using java standard serialization, so I tried to implement

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
I assume that Hbase is more of columnar data store by virtue of it storing column data together. many interpretation of this is all over places. However, it is not columnar in a sense of column based (as opposed to row based) implementation of relational model. Dr Mich Talebzadeh LinkedIn *

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
Thanks Mike, My test csv data comes as UUID, ticker, timecreated, price a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09, 53.36665625650533484995 a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
Oltp use case scenario does not mean necessarily the traditional oltp. See also apache hawk etc. they can fit indeed to some use cases to some other less. > On 17 Oct 2016, at 23:02, Michael Segel wrote: > > You really don’t want to do OLTP on a distributed NoSQL

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You really don’t want to do OLTP on a distributed NoSQL engine. Remember Big Data isn’t relational its more of a hierarchy model or record model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …) On Oct 17, 2016, at 3:45 PM, Jörn Franke

PostgresSql queries vs spark sql

2016-10-17 Thread Selvam Raman
Hi, Please share me some idea if you work on this earlier. How can i develop postgres CROSSTAB function in spark. Postgres Example Example 1: SELECT mthreport.* FROM *crosstab*('SELECT i.item_name::text As row_name, to_char(if.action_date, ''mon'')::text As bucket,

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Skip Phoenix On Oct 17, 2016, at 2:20 PM, Thakrar, Jayesh > wrote: Ben, Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase. http://phoenix.apache.org/ Cheers, Jayesh

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
@Mitch You don’t have a schema in HBase other than the table name and the list of associated column families. So you can’t really infer a schema easily… On Oct 17, 2016, at 2:17 PM, Mich Talebzadeh > wrote: How about this method of

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You forgot to mention that if you roll your own… you can toss your own level of security on top of it. For most, that’s not important. For those working with PII type of information… kinda important, especially when the rules can get convoluted. On Oct 17, 2016, at 12:14 PM, vincent

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch, Short answer… no, it doesn’t scale. Longer answer… You are using an UUID as the row key? Why? (My guess is that you want to avoid hot spotting) So you’re going to have to pull in all of the data… meaning a full table scan… and then perform a sort order transformation, dropping the

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
It has some implication because it imposes the SQL model on Hbase. Internally it translates the SQL queries into custom Hbase processors. Keep also in mind for what Hbase need a proper key design and how Phoenix designs those keys to get the best performance out of it. I think for oltp it is a

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
This will give me an opportunity to start using Structured Streaming. Then, I can try adding more functionality. If all goes well, then we could transition off of HBase to a more in-memory data solution that can “spill-over” data for us. > On Oct 17, 2016, at 11:53 AM, vincent gromakowski >

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread ayan guha
Hi Any reason not to recommend Phoneix? I haven't used it myself so curious about pro's and cons about the use of it. On 18 Oct 2016 03:17, "Michael Segel" wrote: > Guys, > Sorry for jumping in late to the game… > > If memory serves (which may not be a good thing…) :

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
Ben, *Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase.* *http://phoenix.apache.org/ * I am afraid this does not work with Spark 2! Dr Mich Talebzadeh LinkedIn *

Re: Indexing w spark joins?

2016-10-17 Thread Mich Talebzadeh
Hi Michael, just to clarify are you referring to inverted indexes here? Predicate push down is supported by Hive ORC tables that Spark can operate on. With regard to your point "Break down the number and types of accidents by car manufacturer ,

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Thakrar, Jayesh
Ben, Also look at Phoenix (Apache project) which provides a better (one of the best) SQL/JDBC layer on top of HBase. http://phoenix.apache.org/ Cheers, Jayesh From: vincent gromakowski Date: Monday, October 17, 2016 at 1:53 PM To: Benjamin Kim

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
How about this method of creating Data Frames on Hbase tables directly. I define an RDD for each column in the column family as below. In this case column trade_info:ticker //create rdd val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],

Re: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Tobi Bosede
Thanks Assaf. Yes please provide an example of how to wrap code for python. I am leaning towards scala. On Mon, Oct 17, 2016 at 1:50 PM, Mendelson, Assaf wrote: > A possible (bad) workaround would be to use the collect_list function. > This will give you all the values

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
Instead of (or additionally to) saving results somewhere, you just start a thriftserver that expose the Spark tables of the SQLContext (or SparkSession now). That means you can implement any logic (and maybe use structured streaming) to expose your data. Today using the thriftserver means reading

K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
Could you please point me to sample code to retrieve the cluster members of K mean? The below code prints cluster centers. * I needed cluster members belonging to each center.* val clusters = KMeans.train(parsedData, numClusters, numIterations) clusters.clusterCenters.foreach(println)

RE: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Mendelson, Assaf
A possible (bad) workaround would be to use the collect_list function. This will give you all the values in an array (list) and you can then create a UDF to do the aggregation yourself. This would be very slow and cost a lot of memory but it would work if your cluster can handle it. This is the

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
Is this technique similar to what Kinesis is offering or what Structured Streaming is going to have eventually? Just curious. Cheers, Ben > On Oct 17, 2016, at 10:14 AM, vincent gromakowski > wrote: > > I would suggest to code your own Spark thriftserver

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
I would suggest to code your own Spark thriftserver which seems to be very easy. http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server I am starting to test it. The big advantage is that you can implement any logic because it's a spark job and then

question on the structured DataSet API join

2016-10-17 Thread Yang
I'm trying to use the joinWith() method instead of join() since the former provides type checked result while the latter is a straight DataFrame. the signature is DataSet[(T,U)] joinWith(other:DataSet[U], col:Column) here the second arg, col:Column is normally provided by

Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi, Apologies if I’ve asked this question before but I didn’t see it in the list and I’m certain that my last surviving brain cell has gone on strike over my attempt to reduce my caffeine intake… Posting this to both user and dev because I think the question / topic jumps in to both camps.

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Guys, Sorry for jumping in late to the game… If memory serves (which may not be a good thing…) : You can use HiveServer2 as a connection point to HBase. While this doesn’t perform well, its probably the cleanest solution. I’m not keen on Phoenix… wouldn’t recommend it…. The issue is that

Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Lev Katzav
I don't have in my code any object broadcasting. I do have broadcast join hints (df1.join(broadcast(df2))) I tried, starting and stopping the spark context for every test (and not once per suite), and it did stop the OOM errors, so I guess that there is no leakage after the context is stopped.

Substitute Certain Rows a data Frame using SparkR

2016-10-17 Thread shilp
I have a sparkR Data frame and I want to Replace certain Rows of a Column which satisfy certain condition with some value.If it was a simple R data frame then I would do something as follows:df$Column1[df$Column1 == "Value"] = "NewValue" How would i perform similar operation on a SparkR data

RE: rdd and dataframe columns dtype

2016-10-17 Thread 박경희
Do you need this one? http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types Best Regards - Original Message - Sender : muhammet pakyürek Date : 2016-10-17 20:52 (GMT+9) Title : rdd and dataframe columns dtype how

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Thakrar, Jayesh
Yes, iterating over a dataframe and making changes is not uncommon. Ofcourse RDDs, dataframes and datasets are immultable, but there is some optimization in the optimizer that can potentially help to dampen the effect/impact of creating a new rdd, df or ds. Also, the use-case you cited is

Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Hi, I am in situation where I want to generate unique Id for each row. I have use monotonicallyIncreasingId but it is giving increasing values and start generating from start if it fail. I have two question here: Q1. Does this method give me unique id even in failure situation becaue I want to

OutputMetrics with data frames (spark-avro)

2016-10-17 Thread Tim Moran
Hi, I'm using the Databricks spark-avro library to save some DataFrames out as Avro (with Spark 1.6.1). When I do this however, I lose the information in the spark events about the number of records and size of data written to HDFS for each partition that's available if I save an RDD out as a

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread 市场部
Hi Xi Shen Not yet. For the moment my idk for spark is still V7. Thanks for your reminding, I will try it out by upgrading java. 发件人: Xi Shen > 日期: 2016年10月17日 星期一 下午8:00 至: zhangjianxin

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Okay, thank you. On Mon, Oct 17, 2016 at 5:53 PM Sean Owen wrote: > You can take the "with user-provided Hadoop" binary from the download > page, and yes that should mean it does not drag in a Hive dependency of its > own. > > On Mon, Oct 17, 2016 at 7:08 AM Xi Shen

rdd and dataframe columns dtype

2016-10-17 Thread muhammet pakyürek
how can i set columns dtype of rdd

Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-17 Thread Matthias Niehoff
heartbeat.interval.ms default group.max.session.timeout.ms default session.timeout.ms 6 default values as of kafka 0.10. complete Kafka params: val kafkaParams = Map[String, String]( "bootstrap.servers" -> kafkaBrokers, "auto.offset.reset" -> "latest", "enable.auto.commit" -> "false",

Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Sean Owen
You're now asking about couchbase code, so this isn't the best place to ask. Head to couchbase forums. On Mon, Oct 17, 2016 at 10:14 AM Devi P.V wrote: > Hi, > I tried with the following code > > import com.couchbase.spark._ > val conf = new SparkConf() >

Driver storage memory getting waste

2016-10-17 Thread Sushrut Ikhar
Hi, Is there any config to change the storage memory fraction for driver; as i'm not caching anything in driver and by default it is picking from spark.memory.fraction (0.9) spark.memory.storageFraction (0.6); whose value i've set as per my executor usage. Regards, Sushrut Ikhar [image:

Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Sean Owen
You can take the "with user-provided Hadoop" binary from the download page, and yes that should mean it does not drag in a Hive dependency of its own. On Mon, Oct 17, 2016 at 7:08 AM Xi Shen wrote: > Hi, > > I want to configure my Hive to use Spark 2 as its engine.

Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Sean Owen
Did you unpersist the broadcast objects? On Mon, Oct 17, 2016 at 10:02 AM lev wrote: > Hello, > > I'm in the process of migrating my application to spark 2.0.1, > And I think there is some memory leaks related to Broadcast joins. > > the application has many unit tests, > and

Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Devi P.V
Hi, I tried with the following code import com.couchbase.spark._ val conf = new SparkConf() .setAppName(this.getClass.getName) .setMaster("local[*]") .set("com.couchbase.bucket.bucketName","password") .set("com.couchbase.nodes", "node") .set

Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread lev
Hello, I'm in the process of migrating my application to spark 2.0.1, And I think there is some memory leaks related to Broadcast joins. the application has many unit tests, and each individual test suite passes, but when running all together, it fails on OOM errors. In the begging of each

Re: Resizing Image with Scrimage in Spark

2016-10-17 Thread Sean Owen
It pretty much means what it says. Objects you send across machines must be serializable, and the object from the library is not. You can write a wrapper object that is serializable and knows how to serialize it. Or ask the library dev to consider making this object serializable. On Mon, Oct 17,

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed to edit data. They are only designed to query data. I wonder in what scenario you need to update large volume of data repetitively. On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot wrote: > If my

Re: NoClassDefFoundError: org/apache/spark/Logging in SparkSession.getOrCreate

2016-10-17 Thread Saisai Shao
Not sure why your code will search Logging class under org/apache/spark, this should be “org/apache/spark/internal/Logging”, and it changed long time ago. On Sun, Oct 16, 2016 at 3:25 AM, Brad Cox wrote: > I'm experimenting with Spark 2.0.1 for the first time and hitting a

Resizing Image with Scrimage in Spark

2016-10-17 Thread Adline Dsilva
Hi All, I have a Hive Table which contains around 500 million photos(Profile picture of Users) stored as hex string and total size of the table is 5TB. I'm trying to make a solution where images can be retrieved in real-time. Current Solution, Resize the images, index it along the user

Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Hi, I want to configure my Hive to use Spark 2 as its engine. According to Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I could build my own, but for some reason I hope I could use a official binary build. So I want to ask if the official Spark binary build labeled

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Divya Gehlot
If my understanding is correct about your query In spark Dataframes are immutable , cant update the dataframe. you have to create a new dataframe to update the current dataframe . Thanks, Divya On 17 October 2016 at 09:50, Mungeol Heo wrote: > Hello, everyone. > > As