Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
yes Hive external table is partitioned on a daily basis (datestamp below)

CREATE EXTERNAL TABLE IF NOT EXISTS ${DATABASE}.externalMarketData (
 KEY string
   , SECURITY string
   , TIMECREATED string
   , PRICE float
)
COMMENT 'From prices Kakfa delivered by Flume location by day'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION 'hdfs://rhes564:9000/data/prices/'
--TBLPROPERTIES ("skip.header.line.count"="1")
;
ALTER TABLE ${DATABASE}.externalMarketData set location
'hdfs://rhes564:9000/data/prices/${TODAY}';

and there is insert/overwrite into managed table every 15 minutes.

INSERT OVERWRITE TABLE ${DATABASE}.marketData PARTITION (DateStamp =
"${TODAY}")
SELECT
  KEY
, SECURITY
, TIMECREATED
, PRICE
, 1
, CAST(from_unixtime(unix_timestamp()) AS timestamp)
FROM ${DATABASE}.externalMarketData

That works fine. However, Hbase is much faster for data retrieval with
phoenix

When we get Hive with LLAP, I gather Hive will replace Hbase.

So in summary we have


   1. raw data delivered to HDFS
   2. data ingested into Hbase via cron
   3. HDFS directory is mapped to Hive external table
   4. There is Hive managed table with added optimisation/indexing (ORC)


There are a number of ways of doing it as usual.

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 18 October 2016 at 00:48, ayan guha  wrote:

> I do not see a rationale to have hbase in this scheme of thingsmay be
> I am missing something?
>
> If data is delivered in HDFS, why not just add partition to an existing
> Hive table?
>
> On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Thanks Mike,
>>
>> My test csv data comes as
>>
>> UUID, ticker,  timecreated,
>> price
>> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
>> 53.36665625650533484995
>> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
>> 86.31917515824627016510
>> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
>> 95.48298277703729129559
>>
>>
>> And this is my Hbase table with one column family
>>
>> create 'marketDataHbase', 'price_info'
>>
>> It is populated every 15 minutes from test.csv files delivered via Kafka
>> and Flume to HDFS
>>
>>
>>1. Create a fat csv file based on all small csv files for today -->
>>prices/2016-10-17
>>2. Populate data into Hbase table using 
>> org.apache.hadoop.hbase.mapreduce.ImportTsv
>>
>>3. This is pretty quick using MapReduce
>>
>>
>> That importTsv only appends new rows to Hbase table as the choice of UUID
>> as rowKey avoids any issues.
>>
>> So I only have 15 minutes lag in my batch Hbase table.
>>
>> I have both Hive ORC tables and Phoenix views on top of this Hbase
>> tables.
>>
>>
>>1. Phoenix offers the fastest response if used on top of Hbase.
>>unfortunately, Spark 2 with Phoenix is broken
>>2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
>>3. Spark on Hbase tables directly using key, value DFs for each
>>column. Not as fast as 2 but works. I don't think a DF is a good choice 
>> for
>>a key, value pair?
>>
>> Now if I use Zeppelin to read from Hbase
>>
>>
>>1. I can use Phoenix JDBC. That looks very fast
>>2. I can use Spark csv directly on HDFS csv files.
>>3. I can use Spark on Hive tables
>>
>>
>> If I use Tableau on Hbase data then, only sql like code is useful.
>> Phoenix or Hive
>>
>> I don't want to change the design now. But admittedly Hive is the best
>> SQL on top of Hbase. Next release of Hive is going to have in-memory
>> database (LLAP) so we can cache Hive tables in memory. That will be faster.
>> Many people underestimate Hive but I still believe it has a lot to offer
>> besides serious ANSI compliant SQL.
>>
>> Regards
>>
>>  Mich
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be 

Re: Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Can any one help me out

On Mon, Oct 17, 2016 at 7:27 PM, Saurav Sinha 
wrote:

> Hi,
>
> I am in situation where I want to generate unique Id for each row.
>
> I have use monotonicallyIncreasingId but it is giving increasing values
> and start generating from start if it fail.
>
> I have two question here:
>
> Q1. Does this method give me unique id even in failure situation becaue I
> want to use that ID in my solr id.
>
> Q2. If answer to previous question is NO. Then Is there way yo generate
> UUID for each row which is uniqe and not updatedable.
>
> As I have come up with situation where UUID is updated
>
>
> val idUDF = udf(() => UUID.randomUUID().toString)
> val a = withColumn("alarmUUID", lit(idUDF()))
> a.persist(StorageLevel.MEMORY_AND_DISK)
> rawDataDf.registerTempTable("rawAlarms")
>
> ///
> /// I do some joines
>
> but as I reach further below
>
> I do sonthing like
> b is transformation of a
> sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
>   from a right outer join b on a.alarmUUID =
> b.alarmUUID""")
>
> it give output as
>
> +++
>
> |   alarmUUID|   alarmUUID|
> +++
> |7d33a516-5532-410...|null|
> |null|2439d6db-16a2-44b...|
> +++
>
>
>
> --
> Thanks and Regards,
>
> Saurav Sinha
>
> Contact: 9742879062
>



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


Re: mllib model in production web API

2016-10-17 Thread Aseem Bansal
@Nicolas

No, ours is different. We required predictions within 10ms time frame so we
needed much less latency than that.

Every algorithm has some parameters. Correct? We took the parameters from
the mllib and used them to create ml package's model. ml package's model's
prediction time was much faster compared to mllib package's transformation.
So essentially use spark's distributed machine learning library to train
the model, save to S3, load from S3 in a different system and then convert
it into the vector based API model for actual predictions.

There were obviously some transformations involved but we didn't use
Pipeline for those transformations. Instead, we re-wrote them for the
Vector based API. I know it's not perfect but if we had used the
transformations within the pipeline that would make us dependent on spark's
distributed API and we didn't see how we will really reach our latency
requirements. Would have been much simpler and more DRY if the
PipelineModel had a predict method based on vectors and was not distributed.

As you can guess it is very much model-specific and more work. If we decide
to use another type of Model we will have to add conversion
code/transformation code for that also. Only if spark exposed a prediction
method which is as fast as the old machine learning package.

On Sat, Oct 15, 2016 at 8:42 PM, Nicolas Long  wrote:

> Hi Sean and Aseem,
>
> thanks both. A simple thing which sped things up greatly was simply to
> load our sql (for one record effectively) directly and then convert to a
> dataframe, rather than using Spark to load it. Sounds stupid, but this took
> us from > 5 seconds to ~1 second on a very small instance.
>
> Aseem: can you explain your solution a bit more? I'm not sure I understand
> it. At the moment we load our models from S3 (
> RandomForestClassificationModel.load(..) ) and then store that in an
> object property so that it persists across requests - this is in Scala. Is
> this essentially what you mean?
>
>
>
>
>
>
> On 12 October 2016 at 10:52, Aseem Bansal  wrote:
>
>> Hi
>>
>> Faced a similar issue. Our solution was to load the model, cache it after
>> converting it to a model from mllib and then use that instead of ml model.
>>
>> On Tue, Oct 11, 2016 at 10:22 PM, Sean Owen  wrote:
>>
>>> I don't believe it will ever scale to spin up a whole distributed job to
>>> serve one request. You can look possibly at the bits in mllib-local. You
>>> might do well to export as something like PMML either with Spark's export
>>> or JPMML and then load it into a web container and score it, without Spark
>>> (possibly also with JPMML, OpenScoring)
>>>
>>>
>>> On Tue, Oct 11, 2016, 17:53 Nicolas Long  wrote:
>>>
 Hi all,

 so I have a model which has been stored in S3. And I have a Scala
 webapp which for certain requests loads the model and transforms submitted
 data against it.

 I'm not sure how to run this quickly on a single instance though. At
 the moment Spark is being bundled up with the web app in an uberjar (sbt
 assembly).

 But the process is quite slow. I'm aiming for responses < 1 sec so that
 the webapp can respond quickly to requests. When I look the Spark UI I see:

 Summary Metrics for 1 Completed Tasks

 MetricMin25th percentileMedian75th percentileMax
 Duration94 ms94 ms94 ms94 ms94 ms
 Scheduler Delay0 ms0 ms0 ms0 ms0 ms
 Task Deserialization Time3 s3 s3 s3 s3 s
 GC Time2 s2 s2 s2 s2 s
 Result Serialization Time0 ms0 ms0 ms0 ms0 ms
 Getting Result Time0 ms0 ms0 ms0 ms0 ms
 Peak Execution Memory0.0 B0.0 B0.0 B0.0 B0.0 B

 I don't really understand why deserialization and GC should take so
 long when the models are already loaded. Is this evidence I am doing
 something wrong? And where can I get a better understanding on how Spark
 works under the hood here, and how best to do a standalone/bundled jar
 deployment?

 Thanks!

 Nic

>>>
>>
>


Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your
RandomForestClassifier or RandomForestRegressor.
It's a warning which will not affect the result but may lead your training
slower.

Thanks
Yanbo

On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) 
wrote:

> Hi Xi Shen
>
> The warning message wasn’t  removed after I had upgraded my java to V8,
> but  anyway I appreciate your kind help.
>
> Since it’s just a WARN, I suppose I can bear with it and nothing bad would
> really happen. Am I right?
>
>
> 6/10/18 11:12:42 WARN RandomForest: Tree learning is using approximately
> 268437864 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80088 nodes in this
> iteration.
> 16/10/18 11:13:07 WARN RandomForest: Tree learning is using approximately
> 268436304 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80132 nodes in this
> iteration.
> 16/10/18 11:13:32 WARN RandomForest: Tree learning is using approximately
> 268437816 bytes per iteration, which exceeds requested limit
> maxMemoryUsage=268435456. This allows splitting 80082 nodes in this
> iteration.
>
>
>
> 发件人: zhangjianxin 
> 日期: 2016年10月17日 星期一 下午8:16
> 至: Xi Shen 
> 抄送: "user@spark.apache.org" 
> 主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.
>
> Hi Xi Shen
>
> Not yet.  For the moment my idk for spark is still V7. Thanks for your
> reminding, I will try it out by upgrading java.
>
> 发件人: Xi Shen 
> 日期: 2016年10月17日 星期一 下午8:00
> 至: zhangjianxin , "user@spark.apache.org" <
> user@spark.apache.org>
> 主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.
>
> Did you also upgrade to Java from v7 to v8?
>
> On Mon, Oct 17, 2016 at 7:19 PM 张建鑫(市场部) 
> wrote:
>
>>
>> Did anybody encounter this problem before and why it happens , how to
>> solve it?  The same training data and same source code work in 1.6.1,
>> however become lousy in 2.0.1
>>
>> --
>
>
> Thanks,
> David S.
>


Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-17 Thread Efe Selcuk
Bump!

On Thu, Oct 13, 2016 at 8:25 PM Efe Selcuk  wrote:

> I have a use case where I want to build a dataset based off of
> conditionally available data. I thought I'd do something like this:
>
> case class SomeData( ... ) // parameters are basic encodable types like
> strings and BigDecimals
>
> var data = spark.emptyDataset[SomeData]
>
> // loop, determining what data to ingest and process into datasets
>   data = data.union(someCode.thatReturnsADataset)
> // end loop
>
> However I get a runtime exception:
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException:
> unresolved operator 'Union;
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:361)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126)
> at
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58)
> at
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
> at org.apache.spark.sql.Dataset.(Dataset.scala:161)
> at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
> at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
> at org.apache.spark.sql.Dataset.union(Dataset.scala:1459)
>
> Granted, I'm new at Spark so this might be an anti-pattern, so I'm open to
> suggestions. However it doesn't seem like I'm doing anything incorrect
> here, the types are correct. Searching for this error online returns
> results seemingly about working in dataframes and having mismatching
> schemas or a different order of fields, and it seems like bugfixes have
> gone into place for those cases.
>
> Thanks in advance.
> Efe
>
>


Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mike Metzger
I've not done this in Scala yet, but in PySpark I've run into a similar
issue where having too many dataframes cached does cause memory issues.
Unpersist by itself did not clear the memory usage, but rather setting the
variable equal to None allowed all the references to be cleared and the
memory issues went away.

I do not full understand Scala yet, but you may be able to set one of your
dataframes to null to accomplish the same.

Mike


On Mon, Oct 17, 2016 at 8:38 PM, Mungeol Heo  wrote:

> First of all, Thank you for your comments.
> Actually, What I mean "update" is generate a new data frame with modified
> data.
> The more detailed while loop will be something like below.
>
> var continue = 1
> var dfA = "a data frame"
> dfA.persist
>
> while (continue > 0) {
>   val temp = "modified dfA"
>   temp.persist
>   temp.count
>   dfA.unpersist
>
>   dfA = "modified temp"
>   dfA.persist
>   dfA.count
>   temp.unperist
>
>   if ("dfA is not modifed") {
> continue = 0
>   }
> }
>
> The problem is it will cause OOM finally.
> And, the number of skipped stages will increase ever time, even though
> I am not sure whether this is the reason causing OOM.
> Maybe, I need to check the source code of one of the spark ML algorithms.
> Again, thank you all.
>
>
> On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
>  wrote:
> > Yes, iterating over a dataframe and making changes is not uncommon.
> >
> > Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> > optimization in the optimizer that can potentially help to dampen the
> > effect/impact of creating a new rdd, df or ds.
> >
> > Also, the use-case you cited is similar to what is done in regression,
> > clustering and other algorithms.
> >
> > I.e. you iterate making a change to a dataframe/dataset until the desired
> > condition.
> >
> > E.g. see this -
> > https://spark.apache.org/docs/1.6.1/ml-classification-
> regression.html#linear-regression
> > and the setting of the iteration ceiling
> >
> >
> >
> > // instantiate the base classifier
> >
> > val classifier = new LogisticRegression()
> >
> >   .setMaxIter(params.maxIter)
> >
> >   .setTol(params.tol)
> >
> >   .setFitIntercept(params.fitIntercept)
> >
> >
> >
> > Now the impact of that depends on a variety of things.
> >
> > E.g. if the data is completely contained in memory and there is no spill
> > over to disk, it might not be a big issue (ofcourse there will still be
> > memory, CPU and network overhead/latency).
> >
> > If you are looking at storing the data on disk (e.g. as part of a
> checkpoint
> > or explicit storage), then there can be substantial I/O activity.
> >
> >
> >
> >
> >
> >
> >
> > From: Xi Shen 
> > Date: Monday, October 17, 2016 at 2:54 AM
> > To: Divya Gehlot , Mungeol Heo
> > 
> > Cc: "user @spark" 
> > Subject: Re: Is spark a right tool for updating a dataframe repeatedly
> >
> >
> >
> > I think most of the "big data" tools, like Spark and Hive, are not
> designed
> > to edit data. They are only designed to query data. I wonder in what
> > scenario you need to update large volume of data repetitively.
> >
> >
> >
> >
> >
> > On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
> > wrote:
> >
> > If  my understanding is correct about your query
> >
> > In spark Dataframes are immutable , cant update the dataframe.
> >
> > you have to create a new dataframe to update the current dataframe .
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Divya
> >
> >
> >
> >
> >
> > On 17 October 2016 at 09:50, Mungeol Heo  wrote:
> >
> > Hello, everyone.
> >
> > As I mentioned at the tile, I wonder that is spark a right tool for
> > updating a data frame repeatedly until there is no more date to
> > update.
> >
> > For example.
> >
> > while (if there was a updating) {
> > update a data frame A
> > }
> >
> > If it is the right tool, then what is the best practice for this kind of
> > work?
> > Thank you.
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
> > --
> >
> >
> > Thanks,
> > David S.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread 市场部
Hi Xi Shen

The warning message wasn’t  removed after I had upgraded my java to V8,
but  anyway I appreciate your kind help.

Since it’s just a WARN, I suppose I can bear with it and nothing bad would 
really happen. Am I right?


6/10/18 11:12:42 WARN RandomForest: Tree learning is using approximately 
268437864 bytes per iteration, which exceeds requested limit 
maxMemoryUsage=268435456. This allows splitting 80088 nodes in this iteration.
16/10/18 11:13:07 WARN RandomForest: Tree learning is using approximately 
268436304 bytes per iteration, which exceeds requested limit 
maxMemoryUsage=268435456. This allows splitting 80132 nodes in this iteration.
16/10/18 11:13:32 WARN RandomForest: Tree learning is using approximately 
268437816 bytes per iteration, which exceeds requested limit 
maxMemoryUsage=268435456. This allows splitting 80082 nodes in this iteration.



发件人: zhangjianxin 
>
日期: 2016年10月17日 星期一 下午8:16
至: Xi Shen >
抄送: "user@spark.apache.org" 
>
主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.

Hi Xi Shen

Not yet.  For the moment my idk for spark is still V7. Thanks for your 
reminding, I will try it out by upgrading java.

发件人: Xi Shen >
日期: 2016年10月17日 星期一 下午8:00
至: zhangjianxin 
>, 
"user@spark.apache.org" 
>
主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.

Did you also upgrade to Java from v7 to v8?

On Mon, Oct 17, 2016 at 7:19 PM 张建鑫(市场部) 
> wrote:

Did anybody encounter this problem before and why it happens , how to solve it? 
 The same training data and same source code work in 1.6.1, however become 
lousy in 2.0.1

[cid:BD0EFC31-F4CE-421F-BC94-79EF3BE09D60]
--

Thanks,
David S.


Fwd: jdbcRDD for data ingestion from RDBMS

2016-10-17 Thread Ninad Shringarpure
Hi Team,

One of my client teams is trying to see if they can use Spark to source
data from RDBMS instead of Sqoop.  Data would be substantially large in the
order of billions of records.

I am not sure reading the documentations whether jdbcRDD by design is going
to be able to scale well for this amount of data. Plus some in-built
features provided in Sqoop like --direct might give better performance than
straight up jdbc.

My primary question to this group is if it is advisable to use jdbcRDD for
data sourcing and can we expect it to scale. Also performance wise how
would it compare to Sqoop.

Please let me know your thoughts and any pointers if anyone in the group
has already implemented it.

Thanks,
Ninad


Re: previous stage results are not saved?

2016-10-17 Thread Mark Hamstra
There is no need to do that if 1) the stage that you are concerned with
either made use of or produced MapOutputs/shuffle files; 2) reuse of those
shuffle files (which may very well be in the OS buffer cache of the worker
nodes) is sufficient for your needs; 3) the relevant Stage objects haven't
gone out of scope, which would allow the shuffle files to be removed; 4)
you reuse the exact same Stage objects that were used previously.  If all
of that is true, then Spark will re-use the prior stage with performance
very similar to if you had explicitly cached an equivalent RDD.

On Mon, Oct 17, 2016 at 4:53 PM, ayan guha  wrote:

> You can use cache or persist.
>
> On Tue, Oct 18, 2016 at 10:11 AM, Yang  wrote:
>
>> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>>
>> it seems that after all 10 finished successfully, if I run the last, or
>> the 9th again,
>> spark reruns all the previous stages from scratch, instead of utilizing
>> the partial results.
>>
>> this is quite serious since I can't experiment while making small changes
>> to the code.
>>
>> any idea what part of the spark framework might have caused this ?
>>
>> thanks
>> Yang
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Mungeol Heo
First of all, Thank you for your comments.
Actually, What I mean "update" is generate a new data frame with modified data.
The more detailed while loop will be something like below.

var continue = 1
var dfA = "a data frame"
dfA.persist

while (continue > 0) {
  val temp = "modified dfA"
  temp.persist
  temp.count
  dfA.unpersist

  dfA = "modified temp"
  dfA.persist
  dfA.count
  temp.unperist

  if ("dfA is not modifed") {
continue = 0
  }
}

The problem is it will cause OOM finally.
And, the number of skipped stages will increase ever time, even though
I am not sure whether this is the reason causing OOM.
Maybe, I need to check the source code of one of the spark ML algorithms.
Again, thank you all.


On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
 wrote:
> Yes, iterating over a dataframe and making changes is not uncommon.
>
> Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> optimization in the optimizer that can potentially help to dampen the
> effect/impact of creating a new rdd, df or ds.
>
> Also, the use-case you cited is similar to what is done in regression,
> clustering and other algorithms.
>
> I.e. you iterate making a change to a dataframe/dataset until the desired
> condition.
>
> E.g. see this -
> https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
> and the setting of the iteration ceiling
>
>
>
> // instantiate the base classifier
>
> val classifier = new LogisticRegression()
>
>   .setMaxIter(params.maxIter)
>
>   .setTol(params.tol)
>
>   .setFitIntercept(params.fitIntercept)
>
>
>
> Now the impact of that depends on a variety of things.
>
> E.g. if the data is completely contained in memory and there is no spill
> over to disk, it might not be a big issue (ofcourse there will still be
> memory, CPU and network overhead/latency).
>
> If you are looking at storing the data on disk (e.g. as part of a checkpoint
> or explicit storage), then there can be substantial I/O activity.
>
>
>
>
>
>
>
> From: Xi Shen 
> Date: Monday, October 17, 2016 at 2:54 AM
> To: Divya Gehlot , Mungeol Heo
> 
> Cc: "user @spark" 
> Subject: Re: Is spark a right tool for updating a dataframe repeatedly
>
>
>
> I think most of the "big data" tools, like Spark and Hive, are not designed
> to edit data. They are only designed to query data. I wonder in what
> scenario you need to update large volume of data repetitively.
>
>
>
>
>
> On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
> wrote:
>
> If  my understanding is correct about your query
>
> In spark Dataframes are immutable , cant update the dataframe.
>
> you have to create a new dataframe to update the current dataframe .
>
>
>
>
>
> Thanks,
>
> Divya
>
>
>
>
>
> On 17 October 2016 at 09:50, Mungeol Heo  wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
>
>
> Thanks,
> David S.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: previous stage results are not saved?

2016-10-17 Thread ayan guha
You can use cache or persist.

On Tue, Oct 18, 2016 at 10:11 AM, Yang  wrote:

> I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell
>
> it seems that after all 10 finished successfully, if I run the last, or
> the 9th again,
> spark reruns all the previous stages from scratch, instead of utilizing
> the partial results.
>
> this is quite serious since I can't experiment while making small changes
> to the code.
>
> any idea what part of the spark framework might have caused this ?
>
> thanks
> Yang
>



-- 
Best Regards,
Ayan Guha


Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread ayan guha
I do not see a rationale to have hbase in this scheme of thingsmay be I
am missing something?

If data is delivered in HDFS, why not just add partition to an existing
Hive table?

On Tue, Oct 18, 2016 at 8:23 AM, Mich Talebzadeh 
wrote:

> Thanks Mike,
>
> My test csv data comes as
>
> UUID, ticker,  timecreated,  price
> a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
> 53.36665625650533484995
> a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
> 86.31917515824627016510
> 5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
> 95.48298277703729129559
>
>
> And this is my Hbase table with one column family
>
> create 'marketDataHbase', 'price_info'
>
> It is populated every 15 minutes from test.csv files delivered via Kafka
> and Flume to HDFS
>
>
>1. Create a fat csv file based on all small csv files for today -->
>prices/2016-10-17
>2. Populate data into Hbase table using 
> org.apache.hadoop.hbase.mapreduce.ImportTsv
>
>3. This is pretty quick using MapReduce
>
>
> That importTsv only appends new rows to Hbase table as the choice of UUID
> as rowKey avoids any issues.
>
> So I only have 15 minutes lag in my batch Hbase table.
>
> I have both Hive ORC tables and Phoenix views on top of this Hbase tables.
>
>
>1. Phoenix offers the fastest response if used on top of Hbase.
>unfortunately, Spark 2 with Phoenix is broken
>2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
>3. Spark on Hbase tables directly using key, value DFs for each
>column. Not as fast as 2 but works. I don't think a DF is a good choice for
>a key, value pair?
>
> Now if I use Zeppelin to read from Hbase
>
>
>1. I can use Phoenix JDBC. That looks very fast
>2. I can use Spark csv directly on HDFS csv files.
>3. I can use Spark on Hive tables
>
>
> If I use Tableau on Hbase data then, only sql like code is useful. Phoenix
> or Hive
>
> I don't want to change the design now. But admittedly Hive is the best SQL
> on top of Hbase. Next release of Hive is going to have in-memory database
> (LLAP) so we can cache Hive tables in memory. That will be faster. Many
> people underestimate Hive but I still believe it has a lot to offer besides
> serious ANSI compliant SQL.
>
> Regards
>
>  Mich
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 17 October 2016 at 21:54, Michael Segel 
> wrote:
>
>> Mitch,
>>
>> Short answer… no, it doesn’t scale.
>>
>> Longer answer…
>>
>> You are using an UUID as the row key?  Why?  (My guess is that you want
>> to avoid hot spotting)
>>
>> So you’re going to have to pull in all of the data… meaning a full table
>> scan… and then perform a sort order transformation, dropping the UUID in
>> the process.
>>
>> You would be better off not using HBase and storing the data in Parquet
>> files in a directory partitioned on date.  Or rather the rowkey would be
>> the max_ts - TS so that your data is in LIFO.
>> Note: I’ve used the term epoch to describe the max value of a long (8
>> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch,
>> but if anyone has a better term, please let me know.
>>
>>
>>
>> Having said that… if you want to use HBase, you could do the same thing.
>> If you want to avoid hot spotting, you could load the day’s transactions
>> using a bulk loader so that you don’t have to worry about splits.
>>
>> But that’s just my $0.02 cents worth.
>>
>> HTH
>>
>> -Mike
>>
>> PS. If you wanted to capture the transactions… you could do the following
>> schemea:
>>
>> 1) Rowkey = max_ts - TS
>> 2) Rows contain the following:
>> CUSIP (Transaction ID)
>> Party 1 (Seller)
>> Party 2 (Buyer)
>> Symbol
>> Qty
>> Price
>>
>> This is a trade ticket.
>>
>>
>>
>> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> I have trade data stored in Hbase table. Data arrives in csv format to
>> HDFS and then loaded into Hbase via periodic load with
>> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>>
>> The Hbase table has one Column family "trade_info" and three columns:
>> ticker, timecreated, price.
>>
>> The RowKey is UUID. So each row has UUID, ticker, timecreated and price
>> in the csv file
>>
>> Each row in Hbase is a key, value map. In my case, I have one 

Re: K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
I think I got it

  parsedData.foreach(
new VoidFunction() {
@Override
public void call(Vector vector) throws Exception {
System.out.println(clusters.predict(vector));

}
}
);


On Mon, Oct 17, 2016 at 10:56 AM, Reth RM  wrote:

> Could you please point me to sample code to retrieve the cluster members
> of K mean?
>
> The below code prints cluster centers. * I
> needed cluster members belonging to each center.*
>
>
>
> val clusters = KMeans.train(parsedData, numClusters, numIterations) 
> clusters.clusterCenters.foreach(println)
>
>


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Its a quasi columnar store.
Sort of a hi-bred approach.


On Oct 17, 2016, at 4:30 PM, Mich Talebzadeh 
> wrote:

I assume that Hbase is more of columnar data store by virtue of it storing 
column data together.

many interpretation of this is all over places. However, it is not columnar in 
a sense of column based (as opposed to row based) implementation of relational 
model.



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 17 October 2016 at 22:14, Jörn Franke 
> wrote:
Oltp use case scenario does not mean necessarily the traditional oltp. See also 
apache hawk etc. they can fit indeed to some use cases to some other less.

On 17 Oct 2016, at 23:02, Michael Segel 
> wrote:

You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record 
model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke 
> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha 
> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about 
pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" 
> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 

previous stage results are not saved?

2016-10-17 Thread Yang
I'm trying out 2.0, and ran a long job with 10 stages, in spark-shell

it seems that after all 10 finished successfully, if I run the last, or the
9th again,
spark reruns all the previous stages from scratch, instead of utilizing the
partial results.

this is quite serious since I can't experiment while making small changes
to the code.

any idea what part of the spark framework might have caused this ?

thanks
Yang


Re: Consuming parquet files built with version 1.8.1

2016-10-17 Thread Cheng Lian

Hi Dinesh,

Thanks for reporting. This is kinda weird and I can't reproduce this. 
Were doing the experiments using a clean compiled Spark master branch? 
And I don't think you have to use parquet-mr 1.8.1 to read Parquet files 
generated using parquet-mr 1.8.1 unless you are using something not 
implemented in 1.7.


Cheng


On 9/6/16 12:34 AM, Dinesh Narayanan wrote:

Hello,
I have some parquet files generated with 1.8.1 through an MR job that 
i need to consume. I see that master is built with parquet 1.8.1 but i 
get this error(with master branch)


java.lang.NoSuchMethodError: 
org.apache.parquet.schema.Types$MessageTypeBuilder.addFields([Lorg/apache/parquet/schema/Type;)Lorg/apache/parquet/schema/Types$GroupBuilder;
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:114)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:136)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:360)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown 
Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown 
Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

Do you think i am missing something here or is this is a potential 
bug? Any workarounds to use parquet files with version PARQUET_2_0


Thanks
Dinesh




Broadcasting Complex Custom Objects

2016-10-17 Thread Pedro Tuero
Hi guys,

I'm  trying to do a a job with Spark, using Java.

The thing is I need to have an index of words of about 3 GB in each
machine, so I'm trying to broadcast custom objects to represent the index
and the interface with it.
I'm using java standard serialization, so I tried to implement serializable
interface in each class involved, but some objects come from libraries so I
can't go any further.

Is there another way to make it works?
Should I try with Kryo?
Is there a way to work with non-serializable objects?

I use a fat-jar, so the code is available in all workers really. I thing it
should be a way to use it instead of being serializing and deserializing
everything.

Thanks,
Pedro


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
I assume that Hbase is more of columnar data store by virtue of it storing
column data together.

many interpretation of this is all over places. However, it is not columnar
in a sense of column based (as opposed to row based) implementation of
relational model.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 22:14, Jörn Franke  wrote:

> Oltp use case scenario does not mean necessarily the traditional oltp. See
> also apache hawk etc. they can fit indeed to some use cases to some other
> less.
>
> On 17 Oct 2016, at 23:02, Michael Segel  wrote:
>
> You really don’t want to do OLTP on a distributed NoSQL engine.
> Remember Big Data isn’t relational its more of a hierarchy model or record
> model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)
>
>
>
> On Oct 17, 2016, at 3:45 PM, Jörn Franke  wrote:
>
> It has some implication because it imposes the SQL model on Hbase.
> Internally it translates the SQL queries into custom Hbase processors. Keep
> also in mind for what Hbase need a proper key design and how Phoenix
> designs those keys to get the best performance out of it. I think for oltp
> it is a workable model and I think they plan to offer Phoenix as a default
> interface as part of Hbase anyway.
> For OLAP it depends.
>
>
> On 17 Oct 2016, at 22:34, ayan guha  wrote:
>
> Hi
>
> Any reason not to recommend Phoneix? I haven't used it myself so curious
> about pro's and cons about the use of it.
> On 18 Oct 2016 03:17, "Michael Segel"  wrote:
>
>> Guys,
>> Sorry for jumping in late to the game…
>>
>> If memory serves (which may not be a good thing…) :
>>
>> You can use HiveServer2 as a connection point to HBase.
>> While this doesn’t perform well, its probably the cleanest solution.
>> I’m not keen on Phoenix… wouldn’t recommend it….
>>
>>
>> The issue is that you’re trying to make HBase, a key/value object store,
>> a Relational Engine… its not.
>>
>> There are some considerations which make HBase not ideal for all use
>> cases and you may find better performance with Parquet files.
>>
>> One thing missing is the use of secondary indexing and query
>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>> etc …  so your performance will vary.
>>
>> With respect to Tableau… their entire interface in to the big data world
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>> part of your solution, you’re DOA w respect to Tableau.
>>
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>> another Apache project)
>>
>>
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>>
>> Thanks for all the suggestions. It would seem you guys are right about
>> the Tableau side of things. The reports don’t need to be real-time, and
>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>
>> I originally thought that we needed two-way data retrieval from the DMP
>> HBase for ID generation, but after further investigation into the use-case
>> and architecture, the ID generation needs to happen local to the Ad Servers
>> where we generate a unique ID and store it in a ID linking table. Even
>> better, many of the 3rd party services supply this ID. So, data only needs
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>> required. This is also goes for the REST Endpoints. 3rd party services will
>> hit ours to update our data with no need to read from our data. And, when
>> we want to update their data, we will hit theirs to update their data using
>> a triggered job.
>>
>> This al boils down to just integrating with Kafka.
>>
>> Once again, thanks for all the help.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>>
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Mich Talebzadeh
Thanks Mike,

My test csv data comes as

UUID, ticker,  timecreated,  price
a2c844ed-137f-4820-aa6e-c49739e46fa6, S01, 2016-10-17T22:02:09,
53.36665625650533484995
a912b65e-b6bc-41d4-9e10-d6a44ea1a2b0, S02, 2016-10-17T22:02:09,
86.31917515824627016510
5f4e3a9d-05cc-41a2-98b3-40810685641e, S03, 2016-10-17T22:02:09,
95.48298277703729129559


And this is my Hbase table with one column family

create 'marketDataHbase', 'price_info'

It is populated every 15 minutes from test.csv files delivered via Kafka
and Flume to HDFS


   1. Create a fat csv file based on all small csv files for today -->
   prices/2016-10-17
   2. Populate data into Hbase table using
   org.apache.hadoop.hbase.mapreduce.ImportTsv
   3. This is pretty quick using MapReduce


That importTsv only appends new rows to Hbase table as the choice of UUID
as rowKey avoids any issues.

So I only have 15 minutes lag in my batch Hbase table.

I have both Hive ORC tables and Phoenix views on top of this Hbase tables.


   1. Phoenix offers the fastest response if used on top of Hbase.
   unfortunately, Spark 2 with Phoenix is broken
   2. Spark on Hive on Hbase looks OK. This works fine with Spark 2
   3. Spark on Hbase tables directly using key, value DFs for each column.
   Not as fast as 2 but works. I don't think a DF is a good choice for a key,
   value pair?

Now if I use Zeppelin to read from Hbase


   1. I can use Phoenix JDBC. That looks very fast
   2. I can use Spark csv directly on HDFS csv files.
   3. I can use Spark on Hive tables


If I use Tableau on Hbase data then, only sql like code is useful. Phoenix
or Hive

I don't want to change the design now. But admittedly Hive is the best SQL
on top of Hbase. Next release of Hive is going to have in-memory database
(LLAP) so we can cache Hive tables in memory. That will be faster. Many
people underestimate Hive but I still believe it has a lot to offer besides
serious ANSI compliant SQL.

Regards

 Mich
















Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 21:54, Michael Segel 
wrote:

> Mitch,
>
> Short answer… no, it doesn’t scale.
>
> Longer answer…
>
> You are using an UUID as the row key?  Why?  (My guess is that you want to
> avoid hot spotting)
>
> So you’re going to have to pull in all of the data… meaning a full table
> scan… and then perform a sort order transformation, dropping the UUID in
> the process.
>
> You would be better off not using HBase and storing the data in Parquet
> files in a directory partitioned on date.  Or rather the rowkey would be
> the max_ts - TS so that your data is in LIFO.
> Note: I’ve used the term epoch to describe the max value of a long (8
> bytes of ‘FF’ ) for the max_ts. This isn’t a good use of the term epoch,
> but if anyone has a better term, please let me know.
>
>
>
> Having said that… if you want to use HBase, you could do the same thing.
> If you want to avoid hot spotting, you could load the day’s transactions
> using a bulk loader so that you don’t have to worry about splits.
>
> But that’s just my $0.02 cents worth.
>
> HTH
>
> -Mike
>
> PS. If you wanted to capture the transactions… you could do the following
> schemea:
>
> 1) Rowkey = max_ts - TS
> 2) Rows contain the following:
> CUSIP (Transaction ID)
> Party 1 (Seller)
> Party 2 (Buyer)
> Symbol
> Qty
> Price
>
> This is a trade ticket.
>
>
>
> On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I have trade data stored in Hbase table. Data arrives in csv format to
> HDFS and then loaded into Hbase via periodic load with
> org.apache.hadoop.hbase.mapreduce.ImportTsv.
>
> The Hbase table has one Column family "trade_info" and three columns:
> ticker, timecreated, price.
>
> The RowKey is UUID. So each row has UUID, ticker, timecreated and price in
> the csv file
>
> Each row in Hbase is a key, value map. In my case, I have one Column
> Family and three columns. Without going into semantics I see Hbase as a
> column oriented database where column data stay together.
>
> So I thought of this way of accessing the data.
>
> I define an RDD for each column in the column family as below. In this
> case column trade_info:ticker
>
> //create rdd
> val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
> classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
> classOf[org.apache.hadoop.hbase.client.Result])

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
Oltp use case scenario does not mean necessarily the traditional oltp. See also 
apache hawk etc. they can fit indeed to some use cases to some other less.

> On 17 Oct 2016, at 23:02, Michael Segel  wrote:
> 
> You really don’t want to do OLTP on a distributed NoSQL engine. 
> Remember Big Data isn’t relational its more of a hierarchy model or record 
> model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …) 
> 
>  
>> On Oct 17, 2016, at 3:45 PM, Jörn Franke  wrote:
>> 
>> It has some implication because it imposes the SQL model on Hbase. 
>> Internally it translates the SQL queries into custom Hbase processors. Keep 
>> also in mind for what Hbase need a proper key design and how Phoenix designs 
>> those keys to get the best performance out of it. I think for oltp it is a 
>> workable model and I think they plan to offer Phoenix as a default interface 
>> as part of Hbase anyway.
>> For OLAP it depends. 
>> 
>> 
>> On 17 Oct 2016, at 22:34, ayan guha  wrote:
>> 
>>> Hi
>>> 
>>> Any reason not to recommend Phoneix? I haven't used it myself so curious 
>>> about pro's and cons about the use of it.
>>> 
 On 18 Oct 2016 03:17, "Michael Segel"  wrote:
 Guys, 
 Sorry for jumping in late to the game… 
 
 If memory serves (which may not be a good thing…) :
 
 You can use HiveServer2 as a connection point to HBase.  
 While this doesn’t perform well, its probably the cleanest solution. 
 I’m not keen on Phoenix… wouldn’t recommend it…. 
 
 
 The issue is that you’re trying to make HBase, a key/value object store, a 
 Relational Engine… its not. 
 
 There are some considerations which make HBase not ideal for all use cases 
 and you may find better performance with Parquet files. 
 
 One thing missing is the use of secondary indexing and query optimizations 
 that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
 performance will vary. 
 
 With respect to Tableau… their entire interface in to the big data world 
 revolves around the JDBC/ODBC interface. So if you don’t have that piece 
 as part of your solution, you’re DOA w respect to Tableau. 
 
 Have you considered Drill as your JDBC connection point?  (YAAP: Yet 
 another Apache project) 
 
 
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
> 
> Thanks for all the suggestions. It would seem you guys are right about 
> the Tableau side of things. The reports don’t need to be real-time, and 
> they won’t be directly feeding off of the main DMP HBase data. Instead, 
> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
> 
> I originally thought that we needed two-way data retrieval from the DMP 
> HBase for ID generation, but after further investigation into the 
> use-case and architecture, the ID generation needs to happen local to the 
> Ad Servers where we generate a unique ID and store it in a ID linking 
> table. Even better, many of the 3rd party services supply this ID. So, 
> data only needs to flow in one direction. We will use Kafka as the bus 
> for this. No JDBC required. This is also goes for the REST Endpoints. 3rd 
> party services will hit ours to update our data with no need to read from 
> our data. And, when we want to update their data, we will hit theirs to 
> update their data using a triggered job.
> 
> This al boils down to just integrating with Kafka.
> 
> Once again, thanks for all the help.
> 
> Cheers,
> Ben
> 
> 
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>> 
>> please keep also in mind that Tableau Server has the capabilities to 
>> store data in-memory and refresh only when needed the in-memory data. 
>> This means you can import it from any source and let your users work 
>> only on the in-memory data in Tableau Server.
>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  
>>> wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich 
>>> provided already a good alternative. However, you should check if it 
>>> contains a recent version of Hbase and Phoenix. That being said, I just 
>>> wonder what is the dataflow, data model and the analysis you plan to 
>>> do. Maybe there are completely different solutions possible. Especially 
>>> these single inserts, upserts etc. should be avoided as much as 
>>> possible in the Big Data (analysis) world with any technology, because 
>>> they do not perform well. 
>>> 
>>> Hive with Llap will provide an in-memory cache for interactive 
>>> analytics. You can put full tables in-memory with Hive using Ignite 
>>> HDFS in-memory solution. All this does only 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You really don’t want to do OLTP on a distributed NoSQL engine.
Remember Big Data isn’t relational its more of a hierarchy model or record 
model. Think IMS or Pick (Dick Pick’s revelation, U2, Universe, etc …)


On Oct 17, 2016, at 3:45 PM, Jörn Franke 
> wrote:

It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends.


On 17 Oct 2016, at 22:34, ayan guha 
> wrote:


Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious about 
pro's and cons about the use of it.

On 18 Oct 2016 03:17, "Michael Segel" 
> wrote:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On 

PostgresSql queries vs spark sql

2016-10-17 Thread Selvam Raman
Hi,

Please share me some idea if you work on this earlier.
How can i develop postgres CROSSTAB function in spark.

Postgres Example

Example 1:

SELECT mthreport.*
FROM
*crosstab*('SELECT i.item_name::text As row_name,
to_char(if.action_date, ''mon'')::text As bucket,
SUM(if.num_used)::integer As bucketvalue
FROM inventory As i INNER JOIN inventory_flow As if
ON i.item_id = if.item_id
  AND action_date BETWEEN date ''2007-01-01'' and date ''2007-12-31 
23:59''
GROUP BY i.item_name, to_char(if.action_date, ''mon''),
date_part(''month'', if.action_date)
ORDER BY i.item_name',
'SELECT to_char(date ''2007-01-01'' + (n || '' month'')::interval,
''mon'') As short_mname
FROM generate_series(0,11) n')
As mthreport(item_name text, jan integer, feb integer, mar 
integer,
apr integer, may integer, jun integer, jul integer,
aug integer, sep integer, oct integer, nov integer,
dec integer)

The output of the above crosstab looks as follows:
[image: crosstab source_sql cat_sql example]

Example 2:

CREATE TABLE ct(id SERIAL, rowid TEXT, attribute TEXT, value TEXT);
INSERT INTO ct(rowid, attribute, value) VALUES('test1','att1','val1');
INSERT INTO ct(rowid, attribute, value) VALUES('test1','att2','val2');
INSERT INTO ct(rowid, attribute, value) VALUES('test1','att3','val3');
INSERT INTO ct(rowid, attribute, value) VALUES('test1','att4','val4');
INSERT INTO ct(rowid, attribute, value) VALUES('test2','att1','val5');
INSERT INTO ct(rowid, attribute, value) VALUES('test2','att2','val6');
INSERT INTO ct(rowid, attribute, value) VALUES('test2','att3','val7');
INSERT INTO ct(rowid, attribute, value) VALUES('test2','att4','val8');

SELECT *
FROM crosstab(
  'select rowid, attribute, value
   from ct
   where attribute = ''att2'' or attribute = ''att3''
   order by 1,2')
AS ct(row_name text, category_1 text, category_2 text, category_3 text);

 row_name | category_1 | category_2 | category_3
--+++
 test1| val2   | val3   |
 test2| val6   | val7   |


-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Skip Phoenix

On Oct 17, 2016, at 2:20 PM, Thakrar, Jayesh 
> wrote:

Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski 
>
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim >
Cc: Michael Segel 
>, Jörn Franke 
>, Mich Talebzadeh 
>, Felix Cheung 
>, 
"user@spark.apache.org" 
>
Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
@Mitch

You don’t have a schema in HBase other than the table name and the list of 
associated column families.

So you can’t really infer a schema easily…


On Oct 17, 2016, at 2:17 PM, Mich Talebzadeh 
> wrote:

How about this method of creating Data Frames on Hbase tables directly.

I define an RDD for each column in the column family as below. In this case 
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, 
classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
result.getColumn("price_info".getBytes(), "ticker".getBytes(.map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> 
ticker

I use the same approach to create two other DataFrames, namely dftimecreated 
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF 
with each column I use. I am not sure how this compares if I read the full row 
through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join 
themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated 
desc, 'price desc).select('timecreated, 'ticker, 
'price.cast("Float").as("Latest price"))
rs.show(10)

+---+--++
|timecreated|ticker|Latest price|
+---+--++
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|92.11406|
|2016-10-16T18:44:57|   S19|85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|82.38932|
|2016-10-16T18:44:57|   S17|80.77747|
|2016-10-16T18:44:57|   S06|79.81854|
|2016-10-16T18:44:57|   S18|74.10128|
|2016-10-16T18:44:57|   S07|66.13622|
|2016-10-16T18:44:57|   S20|60.35727|
+---+--++
only showing top 10 rows

Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 17 October 2016 at 19:53, vincent gromakowski 
> wrote:
Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
You forgot to mention that if you roll your own… you can toss your own level of 
security on top of it.

For most, that’s not important.
For those working with PII type of information… kinda important, especially 
when the rules can get convoluted.


On Oct 17, 2016, at 12:14 PM, vincent gromakowski 
> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu 

Re: Accessing Hbase tables through Spark, this seems to work

2016-10-17 Thread Michael Segel
Mitch,

Short answer… no, it doesn’t scale.

Longer answer…

You are using an UUID as the row key?  Why?  (My guess is that you want to 
avoid hot spotting)

So you’re going to have to pull in all of the data… meaning a full table scan… 
and then perform a sort order transformation, dropping the UUID in the process.

You would be better off not using HBase and storing the data in Parquet files 
in a directory partitioned on date.  Or rather the rowkey would be the max_ts - 
TS so that your data is in LIFO.
Note: I’ve used the term epoch to describe the max value of a long (8 bytes of 
‘FF’ ) for the max_ts. This isn’t a good use of the term epoch, but if anyone 
has a better term, please let me know.



Having said that… if you want to use HBase, you could do the same thing.  If 
you want to avoid hot spotting, you could load the day’s transactions using a 
bulk loader so that you don’t have to worry about splits.

But that’s just my $0.02 cents worth.

HTH

-Mike

PS. If you wanted to capture the transactions… you could do the following 
schemea:

1) Rowkey = max_ts - TS
2) Rows contain the following:
CUSIP (Transaction ID)
Party 1 (Seller)
Party 2 (Buyer)
Symbol
Qty
Price

This is a trade ticket.



On Oct 16, 2016, at 1:37 PM, Mich Talebzadeh 
> wrote:

Hi,

I have trade data stored in Hbase table. Data arrives in csv format to HDFS and 
then loaded into Hbase via periodic load with 
org.apache.hadoop.hbase.mapreduce.ImportTsv.

The Hbase table has one Column family "trade_info" and three columns: ticker, 
timecreated, price.

The RowKey is UUID. So each row has UUID, ticker, timecreated and price in the 
csv file

Each row in Hbase is a key, value map. In my case, I have one Column Family and 
three columns. Without going into semantics I see Hbase as a column oriented 
database where column data stay together.

So I thought of this way of accessing the data.

I define an RDD for each column in the column family as below. In this case 
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, 
classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow, 
result.getColumn("price_info".getBytes(), "ticker".getBytes(.map(row => {
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column -> 
ticker

I use the same approach to create two other DataFrames, namely dftimecreated 
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a DF 
with each column I use. I am not sure how this compares if I read the full row 
through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to join 
themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated 
desc, 'price desc).select('timecreated, 'ticker, 
'price.cast("Float").as("Latest price"))
rs.show(10)

+---+--++
|timecreated|ticker|Latest price|
+---+--++
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|92.11406|
|2016-10-16T18:44:57|   S19|85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|82.38932|
|2016-10-16T18:44:57|   S17|80.77747|
|2016-10-16T18:44:57|   S06|79.81854|
|2016-10-16T18:44:57|   S18|74.10128|
|2016-10-16T18:44:57|   S07|66.13622|
|2016-10-16T18:44:57|   S20|60.35727|
+---+--++
only showing top 10 rows

Is this a workable solution?


Thanks


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.





Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Jörn Franke
It has some implication because it imposes the SQL model on Hbase. Internally 
it translates the SQL queries into custom Hbase processors. Keep also in mind 
for what Hbase need a proper key design and how Phoenix designs those keys to 
get the best performance out of it. I think for oltp it is a workable model and 
I think they plan to offer Phoenix as a default interface as part of Hbase 
anyway.
For OLAP it depends. 


> On 17 Oct 2016, at 22:34, ayan guha  wrote:
> 
> Hi
> 
> Any reason not to recommend Phoneix? I haven't used it myself so curious 
> about pro's and cons about the use of it.
> 
>> On 18 Oct 2016 03:17, "Michael Segel"  wrote:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen local to the Ad Servers 
>>> where we generate a unique ID and store it in a ID linking table. Even 
>>> better, many of the 3rd party services supply this ID. So, data only needs 
>>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>>> required. This is also goes for the REST Endpoints. 3rd party services will 
>>> hit ours to update our data with no need to read from our data. And, when 
>>> we want to update their data, we will hit theirs to update their data using 
>>> a triggered job.
>>> 
>>> This al boils down to just integrating with Kafka.
>>> 
>>> Once again, thanks for all the help.
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
 On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
 
 please keep also in mind that Tableau Server has the capabilities to store 
 data in-memory and refresh only when needed the in-memory data. This means 
 you can import it from any source and let your users work only on the 
 in-memory data in Tableau Server.
 
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
> Cloudera 5.8 has a very old version of Hive without Tez, but Mich 
> provided already a good alternative. However, you should check if it 
> contains a recent version of Hbase and Phoenix. That being said, I just 
> wonder what is the dataflow, data model and the analysis you plan to do. 
> Maybe there are completely different solutions possible. Especially these 
> single inserts, upserts etc. should be avoided as much as possible in the 
> Big Data (analysis) world with any technology, because they do not 
> perform well. 
> 
> Hive with Llap will provide an in-memory cache for interactive analytics. 
> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
> solution. All this does only make sense if you do not use MR as an 
> engine, the right input format (ORC, parquet) and a recent Hive version.
> 
> On 8 Oct 2016, at 21:55, Benjamin Kim  wrote:
> 
>> Mich,
>> 
>> Unfortunately, we are moving away from Hive and unifying on Spark using 
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC 
>> driver too. I will either try Phoenix JDBC Server for HBase or push to 
>> move faster to Kudu with Impala. We will use Impala as the JDBC 
>> in-between until the Kudu team completes Spark SQL support for JDBC.
>> 
>> Thanks for the advice.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 8, 2016, at 12:35 PM, Mich 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
This will give me an opportunity to start using Structured Streaming. Then, I 
can try adding more functionality. If all goes well, then we could transition 
off of HBase to a more in-memory data solution that can “spill-over” data for 
us.

> On Oct 17, 2016, at 11:53 AM, vincent gromakowski 
>  wrote:
> 
> Instead of (or additionally to) saving results somewhere, you just start a 
> thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
> now). That means you can implement any logic (and maybe use structured 
> streaming) to expose your data. Today using the thriftserver means reading 
> data from the persistent store every query, so if the data modeling doesn't 
> fit the query it can be quite long.  What you generally do in a common spark 
> job is to load the data and cache spark table in a in-memory columnar table 
> which is quite efficient for any kind of query, the counterpart is that the 
> cache isn't updated you have to implement a reload mechanism, and this 
> solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in spark 
> table cache and expose it through the thriftserver. But you have to implement 
> the loading logic, it can be very simple to very complex depending on your 
> needs.
> 
> 
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim  >:
> Is this technique similar to what Kinesis is offering or what Structured 
> Streaming is going to have eventually?
> 
> Just curious.
> 
> Cheers,
> Ben
> 
>  
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
>> > wrote:
>> 
>> I would suggest to code your own Spark thriftserver which seems to be very 
>> easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>>  
>> 
>> 
>> I am starting to test it. The big advantage is that you can implement any 
>> logic because it's a spark job and then start a thrift server on temporary 
>> table. For example you can query a micro batch rdd from a kafka stream, or 
>> pre load some tables and implement a rolling cache to periodically update 
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues 
>> doing this but I think Spark community should look at this path: making the 
>> thriftserver be instantiable in any spark job.
>> 
>> 2016-10-17 18:17 GMT+02:00 Michael Segel > >:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim >> > wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen local to the Ad Servers 
>>> where we generate a unique ID and store it in a ID linking table. Even 
>>> better, many of the 3rd party services supply this ID. So, data only needs 
>>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>>> required. This is also goes for the REST Endpoints. 3rd party services will 
>>> hit ours to update our data with no need to read from our data. And, when 
>>> we want to update their data, we will hit theirs to update their data using 
>>> a triggered job.
>>> 
>>> This al 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread ayan guha
Hi

Any reason not to recommend Phoneix? I haven't used it myself so curious
about pro's and cons about the use of it.
On 18 Oct 2016 03:17, "Michael Segel"  wrote:

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim  wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich SQL and
>> ODBC.JDBC connection to Tableau already.
>>
>> I would go for Hive especially the new release will have an in-memory
>> offering as well for frequently accessed data :)
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
Ben,



*Also look at Phoenix (Apache project) which provides a better (one of the
best) SQL/JDBC layer on top of HBase.*

*http://phoenix.apache.org/ *


I am afraid this does not work with Spark 2!

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 20:20, Thakrar, Jayesh 
wrote:

> Ben,
>
>
>
> Also look at Phoenix (Apache project) which provides a better (one of the
> best) SQL/JDBC layer on top of HBase.
>
> http://phoenix.apache.org/
>
>
>
> Cheers,
>
> Jayesh
>
>
>
>
>
> *From: *vincent gromakowski 
> *Date: *Monday, October 17, 2016 at 1:53 PM
> *To: *Benjamin Kim 
> *Cc: *Michael Segel , Jörn Franke <
> jornfra...@gmail.com>, Mich Talebzadeh , Felix
> Cheung , "user@spark.apache.org" <
> user@spark.apache.org>
>
> *Subject: *Re: Spark SQL Thriftserver with HBase
>
>
>
> Instead of (or additionally to) saving results somewhere, you just start a
> thriftserver that expose the Spark tables of the SQLContext (or
> SparkSession now). That means you can implement any logic (and maybe use
> structured streaming) to expose your data. Today using the thriftserver
> means reading data from the persistent store every query, so if the data
> modeling doesn't fit the query it can be quite long.  What you generally do
> in a common spark job is to load the data and cache spark table in a
> in-memory columnar table which is quite efficient for any kind of query,
> the counterpart is that the cache isn't updated you have to implement a
> reload mechanism, and this solution isn't available using the thriftserver.
>
> What I propose is to mix the two world: periodically/delta load data in
> spark table cache and expose it through the thriftserver. But you have to
> implement the loading logic, it can be very simple to very complex
> depending on your needs.
>
>
>
>
>
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim :
>
> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
>
>
> Just curious.
>
>
>
> Cheers,
>
> Ben
>
>
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
>
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
>
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel :
>
> Guys,
>
> Sorry for jumping in late to the game…
>
>
>
> If memory serves (which may not be a good thing…) :
>
>
>
> You can use HiveServer2 as a connection point to HBase.
>
> While this doesn’t perform well, its probably the cleanest solution.
>
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
>
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
>
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
>
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
>
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
>
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
>
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>
>
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and 

Re: Indexing w spark joins?

2016-10-17 Thread Mich Talebzadeh
Hi Michael,

just to clarify are you referring to inverted indexes
here?

Predicate push down is supported by Hive ORC tables that Spark can operate
on.

With regard to your point

"Break down the number and types of accidents by car manufacturer , model
and color"

How about using some analytics and windowing functions here. Spark supports
all sorts of analytic functions.

HTH




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 17:49, Michael Segel 
wrote:

> Hi,
>
> Apologies if I’ve asked this question before but I didn’t see it in the
> list and I’m certain that my last surviving brain cell has gone on strike
> over my attempt to reduce my caffeine intake…
>
> Posting this to both user and dev because I think the question / topic
> jumps in to both camps.
>
>
> Again since I’m a relative newbie on spark… I may be missing something so
> apologies up front…
>
>
> With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In
> post 2.0.x you have hash, semi-hash , and sorted list merge.
>
> For the sake of simplicity… lets forget about cross product joins…
>
> Has anyone looked at how we could use inverted tables to improve query
> performance?
>
> The issue is that when you have a data sewer (lake) , what happens when
> your use case query is orthogonal to how your data is stored? This means
> full table scans.
> By using secondary indexes, we can reduce this albeit at a cost of
> increasing your storage footprint by the size of the index.
>
> Are there any JIRAs open that discuss this?
>
> Indexes to assist in terms of ‘predicate push downs’ (using the index when
> a field in a where clause is indexed) rather than performing a full table
> scan.
> Indexes to assist in the actual join if the join column is on an indexed
> column?
>
> In the first, using an inverted table to produce a sort ordered set of row
> keys that you would then use in the join process (same as if you produced
> the subset based on the filter.)
>
> To put this in perspective… here’s a dummy use case…
>
>
> CCCis (CCC) is the middle man in the insurance industry. They have a piece
> of software that sits in the repair shop (e.g Joe’s Auto Body) and works
> with multiple insurance carriers.
> The primary key in their data is going to be Insurance Company | Claim
> ID.  This makes it very easy to find a specific claim for further
> processing.
>
> Now lets say I want to do some analysis on determining the average cost of
> repairing a front end collision of a Volvo S80?
> Or
> Break down the number and types of accidents by car manufacturer , model
> and color.  (Then see if there is any correlation between car color and #
> and type of accidents)
>
>
> As you can see, all of these queries are orthogonal to my storage.  So I
> need to create secondary indexes to help sift thru the data efficiently.
>
> Does this make sense?
>
> Please Note: I did some work for CCC back in the late 90’s. Any
> resemblance to their big data efforts is purely coincidence  and you can
> replace CCC with Allstate, Progressive, StateFarm or some other auto
> insurance company …
>
> Thx
>
> -Mike
>
>
>


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Thakrar, Jayesh
Ben,

Also look at Phoenix (Apache project) which provides a better (one of the best) 
SQL/JDBC layer on top of HBase.
http://phoenix.apache.org/

Cheers,
Jayesh


From: vincent gromakowski 
Date: Monday, October 17, 2016 at 1:53 PM
To: Benjamin Kim 
Cc: Michael Segel , Jörn Franke 
, Mich Talebzadeh , Felix 
Cheung , "user@spark.apache.org" 

Subject: Re: Spark SQL Thriftserver with HBase

Instead of (or additionally to) saving results somewhere, you just start a 
thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
now). That means you can implement any logic (and maybe use structured 
streaming) to expose your data. Today using the thriftserver means reading data 
from the persistent store every query, so if the data modeling doesn't fit the 
query it can be quite long.  What you generally do in a common spark job is to 
load the data and cache spark table in a in-memory columnar table which is 
quite efficient for any kind of query, the counterpart is that the cache isn't 
updated you have to implement a reload mechanism, and this solution isn't 
available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in spark 
table cache and expose it through the thriftserver. But you have to implement 
the loading logic, it can be very simple to very complex depending on your 
needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim 
>:
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben


On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> wrote:

I would suggest to code your own Spark thriftserver which seems to be very easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any logic 
because it's a spark job and then start a thrift server on temporary table. For 
example you can query a micro batch rdd from a kafka stream, or pre load some 
tables and implement a rolling cache to periodically update the spark in memory 
tables with persistent store...
It's not part of the public API and I don't know yet what are the issues doing 
this but I think Spark community should look at this path: making the 
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel 
>:
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Mich Talebzadeh
How about this method of creating Data Frames on Hbase tables directly.

I define an RDD for each column in the column family as below. In this case
column trade_info:ticker

//create rdd
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
val rdd1 = hBaseRDD.map(tuple => tuple._2).map(result => (result.getRow,
result.getColumn("price_info".getBytes(), "ticker".getBytes(.map(row =>
{
(
  row._1.map(_.toChar).mkString,
  row._2.asScala.reduceLeft {
(a, b) => if (a.getTimestamp > b.getTimestamp) a else b
  }.getValue.map(_.toChar).mkString
)
})
case class columns (key: String, ticker: String)
val dfticker = rdd1.toDF.map(p => columns(p(0).toString,p(1).toString))

Note that the end result is a DataFrame with the RowKey -> key and column
-> ticker

I use the same approach to create two other DataFrames, namely dftimecreated
and dfprice for the two other columns.

Note that if I don't need a column, then I do not create a DF for it. So a
DF with each column I use. I am not sure how this compares if I read the
full row through other methods if any.

Anyway all I need to do after creating a DataFrame for each column is to
join themthrough RowKey to slice and dice data. Like below.

Get me the latest prices ordered by timecreated and ticker (ticker is stock)

val rs = 
dfticker.join(dftimecreated,"key").join(dfprice,"key").orderBy('timecreated
desc, 'price desc).select('timecreated, 'ticker,
'price.cast("Float").as("Latest
price"))
rs.show(10)

+---+--++
|timecreated|ticker|Latest price|
+---+--++
|2016-10-16T18:44:57|   S16|   97.631966|
|2016-10-16T18:44:57|   S13|92.11406|
|2016-10-16T18:44:57|   S19|85.93021|
|2016-10-16T18:44:57|   S09|   85.714645|
|2016-10-16T18:44:57|   S15|82.38932|
|2016-10-16T18:44:57|   S17|80.77747|
|2016-10-16T18:44:57|   S06|79.81854|
|2016-10-16T18:44:57|   S18|74.10128|
|2016-10-16T18:44:57|   S07|66.13622|
|2016-10-16T18:44:57|   S20|60.35727|
+---+--++
only showing top 10 rows

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 17 October 2016 at 19:53, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Instead of (or additionally to) saving results somewhere, you just start a
> thriftserver that expose the Spark tables of the SQLContext (or
> SparkSession now). That means you can implement any logic (and maybe use
> structured streaming) to expose your data. Today using the thriftserver
> means reading data from the persistent store every query, so if the data
> modeling doesn't fit the query it can be quite long.  What you generally do
> in a common spark job is to load the data and cache spark table in a
> in-memory columnar table which is quite efficient for any kind of query,
> the counterpart is that the cache isn't updated you have to implement a
> reload mechanism, and this solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in
> spark table cache and expose it through the thriftserver. But you have to
> implement the loading logic, it can be very simple to very complex
> depending on your needs.
>
>
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim :
>
>> Is this technique similar to what Kinesis is offering or what Structured
>> Streaming is going to have eventually?
>>
>> Just curious.
>>
>> Cheers,
>> Ben
>>
>>
>>
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>> I would suggest to code your own Spark thriftserver which seems to be
>> very easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-
>> sql-rdd-tables-through-the-thrift-server
>>
>> I am starting to test it. The big advantage is that you can implement any
>> logic because it's a spark job and then start a thrift server on temporary
>> table. For example you can query a micro batch rdd from a kafka stream, or
>> pre load some tables and implement a rolling cache to periodically update
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues
>> doing this but I think Spark community should look at this path: making the
>> thriftserver be instantiable in any spark job.
>>
>> 2016-10-17 18:17 GMT+02:00 

Re: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Tobi Bosede
Thanks Assaf. Yes please provide an example of how to wrap code for python.
I am leaning towards scala.

On Mon, Oct 17, 2016 at 1:50 PM, Mendelson, Assaf 
wrote:

> A possible (bad) workaround would be to use the collect_list function.
> This will give you all the values in an array (list) and you can then
> create a UDF to do the aggregation yourself. This would be very slow and
> cost a lot of memory but it would work if your cluster can handle it.
>
> This is the only workaround I can think of, otherwise you  will need to
> write the UDAF in java/scala and wrap it for python use. If you need an
> example on how to do so I can provide one.
>
> Assaf.
>
>
>
> *From:* Tobi Bosede [mailto:ani.to...@gmail.com]
> *Sent:* Sunday, October 16, 2016 7:49 PM
> *To:* Holden Karau
> *Cc:* user
> *Subject:* Re: Aggregate UDF (UDAF) in Python
>
>
>
> OK, I misread the year on the dev list. Can you comment on work arounds?
> (I.e. question about if scala/java are the only option.)
>
>
>
> On Sun, Oct 16, 2016 at 12:09 PM, Holden Karau 
> wrote:
>
> The comment on the developer list is from earlier this week. I'm not sure
> why UDAF support hasn't made the hop to Python - while I work a fair amount
> on PySpark it's mostly in core & ML and not a lot with SQL so there could
> be good reasons I'm just not familiar with. We can try pinging Davies or
> Michael on the JIRA to see what their thoughts are.
>
>
> On Sunday, October 16, 2016, Tobi Bosede  wrote:
>
> Thanks for the info Holden.
>
>
>
> So it seems both the jira and the comment on the developer list are over a
> year old. More surprising, the jira has no assignee. Any particular reason
> for the lack of activity in this area?
>
>
>
> Is writing scala/java the only work around for this? I hear a lot of
> people say python is the gateway language to scala. It is because of issues
> like this that people use scala for Spark rather than python or eventually
> abandon python for scala. It just takes too long for features to get ported
> over from scala/java.
>
>
>
>
>
> On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau 
> wrote:
>
> I don't believe UDAFs are available in PySpark as this came up on the
> developer list while I was asking for what features people were missing in
> PySpark - see http://apache-spark-developers-list.1001551.n3.
> nabble.com/Python-Spark-Improvements-forked-from-
> Spark-Improvement-Proposals-td19422.html . The JIRA for tacking this
> issue is at https://issues.apache.org/jira/browse/SPARK-10915
>
>
>
> On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede  wrote:
>
> Hello,
>
>
>
> I am trying to use a UDF that calculates inter-quartile (IQR) range for
> pivot() and SQL in pyspark and got the error that my function wasn't an
> aggregate function in both scenarios. Does anyone know if UDAF
> functionality is available in python? If not, what can I do as a work
> around?
>
>
>
> Thanks,
>
> Tobi
>
>
>
>
>
> --
>
> Cell : 425-233-8271
>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
>
> --
>
> Cell : 425-233-8271
>
> Twitter: https://twitter.com/holdenkarau
>
>
>
>
>


Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
Instead of (or additionally to) saving results somewhere, you just start a
thriftserver that expose the Spark tables of the SQLContext (or
SparkSession now). That means you can implement any logic (and maybe use
structured streaming) to expose your data. Today using the thriftserver
means reading data from the persistent store every query, so if the data
modeling doesn't fit the query it can be quite long.  What you generally do
in a common spark job is to load the data and cache spark table in a
in-memory columnar table which is quite efficient for any kind of query,
the counterpart is that the cache isn't updated you have to implement a
reload mechanism, and this solution isn't available using the thriftserver.
What I propose is to mix the two world: periodically/delta load data in
spark table cache and expose it through the thriftserver. But you have to
implement the loading logic, it can be very simple to very complex
depending on your needs.


2016-10-17 19:48 GMT+02:00 Benjamin Kim :

> Is this technique similar to what Kinesis is offering or what Structured
> Streaming is going to have eventually?
>
> Just curious.
>
> Cheers,
> Ben
>
>
>
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> I would suggest to code your own Spark thriftserver which seems to be very
> easy.
> http://stackoverflow.com/questions/27108863/accessing-
> spark-sql-rdd-tables-through-the-thrift-server
>
> I am starting to test it. The big advantage is that you can implement any
> logic because it's a spark job and then start a thrift server on temporary
> table. For example you can query a micro batch rdd from a kafka stream, or
> pre load some tables and implement a rolling cache to periodically update
> the spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues
> doing this but I think Spark community should look at this path: making the
> thriftserver be instantiable in any spark job.
>
> 2016-10-17 18:17 GMT+02:00 Michael Segel :
>
>> Guys,
>> Sorry for jumping in late to the game…
>>
>> If memory serves (which may not be a good thing…) :
>>
>> You can use HiveServer2 as a connection point to HBase.
>> While this doesn’t perform well, its probably the cleanest solution.
>> I’m not keen on Phoenix… wouldn’t recommend it….
>>
>>
>> The issue is that you’re trying to make HBase, a key/value object store,
>> a Relational Engine… its not.
>>
>> There are some considerations which make HBase not ideal for all use
>> cases and you may find better performance with Parquet files.
>>
>> One thing missing is the use of secondary indexing and query
>> optimizations that you have in RDBMSs and are lacking in HBase / MapRDB /
>> etc …  so your performance will vary.
>>
>> With respect to Tableau… their entire interface in to the big data world
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
>> part of your solution, you’re DOA w respect to Tableau.
>>
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
>> another Apache project)
>>
>>
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>>
>> Thanks for all the suggestions. It would seem you guys are right about
>> the Tableau side of things. The reports don’t need to be real-time, and
>> they won’t be directly feeding off of the main DMP HBase data. Instead,
>> it’ll be batched to Parquet or Kudu/Impala or even PostgreSQL.
>>
>> I originally thought that we needed two-way data retrieval from the DMP
>> HBase for ID generation, but after further investigation into the use-case
>> and architecture, the ID generation needs to happen local to the Ad Servers
>> where we generate a unique ID and store it in a ID linking table. Even
>> better, many of the 3rd party services supply this ID. So, data only needs
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC
>> required. This is also goes for the REST Endpoints. 3rd party services will
>> hit ours to update our data with no need to read from our data. And, when
>> we want to update their data, we will hit theirs to update their data using
>> a triggered job.
>>
>> This al boils down to just integrating with Kafka.
>>
>> Once again, thanks for all the help.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>>
>> please keep also in mind that Tableau Server has the capabilities to
>> store data in-memory and refresh only when needed the in-memory data. This
>> means you can import it from any source and let your users work only on the
>> in-memory data in Tableau Server.
>>
>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
>>
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>>> provided already a good alternative. However, you should check if it
>>> contains a recent version of Hbase and 

K-Mean retrieving Cluster Members

2016-10-17 Thread Reth RM
Could you please point me to sample code to retrieve the cluster members of
K mean?

The below code prints cluster centers. * I needed cluster members belonging
to each center.*



val clusters = KMeans.train(parsedData, numClusters, numIterations)
clusters.clusterCenters.foreach(println)


RE: Aggregate UDF (UDAF) in Python

2016-10-17 Thread Mendelson, Assaf
A possible (bad) workaround would be to use the collect_list function. This 
will give you all the values in an array (list) and you can then create a UDF 
to do the aggregation yourself. This would be very slow and cost a lot of 
memory but it would work if your cluster can handle it.
This is the only workaround I can think of, otherwise you  will need to write 
the UDAF in java/scala and wrap it for python use. If you need an example on 
how to do so I can provide one.
Assaf.

From: Tobi Bosede [mailto:ani.to...@gmail.com]
Sent: Sunday, October 16, 2016 7:49 PM
To: Holden Karau
Cc: user
Subject: Re: Aggregate UDF (UDAF) in Python

OK, I misread the year on the dev list. Can you comment on work arounds? (I.e. 
question about if scala/java are the only option.)

On Sun, Oct 16, 2016 at 12:09 PM, Holden Karau 
> wrote:
The comment on the developer list is from earlier this week. I'm not sure why 
UDAF support hasn't made the hop to Python - while I work a fair amount on 
PySpark it's mostly in core & ML and not a lot with SQL so there could be good 
reasons I'm just not familiar with. We can try pinging Davies or Michael on the 
JIRA to see what their thoughts are.

On Sunday, October 16, 2016, Tobi Bosede 
> wrote:
Thanks for the info Holden.

So it seems both the jira and the comment on the developer list are over a year 
old. More surprising, the jira has no assignee. Any particular reason for the 
lack of activity in this area?

Is writing scala/java the only work around for this? I hear a lot of people say 
python is the gateway language to scala. It is because of issues like this that 
people use scala for Spark rather than python or eventually abandon python for 
scala. It just takes too long for features to get ported over from scala/java.


On Sun, Oct 16, 2016 at 8:42 AM, Holden Karau 
> wrote:
I don't believe UDAFs are available in PySpark as this came up on the developer 
list while I was asking for what features people were missing in PySpark - see 
http://apache-spark-developers-list.1001551.n3.nabble.com/Python-Spark-Improvements-forked-from-Spark-Improvement-Proposals-td19422.html
 . The JIRA for tacking this issue is at 
https://issues.apache.org/jira/browse/SPARK-10915

On Sat, Oct 15, 2016 at 7:20 PM, Tobi Bosede 
> wrote:
Hello,

I am trying to use a UDF that calculates inter-quartile (IQR) range for pivot() 
and SQL in pyspark and got the error that my function wasn't an aggregate 
function in both scenarios. Does anyone know if UDAF functionality is available 
in python? If not, what can I do as a work around?

Thanks,
Tobi



--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau



--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben

 
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
>  wrote:
> 
> I would suggest to code your own Spark thriftserver which seems to be very 
> easy.
> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>  
> 
> 
> I am starting to test it. The big advantage is that you can implement any 
> logic because it's a spark job and then start a thrift server on temporary 
> table. For example you can query a micro batch rdd from a kafka stream, or 
> pre load some tables and implement a rolling cache to periodically update the 
> spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues 
> doing this but I think Spark community should look at this path: making the 
> thriftserver be instantiable in any spark job.
> 
> 2016-10-17 18:17 GMT+02:00 Michael Segel  >:
> Guys, 
> Sorry for jumping in late to the game… 
> 
> If memory serves (which may not be a good thing…) :
> 
> You can use HiveServer2 as a connection point to HBase.  
> While this doesn’t perform well, its probably the cleanest solution. 
> I’m not keen on Phoenix… wouldn’t recommend it…. 
> 
> 
> The issue is that you’re trying to make HBase, a key/value object store, a 
> Relational Engine… its not. 
> 
> There are some considerations which make HBase not ideal for all use cases 
> and you may find better performance with Parquet files. 
> 
> One thing missing is the use of secondary indexing and query optimizations 
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
> performance will vary. 
> 
> With respect to Tableau… their entire interface in to the big data world 
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
> part of your solution, you’re DOA w respect to Tableau. 
> 
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
> Apache project) 
> 
> 
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim > > wrote:
>> 
>> Thanks for all the suggestions. It would seem you guys are right about the 
>> Tableau side of things. The reports don’t need to be real-time, and they 
>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>> 
>> I originally thought that we needed two-way data retrieval from the DMP 
>> HBase for ID generation, but after further investigation into the use-case 
>> and architecture, the ID generation needs to happen local to the Ad Servers 
>> where we generate a unique ID and store it in a ID linking table. Even 
>> better, many of the 3rd party services supply this ID. So, data only needs 
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>> required. This is also goes for the REST Endpoints. 3rd party services will 
>> hit ours to update our data with no need to read from our data. And, when we 
>> want to update their data, we will hit theirs to update their data using a 
>> triggered job.
>> 
>> This al boils down to just integrating with Kafka.
>> 
>> Once again, thanks for all the help.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke >> > wrote:
>>> 
>>> please keep also in mind that Tableau Server has the capabilities to store 
>>> data in-memory and refresh only when needed the in-memory data. This means 
>>> you can import it from any source and let your users work only on the 
>>> in-memory data in Tableau Server.
>>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke >> > wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
>>> already a good alternative. However, you should check if it contains a 
>>> recent version of Hbase and Phoenix. That being said, I just wonder what is 
>>> the dataflow, data model and the analysis you plan to do. Maybe there are 
>>> completely different solutions possible. Especially these single inserts, 
>>> upserts etc. should be avoided as much as possible in the Big Data 
>>> (analysis) world with any technology, because they do not perform well. 
>>> 
>>> Hive with Llap will provide an in-memory cache for interactive analytics. 
>>> You can put full tables in-memory with Hive using Ignite HDFS in-memory 
>>> solution. All this does only make sense if you do not use MR as an engine, 
>>> the right input format (ORC, parquet) and a recent Hive version.
>>> 
>>> On 8 Oct 2016, at 21:55, Benjamin Kim >> 

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread vincent gromakowski
I would suggest to code your own Spark thriftserver which seems to be very
easy.
http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server

I am starting to test it. The big advantage is that you can implement any
logic because it's a spark job and then start a thrift server on temporary
table. For example you can query a micro batch rdd from a kafka stream, or
pre load some tables and implement a rolling cache to periodically update
the spark in memory tables with persistent store...
It's not part of the public API and I don't know yet what are the issues
doing this but I think Spark community should look at this path: making the
thriftserver be instantiable in any spark job.

2016-10-17 18:17 GMT+02:00 Michael Segel :

> Guys,
> Sorry for jumping in late to the game…
>
> If memory serves (which may not be a good thing…) :
>
> You can use HiveServer2 as a connection point to HBase.
> While this doesn’t perform well, its probably the cleanest solution.
> I’m not keen on Phoenix… wouldn’t recommend it….
>
>
> The issue is that you’re trying to make HBase, a key/value object store, a
> Relational Engine… its not.
>
> There are some considerations which make HBase not ideal for all use cases
> and you may find better performance with Parquet files.
>
> One thing missing is the use of secondary indexing and query optimizations
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your
> performance will vary.
>
> With respect to Tableau… their entire interface in to the big data world
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as
> part of your solution, you’re DOA w respect to Tableau.
>
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet
> another Apache project)
>
>
> On Oct 9, 2016, at 12:23 PM, Benjamin Kim  wrote:
>
> Thanks for all the suggestions. It would seem you guys are right about the
> Tableau side of things. The reports don’t need to be real-time, and they
> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be
> batched to Parquet or Kudu/Impala or even PostgreSQL.
>
> I originally thought that we needed two-way data retrieval from the DMP
> HBase for ID generation, but after further investigation into the use-case
> and architecture, the ID generation needs to happen local to the Ad Servers
> where we generate a unique ID and store it in a ID linking table. Even
> better, many of the 3rd party services supply this ID. So, data only needs
> to flow in one direction. We will use Kafka as the bus for this. No JDBC
> required. This is also goes for the REST Endpoints. 3rd party services will
> hit ours to update our data with no need to read from our data. And, when
> we want to update their data, we will hit theirs to update their data using
> a triggered job.
>
> This al boils down to just integrating with Kafka.
>
> Once again, thanks for all the help.
>
> Cheers,
> Ben
>
>
> On Oct 9, 2016, at 3:16 AM, Jörn Franke  wrote:
>
> please keep also in mind that Tableau Server has the capabilities to store
> data in-memory and refresh only when needed the in-memory data. This means
> you can import it from any source and let your users work only on the
> in-memory data in Tableau Server.
>
> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke  wrote:
>
>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich
>> provided already a good alternative. However, you should check if it
>> contains a recent version of Hbase and Phoenix. That being said, I just
>> wonder what is the dataflow, data model and the analysis you plan to do.
>> Maybe there are completely different solutions possible. Especially these
>> single inserts, upserts etc. should be avoided as much as possible in the
>> Big Data (analysis) world with any technology, because they do not perform
>> well.
>>
>> Hive with Llap will provide an in-memory cache for interactive analytics.
>> You can put full tables in-memory with Hive using Ignite HDFS in-memory
>> solution. All this does only make sense if you do not use MR as an engine,
>> the right input format (ORC, parquet) and a recent Hive version.
>>
>> On 8 Oct 2016, at 21:55, Benjamin Kim  wrote:
>>
>> Mich,
>>
>> Unfortunately, we are moving away from Hive and unifying on Spark using
>> CDH 5.8 as our distro. And, the Tableau released a Spark ODBC/JDBC driver
>> too. I will either try Phoenix JDBC Server for HBase or push to move faster
>> to Kudu with Impala. We will use Impala as the JDBC in-between until the
>> Kudu team completes Spark SQL support for JDBC.
>>
>> Thanks for the advice.
>>
>> Cheers,
>> Ben
>>
>>
>> On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
>> wrote:
>>
>> Sure. But essentially you are looking at batch data for analytics for
>> your tableau users so Hive may be a better choice with its rich 

question on the structured DataSet API join

2016-10-17 Thread Yang
I'm trying to use the joinWith() method instead of join() since the former
provides type checked result while the latter is a straight DataFrame.


the signature is DataSet[(T,U)] joinWith(other:DataSet[U], col:Column)



here the second arg, col:Column is normally provided by
other.col("col_name"). again once we use a string to specify the column,
you can't do compile time type checks (on the validity of the join
condition, for example you could end up specifying
other.col("a_string_col") === this_ds.col("a_double_col") )

I checked the DataSet API doc, seems there is only this col() method
producing a Column, no other ways.

so is there a type-checked way to provide the join condition?


thanks


Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list 
and I’m certain that my last surviving brain cell has gone on strike over my 
attempt to reduce my caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in 
to both camps.


Again since I’m a relative newbie on spark… I may be missing something so 
apologies up front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 
2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query 
performance?

The issue is that when you have a data sewer (lake) , what happens when your 
use case query is orthogonal to how your data is stored? This means full table 
scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing 
your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a 
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys 
that you would then use in the join process (same as if you produced the subset 
based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of 
software that sits in the repair shop (e.g Joe’s Auto Body) and works with 
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  
This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of 
repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and 
color.  (Then see if there is any correlation between car color and # and type 
of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need 
to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to 
their big data efforts is purely coincidence  and you can replace CCC with 
Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike




Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Michael Segel
Guys,
Sorry for jumping in late to the game…

If memory serves (which may not be a good thing…) :

You can use HiveServer2 as a connection point to HBase.
While this doesn’t perform well, its probably the cleanest solution.
I’m not keen on Phoenix… wouldn’t recommend it….


The issue is that you’re trying to make HBase, a key/value object store, a 
Relational Engine… its not.

There are some considerations which make HBase not ideal for all use cases and 
you may find better performance with Parquet files.

One thing missing is the use of secondary indexing and query optimizations that 
you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
performance will vary.

With respect to Tableau… their entire interface in to the big data world 
revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
part of your solution, you’re DOA w respect to Tableau.

Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
Apache project)


On Oct 9, 2016, at 12:23 PM, Benjamin Kim 
> wrote:

Thanks for all the suggestions. It would seem you guys are right about the 
Tableau side of things. The reports don’t need to be real-time, and they won’t 
be directly feeding off of the main DMP HBase data. Instead, it’ll be batched 
to Parquet or Kudu/Impala or even PostgreSQL.

I originally thought that we needed two-way data retrieval from the DMP HBase 
for ID generation, but after further investigation into the use-case and 
architecture, the ID generation needs to happen local to the Ad Servers where 
we generate a unique ID and store it in a ID linking table. Even better, many 
of the 3rd party services supply this ID. So, data only needs to flow in one 
direction. We will use Kafka as the bus for this. No JDBC required. This is 
also goes for the REST Endpoints. 3rd party services will hit ours to update 
our data with no need to read from our data. And, when we want to update their 
data, we will hit theirs to update their data using a triggered job.

This al boils down to just integrating with Kafka.

Once again, thanks for all the help.

Cheers,
Ben


On Oct 9, 2016, at 3:16 AM, Jörn Franke 
> wrote:

please keep also in mind that Tableau Server has the capabilities to store data 
in-memory and refresh only when needed the in-memory data. This means you can 
import it from any source and let your users work only on the in-memory data in 
Tableau Server.

On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke 
> wrote:
Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
already a good alternative. However, you should check if it contains a recent 
version of Hbase and Phoenix. That being said, I just wonder what is the 
dataflow, data model and the analysis you plan to do. Maybe there are 
completely different solutions possible. Especially these single inserts, 
upserts etc. should be avoided as much as possible in the Big Data (analysis) 
world with any technology, because they do not perform well.

Hive with Llap will provide an in-memory cache for interactive analytics. You 
can put full tables in-memory with Hive using Ignite HDFS in-memory solution. 
All this does only make sense if you do not use MR as an engine, the right 
input format (ORC, parquet) and a recent Hive version.

On 8 Oct 2016, at 21:55, Benjamin Kim 
> wrote:

Mich,

Unfortunately, we are moving away from Hive and unifying on Spark using CDH 5.8 
as our distro. And, the Tableau released a Spark ODBC/JDBC driver too. I will 
either try Phoenix JDBC Server for HBase or push to move faster to Kudu with 
Impala. We will use Impala as the JDBC in-between until the Kudu team completes 
Spark SQL support for JDBC.

Thanks for the advice.

Cheers,
Ben


On Oct 8, 2016, at 12:35 PM, Mich Talebzadeh 
> wrote:

Sure. But essentially you are looking at batch data for analytics for your 
tableau users so Hive may be a better choice with its rich SQL and ODBC.JDBC 
connection to Tableau already.

I would go for Hive especially the new release will have an in-memory offering 
as well for frequently accessed data :)


Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 8 October 2016 at 20:15, Benjamin Kim 
> wrote:
Mich,

First and 

Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Lev Katzav
I don't have in my code any object broadcasting.
I do have broadcast join hints (df1.join(broadcast(df2)))

I tried, starting and stopping the spark context for every test (and not
once per suite),
and it did stop the OOM errors, so I guess that there is no leakage after
the context is stopped.
also removing the broadcast hint had stopped the errors.

So perhaps DFs that were broadcast are never released?
I would assume that the same way cached rdds are evicted when there is no
more free memory, broadcasts will behave the same. Is that incorrect?

Thanks


On Mon, Oct 17, 2016 at 4:52 PM, Sean Owen  wrote:

> Did you unpersist the broadcast objects?
>
> On Mon, Oct 17, 2016 at 10:02 AM lev  wrote:
>
>> Hello,
>>
>> I'm in the process of migrating my application to spark 2.0.1,
>> And I think there is some memory leaks related to Broadcast joins.
>>
>> the application has many unit tests,
>> and each individual test suite passes, but when running all together, it
>> fails on OOM errors.
>>
>> In the begging of each suite I create a new spark session with the session
>> builder:
>> /val spark = sessionBuilder.getOrCreate()
>> /
>> and in the end of each suite, I call the stop method:
>> /spark.stop()/
>>
>> I added a profiler to the application, and looks like broadcast objects
>> are
>> taking most of the memory:
>> > file/n27910/profiler.png>
>>
>> Since each test suite passes when running by itself,
>> I think that the broadcasts are leaking between the tests suites.
>>
>> Any suggestions on how to resolve this?
>>
>> thanks
>>
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Possible-memory-leak-after-
>> closing-spark-context-in-v2-0-1-tp27910.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Substitute Certain Rows a data Frame using SparkR

2016-10-17 Thread shilp
I have a sparkR Data frame and I want to Replace certain Rows of a Column
which satisfy certain condition with some value.If it was a simple R data
frame then I would do something as follows:df$Column1[df$Column1 == "Value"]
= "NewValue" How would i perform similar operation on a SparkR data frame.
??



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Substitute-Certain-Rows-a-data-Frame-using-SparkR-tp27912.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: rdd and dataframe columns dtype

2016-10-17 Thread 박경희
Do you need this one?
http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

Best Regards

- Original Message -
Sender : muhammet pakyürek 
Date : 2016-10-17 20:52 (GMT+9)
Title : rdd and dataframe columns dtype









how can i set columns dtype of rdd












 

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Thakrar, Jayesh
Yes, iterating over a dataframe and making changes is not uncommon.
Ofcourse RDDs, dataframes and datasets are immultable, but there is some 
optimization in the optimizer that can potentially help to dampen the 
effect/impact of creating a new rdd, df or ds.
Also, the use-case you cited is similar to what is done in regression, 
clustering and other algorithms.
I.e. you iterate making a change to a dataframe/dataset until the desired 
condition.
E.g. see this - 
https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
 and the setting of the iteration ceiling

// instantiate the base classifier
val classifier = new LogisticRegression()
  .setMaxIter(params.maxIter)
  .setTol(params.tol)
  .setFitIntercept(params.fitIntercept)

Now the impact of that depends on a variety of things.
E.g. if the data is completely contained in memory and there is no spill over 
to disk, it might not be a big issue (ofcourse there will still be memory, CPU 
and network overhead/latency).
If you are looking at storing the data on disk (e.g. as part of a checkpoint or 
explicit storage), then there can be substantial I/O activity.



From: Xi Shen 
Date: Monday, October 17, 2016 at 2:54 AM
To: Divya Gehlot , Mungeol Heo 
Cc: "user @spark" 
Subject: Re: Is spark a right tool for updating a dataframe repeatedly

I think most of the "big data" tools, like Spark and Hive, are not designed to 
edit data. They are only designed to query data. I wonder in what scenario you 
need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
> wrote:
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


Thanks,
Divya


On 17 October 2016 at 09:50, Mungeol Heo 
> wrote:
Hello, everyone.

As I mentioned at the tile, I wonder that is spark a right tool for
updating a data frame repeatedly until there is no more date to
update.

For example.

while (if there was a updating) {
update a data frame A
}

If it is the right tool, then what is the best practice for this kind of work?
Thank you.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org

--

Thanks,
David S.


Help in generating unique Id in spark row

2016-10-17 Thread Saurav Sinha
Hi,

I am in situation where I want to generate unique Id for each row.

I have use monotonicallyIncreasingId but it is giving increasing values and
start generating from start if it fail.

I have two question here:

Q1. Does this method give me unique id even in failure situation becaue I
want to use that ID in my solr id.

Q2. If answer to previous question is NO. Then Is there way yo generate
UUID for each row which is uniqe and not updatedable.

As I have come up with situation where UUID is updated


val idUDF = udf(() => UUID.randomUUID().toString)
val a = withColumn("alarmUUID", lit(idUDF()))
a.persist(StorageLevel.MEMORY_AND_DISK)
rawDataDf.registerTempTable("rawAlarms")

///
/// I do some joines

but as I reach further below

I do sonthing like
b is transformation of a
sqlContext.sql("""Select a.alarmUUID,b.alarmUUID
  from a right outer join b on a.alarmUUID =
b.alarmUUID""")

it give output as

+++

|   alarmUUID|   alarmUUID|
+++
|7d33a516-5532-410...|null|
|null|2439d6db-16a2-44b...|
+++



-- 
Thanks and Regards,

Saurav Sinha

Contact: 9742879062


OutputMetrics with data frames (spark-avro)

2016-10-17 Thread Tim Moran
Hi,

I'm using the Databricks spark-avro library to save some DataFrames out as
Avro (with Spark 1.6.1). When I do this however, I lose the information in
the spark events about the number of records and size of data written to
HDFS for each partition that's available if I save an RDD out as a text
file.

Is this just a limitation of data frames, or is there a way of making this
information available? It's really useful for performance monitoring.

Thanks,

Tim.

-- 
This email is confidential, if you are not the intended recipient please 
delete it and notify us immediately by emailing the sender. You should not 
copy it or use it for any purpose nor disclose its contents to any other 
person. Privitar Limited is registered in England with registered number 
09305666. Registered office Salisbury House, Station Road, Cambridge, 
CB12LA.


Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread 市场部
Hi Xi Shen

Not yet.  For the moment my idk for spark is still V7. Thanks for your 
reminding, I will try it out by upgrading java.

发件人: Xi Shen >
日期: 2016年10月17日 星期一 下午8:00
至: zhangjianxin 
>, 
"user@spark.apache.org" 
>
主题: Re: Did anybody come across this random-forest issue with spark 2.0.1.

Did you also upgrade to Java from v7 to v8?

On Mon, Oct 17, 2016 at 7:19 PM 张建鑫(市场部) 
> wrote:

Did anybody encounter this problem before and why it happens , how to solve it? 
 The same training data and same source code work in 1.6.1, however become 
lousy in 2.0.1

[cid:BD0EFC31-F4CE-421F-BC94-79EF3BE09D60]
--

Thanks,
David S.


Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Okay, thank you.

On Mon, Oct 17, 2016 at 5:53 PM Sean Owen  wrote:

> You can take the "with user-provided Hadoop" binary from the download
> page, and yes that should mean it does not drag in a Hive dependency of its
> own.
>
> On Mon, Oct 17, 2016 at 7:08 AM Xi Shen  wrote:
>
> Hi,
>
> I want to configure my Hive to use Spark 2 as its engine. According to
> Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
> could build my own, but for some reason I hope I could use a official
> binary build.
>
> So I want to ask if the official Spark binary build labeled "with
> user-provided Hadoop" also implies "user-provided Hive".
>
> --
>
>
> Thanks,
> David S.
>
> --


Thanks,
David S.


rdd and dataframe columns dtype

2016-10-17 Thread muhammet pakyürek
how can i set columns dtype of rdd




Re: Problems with new experimental Kafka Consumer for 0.10

2016-10-17 Thread Matthias Niehoff
heartbeat.interval.ms default
group.max.session.timeout.ms default
session.timeout.ms 6

default values as of kafka 0.10.

complete Kafka params:

val kafkaParams = Map[String, String](
  "bootstrap.servers" -> kafkaBrokers,
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> "false",
  "key.deserializer" -> classOf[StringDeserializer].getName,
  "value.deserializer" -> classOf[BytesDeserializer].getName,
  "session.timeout.ms" -> s"${1 * 60 * 1000}",
  "request.timeout.ms" -> s"${2 * 60 * 1000}",
  "max.poll.records" -> "1000"
)


As pointed out, when using different groups for each DirectStream
everything is fine.

2016-10-15 2:42 GMT+02:00 Cody Koeninger :

> For you or anyone else having issues with consumer rebalance, what are
> your settings for
>
> heartbeat.interval.ms
> session.timeout.ms
> group.max.session.timeout.ms
>
> relative to your batch time?
>
> On Tue, Oct 11, 2016 at 10:19 AM, static-max 
> wrote:
> > Hi,
> >
> > I run into the same exception
> > (org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot
> be
> > completed since the group has already rebalanced ...), but I only use one
> > stream.
> > I get the exceptions when trying to manually commit the offset to Kafka:
> >
> > OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
> > CanCommitOffsets cco = (CanCommitOffsets) directKafkaStream.dstream();
> > cco.commitAsync(offsets);
> >
> > I tried setting "max.poll.records" to 1000 but this did not help.
> >
> > Any idea what could be wrong?
> >
> > 2016-10-11 15:36 GMT+02:00 Cody Koeninger :
> >>
> >> The new underlying kafka consumer prefetches data and is generally
> heavier
> >> weight, so it is cached on executors.  Group id is part of the cache
> key. I
> >> assumed kafka users would use different group ids for consumers they
> wanted
> >> to be distinct, since otherwise would cause problems even with the
> normal
> >> kafka consumer,  but that appears to be a poor assumption.
> >>
> >> I'll figure out a way to make this more obvious.
> >>
> >>
> >> On Oct 11, 2016 8:19 AM, "Matthias Niehoff"
> >>  wrote:
> >>
> >> good point, I changed the group id to be unique for the separate streams
> >> and now it works. Thanks!
> >>
> >> Although changing this is ok for us, i am interested in the why :-) With
> >> the old connector this was not a problem nor is it afaik with the pure
> kafka
> >> consumer api
> >>
> >> 2016-10-11 14:30 GMT+02:00 Cody Koeninger :
> >>>
> >>> Just out of curiosity, have you tried using separate group ids for the
> >>> separate streams?
> >>>
> >>>
> >>> On Oct 11, 2016 4:46 AM, "Matthias Niehoff"
> >>>  wrote:
> 
>  I stripped down the job to just consume the stream and print it,
> without
>  avro deserialization. When I only consume one topic, everything is
> fine. As
>  soon as I add a second stream it stucks after about 5 minutes. So I
>  basically bails down to:
> 
> 
>    val kafkaParams = Map[String, String](
>  "bootstrap.servers" -> kafkaBrokers,
>  "group.id" -> group,
>  "key.deserializer" -> classOf[StringDeserializer].getName,
>  "value.deserializer" -> classOf[BytesDeserializer].getName,
>  "session.timeout.ms" -> s"${1 * 60 * 1000}",
>  "request.timeout.ms" -> s"${2 * 60 * 1000}",
>  "auto.offset.reset" -> "latest",
>  "enable.auto.commit" -> "false"
>    )
> 
>    def main(args: Array[String]) {
> 
>  def createStreamingContext(): StreamingContext = {
> 
>    val sparkConf = new SparkConf(true)
>  .setAppName("Kafka Consumer Test")
>    sparkConf.setMaster("local[*]")
> 
>    val ssc = new StreamingContext(sparkConf,
>  Seconds(streaming_interval_seconds))
> 
>    // AD REQUESTS
>    // ===
>    val serializedAdRequestStream = createStream(ssc,
> topic_adrequest)
>    serializedAdRequestStream.map(record =>
>  record.value().get()).print(10)
> 
>    // VIEWS
>    // ==
>    val serializedViewStream = createStream(ssc, topic_view)
>    serializedViewStream.map(record => record.value().get()).print(
> 10)
> 
>  //  // CLICKS
>  //  // ==
>  //  val serializedClickStream = createStream(ssc, topic_click)
>  //  serializedClickStream.map(record =>
>  record.value().get()).print(10)
> 
>    ssc
>  }
> 
>  val streamingContext = createStreamingContext
>  streamingContext.start()
>  streamingContext.awaitTermination()
>    }
> 
> 
>  And in the logs you see:
> 
> 
>  16/10/10 14:02:26 INFO JobScheduler: Finished job streaming job
>  1476100944000 ms.2 from job set of time 

Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Sean Owen
You're now asking about couchbase code, so this isn't the best place to
ask. Head to couchbase forums.

On Mon, Oct 17, 2016 at 10:14 AM Devi P.V  wrote:

> Hi,
> I tried with the following code
>
> import com.couchbase.spark._
> val conf = new SparkConf()
>   .setAppName(this.getClass.getName)
>   .setMaster("local[*]")
>   .set("com.couchbase.bucket.bucketName","password")
>   .set("com.couchbase.nodes", "node")
> .set ("com.couchbase.queryEnabled", "true")
> val sc = new SparkContext(conf)
>
> I need full document from bucket,so i gave query like this,
>
> val query = "SELECT META(`bucketName`).id as id FROM `bucketName` "
>  sc
>   .couchbaseQuery(Query.simple(query))
>   .map(_.value.getString("id"))
>   .couchbaseGet[JsonDocument]()
>   .collect()
>   .foreach(println)
>
> But it can't take Query.simple(query)
>
> I used libraryDependencies += "com.couchbase.client" %
> "spark-connector_2.11" % "1.2.1" in built.sbt.
> Is my query wrong or anything else needed to import?
>
>
> Please help.
>
> On Sun, Oct 16, 2016 at 8:23 PM, Rodrick Brown <
> rodr...@orchardplatform.com> wrote:
>
>
>
> On Sun, Oct 16, 2016 at 10:51 AM, Devi P.V  wrote:
>
> Hi all,
> I am trying to read data from couchbase using spark 2.0.0.I need to fetch
> complete data from a bucket as  Rdd.How can I solve this?Does spark 2.0.0
> support couchbase?Please help.
>
> Thanks
>
> https://github.com/couchbase/couchbase-spark-connector
>
>
> --
>
> [image: Orchard Platform] 
>
> *Rodrick Brown */ *DevOPs*
>
> 9174456839 / rodr...@orchardplatform.com
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>
>
>


Driver storage memory getting waste

2016-10-17 Thread Sushrut Ikhar
Hi,
Is there any config to change the storage memory fraction for driver; as
i'm not caching anything in driver and by default it is picking from
spark.memory.fraction (0.9)
spark.memory.storageFraction (0.6);
whose value i've set as per my executor usage.

Regards,

Sushrut Ikhar
[image: https://]about.me/sushrutikhar



Re: Question about the offiicial binary Spark 2 package

2016-10-17 Thread Sean Owen
You can take the "with user-provided Hadoop" binary from the download page,
and yes that should mean it does not drag in a Hive dependency of its own.

On Mon, Oct 17, 2016 at 7:08 AM Xi Shen  wrote:

> Hi,
>
> I want to configure my Hive to use Spark 2 as its engine. According to
> Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
> could build my own, but for some reason I hope I could use a official
> binary build.
>
> So I want to ask if the official Spark binary build labeled "with
> user-provided Hadoop" also implies "user-provided Hive".
>
> --
>
>
> Thanks,
> David S.
>


Re: Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread Sean Owen
Did you unpersist the broadcast objects?

On Mon, Oct 17, 2016 at 10:02 AM lev  wrote:

> Hello,
>
> I'm in the process of migrating my application to spark 2.0.1,
> And I think there is some memory leaks related to Broadcast joins.
>
> the application has many unit tests,
> and each individual test suite passes, but when running all together, it
> fails on OOM errors.
>
> In the begging of each suite I create a new spark session with the session
> builder:
> /val spark = sessionBuilder.getOrCreate()
> /
> and in the end of each suite, I call the stop method:
> /spark.stop()/
>
> I added a profiler to the application, and looks like broadcast objects are
> taking most of the memory:
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n27910/profiler.png
> >
>
> Since each test suite passes when running by itself,
> I think that the broadcasts are leaking between the tests suites.
>
> Any suggestions on how to resolve this?
>
> thanks
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Possible-memory-leak-after-closing-spark-context-in-v2-0-1-tp27910.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Couchbase-Spark 2.0.0

2016-10-17 Thread Devi P.V
Hi,
I tried with the following code

import com.couchbase.spark._
val conf = new SparkConf()
  .setAppName(this.getClass.getName)
  .setMaster("local[*]")
  .set("com.couchbase.bucket.bucketName","password")
  .set("com.couchbase.nodes", "node")
.set ("com.couchbase.queryEnabled", "true")
val sc = new SparkContext(conf)

I need full document from bucket,so i gave query like this,

val query = "SELECT META(`bucketName`).id as id FROM `bucketName` "
 sc
  .couchbaseQuery(Query.simple(query))
  .map(_.value.getString("id"))
  .couchbaseGet[JsonDocument]()
  .collect()
  .foreach(println)

But it can't take Query.simple(query)

I used libraryDependencies += "com.couchbase.client" %
"spark-connector_2.11" % "1.2.1" in built.sbt.
Is my query wrong or anything else needed to import?


Please help.

On Sun, Oct 16, 2016 at 8:23 PM, Rodrick Brown 
wrote:

>
>
> On Sun, Oct 16, 2016 at 10:51 AM, Devi P.V  wrote:
>
>> Hi all,
>> I am trying to read data from couchbase using spark 2.0.0.I need to fetch
>> complete data from a bucket as  Rdd.How can I solve this?Does spark 2.0.0
>> support couchbase?Please help.
>>
>> Thanks
>>
> https://github.com/couchbase/couchbase-spark-connector
>
>
> --
>
> [image: Orchard Platform] 
>
> *Rodrick Brown */ *DevOPs*
>
> 9174456839 / rodr...@orchardplatform.com
>
> Orchard Platform
> 101 5th Avenue, 4th Floor, New York, NY
>
> *NOTICE TO RECIPIENTS*: This communication is confidential and intended
> for the use of the addressee only. If you are not an intended recipient of
> this communication, please delete it immediately and notify the sender by
> return email. Unauthorized reading, dissemination, distribution or copying
> of this communication is prohibited. This communication does not constitute
> an offer to sell or a solicitation of an indication of interest to purchase
> any loan, security or any other financial product or instrument, nor is it
> an offer to sell or a solicitation of an indication of interest to purchase
> any products or services to any persons who are prohibited from receiving
> such information under applicable law. The contents of this communication
> may not be accurate or complete and are subject to change without notice.
> As such, Orchard App, Inc. (including its subsidiaries and affiliates,
> "Orchard") makes no representation regarding the accuracy or completeness
> of the information contained herein. The intended recipient is advised to
> consult its own professional advisors, including those specializing in
> legal, tax and accounting matters. Orchard does not provide legal, tax or
> accounting advice.
>


Possible memory leak after closing spark context in v2.0.1

2016-10-17 Thread lev
Hello,

I'm in the process of migrating my application to spark 2.0.1,
And I think there is some memory leaks related to Broadcast joins.

the application has many unit tests,
and each individual test suite passes, but when running all together, it
fails on OOM errors.

In the begging of each suite I create a new spark session with the session
builder:
/val spark = sessionBuilder.getOrCreate()
/
and in the end of each suite, I call the stop method:
/spark.stop()/

I added a profiler to the application, and looks like broadcast objects are
taking most of the memory:
 

Since each test suite passes when running by itself,
I think that the broadcasts are leaking between the tests suites.

Any suggestions on how to resolve this?

thanks 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-memory-leak-after-closing-spark-context-in-v2-0-1-tp27910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Resizing Image with Scrimage in Spark

2016-10-17 Thread Sean Owen
It pretty much means what it says. Objects you send across machines must be
serializable, and the object from the library is not.
You can write a wrapper object that is serializable and knows how to
serialize it. Or ask the library dev to consider making this object
serializable.

On Mon, Oct 17, 2016 at 8:04 AM Adline Dsilva 
wrote:

> Hi All,
>
>I have a Hive Table which contains around 500 million photos(Profile
> picture of Users) stored as hex string and total size of the table is 5TB.
> I'm trying to make a solution where images can be retrieved in real-time.
>
> Current Solution,  Resize the images, index it along the user profile to
> solr. For Resizing, Im using a scala library called scrimage
> 
>
> While running the udf function im getting below error.
> Serialization stack:
> - *object not serializable* (class: com.sksamuel.scrimage.Image,
> value: Image [width=767, height=1024, type=2])
> - field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: imgR,
> type: class com.sksamuel.scrimage.Image)
>
> Can anyone suggest method to overcome the above error.
>
> Regards,
> Adline
>
> --
> *DISCLAIMER:*
>
> This e-mail (including any attachments) is for the addressee(s) only and
> may be confidential, especially as regards personal data. If you are not
> the intended recipient, please note that any dealing, review, distribution,
> printing, copying or use of this e-mail is strictly prohibited. If you have
> received this email in error, please notify the sender immediately and
> delete the original message (including any attachments).
>
> MIMOS Berhad is a research and development institution under the purview
> of the Malaysian Ministry of Science, Technology and Innovation. Opinions,
> conclusions and other information in this e-mail that do not relate to the
> official business of MIMOS Berhad and/or its subsidiaries shall be
> understood as neither given nor endorsed by MIMOS Berhad and/or its
> subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts
> responsibility for the same. All liability arising from or in connection
> with computer viruses and/or corrupted e-mails is excluded to the fullest
> extent permitted by law.
>


Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Xi Shen
I think most of the "big data" tools, like Spark and Hive, are not designed
to edit data. They are only designed to query data. I wonder in what
scenario you need to update large volume of data repetitively.


On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot 
wrote:

> If  my understanding is correct about your query
> In spark Dataframes are immutable , cant update the dataframe.
> you have to create a new dataframe to update the current dataframe .
>
>
> Thanks,
> Divya
>
>
> On 17 October 2016 at 09:50, Mungeol Heo  wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --


Thanks,
David S.


Re: NoClassDefFoundError: org/apache/spark/Logging in SparkSession.getOrCreate

2016-10-17 Thread Saisai Shao
Not sure why your code will search Logging class under org/apache/spark,
this should be “org/apache/spark/internal/Logging”, and it changed long
time ago.


On Sun, Oct 16, 2016 at 3:25 AM, Brad Cox  wrote:

> I'm experimenting with Spark 2.0.1 for the first time and hitting a
> problem right out of the gate.
>
> My main routine starts with this which I think is the standard idiom.
>
> SparkSession sparkSession = SparkSession
> .builder()
> .master("local")
> .appName("DecisionTreeExample")
> .getOrCreate();
>
> Running this in the eclipse debugger, execution fails in getOrCreate()
> with this exception
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/Logging
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> at java.security.SecureClassLoader.defineClass(
> SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at org.apache.spark.sql.SparkSession.(
> SparkSession.scala:122)
> at org.apache.spark.sql.SparkSession.(SparkSession.scala:77)
> at org.apache.spark.sql.SparkSession$Builder.
> getOrCreate(SparkSession.scala:840)
> at titanic.DecisionTreeExample.main(DecisionTreeExample.java:54)
>
> java.lang.NoClassDefFoundError means a class is not found at run time that
> was present at
> compile time. I've googled everything I can think of and found no
> solutions. Can someone
> help? Thanks!
>
> These are my spark-relevant dependencies:
>
> 
> org.apache.spark
> spark-core_2.11
> 2.0.1
> 
> 
> org.apache.spark
> spark-mllib_2.11
> 2.0.1
> 
> 
> org.apache.spark
> spark-sql_2.11
> 2.0.1
> 
>
>
>
> Dr. Brad J. CoxCell: 703-594-1883 Skype: dr.brad.cox
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Resizing Image with Scrimage in Spark

2016-10-17 Thread Adline Dsilva
Hi All,

   I have a Hive Table which contains around 500 million photos(Profile picture 
of Users) stored as hex string and total size of the table is 5TB. I'm trying 
to make a solution where images can be retrieved in real-time.

Current Solution,  Resize the images, index it along the user profile to solr. 
For Resizing, Im using a scala library called 
scrimage

While running the udf function im getting below error.
Serialization stack:
- object not serializable (class: com.sksamuel.scrimage.Image, value: Image 
[width=767, height=1024, type=2])
- field (class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC, name: imgR, type: 
class com.sksamuel.scrimage.Image)

Can anyone suggest method to overcome the above error.

Regards,
Adline


DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be 
confidential, especially as regards personal data. If you are not the intended 
recipient, please note that any dealing, review, distribution, printing, 
copying or use of this e-mail is strictly prohibited. If you have received this 
email in error, please notify the sender immediately and delete the original 
message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the 
Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions 
and other information in this e-mail that do not relate to the official 
business of MIMOS Berhad and/or its subsidiaries shall be understood as neither 
given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS 
Berhad nor its subsidiaries accepts responsibility for the same. All liability 
arising from or in connection with computer viruses and/or corrupted e-mails is 
excluded to the fullest extent permitted by law.


Question about the offiicial binary Spark 2 package

2016-10-17 Thread Xi Shen
Hi,

I want to configure my Hive to use Spark 2 as its engine. According to
Hive's instruction, the Spark should build *without *Hadoop, nor Hive. I
could build my own, but for some reason I hope I could use a official
binary build.

So I want to ask if the official Spark binary build labeled "with
user-provided Hadoop" also implies "user-provided Hive".

-- 


Thanks,
David S.


Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Divya Gehlot
If  my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .


Thanks,
Divya


On 17 October 2016 at 09:50, Mungeol Heo  wrote:

> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>