Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
@Debasish

I see that the spark version being used in the project that you mentioned
is 1.6.0. I would suggest that you take a look at some blogs related to
Spark 2.0 Pipelines, Models in new ml package. The new ml package's API as
of latest Spark 2.1.0 release has no way to call predict on single vector.
There is no API exposed. It is WIP but not yet released.

On Sat, Feb 4, 2017 at 11:07 PM, Debasish Das 
wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on itthere
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal"  wrote:
>
>>
>>- In Spark 2.0 there is a class called PipelineModel. I know that the
>>title says pipeline but it is actually talking about PipelineModel trained
>>via using a Pipeline.
>>- Why PipelineModel instead of pipeline? Because usually there is a
>>series of stuff that needs to be done when doing ML which warrants an
>>ordered sequence of operations. Read the new spark ml docs or one of the
>>databricks blogs related to spark pipelines. If you have used python's
>>sklearn library the concept is inspired from there.
>>- "once model is deserialized as ml model from the store of choice
>>within ms" - The timing of loading the model was not what I was
>>referring to when I was talking about timing.
>>- "it can be used on incoming features to score through
>>spark.ml.Model predict API". The predict API is in the old mllib package
>>not the new ml package.
>>- "why r we using dataframe and not the ML model directly from API" -
>>Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das 
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:
>>>
 Does this support Java 7?
 What is your timezone in case someone wanted to talk?

 On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins 
 wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further 
> optimized
> by using row-based transformations if your pipeline is simple to around 
> 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 
> features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
> wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>> wrote:
>>
>>> Is computational time for predictions on the order of few
>>> milliseconds (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>>> wrote:
>>>
 Hey everyone,


 Some of you may have seen Mikhail and I talk at Spark/Hadoop
 Summits about MLeap and how you can use it to build production services
 from your Spark-trained ML pipelines. MLeap is an open-source 
 technology
 that allows Data Scientists and Engineers to deploy Spark-trained ML
 Pipelines and Models to a scoring engine instantly. The MLeap execution
 engine has no dependencies on a Spark context and the serialization 
 format
 is entirely based on Protobuf 3 and JSON.


 The recent 0.5.0 release provides serialization and inference
 support for close to 100% of Spark transformers (we don’t yet support 
 ALS
 and LDA).


 MLeap is open-source, take a look at our Github page:

 https://github.com/combust/mleap


 Or join the 

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Chris Fregly
to date, i haven't seen very good performance coming from mleap. i believe ram 
from databricks keeps getting you guys on stage at the spark summits, but i've 
been unimpressed with the performance numbers - as well as your choice to 
reimplement own non-standard "pmml-like" mechanism which incurs heavy technical 
debt on the development side.

creating technical debt is a very databricks-like thing as seen in their own 
product - so it's no surprise that databricks supports and encourages this type 
of engineering effort.

@hollin: please correct me if i'm wrong, but the numbers you guys have quoted 
in the past are at very low scale. at one point you were quoting 40-50ms which 
is pretty bad. 11ms is better, but these are all at low scale which is not good.

i'm not sure where the 2-3ms numbers are coming from, but even that is not 
realistic in most real-world scenarios at scale.

checkout our 100% open source solution to this exact problem starting at 
http://pipeline.io. you'll find links to the github repo, youtube demos, and 
slideshare conference talks, online training, and lots more.

our entire focus at PipelineIO is optimizing. deploying, a/b + bandit testing, 
and scaling Scikit-Learn + Spark ML + Tensorflow AI models for high-performance 
predictions.

this focus on performance and scale is an extension of our team's long history 
of building highly scalable, highly available, and highly performance 
distributed ML and AI systems at netflix, twitter, mesosphere - and even 
databricks. :)

reminder that everything here is 100% open source. no product pitches here. we 
work for you guys/gals - aka the community!

please contact me directly if you're looking to solve this problem the best way 
possible.

we can get you up and running in your own cloud-based or on-premise environment 
in minutes. we support aws, google cloud, and azure - basically anywhere that 
runs docker.

any time zone works. we're completely global with free 24x7 support for 
everyone in the community.

thanks! hope this is useful.

Chris Fregly
Research Scientist @ PipelineIO
Founder @ Advanced Spark and TensorFlow Meetup
San Francisco - Chicago - Washington DC - London

On Feb 4, 2017, 12:06 PM -0600, Debasish Das , wrote:
>
> Except of course lda als and neural net modelfor them the model need to 
> be either prescored and cached on a kv store or the matrices / graph should 
> be kept on kv store to access them using a REST API to serve the output..for 
> neural net its more fun since its a distributed or local graph over which 
> tensorflow compute needs to run...
>
>
> In trapezium we support writing these models to store like cassandra and 
> lucene for example and then provide config driven akka-http based API to add 
> the business logic to access these model from a store and expose the model 
> serving as REST endpoint
>
>
> Matrix, graph and kernel models we use a lot and for them turned out that 
> mllib style model predict were useful if we change the underlying store...
>
> On Feb 4, 2017 9:37 AM, "Debasish Das"  (mailto:debasish.da...@gmail.com)> wrote:
> >
> > If we expose an API to access the raw models out of PipelineModel can't we 
> > call predict directly on it from an API ? Is there a task open to expose 
> > the model out of PipelineModel so that predict can be called on itthere 
> > is no dependency of spark context in ml model...
> >
> > On Feb 4, 2017 9:11 AM, "Aseem Bansal"  > (mailto:asmbans...@gmail.com)> wrote:
> > > In Spark 2.0 there is a class called PipelineModel. I know that the title 
> > > says pipeline but it is actually talking about PipelineModel trained via 
> > > using a Pipeline.
> > > Why PipelineModel instead of pipeline? Because usually there is a series 
> > > of stuff that needs to be done when doing ML which warrants an ordered 
> > > sequence of operations. Read the new spark ml docs or one of the 
> > > databricks blogs related to spark pipelines. If you have used python's 
> > > sklearn library the concept is inspired from there.
> > > "once model is deserialized as ml model from the store of choice within 
> > > ms" - The timing of loading the model was not what I was referring to 
> > > when I was talking about timing.
> > > "it can be used on incoming features to score through spark.ml.Model 
> > > predict API". The predict API is in the old mllib package not the new ml 
> > > package.
> > > "why r we using dataframe and not the ML model directly from API" - 
> > > Because as of now the new ml package does not have the direct API.
> > >
> > >
> > >
> > > On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das  > > (mailto:debasish.da...@gmail.com)> wrote:
> > > >
> > > > I am not sure why I will use pipeline to do scoring...idea is to build 
> > > > a model, use model ser/deser feature to put it in the row or column 
> > > > store of choice and provide a api access to 

Turning rows into columns

2017-02-04 Thread Paul Tremblay
I am using pyspark 2.1 and am wondering how to convert a flat file, with
one record per row, into a columnar format.

Here is an example of the data:

u'WARC/1.0',
 u'WARC-Type: warcinfo',
 u'WARC-Date: 2016-12-08T13:00:23Z',
 u'WARC-Record-ID: ',
 u'Content-Length: 344',
 u'Content-Type: application/warc-fields',
 u'WARC-Filename:
CC-MAIN-20161202170900-0-ip-10-31-129-80.ec2.internal.warc.gz',
 u'',
 u'robots: classic',
 u'hostname: ip-10-31-129-80.ec2.internal',
 u'software: Nutch 1.6 (CC)/CC WarcExport 1.0',
 u'isPartOf: CC-MAIN-2016-50',
 u'operator: CommonCrawl Admin',
 u'description: Wide crawl of the web for November 2016',
 u'publisher: CommonCrawl',
 u'format: WARC File Format 1.0',
 u'conformsTo:
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf',
 u'',
 u'',
 u'WARC/1.0',
 u'WARC-Type: request',
 u'WARC-Date: 2016-12-02T17:54:09Z',
 u'WARC-Record-ID: ',
 u'Content-Length: 220',
 u'Content-Type: application/http; msgtype=request',
 u'WARC-Warcinfo-ID: ',
 u'WARC-IP-Address: 217.197.115.133',
 u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
 u'',
 u'GET /blog/ HTTP/1.0',
 u'Host: 1018201.vkrugudruzei.ru',
 u'Accept-Encoding: x-gzip, gzip, deflate',
 u'User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)',
 u'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 u'',
 u'',
 u'',
 u'WARC/1.0',
 u'WARC-Type: response',
 u'WARC-Date: 2016-12-02T17:54:09Z',
 u'WARC-Record-ID: ',
 u'Content-Length: 577',
 u'Content-Type: application/http; msgtype=response',
 u'WARC-Warcinfo-ID: ',
 u'WARC-Concurrent-To: ',
 u'WARC-IP-Address: 217.197.115.133',
 u'WARC-Target-URI: http://1018201.vkrugudruzei.ru/blog/',
 u'WARC-Payload-Digest: sha1:Y4TZFLB6UTXHU4HUVONBXC5NZQW2LYMM',
 u'WARC-Block-Digest: sha1:3J7HHBMWTSC7W53DDB7BHTUVPM26QS4B',
 u'']

I want to convert it to something like:
{warc-type='request',warc-date='2016-12-02'.
ward-record-id='

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-04 Thread Marco Mistroni
Hi
 not sure if this will help at all, and pls take it with a pinch of salt as
i dont have your setup and i am not running on a cluster

 I have tried to run a kafka example which was originally workkign on spark
1.6.1 on spark 2.
These are the jars i am using

spark-streaming-kafka-0-10_2.11_2.0.1.jar
kafka_2.11-0.10.1.1


And here's the code up to the creation of the Direct Stream. apparently
with the new version of kafka libs some properties have to be specified


import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.storage.StorageLevel

import java.util.regex.Pattern
import java.util.regex.Matcher

import Utilities._

import org.apache.spark.streaming.kafka010.KafkaUtils
import kafka.serializer.StringDecoder
import
org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

/** Working example of listening for log data from Kafka's testLogs topic
on port 9092. */
object KafkaExample {

  def main(args: Array[String]) {

// Create the context with a 1 second batch size
val ssc = new StreamingContext("local[*]", "KafkaExample", Seconds(1))

setupLogging()

// Construct a regular expression (regex) to extract fields from raw
Apache log lines
val pattern = apacheLogPattern()

val kafkaParams = Map("metadata.broker.list" -> "localhost:9092",
"bootstrap.servers" -> "localhost:9092",
"key.deserializer"
->"org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer"
->"org.apache.kafka.common.serialization.StringDeserializer",
"group.id" -> "group1")
val topics = List("testLogs").toSet
val lines = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String,
String](topics, kafkaParams)
  ).map(cr => cr.value())

hth

 marco












On Sat, Feb 4, 2017 at 8:33 PM, Mich Talebzadeh 
wrote:

> I am getting this error with Spark 2. which works with CDH 5.5.1 (Spark
> 1.5).
>
> Admittedly I am messing around with Spark-shell. However, I am surprised
> why this does not work with Spark 2 and is ok with CDH 5.1
>
> scala> val dstream = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)
>
> java.lang.NoClassDefFoundError: Could not initialize class kafka.consumer.
> FetchRequestAndResponseStatsRegistry$
>   at kafka.consumer.SimpleConsumer.(SimpleConsumer.scala:39)
>   at org.apache.spark.streaming.kafka.KafkaCluster.connect(
> KafkaCluster.scala:52)
>   at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$
> org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(
> KafkaCluster.scala:345)
>   at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$
> org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(
> KafkaCluster.scala:342)
>   at scala.collection.IndexedSeqOptimized$class.
> foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at org.apache.spark.streaming.kafka.KafkaCluster.org$apache$
> spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
>   at org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(
> KafkaCluster.scala:125)
>   at org.apache.spark.streaming.kafka.KafkaCluster.
> getPartitions(KafkaCluster.scala:112)
>   at org.apache.spark.streaming.kafka.KafkaUtils$.
> getFromOffsets(KafkaUtils.scala:211)
>   at org.apache.spark.streaming.kafka.KafkaUtils$.
> createDirectStream(KafkaUtils.scala:484)
>   ... 74 elided
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Mismatched datatype in Case statement

2017-02-04 Thread Aviral Agarwal
Hi,
I was trying Spark version 1.6.0 when I ran into the error mentioned in the
following Hive JIRA.
https://issues.apache.org/jira/browse/HIVE-5825
This error was there in both cases : either using SQLContext or HiveContext.

Any indication if this has been fixed in a higher spark version ? If yes,
which version ?

Thanks and Regards,
Aviral Agarwal


SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-04 Thread Mich Talebzadeh
I am getting this error with Spark 2. which works with CDH 5.5.1 (Spark
1.5).

Admittedly I am messing around with Spark-shell. However, I am surprised
why this does not work with Spark 2 and is ok with CDH 5.1

scala> val dstream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

java.lang.NoClassDefFoundError: Could not initialize class
kafka.consumer.FetchRequestAndResponseStatsRegistry$
  at kafka.consumer.SimpleConsumer.(SimpleConsumer.scala:39)
  at
org.apache.spark.streaming.kafka.KafkaCluster.connect(KafkaCluster.scala:52)
  at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:345)
  at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$org$apache$spark$streaming$kafka$KafkaCluster$$withBrokers$1.apply(KafkaCluster.scala:342)
  at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
  at org.apache.spark.streaming.kafka.KafkaCluster.org
$apache$spark$streaming$kafka$KafkaCluster$$withBrokers(KafkaCluster.scala:342)
  at
org.apache.spark.streaming.kafka.KafkaCluster.getPartitionMetadata(KafkaCluster.scala:125)
  at
org.apache.spark.streaming.kafka.KafkaCluster.getPartitions(KafkaCluster.scala:112)
  at
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
  at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
  ... 74 elided


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
Except of course lda als and neural net modelfor them the model need to
be either prescored and cached on a kv store or the matrices / graph should
be kept on kv store to access them using a REST API to serve the
output..for neural net its more fun since its a distributed or local  graph
over which tensorflow compute needs to run...

In trapezium we support writing these models to store like cassandra and
lucene for example and then provide config driven akka-http based API to
add the business logic to access these model from a store and expose the
model serving as REST endpoint

Matrix, graph and kernel models we use a lot and for them turned out that
mllib style model predict were useful if we change the underlying store...
On Feb 4, 2017 9:37 AM, "Debasish Das"  wrote:

> If we expose an API to access the raw models out of PipelineModel can't we
> call predict directly on it from an API ? Is there a task open to expose
> the model out of PipelineModel so that predict can be called on itthere
> is no dependency of spark context in ml model...
> On Feb 4, 2017 9:11 AM, "Aseem Bansal"  wrote:
>
>>
>>- In Spark 2.0 there is a class called PipelineModel. I know that the
>>title says pipeline but it is actually talking about PipelineModel trained
>>via using a Pipeline.
>>- Why PipelineModel instead of pipeline? Because usually there is a
>>series of stuff that needs to be done when doing ML which warrants an
>>ordered sequence of operations. Read the new spark ml docs or one of the
>>databricks blogs related to spark pipelines. If you have used python's
>>sklearn library the concept is inspired from there.
>>- "once model is deserialized as ml model from the store of choice
>>within ms" - The timing of loading the model was not what I was
>>referring to when I was talking about timing.
>>- "it can be used on incoming features to score through
>>spark.ml.Model predict API". The predict API is in the old mllib package
>>not the new ml package.
>>- "why r we using dataframe and not the ML model directly from API" -
>>Because as of now the new ml package does not have the direct API.
>>
>>
>> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das 
>> wrote:
>>
>>> I am not sure why I will use pipeline to do scoring...idea is to build a
>>> model, use model ser/deser feature to put it in the row or column store of
>>> choice and provide a api access to the model...we support these primitives
>>> in github.com/Verizon/trapezium...the api has access to spark context
>>> in local or distributed mode...once model is deserialized as ml model from
>>> the store of choice within ms, it can be used on incoming features to score
>>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>>> r we using dataframe and not the ML model directly from API ?
>>> On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:
>>>
 Does this support Java 7?
 What is your timezone in case someone wanted to talk?

 On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins 
 wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further 
> optimized
> by using row-based transformations if your pipeline is simple to around 
> 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 
> features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
> wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>> wrote:
>>
>>> Is computational time for predictions on the order of few
>>> milliseconds (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>>> wrote:
>>>
 Hey everyone,


 Some of you may have seen Mikhail and I talk at Spark/Hadoop
 Summits about MLeap and how you can use it to build production services
 from your Spark-trained ML pipelines. MLeap is an open-source 
 technology
 that allows Data Scientists and Engineers to deploy Spark-trained ML
 Pipelines and Models to a scoring engine instantly. The MLeap execution
 engine has no dependencies on a Spark context and the serialization 
 format
 is entirely based on Protobuf 3 and JSON.


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
If we expose an API to access the raw models out of PipelineModel can't we
call predict directly on it from an API ? Is there a task open to expose
the model out of PipelineModel so that predict can be called on itthere
is no dependency of spark context in ml model...
On Feb 4, 2017 9:11 AM, "Aseem Bansal"  wrote:

>
>- In Spark 2.0 there is a class called PipelineModel. I know that the
>title says pipeline but it is actually talking about PipelineModel trained
>via using a Pipeline.
>- Why PipelineModel instead of pipeline? Because usually there is a
>series of stuff that needs to be done when doing ML which warrants an
>ordered sequence of operations. Read the new spark ml docs or one of the
>databricks blogs related to spark pipelines. If you have used python's
>sklearn library the concept is inspired from there.
>- "once model is deserialized as ml model from the store of choice
>within ms" - The timing of loading the model was not what I was
>referring to when I was talking about timing.
>- "it can be used on incoming features to score through spark.ml.Model
>predict API". The predict API is in the old mllib package not the new ml
>package.
>- "why r we using dataframe and not the ML model directly from API" -
>Because as of now the new ml package does not have the direct API.
>
>
> On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das 
> wrote:
>
>> I am not sure why I will use pipeline to do scoring...idea is to build a
>> model, use model ser/deser feature to put it in the row or column store of
>> choice and provide a api access to the model...we support these primitives
>> in github.com/Verizon/trapezium...the api has access to spark context in
>> local or distributed mode...once model is deserialized as ml model from the
>> store of choice within ms, it can be used on incoming features to score
>> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
>> r we using dataframe and not the ML model directly from API ?
>> On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:
>>
>>> Does this support Java 7?
>>> What is your timezone in case someone wanted to talk?
>>>
>>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins 
>>> wrote:
>>>
 Hey Aseem,

 We have built pipelines that execute several string indexers, one hot
 encoders, scaling, and a random forest or linear regression at the end.
 Execution time for the linear regression was on the order of 11
 microseconds, a bit longer for random forest. This can be further optimized
 by using row-based transformations if your pipeline is simple to around 2-3
 microseconds. The pipeline operated on roughly 12 input features, and by
 the time all the processing was done, we had somewhere around 1000 features
 or so going into the linear regression after one hot encoding and
 everything else.

 Hope this helps,
 Hollin

 On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
 wrote:

> Does this support Java 7?
>
> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
> wrote:
>
>> Is computational time for predictions on the order of few
>> milliseconds (< 10 ms) like the old mllib library?
>>
>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>> wrote:
>>
>>> Hey everyone,
>>>
>>>
>>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>>> about MLeap and how you can use it to build production services from 
>>> your
>>> Spark-trained ML pipelines. MLeap is an open-source technology that 
>>> allows
>>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>>> Models to a scoring engine instantly. The MLeap execution engine has no
>>> dependencies on a Spark context and the serialization format is entirely
>>> based on Protobuf 3 and JSON.
>>>
>>>
>>> The recent 0.5.0 release provides serialization and inference
>>> support for close to 100% of Spark transformers (we don’t yet support 
>>> ALS
>>> and LDA).
>>>
>>>
>>> MLeap is open-source, take a look at our Github page:
>>>
>>> https://github.com/combust/mleap
>>>
>>>
>>> Or join the conversation on Gitter:
>>>
>>> https://gitter.im/combust/mleap
>>>
>>>
>>> We have a set of documentation to help get you started here:
>>>
>>> http://mleap-docs.combust.ml/
>>>
>>>
>>> We even have a set of demos, for training ML Pipelines and linear,
>>> logistic and random forest models:
>>>
>>> https://github.com/combust/mleap-demo
>>>
>>>
>>> Check out our latest MLeap-serving Docker image, which allows you to
>>> expose a REST interface to your Spark ML pipeline models:
>>>
>>> 

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
   - In Spark 2.0 there is a class called PipelineModel. I know that the
   title says pipeline but it is actually talking about PipelineModel trained
   via using a Pipeline.
   - Why PipelineModel instead of pipeline? Because usually there is a
   series of stuff that needs to be done when doing ML which warrants an
   ordered sequence of operations. Read the new spark ml docs or one of the
   databricks blogs related to spark pipelines. If you have used python's
   sklearn library the concept is inspired from there.
   - "once model is deserialized as ml model from the store of choice
   within ms" - The timing of loading the model was not what I was
   referring to when I was talking about timing.
   - "it can be used on incoming features to score through spark.ml.Model
   predict API". The predict API is in the old mllib package not the new ml
   package.
   - "why r we using dataframe and not the ML model directly from API" -
   Because as of now the new ml package does not have the direct API.


On Sat, Feb 4, 2017 at 10:24 PM, Debasish Das 
wrote:

> I am not sure why I will use pipeline to do scoring...idea is to build a
> model, use model ser/deser feature to put it in the row or column store of
> choice and provide a api access to the model...we support these primitives
> in github.com/Verizon/trapezium...the api has access to spark context in
> local or distributed mode...once model is deserialized as ml model from the
> store of choice within ms, it can be used on incoming features to score
> through spark.ml.Model predict API...I am not clear on 2200x speedup...why
> r we using dataframe and not the ML model directly from API ?
> On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:
>
>> Does this support Java 7?
>> What is your timezone in case someone wanted to talk?
>>
>> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins 
>> wrote:
>>
>>> Hey Aseem,
>>>
>>> We have built pipelines that execute several string indexers, one hot
>>> encoders, scaling, and a random forest or linear regression at the end.
>>> Execution time for the linear regression was on the order of 11
>>> microseconds, a bit longer for random forest. This can be further optimized
>>> by using row-based transformations if your pipeline is simple to around 2-3
>>> microseconds. The pipeline operated on roughly 12 input features, and by
>>> the time all the processing was done, we had somewhere around 1000 features
>>> or so going into the linear regression after one hot encoding and
>>> everything else.
>>>
>>> Hope this helps,
>>> Hollin
>>>
>>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
>>> wrote:
>>>
 Does this support Java 7?

 On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
 wrote:

> Is computational time for predictions on the order of few milliseconds
> (< 10 ms) like the old mllib library?
>
> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
> wrote:
>
>> Hey everyone,
>>
>>
>> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
>> about MLeap and how you can use it to build production services from your
>> Spark-trained ML pipelines. MLeap is an open-source technology that 
>> allows
>> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
>> Models to a scoring engine instantly. The MLeap execution engine has no
>> dependencies on a Spark context and the serialization format is entirely
>> based on Protobuf 3 and JSON.
>>
>>
>> The recent 0.5.0 release provides serialization and inference support
>> for close to 100% of Spark transformers (we don’t yet support ALS and 
>> LDA).
>>
>>
>> MLeap is open-source, take a look at our Github page:
>>
>> https://github.com/combust/mleap
>>
>>
>> Or join the conversation on Gitter:
>>
>> https://gitter.im/combust/mleap
>>
>>
>> We have a set of documentation to help get you started here:
>>
>> http://mleap-docs.combust.ml/
>>
>>
>> We even have a set of demos, for training ML Pipelines and linear,
>> logistic and random forest models:
>>
>> https://github.com/combust/mleap-demo
>>
>>
>> Check out our latest MLeap-serving Docker image, which allows you to
>> expose a REST interface to your Spark ML pipeline models:
>>
>> http://mleap-docs.combust.ml/mleap-serving/
>>
>>
>> Several companies are using MLeap in production and even more are
>> currently evaluating it. Take a look and tell us what you think! We hope 
>> to
>> talk with you soon and welcome feedback/suggestions!
>>
>>
>> Sincerely,
>>
>> Hollin and Mikhail
>>
>
>

>>>
>>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Debasish Das
I am not sure why I will use pipeline to do scoring...idea is to build a
model, use model ser/deser feature to put it in the row or column store of
choice and provide a api access to the model...we support these primitives
in github.com/Verizon/trapezium...the api has access to spark context in
local or distributed mode...once model is deserialized as ml model from the
store of choice within ms, it can be used on incoming features to score
through spark.ml.Model predict API...I am not clear on 2200x speedup...why
r we using dataframe and not the ML model directly from API ?
On Feb 4, 2017 7:52 AM, "Aseem Bansal"  wrote:

> Does this support Java 7?
> What is your timezone in case someone wanted to talk?
>
> On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins  wrote:
>
>> Hey Aseem,
>>
>> We have built pipelines that execute several string indexers, one hot
>> encoders, scaling, and a random forest or linear regression at the end.
>> Execution time for the linear regression was on the order of 11
>> microseconds, a bit longer for random forest. This can be further optimized
>> by using row-based transformations if your pipeline is simple to around 2-3
>> microseconds. The pipeline operated on roughly 12 input features, and by
>> the time all the processing was done, we had somewhere around 1000 features
>> or so going into the linear regression after one hot encoding and
>> everything else.
>>
>> Hope this helps,
>> Hollin
>>
>> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal 
>> wrote:
>>
>>> Does this support Java 7?
>>>
>>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>>> wrote:
>>>
 Is computational time for predictions on the order of few milliseconds
 (< 10 ms) like the old mllib library?

 On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
 wrote:

> Hey everyone,
>
>
> Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
> about MLeap and how you can use it to build production services from your
> Spark-trained ML pipelines. MLeap is an open-source technology that allows
> Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
> Models to a scoring engine instantly. The MLeap execution engine has no
> dependencies on a Spark context and the serialization format is entirely
> based on Protobuf 3 and JSON.
>
>
> The recent 0.5.0 release provides serialization and inference support
> for close to 100% of Spark transformers (we don’t yet support ALS and 
> LDA).
>
>
> MLeap is open-source, take a look at our Github page:
>
> https://github.com/combust/mleap
>
>
> Or join the conversation on Gitter:
>
> https://gitter.im/combust/mleap
>
>
> We have a set of documentation to help get you started here:
>
> http://mleap-docs.combust.ml/
>
>
> We even have a set of demos, for training ML Pipelines and linear,
> logistic and random forest models:
>
> https://github.com/combust/mleap-demo
>
>
> Check out our latest MLeap-serving Docker image, which allows you to
> expose a REST interface to your Spark ML pipeline models:
>
> http://mleap-docs.combust.ml/mleap-serving/
>
>
> Several companies are using MLeap in production and even more are
> currently evaluating it. Take a look and tell us what you think! We hope 
> to
> talk with you soon and welcome feedback/suggestions!
>
>
> Sincerely,
>
> Hollin and Mikhail
>


>>>
>>
>


Re: How to checkpoint and RDD after a stage and before reaching an action?

2017-02-04 Thread Koert Kuipers
this is a general problem with checkpoint, one of the least understood
operations i think.

checkpoint is lazy (meaning it doesnt start until there is an action) and
asynchronous (meaning when it does start it is its own computation). so
basically with a checkpoint the rdd always gets computed twice.

i think the only useful pattern for checkpoint is to always persist/cache
right before the checkpoint. so:
rdd.persist(...).checkpoint()

On Sat, Feb 4, 2017 at 4:11 AM, leo9r  wrote:

> Hi,
>
> I have a 1-action job (saveAsObjectFile at the end), that includes several
> stages. One of those stages is an expensive join "rdd1.join(rdd2)". I would
> like to checkpoint rdd1 right before the join to improve the stability of
> the job. However, what I'm seeing is that the job gets executed all the way
> to the end (saveAsObjectFile) without doing any checkpointing, and then
> re-runing the computation to checkpoint rdd1 (when I see the files saved to
> the checkpoint directory). I have no issue with recomputing, given that I'm
> not caching rdd1, but the fact that the checkpointing of rdd1 happens after
> the join brings no benefit because the whole DAG is executed in one piece
> and the job fails. If that is actually what is happening, what would be the
> best approach to solve this?
> What I'm currently doing is to manually save rdd1 to HDFS right after the
> filter in line (4) and then load it back right before the join in line
> (11).
> That prevents the job from failing by splitting it into 2 jobs (ie. 2
> actions). My expectations was that rdd1.checkpoint in line (8) was going to
> have the same effect but without the hassle of manually saving and loading
> intermediate files.
>
> ///
>
> (1)   val rdd1 = loadData1
> (2) .map
> (3) .groupByKey
> (4) .filter
> (5)
> (6)   val rdd2 = loadData2
> (7)
> (8)   rdd1.checkpoint()
> (9)
> (10)  rdd1
> (11).join(rdd2)
> (12).saveAsObjectFile(...)
>
> /
>
> Thanks in advance,
> Leo
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/How-to-checkpoint-
> and-RDD-after-a-stage-and-before-reaching-an-action-tp20852.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-04 Thread Aseem Bansal
Does this support Java 7?
What is your timezone in case someone wanted to talk?

On Fri, Feb 3, 2017 at 10:23 PM, Hollin Wilkins  wrote:

> Hey Aseem,
>
> We have built pipelines that execute several string indexers, one hot
> encoders, scaling, and a random forest or linear regression at the end.
> Execution time for the linear regression was on the order of 11
> microseconds, a bit longer for random forest. This can be further optimized
> by using row-based transformations if your pipeline is simple to around 2-3
> microseconds. The pipeline operated on roughly 12 input features, and by
> the time all the processing was done, we had somewhere around 1000 features
> or so going into the linear regression after one hot encoding and
> everything else.
>
> Hope this helps,
> Hollin
>
> On Fri, Feb 3, 2017 at 4:05 AM, Aseem Bansal  wrote:
>
>> Does this support Java 7?
>>
>> On Fri, Feb 3, 2017 at 5:30 PM, Aseem Bansal 
>> wrote:
>>
>>> Is computational time for predictions on the order of few milliseconds
>>> (< 10 ms) like the old mllib library?
>>>
>>> On Thu, Feb 2, 2017 at 10:12 PM, Hollin Wilkins 
>>> wrote:
>>>
 Hey everyone,


 Some of you may have seen Mikhail and I talk at Spark/Hadoop Summits
 about MLeap and how you can use it to build production services from your
 Spark-trained ML pipelines. MLeap is an open-source technology that allows
 Data Scientists and Engineers to deploy Spark-trained ML Pipelines and
 Models to a scoring engine instantly. The MLeap execution engine has no
 dependencies on a Spark context and the serialization format is entirely
 based on Protobuf 3 and JSON.


 The recent 0.5.0 release provides serialization and inference support
 for close to 100% of Spark transformers (we don’t yet support ALS and LDA).


 MLeap is open-source, take a look at our Github page:

 https://github.com/combust/mleap


 Or join the conversation on Gitter:

 https://gitter.im/combust/mleap


 We have a set of documentation to help get you started here:

 http://mleap-docs.combust.ml/


 We even have a set of demos, for training ML Pipelines and linear,
 logistic and random forest models:

 https://github.com/combust/mleap-demo


 Check out our latest MLeap-serving Docker image, which allows you to
 expose a REST interface to your Spark ML pipeline models:

 http://mleap-docs.combust.ml/mleap-serving/


 Several companies are using MLeap in production and even more are
 currently evaluating it. Take a look and tell us what you think! We hope to
 talk with you soon and welcome feedback/suggestions!


 Sincerely,

 Hollin and Mikhail

>>>
>>>
>>
>


Re: specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi Direceu

Thanks your right! that did work


But now im facing an even bigger problem since i dont have access to change
the underlying data, I just want to apply a schema over something that was
written via the sparkContext.newAPIHadoopRDD

Basically I am reading in a RDD[JsonObject] and would like to convert it
into a dataframe which I pass the schema into

Whats the best way to do this?

I doubt removing all the quotes in the JSON is the best solution is it?

Regards
Sam

On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
dirceu.semigh...@gmail.com> wrote:

> Hi Sam
> Remove the " from the number that it will work
>
> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" 
> escreveu:
>
>> Hi All
>>
>> I would like to specify a schema when reading from a json but when trying
>> to map a number to a Double it fails, I tried FloatType and IntType with no
>> joy!
>>
>>
>> When inferring the schema customer id is set to String, and I would like
>> to cast it as Double
>>
>> so df1 is corrupted while df2 shows
>>
>>
>> Also FYI I need this to be generic as I would like to apply it to any
>> json, I specified the below schema as an example of the issue I am facing
>>
>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>> DoubleType,FloatType, StructType, LongType,DecimalType}
>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>> val df1 = 
>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> val df2 = 
>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>> df1.show(1)
>> df2.show(1)
>>
>>
>> Any help would be appreciated, I am sure I am missing something obvious
>> but for the life of me I cant tell what it is!
>>
>>
>> Kind Regards
>> Sam
>>
>


Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho
Hi Sam
Remove the " from the number that it will work

Em 4 de fev de 2017 11:46 AM, "Sam Elamin" 
escreveu:

> Hi All
>
> I would like to specify a schema when reading from a json but when trying
> to map a number to a Double it fails, I tried FloatType and IntType with no
> joy!
>
>
> When inferring the schema customer id is set to String, and I would like
> to cast it as Double
>
> so df1 is corrupted while df2 shows
>
>
> Also FYI I need this to be generic as I would like to apply it to any
> json, I specified the below schema as an example of the issue I am facing
>
> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
> DoubleType,FloatType, StructType, LongType,DecimalType}
> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
> val df1 = 
> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> val df2 = 
> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
> df1.show(1)
> df2.show(1)
>
>
> Any help would be appreciated, I am sure I am missing something obvious
> but for the life of me I cant tell what it is!
>
>
> Kind Regards
> Sam
>


specifing schema on dataframe

2017-02-04 Thread Sam Elamin
Hi All

I would like to specify a schema when reading from a json but when trying
to map a number to a Double it fails, I tried FloatType and IntType with no
joy!


When inferring the schema customer id is set to String, and I would like to
cast it as Double

so df1 is corrupted while df2 shows


Also FYI I need this to be generic as I would like to apply it to any json,
I specified the below schema as an example of the issue I am facing

import org.apache.spark.sql.types.{BinaryType, StringType,
StructField, DoubleType,FloatType, StructType, LongType,DecimalType}
val testSchema = StructType(Array(StructField("customerid",DoubleType)))
val df1 = 
spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
val df2 = spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
df1.show(1)
df2.show(1)


Any help would be appreciated, I am sure I am missing something obvious but
for the life of me I cant tell what it is!


Kind Regards
Sam


Re: NoNodeAvailableException (None of the configured nodes are available) error when trying to push data to Elastic from a Spark job

2017-02-04 Thread Jacek Laskowski
Hi,

I'd say the error says it all :

Caused by: NoNodeAvailableException[None of the configured nodes are
available: [{#transport#-1}{XX.XXX.XXX.XX}{XX.XXX.XXX.XX:9300}]]

Jacek

On 3 Feb 2017 7:58 p.m., "Anastasios Zouzias"  wrote:

Hi there,

Are you sure that the cluster nodes where the executors run have network
connectivity to the elastic cluster?

Speaking of which, why don't you use: https://github.com/
elastic/elasticsearch-hadoop#apache-spark ?

Cheers,
Anastasios

On Fri, Feb 3, 2017 at 7:10 PM, Dmitry Goldenberg 
wrote:

> Hi,
>
> Any reason why we might be getting this error?  The code seems to work
> fine in the non-distributed mode but the same code when run from a Spark
> job is not able to get to Elastic.
>
> Spark version: 2.0.1 built for Hadoop 2.4, Scala 2.11
> Elastic version: 2.3.1
>
> I've verified the Elastic hosts and the cluster name.
>
> The spot in the code where this happens is:
>
>  ClusterHealthResponse clusterHealthResponse = client.admin().cluster()
>
>   .prepareHealth()
>
>   .setWaitForGreenStatus()
>
>   .setTimeout(TimeValue.*timeValueSeconds*(10))
>
>   .get();
>
>
> Stack trace:
>
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(DAGSched
> uler.scala:1454)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
> 1.apply(DAGScheduler.scala:1442)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$
> 1.apply(DAGScheduler.scala:1441)
>
> at scala.collection.mutable.ResizableArray$class.foreach(Resiza
> bleArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.sca
> la:48)
>
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSchedu
> ler.scala:1441)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
> etFailed$1.apply(DAGScheduler.scala:811)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskS
> etFailed$1.apply(DAGScheduler.scala:811)
>
> at scala.Option.foreach(Option.scala:257)
>
> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
> DAGScheduler.scala:811)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOn
> Receive(DAGScheduler.scala:1667)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
> ceive(DAGScheduler.scala:1622)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onRe
> ceive(DAGScheduler.scala:1611)
>
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.
> scala:632)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
>
> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(R
> DD.scala:902)
>
> at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(R
> DD.scala:900)
>
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
> onScope.scala:151)
>
> at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperati
> onScope.scala:112)
>
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:900)
>
> at org.apache.spark.api.java.JavaRDDLike$class.foreachPartition
> (JavaRDDLike.scala:218)
>
> at org.apache.spark.api.java.AbstractJavaRDDLike.foreachPartiti
> on(JavaRDDLike.scala:45)
>
> at com.myco.MyDriver$3.call(com.myco.MyDriver.java:214)
>
> at com.myco.MyDriver$3.call(KafkaSparkStreamingDriver.java:201)
>
> at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun
> $foreachRDD$1.apply(JavaDStreamLike.scala:272)
>
> at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun
> $foreachRDD$1.apply(JavaDStreamLike.scala:272)
>
> at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachR
> DD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
>
> at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachR
> DD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
>
> at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1
> $$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
>
> at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1
> $$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
>
> at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1
> $$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
>
> at org.apache.spark.streaming.dstream.DStream.createRDDWithLoca
> 

NullPointerException while joining two avro Hive tables

2017-02-04 Thread Понькин Алексей

Hi,

I have a table in Hive(data is stored as avro files).
Using python spark shell I am trying to join two datasets

events = spark.sql('select * from mydb.events')

intersect = events.where('attr2 in (5,6,7) and attr1 in (1,2,3)')
intersect.count()

But I am constantly receiving the following

java.lang.NullPointerException
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:83)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:56)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:124)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:251)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:103)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Using Spark 2.0.0.2.5.0.0-1245

Any help will be appreciated


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef

2017-02-04 Thread Sam Elamin
Hi sathyanarayanan


zero() on scala.runtime.VolatileObjectRef has been introduced in Scala 2.11
You probably have a library compiled against Scala 2.11 and running on a
Scala 2.10 runtime.

See

v2.10:
https://github.com/scala/scala/blob/2.10.x/src/library/scala/runtime/VolatileObjectRef.java
v2.11:
https://github.com/scala/scala/blob/2.11.x/src/library/scala/runtime/VolatileObjectRef.java

Regards
Sam

On Sat, 4 Feb 2017 at 09:24, sathyanarayanan mudhaliyar <
sathyanarayananmudhali...@gmail.com> wrote:

> Hi ,
> I got the error below when executed
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;
>
> error in detail:
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;
> at
> com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
> at
> com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
> at
> com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
> at
> com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
> at
> com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:82)
> at com.nwf.Consumer.main(Consumer.java:63)
>
> code :
>
> Consumer consumer = new Consumer();
> SparkConf conf = new
> SparkConf().setAppName("kafka-sandbox").setMaster("local[2]");
> conf.set("spark.cassandra.connection.host", "localhost"); //connection
> for cassandra database
> JavaSparkContext sc = new JavaSparkContext(conf);
> CassandraConnector connector = CassandraConnector.apply(sc.getConf());
> final Session session = connector.openSession();
> final PreparedStatement prepared = session.prepare("INSERT INTO
> spark_test5.messages JSON?");
>
>
> The error is in the line which is in green color.
> Thank you guys.
>
>


java.lang.NoSuchMethodError: scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef

2017-02-04 Thread sathyanarayanan mudhaliyar
Hi ,
I got the error below when executed

Exception in thread "main" java.lang.NoSuchMethodError:
scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;

error in detail:

Exception in thread "main" java.lang.NoSuchMethodError:
scala.runtime.ObjectRef.zero()Lscala/runtime/ObjectRef;
at
com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$3.apply(CassandraConnector.scala:149)
at
com.datastax.spark.connector.cql.RefCountedCache.createNewValueAndKeys(RefCountedCache.scala:31)
at
com.datastax.spark.connector.cql.RefCountedCache.acquire(RefCountedCache.scala:56)
at
com.datastax.spark.connector.cql.CassandraConnector.openSession(CassandraConnector.scala:82)
at com.nwf.Consumer.main(Consumer.java:63)

code :

Consumer consumer = new Consumer();
SparkConf conf = new
SparkConf().setAppName("kafka-sandbox").setMaster("local[2]");
conf.set("spark.cassandra.connection.host", "localhost"); //connection
for cassandra database
JavaSparkContext sc = new JavaSparkContext(conf);
CassandraConnector connector = CassandraConnector.apply(sc.getConf());
final Session session = connector.openSession();
final PreparedStatement prepared = session.prepare("INSERT INTO
spark_test5.messages JSON?");


The error is in the line which is in green color.
Thank you guys.


Re: Is DoubleWritable and DoubleObjectInspector doing the same thing in Hive UDF?

2017-02-04 Thread Alex
H,

Please Reply?

On Fri, Feb 3, 2017 at 8:19 PM, Alex  wrote:

> Hi,
>
> can You guys tell me if below peice of two codes are returning the same
> thing?
>
> (((DoubleObjectInspector) ins2).get(obj)); and (DoubleWritable)obj).get()
> ; from below two  codes
>
>
> code 1)
>
> public Object get(Object name) {
>   int pos = getPos((String)name);
>   if(pos<0) return null;
>   String f = "string";
>   Object obj= list.get(pos);
>   if(obj==null) return null;
>   ObjectInspector ins = ((StructField)colnames.get(pos
> )).getFieldObjectInspector();
>   if(ins!=null) f = ins.getTypeName();
>   switch (f) {
> case "double" :  return ((DoubleWritable)obj).get();
> case "bigint" :  return ((LongWritable)obj).get();
> case "string" :  return ((Text)obj).toString();
> default  :  return obj;
>   }
> }
>
>
> Code 2)
>
> public Object get(Object name) {
>
> int pos = getPos((String) name);
>
> if (pos < 0)
>
> return null;
>
> String f = "string";
>
> String f1 = "string";
>
> Object obj = list.get(pos);
>
>
>
> if (obj == null)
>
> return null;
>
> ObjectInspector ins = ((StructField)
> colnames.get(pos)).getFieldObjectInspector();
>
> if (ins != null)
>
> f = ins.getTypeName();
>
>
>
>
>
> PrimitiveObjectInspector ins2 =
> (PrimitiveObjectInspector) ins;
>
> f1 = ins2.getPrimitiveCategory().name();
>
>
>
>
>
>switch (ins2.getPrimitiveCategory()) {
>
> case DOUBLE:
>
>
>
> return (((DoubleObjectInspector) ins2).get(obj));
>
>
>
> case LONG:
>
>
>
>return  (((LongObjectInspector) ins2).get(obj));
>
>
>
>case STRING:
>
> return (((StringObjectInspector)
> ins2).getPrimitiveJavaObject(obj)).toString();
>
>
>
> default:return obj;
>
> }
>
>  }
>


Re: spark architecture question -- Pleas Read

2017-02-04 Thread Mich Talebzadeh
Ingesting from Hive tables back into Oracle. What mechanisms are in place
to ensure that data ends up consistently into Oracle table and Spark is
notified when Oracle has issues with data ingested (say rollback)?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 29 January 2017 at 22:22, Jörn Franke  wrote:

> You can use HDFS, S3, Azure, glusterfs, Ceph, ignite (in-memory )  a
> Spark cluster itself does not store anything it just processes.
>
> On 29 Jan 2017, at 15:37, Alex  wrote:
>
> But for persistance after intermediate processing can i use spark cluster
> itself or i have to use hadoop cluster?!
>
> On Jan 29, 2017 7:36 PM, "Deepak Sharma"  wrote:
>
> The better way is to read the data directly into spark using spark sql
> read jdbc .
> Apply the udf's locally .
> Then save the data frame back to Oracle using dataframe's write jdbc.
>
> Thanks
> Deepak
>
> On Jan 29, 2017 7:15 PM, "Jörn Franke"  wrote:
>
>> One alternative could be the oracle Hadoop loader and other Oracle
>> products, but you have to invest some money and probably buy their Hadoop
>> Appliance, which you have to evaluate if it make sense (can get expensive
>> with large clusters etc).
>>
>> Another alternative would be to get rid of Oracle alltogether and use
>> other databases.
>>
>> However, can you elaborate a little bit on your use case and the business
>> logic as well as SLA requires. Otherwise all recommendations are right
>> because the requirements you presented are very generic.
>>
>> About get rid of Hadoop - this depends! You will need some resource
>> manager (yarn, mesos, kubernetes etc) and most likely also a distributed
>> file system. Spark supports through the Hadoop apis a wide range of file
>> systems, but does not need HDFS for persistence. You can have local
>> filesystem (ie any file system mounted to a node, so also distributed ones,
>> such as zfs), cloud file systems (s3, azure blob etc).
>>
>>
>>
>> On 29 Jan 2017, at 11:18, Alex  wrote:
>>
>> Hi All,
>>
>> Thanks for your response .. Please find below flow diagram
>>
>> Please help me out simplifying this architecture using Spark
>>
>> 1) Can i skip step 1 to step 4 and directly store it in spark
>> if I am storing it in spark where actually it is getting stored
>> Do i need to retain HAdoop to store data
>> or can i directly store it in spark and remove hadoop also?
>>
>> I want to remove informatica for preprocessing and directly load the
>> files data coming from server to Hadoop/Spark
>>
>> So My Question is Can i directly load files data to spark ? Then where
>> exactly the data will get stored.. Do I need to have Spark installed on Top
>> of HDFS?
>>
>> 2) if I am retaining below architecture Can I store back output from
>> spark directly to oracle from step 5 to step 7
>>
>> and will spark way of storing it back to oracle will be better than using
>> sqoop performance wise
>> 3)Can I use SPark scala UDF to process data from hive and retain entire
>> architecture
>>
>> which among the above would be optimal
>>
>> [image: Inline image 1]
>>
>> On Sat, Jan 28, 2017 at 10:38 PM, Sachin Naik 
>> wrote:
>>
>>> I strongly agree with Jorn and Russell. There are different solutions
>>> for data movement depending upon your needs frequency, bi-directional
>>> drivers. workflow, handling duplicate records. This is a space is known as
>>> " Change Data Capture - CDC" for short. If you need more information, I
>>> would be happy to chat with you.  I built some products in this space that
>>> extensively used connection pooling over ODBC/JDBC.
>>>
>>> Happy to chat if you need more information.
>>>
>>> -Sachin Naik
>>>
>>> >>Hard to tell. Can you give more insights >>on what you try to achieve
>>> and what the data is about?
>>> >>For example, depending on your use case sqoop can make sense or not.
>>> Sent from my iPhone
>>>
>>> On Jan 27, 2017, at 11:22 PM, Russell Spitzer 
>>> wrote:
>>>
>>> You can treat Oracle as a JDBC source (http://spark.apache.org/docs/
>>> latest/sql-programming-guide.html#jdbc-to-other-databases) and skip
>>> Sqoop, HiveTables and go straight to Queries. Then you can skip hive on the
>>> way back out (see the same link) and write directly to Oracle. I'll leave
>>> the performance questions for someone else.
>>>
>>> On Fri, Jan 27, 2017 at 11:06 PM Sirisha