Re: [Arrow][Dremio]

2018-05-14 Thread Pierce Lamb
Hi Xavier,

Along the lines of connecting to multiple sources of data and replacing ETL
tools you may want to check out Confluent's blog on building a real-time
streaming ETL pipeline on Kafka

as well as SnappyData's blog on Real-Time Streaming ETL with SnappyData
 where
Spark is central to connecting to multiple data sources, executing SQL on
streams etc. These should provide nice comparisons to your ideas about
Dremio + Spark as ETL tools.

Disclaimer: I am a SnappyData employee

Hope this helps,

Pierce

On Mon, May 14, 2018 at 2:24 AM, xmehaut  wrote:

> Hi Michaël,
>
> I'm not an expert of Dremio, i just try to evaluate the potential of this
> techno and what impacts it could have on spark, and how they can work
> together, or how spark could use even further arrow internally along the
> existing algorithms.
>
> Dremio has already a quite rich api set enabling to access for instance to
> metadata, sql queries, or even to create virtual datasets programmatically.
> They also have a lot of predefined functions, and I imagine there will be
> more an more fucntions in the future, eg machine learning functions like
> the
> ones we may find in azure sql server which enables to mix sql and ml
> functions.  Acces to dremio is made through jdbc, and we may imagine to
> access virtual datasets through spark and create dynamically new datasets
> from the api connected to parquets files stored dynamycally by spark on
> hdfs, azure datalake or s3... Of course a more thight integration between
> both should be better with a spark read/write connector to dremio :)
>
> regards
> xavier
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Streaming Analytics/BI tool to connect Spark SQL

2017-12-07 Thread Pierce Lamb
Hi Umar,

While this answer is a bit dated, you make find it useful in diagnosing a
store for Spark SQL tables:

https://stackoverflow.com/a/39753976/3723346

I don't know much about Pentaho or Arcadia, but I assume many of the listed
options have a JDBC or ODBC client.

Hope this helps,

Pierce

On Thu, Dec 7, 2017 at 10:27 AM, umargeek 
wrote:

> Hi All,
>
> We are currently looking for real-time streaming analytics of data stored
> as
> Spark SQL tables is there any external connectivity available to connect
> with BI tools(Pentaho/Arcadia).
>
> currently, we are storing data into the hive tables but its response on the
> Arcadia dashboard is slow.
>
> Looking for suggestions whether to move out from hive or any connectivity
> for Spark SQL or to Ignite?
>
> Thanks,
> Umar
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Update MySQL table via Spark/SparkR?

2017-08-22 Thread Pierce Lamb
Hi Jake,

There is a another option within the 3rd party projects in the spark
database ecosystem that have combined Spark with a DBMS in such a way that
DataFrame API has been extended to include UPDATE operations
.
However, in your case you would have to move away from MySQL in order to
use this API.

Best,

Pierce

On Tue, Aug 22, 2017 at 7:54 AM, Jake Russ 
wrote:

> Hi Mich,
>
>
>
> Thank you for the explanation, that makes sense, and is helpful for me to
> understand the bigger picture between Spark/RDBMS.
>
>
>
> Happy to know I’m already following best practice.
>
>
>
> Cheers,
>
>
>
> Jake
>
>
>
> *From: *Mich Talebzadeh 
> *Date: *Monday, August 21, 2017 at 6:44 PM
> *To: *Jake Russ 
> *Cc: *"user@spark.apache.org" 
> *Subject: *Re: Update MySQL table via Spark/SparkR?
>
>
>
> Hi Jake,
>
> This is an issue across all RDBMs including Oracle etc. When you are
> updating you have to commit or roll back in RDBMS itself and I am not aware
> of Spark doing that.
>
> The staging table is a safer method as it follows ETL type approach. You
> create new data in the staging table in RDBMS and do the DML in the RDBMS
> itself where you can control commit or rollback. That is the way I would do
> it. A simple shell script can do both.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 21 August 2017 at 15:50, Jake Russ  wrote:
>
> Hi everyone,
>
>
>
> I’m currently using SparkR to read data from a MySQL database, perform
> some calculations, and then write the results back to MySQL. Is it still
> true that Spark does not support UPDATE queries via JDBC? I’ve seen many
> posts on the internet that Spark’s DataFrameWriter does not support
> UPDATE queries via JDBC
> . It will only
> “append” or “overwrite” to existing tables. The best advice I’ve found so
> far, for performing this update, is to write to a staging table in MySQL
> 
>  and
> then perform the UPDATE query on the MySQL side.
>
>
>
> Ideally, I’d like to handle the update during the write operation. Has
> anyone else encountered this limitation and have a better solution?
>
>
>
> Thank you,
>
>
>
> Jake
>
>
>


Re: using Kudu with Spark

2017-07-24 Thread Pierce Lamb
Hi Mich,

I tried to compile a list of datastores that connect to Spark and provide a
bit of context. The list may help you in your research:

https://stackoverflow.com/a/39753976/3723346

I'm going to add Kudu, Druid and Ampool from this thread.

I'd like to point out SnappyData
 as an option you should try.
SnappyData provides many of the features you've discussed (columnar
storage, replication, in-place updates etc) while also integrating the
datastore with Spark directly. That is, there is no "connector" to go over
for database operations; Spark and the datastore share the same JVM and
block manager. Thus, if performance is one of your concerns, this should
give you some of the best performance
 in this area.

Hope this helps,

Pierce

On Mon, Jul 24, 2017 at 10:02 AM, Mich Talebzadeh  wrote:

> now they are bringing up Ampool with spark for real time analytics
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 July 2017 at 11:15, Mich Talebzadeh 
> wrote:
>
>> sounds like Druid can do the same?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 24 July 2017 at 08:38, Mich Talebzadeh 
>> wrote:
>>
>>> Yes this storage layer is something I have been investigating in my own
>>> lab for mixed load such as Lambda Architecture.
>>>
>>>
>>>
>>> It offers the convenience of columnar RDBMS (much like Sybase IQ). Kudu
>>> tables look like those in SQL relational databases, each with a primary key
>>> made up of one or more columns that enforce uniqueness and acts as an index
>>> for efficient updates and deletes. Data is partitioned using what is known
>>> as tablets that make up tables. Kudu replicates these tablets to other
>>> nodes for redundancy.
>>>
>>>
>>> As you said there are a number of options. Kudu also claims in-place
>>> updates that needs to be tried for its consistency.
>>>
>>> Cheers
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 24 July 2017 at 08:30, Jörn Franke  wrote:
>>>
 I guess you have to find out yourself with experiments. Cloudera has
 some benchmarks, but it always depends what you test, your data volume and
 what is meant by "fast". It is also more than a file format with servers
 that communicate with each other etc.  - more complexity.
 Of course there are alternatives that you could benchmark again, such
 as Apache HAWQ (which is basically postgres on Hadoop), Apache ignite or
 depending on your analysis even Flink or Spark Streaming.

 On 24. Jul 2017, at 09:25, Mich Talebzadeh 
 wrote:

 hi,

 Has anyone had experience of using Kudu for faster analytics with Spark?

 How efficient is it compared to usinh HBase and other traditional
 storage for fast changing data please?

 Any insight will be appreciated.

 Thanks

 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility f

Re: "Sharing" dataframes...

2017-06-21 Thread Pierce Lamb
Hi Jean,

Since many in this thread have mentioned datastores from what I would call
the "Spark datastore ecosystem" I thought I would link you to a
StackOverflow answer I posted awhile back that tried to capture the
majority of this ecosystem. Most would claim to allow you to do something
like you're describing in your original email once connected to Spark:

https://stackoverflow.com/questions/39650298/how-to-save-insert-each-dstream-into-a-permanent-table/39753976#39753976

Regarding Rick Mortiz's reply, SnappyData
, a member of this ecosystem,
avoids the latency intensive serialization steps he describes by
integrating the database and Spark such that they use the same JVM/block
manager (you can think of it as an in-memory SQL database replacing Spark's
native cache).

Hope this helps,

Pierce

On Wed, Jun 21, 2017 at 8:29 AM, Gene Pang  wrote:

> Hi Jean,
>
> As others have mentioned, you can use Alluxio with Spark dataframes
>  to
> keep the data in memory, and for other jobs to read them from memory again.
>
> Hope this helps,
> Gene
>
> On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges Perrin  wrote:
>
>> I have looked at Livy in the (very recent past) past and it will not do
>> the trick for me. It seems pretty greedy in terms of resources (or at least
>> that was our experience). I will investigate how job-server could do the
>> trick.
>>
>> (on a side note I tried to find a paper on memory lifecycle within Spark
>> but was not very successful, maybe someone has a link to spare.)
>>
>> My need is to keep one/several dataframes in memory (well, within Spark)
>> so it/they can be reused at a later time, without persisting it/them to
>> disk (unless Spark wants to, of course).
>>
>>
>>
>> On Jun 21, 2017, at 10:47 AM, Michael Mior  wrote:
>>
>> This is a puzzling suggestion to me. It's unclear what features the OP
>> needs, so it's really hard to say whether Livy or job-server aren't
>> sufficient. It's true that neither are particularly mature, but they're
>> much more mature than a homemade project which hasn't started yet.
>>
>> That said, I'm not very familiar with either project, so perhaps there
>> are some big concerns I'm not aware of.
>>
>> --
>> Michael Mior
>> mm...@apache.org
>>
>> 2017-06-21 3:19 GMT-04:00 Rick Moritz :
>>
>>> Keeping it inside the same program/SparkContext is the most performant
>>> solution, since you can avoid serialization and deserialization.
>>> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
>>> and invokes serialization and deserialization. Technologies that can help
>>> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
>>> with in-memory tables and a memory-backed HDFS-directory (see tiered
>>> storage).
>>> Although livy and job-server provide the functionality of providing a
>>> single SparkContext to mutliple programs, I would recommend you build your
>>> own framework for integrating different jobs, since many features you may
>>> need aren't present yet, while others may cause issues due to lack of
>>> maturity. Artificially splitting jobs is in general a bad idea, since it
>>> breaks the DAG and thus prevents some potential push-down optimizations.
>>>
>>> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin 
>>> wrote:
>>>
 Thanks Vadim & Jörn... I will look into those.

 jg

 On Jun 20, 2017, at 2:12 PM, Vadim Semenov 
 wrote:

 You can launch one permanent spark context and then execute your jobs
 within the context. And since they'll be running in the same context, they
 can share data easily.

 These two projects provide the functionality that you need:
 https://github.com/spark-jobserver/spark-jobserver#persisten
 t-context-mode---faster--required-for-related-jobs
 https://github.com/cloudera/livy#post-sessions

 On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin 
 wrote:

> Hey,
>
> Here is my need: program A does something on a set of data and
> produces results, program B does that on another set, and finally, program
> C combines the data of A and B. Of course, the easy way is to dump all on
> disk after A and B are done, but I wanted to avoid this.
>
> I was thinking of creating a temp view, but I do not really like the
> temp aspect of it ;). Any idea (they are all worth sharing)
>
> jg
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


>>>
>>
>>
>


Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
Hi Nipun,

To expand a bit, you might find this stackoverflow answer useful:

http://stackoverflow.com/a/39753976/3723346

Most spark + database combinations can handle a use case like this.

Hope this helps,

Pierce

On Thu, May 4, 2017 at 9:18 AM, Gene Pang  wrote:

> As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help
> you. Here is some documentation on how to run Alluxio and Spark together
> , and
> here is a blog post on a Spark streaming + Alluxio use case
> 
> .
>
> Hope that helps,
> Gene
>
> On Tue, May 2, 2017 at 11:56 AM, Nipun Arora 
> wrote:
>
>> Hi All,
>>
>> To support our Spark Streaming based anomaly detection tool, we have made
>> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>>
>> I'll first explain our use-case, which I believe should be common to
>> several people using Spark Streaming applications. Broadcast variables are
>> often used to store values "machine learning models", which can then be
>> used on streaming data to "test" and get the desired results (for our case
>> anomalies). Unfortunately, in the current spark, broadcast variables are
>> final and can only be initialized once before the initialization of the
>> streaming context. Hence, if a new model is learned the streaming system
>> cannot be updated without shutting down the application, broadcasting
>> again, and restarting the application. Our goal was to re-broadcast
>> variables without requiring a downtime of the streaming service.
>>
>> The key to this implementation is a live re-broadcastVariable()
>> interface, which can be triggered in between micro-batch executions,
>> without any re-boot required for the streaming application. At a high level
>> the task is done by re-fetching broadcast variable information from the
>> spark driver, and then re-distribute it to the workers. The micro-batch
>> execution is blocked while the update is made, by taking a lock on the
>> execution. We have already tested this in our prototype deployment of our
>> anomaly detection service and can successfully re-broadcast the broadcast
>> variables with no downtime.
>>
>> We would like to integrate these changes in spark, can anyone please let
>> me know the process of submitting patches/ new features to spark. Also. I
>> understand that the current version of Spark is 2.1. However, our changes
>> have been done and tested on Spark 1.6.2, will this be a problem?
>>
>> Thanks
>> Nipun
>>
>
>


Re: Spark Streaming. Real-time save data and visualize on dashboard

2017-04-11 Thread Pierce Lamb
Hi,

It is possible to use Mongo or Cassandra to persist results from Spark. In
fact, a wide variety of data stores are available to use with Spark and
many are aimed at serving queries for dashboard visualizations. I cannot
comment on which work well with Grafana or Kabana, however, I've listed
(with links) a majority of the data stores that have an existing connector
or integration with Spark here:

http://stackoverflow.com/a/39753976/3723346

Hope this helps,

Pierce

On Tue, Apr 11, 2017 at 7:35 AM, tencas  wrote:

> I've developed an application using Apache Spark Streaming, that reads
> simple
> info from plane sensors like acceleration, via TCP sockets on json format,
> and analyse it.
>
> I'd like to be able to persist this info from each "flight" on real-time,
> while it is shown on any responsive dashboard.
>
> I just don't know if is it possible to use a no-SQL database like Mongo,
> Cassandra in collaboration with a monitoring tools like Grafana,Kabana, and
> it make sense.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Spark-Streaming-Real-time-save-data-
> and-visualize-on-dashboard-tp28587.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Apache Drill vs Spark SQL

2017-04-07 Thread Pierce Lamb
Hi Kant,

If you are interested in using Spark alongside a database to serve real
time queries, there are many options. Almost every popular database has
built some sort of connector to Spark. I've listed a majority of them and
tried to delineate them in some way in this StackOverflow answer:

http://stackoverflow.com/a/39753976/3723346

As an employee of SnappyData ,
I'm biased toward it's solution in which Spark and the database are deeply
integrated and run on the same JVM. But there are many options depending on
your needs.

I'm not sure if the above link also answers your second question, but there
are two graph databases listed that connect to Spark as well.

Hope this helps,

Pierce

On Thu, Apr 6, 2017 at 10:34 PM, kant kodali  wrote:

> Hi All,
>
> I am very impressed with the work done on Spark SQL however when I have to
> pick something to serve real time queries I am in a dilemma for the
> following reasons.
>
> 1. Even though Spark Sql has logical plans, physical plans and run time
> code generation and all that it still doesn't look like the tool to serve
> real time queries like we normally do from a database. I tend to think this
> is because the queries had to go through job submission first. I don't want
> to call this overhead or anything but this is what it seems to do.
> comparing this, on the other hand we have the data that we want to serve
> sitting on a database where we simply issue an SQL query and get the
> response back so for this use case what would be an appropriate tool? I
> tend to think its Drill but would like to hear if there are any interesting
> arguments.
>
> 2. I can see a case for Spark SQL such as queries that need to be
> expressed in a iterative fashion. For example doing a graph traversal such
> BFS, DFS or say even a simple pre order, in order , post order Traversals
> on a BST. All this will be very hard to express on a Declarative syntax
> like SQL. I also tend to think Ad-hoc distributed joins (By Ad-hoc I mean
> one is not certain about their query patterns) are also better off
> expressing it in map-reduce style than say SQL unless one know their query
> patterns well ahead such that the possibility of queries that require
> redistribution is so low. I am also sure there are plenty of other cases
> where Spark SQL will excel but I wanted to see what is good choice to
> simple serve the data?
>
> Any suggestions are appreciated.
>
> Thanks!
>
>


Re: How best we can store streaming data on dashboards for real time user experience?

2017-03-30 Thread Pierce Lamb
SnappyData should work well for what you want, it deeply integrates an
in-memory database with Spark which supports ingesting streaming data and
concurrently querying it from a dashboard. SnappyData currently has an
integration with Apache Zeppelin (notebook visualization) and soon it will
have one with Tableau. Finally, the database and Spark executors share the
same JVM in SnappyData, so you will get better performance than any
database that works with Spark over a connector.

Repo: https://github.com/SnappyDataInc/snappydata

Using it with Zeppelin:
https://github.com/SnappyDataInc/snappydata/blob/master/docs/aqp_aws.md#using-apache-zeppelin

Hope this helps

On Thu, Mar 30, 2017 at 7:56 AM, Alonso Isidoro Roman 
wrote:

> you can check if you want this link
> 
>
>
> elastic, kibana and spark working together.
>
> Alonso Isidoro Roman
> [image: https://]about.me/alonso.isidoro.roman
>
> 
>
> 2017-03-30 16:25 GMT+02:00 Miel Hostens :
>
>> We're doing exactly same thing over here!
>>
>> Spark + ELK
>>
>> Met vriendelijk groeten,
>>
>> *Miel Hostens*,
>>
>> *B**uitenpraktijk+*
>> Department of Reproduction, Obstetrics and Herd Health
>> 
>>
>> Ambulatory Clinic
>> Faculty of Veterinary Medicine 
>> Ghent University  
>> Salisburylaan 133
>> 9820 Merelbeke
>> Belgium
>>
>>
>>
>> Tel:   +32 9 264 75 28 <+32%209%20264%2075%2028>
>>
>> Fax:  +32 9 264 77 97 <+32%209%20264%2077%2097>
>>
>> Mob: +32 478 59 37 03 <+32%20478%2059%2037%2003>
>>
>>  e-mail: miel.host...@ugent.be
>>
>> On 30 March 2017 at 15:01, Szuromi Tamás  wrote:
>>
>>> For us, after some Spark Streaming transformation, Elasticsearch +
>>> Kibana is a great combination to store and visualize data.
>>> An alternative solution that we use is Spark Streaming put some data
>>> back to Kafka and we consume it with nodejs.
>>>
>>> Cheers,
>>> Tamas
>>>
>>> 2017-03-30 9:25 GMT+02:00 Alonso Isidoro Roman :
>>>
 Read this first:

 http://www.oreilly.com/data/free/big-data-analytics-emerging
 -architecture.csp

 https://www.ijircce.com/upload/2015/august/97_A%20Study.pdf

 http://www.pentaho.com/assets/pdf/CqPxTROXtCpfoLrUi4Bj.pdf

 http://www.gartner.com/smarterwithgartner/six-best-practices
 -for-real-time-analytics/

 https://speakerdeck.com/elasticsearch/using-elasticsearch-lo
 gstash-and-kibana-to-create-realtime-dashboards

 https://www.youtube.com/watch?v=PuvHINcU9DI

 then take a look to

 https://kudu.apache.org/

 Tell us later what you think.




 Alonso Isidoro Roman
 [image: https://]about.me/alonso.isidoro.roman

 

 2017-03-30 7:14 GMT+02:00 Gaurav Pandya :

> Hi Noorul,
>
> Thanks for the reply.
> But then how to build the dashboard report? Don't we need to store
> data anywhere?
> Please suggest.
>
> Thanks.
> Gaurav
>
> On Thu, Mar 30, 2017 at 10:32 AM, Noorul Islam Kamal Malmiyoda <
> noo...@noorul.com> wrote:
>
>> I think better place would be a in memory cache for real time.
>>
>> Regards,
>> Noorul
>>
>> On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809 
>> wrote:
>> > I am getting streaming data and want to show them onto dashboards
>> in real
>> > time?
>> > May I know how best we can handle these streaming data? where to
>> store? (DB
>> > or HDFS or ???)
>> > I want to give users a real time analytics experience.
>> >
>> > Please suggest possible ways. Thanks.
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/How-best-we-can-store-streaming-data-o
>> n-dashboards-for-real-time-user-experience-tp28548.html
>> > Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >
>> > 
>> -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >
>>
>
>

>>>
>>
>


Re: Appropriate Apache Users List Uses

2016-02-09 Thread Pierce Lamb
I sent this mail. It was not automated or part of a mass email.

My apologies for misuse.

Pierce

On Tue, Feb 9, 2016 at 12:02 PM, u...@moosheimer.com 
wrote:

> I wouldn't expect this either.
> Very disappointing...
>
> -Kay-Uwe Moosheimer
>
> Am 09.02.2016 um 20:53 schrieb Ryan Victory :
>
> Yeah, a little disappointed with this, I wouldn't expect to be sent
> unsolicited mail based on my membership to this list.
>
> -Ryan Victory
>
> On Tue, Feb 9, 2016 at 1:36 PM, John Omernik  wrote:
>
>> All, I received this today, is this appropriate list use? Note: This was
>> unsolicited.
>>
>> Thanks
>> John
>>
>>
>>
>> From: Pierce Lamb 
>> 11:57 AM (1 hour ago)
>> to me
>>
>> Hi John,
>>
>> I saw you on the Spark Mailing List and noticed you worked for * and
>> wanted to reach out. My company, SnappyData, just launched an open source
>> OLTP + OLAP Database built on Spark. Our lead investor is Pivotal, whose
>> largest owner is EMC which makes * like a father figure :)
>>
>> SnappyData’s goal is two fold: Operationalize Spark and deliver truly
>> interactive queries. To do this, we first integrated Spark with an
>> in-memory database with a pedigree of production customer deployments:
>> GemFireXD (GemXD).
>>
>> GemXD operationalized Spark via:
>>
>> -- True high availability
>>
>> -- A highly concurrent environment
>>
>> -- An OLTP engine that can process transactions (mutable state)
>>
>> With GemXD as a storage engine, we packaged SnappyData with Approximate
>> Query Processing (AQP) technology. AQP enables interactive response times
>> even when data volumes are huge because it allows the developer to trade
>> latency for accuracy. AQP queries (SQL queries with a specified error rate)
>> execute on sample tables -- tables that have taken a stratified sample of
>> the full dataset. As such, AQP queries enable much faster decisions when
>> 100% accuracy isn’t needed and sample tables require far fewer resources to
>> manage.
>>
>> If that sounds interesting to you, please check out our Github repo (our
>> release is hosted there under “releases”):
>>
>> https://github.com/SnappyDataInc/snappydata
>>
>> We also have a technical paper that dives into the architecture:
>> http://www.snappydata.io/snappy-industrial
>>
>> Are you currently using Spark at ? I’d love to set up a call with you
>> and hear about how you’re using it and see if SnappyData could be a fit.
>>
>> In addition to replying to this email, there are many ways to chat with
>> us: https://github.com/SnappyDataInc/snappydata#community-support
>>
>> Hope to hear from you,
>>
>> Pierce
>>
>> pl...@snappydata.io
>>
>> http://www.twitter.com/snappydata
>>
>
>


MLlib/kmeans newbie question(s)

2015-03-07 Thread Pierce Lamb
Hi all,

I'm very new to machine learning algorithms and Spark. I'm follow the
Twitter Streaming Language Classifier found here:

http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/README.html

Specifically this code:

http://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/scala/src/main/scala/com/databricks/apps/twitter_classifier/ExamineAndTrain.scala

Except I'm trying to run it in batch mode on some tweets it pulls out
of Cassandra, in this case 200 total tweets.

As the example shows, I am using this object for "vectorizing" a set of tweets:

object Utils{
  val numFeatures = 1000
  val tf = new HashingTF(numFeatures)

  /**
   * Create feature vectors by turning each tweet into bigrams of
   * characters (an n-gram model) and then hashing those to a
   * length-1000 feature vector that we can pass to MLlib.
   * This is a common way to decrease the number of features in a
   * model while still getting excellent accuracy (otherwise every
   * pair of Unicode characters would potentially be a feature).
   */
  def featurize(s: String): Vector = {
tf.transform(s.sliding(2).toSeq)
  }
}

Here is my code which is modified from ExaminAndTrain.scala:

 val noSets = rawTweets.map(set => set.mkString("\n"))

val vectors = noSets.map(Utils.featurize).cache()
vectors.count()

val numClusters = 5
val numIterations = 30

val model = KMeans.train(vectors, numClusters, numIterations)

  for (i <- 0 until numClusters) {
println(s"\nCLUSTER $i")
noSets.foreach {
t => if (model.predict(Utils.featurize(t)) == 1) {
  println(t)
}
  }
}

This code runs and each Cluster prints "Cluster 0" "Cluster 1" etc
with nothing printing beneath. If i flip

models.predict(Utils.featurize(t)) == 1 to
models.predict(Utils.featurize(t)) == 0

the same thing happens except every tweet is printed beneath every cluster.

Here is what I intuitively think is happening (please correct my
thinking if its wrong): This code turns each tweet into a vector,
randomly picks some clusters, then runs kmeans to group the tweets (at
a really high level, the clusters, i assume, would be common
"topics"). As such, when it checks each tweet to see if models.predict
== 1, different sets of tweets should appear under each cluster (and
because its checking the training set against itself, every tweet
should be in a cluster). Why isn't it doing this? Either my
understanding of what kmeans does is wrong, my training set is too
small or I'm missing a step.

Any help is greatly appreciated

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
This produces the expected output, thank you!

On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito
 wrote:
> Ok, I have a better idea of what you’re trying to do now.
>
> I think the prob might be the map. The first time the function runs,
> currentValue will be None. Using map on None returns None.
>
> Instead, try:
>
> Some(currentValue.getOrElse(Seq.empty) ++ newValues)
>
> I think that should give you the expected result.
>
>
> From: Pierce Lamb 
> Date: Thursday, December 18, 2014 at 2:31 PM
> To: Silvio Fiorito 
> Cc: "user@spark.apache.org" 
> Subject: Re: Help with updateStateByKey
>
> Hi Silvio,
>
> This is a great suggestion (I wanted to get rid of groupByKey), I have been
> trying to implement it this morning, but having some trouble. I would love
> to see your code for the function that goes inside updateStateByKey
>
> Here is my current code:
>
>  def updateGroupByKey( newValues: Seq[(String, Long, Long)],
>   currentValue: Option[Seq[(String, Long, Long)]]
>   ): Option[Seq[(String, Long, Long)]] = {
>
>   currentValue.map{ case (v) => v ++ newValues
>   }
> }
>
> val grouped = ipTimeStamp.updateStateByKey(updateGroupByKey)
>
>
> However, when I run it the grouped DStream doesn't get populated with
> anything. The issue is probably that currentValue is not actually an
> Option[Seq[triple]] but rather an Option[triple]. However if I change it to
> an Option[triple] then I have to also return an Option[triple] for
> updateStateByKey to compile, but I want that return value to be an
> Option[Seq[triple]] because ultimately i want the data to look like
> (IPaddress, Seq[(pageRequested, startTime, EndTime), (pageRequested,
> startTime, EndTime)...]) and have that Seq build over time
>
> Am I thinking about this wrong?
>
> Thank you
>
> On Thu, Dec 18, 2014 at 6:05 AM, Silvio Fiorito
>  wrote:
>>
>> Hi Pierce,
>>
>> You shouldn’t have to use groupByKey because updateStateByKey will get a
>> Seq of all the values for that key already.
>>
>> I used that for realtime sessionization as well. What I did was key my
>> incoming events, then send them to udpateStateByKey. The updateStateByKey
>> function then received a Seq of the events and the Option of the previous
>> state for that key. The sessionization code then did its thing to check if
>> the incoming events were part of the same session, based on a configured
>> timeout. If a session already was active (from the previous state) and it
>> hadn’t exceeded the timeout, it used that value. Otherwise it generated a
>> new session id. Then the return value for the updateStateByKey function
>> was a Tuple of session id and last timestamp.
>>
>> Then I joined the DStream with the session ids, which were both keyed off
>> the same id and continued my processing. Your requirements may be
>> different, but that’s what worked well for me.
>>
>> Another thing to consider is cleaning up old sessions by returning None in
>> the updateStateByKey function. This will help with long running apps and
>> minimize memory usage (and checkpoint size).
>>
>> I was using something similar to the method above on a live production
>> stream with very little CPU and memory footprint, running for weeks at a
>> time, processing up to 15M events per day with fluctuating traffic.
>>
>> Thanks,
>> Silvio
>>
>>
>>
>> On 12/17/14, 10:07 PM, "Pierce Lamb" 
>> wrote:
>>
>> >I am trying to run stateful Spark Streaming computations over (fake)
>> >apache web server logs read from Kafka. The goal is to "sessionize"
>> >the web traffic similar to this blog post:
>>
>> > >http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionizat
>> >ion-with-spark-streaming-and-apache-hadoop/
>> >
>> >The only difference is that I want to "sessionize" each page the IP
>> >hits, instead of the entire session. I was able to do this reading
>> >from a file of fake web traffic using Spark in batch mode, but now I
>> >want to do it in a streaming context.
>> >
>> >Log files are read from Kafka and parsed into K/V pairs of
>> >
>> >(String, (String, Long, Long)) or
>> >
>> >(IP, (requestPage, time, time))
>> >
>> >I then call "groupByKey()" on this K/V pair. In batch mode, this would
>> >produce a:
>> >
>> >(String, CollectionBuffer((String, Long, Long), ...) or
>> >
>> >(IP, CollectionBuffer((requ

Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
Hi Silvio,

This is a great suggestion (I wanted to get rid of groupByKey), I have been
trying to implement it this morning, but having some trouble. I would love
to see your code for the function that goes inside updateStateByKey

Here is my current code:

 def updateGroupByKey( newValues: Seq[(String, Long, Long)],
  currentValue: Option[Seq[(String, Long, Long)]]
  ): Option[Seq[(String, Long, Long)]] = {

  currentValue.map{ case (v) => v ++ newValues
  }
}

val grouped = ipTimeStamp.updateStateByKey(updateGroupByKey)


However, when I run it the grouped DStream doesn't get populated with
anything. The issue is probably that currentValue is not actually an
Option[Seq[triple]] but rather an Option[triple]. However if I change it to
an Option[triple] then I have to also return an Option[triple] for
updateStateByKey to compile, but I want that return value to be an
Option[Seq[triple]] because ultimately i want the data to look like
(IPaddress, Seq[(pageRequested, startTime, EndTime), (pageRequested,
startTime, EndTime)...]) and have that Seq build over time

Am I thinking about this wrong?

Thank you

On Thu, Dec 18, 2014 at 6:05 AM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:
>
> Hi Pierce,
>
> You shouldn’t have to use groupByKey because updateStateByKey will get a
> Seq of all the values for that key already.
>
> I used that for realtime sessionization as well. What I did was key my
> incoming events, then send them to udpateStateByKey. The updateStateByKey
> function then received a Seq of the events and the Option of the previous
> state for that key. The sessionization code then did its thing to check if
> the incoming events were part of the same session, based on a configured
> timeout. If a session already was active (from the previous state) and it
> hadn’t exceeded the timeout, it used that value. Otherwise it generated a
> new session id. Then the return value for the updateStateByKey function
> was a Tuple of session id and last timestamp.
>
> Then I joined the DStream with the session ids, which were both keyed off
> the same id and continued my processing. Your requirements may be
> different, but that’s what worked well for me.
>
> Another thing to consider is cleaning up old sessions by returning None in
> the updateStateByKey function. This will help with long running apps and
> minimize memory usage (and checkpoint size).
>
> I was using something similar to the method above on a live production
> stream with very little CPU and memory footprint, running for weeks at a
> time, processing up to 15M events per day with fluctuating traffic.
>
> Thanks,
> Silvio
>
>
>
> On 12/17/14, 10:07 PM, "Pierce Lamb" 
> wrote:
>
> >I am trying to run stateful Spark Streaming computations over (fake)
> >apache web server logs read from Kafka. The goal is to "sessionize"
> >the web traffic similar to this blog post:
> >
> http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionizat
> >ion-with-spark-streaming-and-apache-hadoop/
> >
> >The only difference is that I want to "sessionize" each page the IP
> >hits, instead of the entire session. I was able to do this reading
> >from a file of fake web traffic using Spark in batch mode, but now I
> >want to do it in a streaming context.
> >
> >Log files are read from Kafka and parsed into K/V pairs of
> >
> >(String, (String, Long, Long)) or
> >
> >(IP, (requestPage, time, time))
> >
> >I then call "groupByKey()" on this K/V pair. In batch mode, this would
> >produce a:
> >
> >(String, CollectionBuffer((String, Long, Long), ...) or
> >
> >(IP, CollectionBuffer((requestPage, time, time), ...)
> >
> >In a StreamingContext, it produces a:
> >
> >(String, ArrayBuffer((String, Long, Long), ...) like so:
> >
> >(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))
> >
> >However, as the next microbatch (DStream) arrives, this information is
> >discarded. Ultimately what I want is for that ArrayBuffer to fill up
> >over time as a given IP continues to interact and to run some
> >computations on its data to "sessionize" the page time. I believe the
> >operator to make that happen is "updateStateByKey." I'm having some
> >trouble with this operator (I'm new to both Spark & Scala); any help
> >is appreciated.
> >
> >Thus far:
> >
> >val grouped =
> >ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey)
> >
> >
> >def updateGroupByKey(
> >  a: Seq[(String, ArrayBuffer[(String,
> >Long, Long)])],
> >  b: Option[(String, ArrayBuffer[(String,
> >Long, Long)])]
> >  ): Option[(String, ArrayBuffer[(String,
> >Long, Long)])] = {
> >
> >  }
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Help with updateStateByKey

2014-12-17 Thread Pierce Lamb
I am trying to run stateful Spark Streaming computations over (fake)
apache web server logs read from Kafka. The goal is to "sessionize"
the web traffic similar to this blog post:
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/

The only difference is that I want to "sessionize" each page the IP
hits, instead of the entire session. I was able to do this reading
from a file of fake web traffic using Spark in batch mode, but now I
want to do it in a streaming context.

Log files are read from Kafka and parsed into K/V pairs of

(String, (String, Long, Long)) or

(IP, (requestPage, time, time))

I then call "groupByKey()" on this K/V pair. In batch mode, this would
produce a:

(String, CollectionBuffer((String, Long, Long), ...) or

(IP, CollectionBuffer((requestPage, time, time), ...)

In a StreamingContext, it produces a:

(String, ArrayBuffer((String, Long, Long), ...) like so:

(183.196.254.131,ArrayBuffer((/test.php,1418849762000,1418849762000)))

However, as the next microbatch (DStream) arrives, this information is
discarded. Ultimately what I want is for that ArrayBuffer to fill up
over time as a given IP continues to interact and to run some
computations on its data to "sessionize" the page time. I believe the
operator to make that happen is "updateStateByKey." I'm having some
trouble with this operator (I'm new to both Spark & Scala); any help
is appreciated.

Thus far:

val grouped = ipTimeStamp.groupByKey().updateStateByKey(updateGroupByKey)


def updateGroupByKey(
  a: Seq[(String, ArrayBuffer[(String,
Long, Long)])],
  b: Option[(String, ArrayBuffer[(String,
Long, Long)])]
  ): Option[(String, ArrayBuffer[(String,
Long, Long)])] = {

  }

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org