Re: Structured Streaming & Query Planning

2019-03-18 Thread Paolo Platter
I can understand that if you involve columns with variable distribution in join 
operations, it may change your execution plan, but most of the time this is not 
going to happen, in streaming the most used operations are: map filter, 
grouping and stateful operations and in all these cases I can't how a dynamic 
query planning could help.

It could be useful to have a parameter to force a streaming query to calculate 
the query plan just once.

Paolo



Ottieni Outlook per Android<https://aka.ms/ghei36>


From: Alessandro Solimando 
Sent: Thursday, March 14, 2019 6:59:50 PM
To: Paolo Platter
Cc: user@spark.apache.org
Subject: Re: Structured Streaming & Query Planning

Hello Paolo,
generally speaking, query planning is mostly based on statistics and 
distributions of data values for the involved columns, which might 
significantly change over time in a streaming context, so for me it makes a lot 
of sense that it is run at every schedule, even though I understand your 
concern.

For the second question I don't know how to (or if you even can) cache the 
computed query plan.

If possible, would you mind sharing your findings afterwards? (query planning 
on streaming it's a very interesting and not yet enough explored topic IMO)

Best regards,
Alessandro

On Thu, 14 Mar 2019 at 16:51, Paolo Platter 
mailto:paolo.plat...@agilelab.it>> wrote:
Hi All,

I would like to understand why in a streaming query ( that should not be able 
to change its behaviour along iterations ) there is a queryPlanning-Duration 
effort ( in my case is 33% of trigger interval ) at every schedule. I don’t 
uderstand  why this is needed and if it is possible to disable or cache it.

Thanks


[cid:image001.jpg@01D41D15.E01B6F00]

Paolo Platter
CTO
E-mail:paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>
Web Site:   www.agilelab.it<http://www.agilelab.it/>





Structured Streaming & Query Planning

2019-03-14 Thread Paolo Platter
Hi All,

I would like to understand why in a streaming query ( that should not be able 
to change its behaviour along iterations ) there is a queryPlanning-Duration 
effort ( in my case is 33% of trigger interval ) at every schedule. I don’t 
uderstand  why this is needed and if it is possible to disable or cache it.

Thanks


[cid:image001.jpg@01D41D15.E01B6F00]

Paolo Platter
CTO
E-mail:paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>
Web Site:   www.agilelab.it<http://www.agilelab.it/>





R: How to reissue a delegated token after max lifetime passes for a spark streaming application on a Kerberized cluster

2019-01-03 Thread Paolo Platter
Hi,



The spark default behaviour is to request a brand new token every 24 hours, it 
is not going to renew delegation tokens, and it is the better approach for long 
running applications like streaming ones.



In our use case using keytab and principal is working fine with 
hdfs_delegation_token but is NOT working with “kms-dt”.



Anyone knows why this is happening ? Any suggestion to make it working with KMS 
?



Thanks







[cid:image001.jpg@01D41D15.E01B6F00]


Paolo Platter

CTO

E-mail:paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>

Web Site:   www.agilelab.it<http://www.agilelab.it/>







Da: Marcelo Vanzin 
Inviato: Thursday, January 3, 2019 7:03:22 PM
A: alinazem...@gmail.com
Cc: user
Oggetto: Re: How to reissue a delegated token after max lifetime passes for a 
spark streaming application on a Kerberized cluster

If you are using the principal / keytab params, Spark should create
tokens as needed. If it's not, something else is going wrong, and only
looking at full logs for the app would help.
On Wed, Jan 2, 2019 at 5:09 PM Ali Nazemian  wrote:
>
> Hi,
>
> We are using a headless keytab to run our long-running spark streaming 
> application. The token is renewed automatically every 1 day until it hits the 
> max life limit. The problem is token is expired after max life (7 days) and 
> we need to restart the job. Is there any way we can re-issue the token and 
> pass it to a job that is already running? It doesn't feel right at all to 
> restart the job every 7 days only due to the token issue.
>
> P.S: We use  "--keytab /path/to/the/headless-keytab", "--principal 
> principalNameAsPerTheKeytab" and "--conf 
> spark.hadoop.fs.hdfs.impl.disable.cache=true" as the arguments for 
> spark-submit command.
>
> Thanks,
> Ali



--
Marcelo

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



R: Tungsten and Spark Streaming

2015-09-10 Thread Paolo Platter
Did you plan to modify dstream interface in order to work with dataframe ? It 
would be nice handle dstreams without generics

Paolo

Inviata dal mio Windows Phone

Da: Tathagata Das
Inviato: ‎10/‎09/‎2015 07:42
A: N B
Cc: user
Oggetto: Re: Tungsten and Spark Streaming

Rewriting is necessary. You will have to convert RDD/DStream operations to 
DataFrame operations. So get the RDDs in DStream, using transform/foreachRDD, 
convert to DataFrames and then do DataFrame operations.

On Wed, Sep 9, 2015 at 9:23 PM, N B 
> wrote:
Hello,

How can we start taking advantage of the performance gains made under Project 
Tungsten in Spark 1.5 for a Spark Streaming program?

>From what I understand, this is available by default for Dataframes. But for a 
>program written using Spark Streaming, would we see any potential gains "out 
>of the box" in 1.5 or will we have to rewrite some portions of the application 
>code to realize that benefit?

Any insight/documentation links etc in this regard will be appreciated.

Thanks
Nikunj




R: Spark + Druid

2015-09-02 Thread Paolo Platter
Fantastic!!! I will look into that and I hope to contribute

Paolo

Inviata dal mio Windows Phone

Da: Harish Butani
Inviato: ‎02/‎09/‎2015 06:04
A: user
Oggetto: Spark + Druid

Hi,

I am working on the Spark Druid Package: 
https://github.com/SparklineData/spark-druid-olap.
For scenarios where a 'raw event' dataset is being indexed in Druid it enables 
you to write your Logical Plans(queries/dataflows) against the 'raw event' 
dataset and it rewrites parts of the plan to execute as a Druid Query. In Spark 
the configuration of a Druid DataSource is somewhat like configuring an OLAP 
index in a traditional DB. Early results show significant speedup of pushing 
slice and dice queries to Druid.

It comprises of a Druid DataSource that wraps the 'raw event' dataset and has 
knowledge of the Druid Index; and a DruidPlanner which is a set of plan rewrite 
strategies to convert Aggregation queries into a Plan having a DruidRDD.

Here
 is a detailed design document, which also describes a benchmark of 
representative queries on the TPCH dataset.

Looking for folks who would be willing to try this out and/or contribute.

regards,
Harish Butani.


R: Is SPARK is the right choice for traditional OLAP query processing?

2015-07-29 Thread Paolo Platter
Try to give a look at zoomdata. They are spark based and they offer BI features 
with good performance.

Paolo

Inviata dal mio Windows Phone

Da: Ruslan Dautkhanovmailto:dautkha...@gmail.com
Inviato: ‎29/‎07/‎2015 06:18
A: renga.kannanmailto:renga.kan...@gmail.com
Cc: usermailto:user@spark.apache.org
Oggetto: Re: Is SPARK is the right choice for traditional OLAP query processing?

 We want these use actions respond within 2 to 5 seconds.

I think this goal is a stretch for Spark. Some queries may run faster than that 
on a large dataset,
but in general you can't put an SLA like this. For example if you have to join 
some huge datasets,
you'll likely will be much over that. Spark is great for huge jobs and it'll be 
much faster than MR.
I don't think Spark was designed with interactive queries in mind. For example, 
although Spark is
in-memory, its in-memory is only for a job. It's not like in traditional 
RDBMS systems where you
have a persistent buffer cache or in-memory columnar storage (both are 
Oracle terms)
If you have multiple users running interatactive BI queries, results that were 
cached for first user
wouldn't be used by second user. Unless you invent something that would keep a 
persistent
Spark context and serve users' requests and decided which RDDs to cache, when 
and how.
At least that's my understanding how Spark works. If I'm wrong, I will be glad 
to hear that as
we ran into the same questions.

As we use Cloudera's CDH, I'm not sure where Hortonworks are with their Tez 
project,
but Tez has components that resemble closer to buffer cache or in-memory 
columnar storage caching
from traditional RDBMS systems, and may get better and/or more predictable 
performance on
BI queries.



--
Ruslan Dautkhanov

On Mon, Jul 20, 2015 at 6:04 PM, renga.kannan 
renga.kan...@gmail.commailto:renga.kan...@gmail.com wrote:
All,
I really appreciate anyone's input on this. We are having a very simple
traditional OLAP query processing use case. Our use case is as follows.


1. We have a customer sales order table data coming from RDBMs table.
2. There are many dimension columns in the sales order table. For each of
those dimensions, we have individual dimension tables that stores the
dimension record sets.
3. We also have some BI like hierarchies that is defined for dimension data
set.

What we want for business users is as follows.?

1. We wanted to show some aggregated values from sales Order transaction
table columns.
2. User would like to filter these with specific dimension values from
dimension table.
3. User should be able to drill down from higher level to lower level by
traversing hierarchy on dimension


We want these use actions respond within 2 to 5 seconds.


We are thinking about using SPARK as our backend enginee to sever data to
these front end application.


Has anyone tried using SPARK for these kind of use cases. These are all
traditional use cases in BI space. If so, can SPARK respond to these queries
with in 2 to 5 seconds for large data sets.

Thanks,
Renga



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-SPARK-is-the-right-choice-for-traditional-OLAP-query-processing-tp23921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




R: Spark is much slower than direct access MySQL

2015-07-26 Thread Paolo Platter
If you want a performance boost, you need to load the full table in memory 
using caching and them execute your query directly on cached dataframe. 
Otherwise you use spark only as a bridge and you don't leverage the distributed 
in memory engine of spark.

Paolo

Inviata dal mio Windows Phone

Da: Louis Hustmailto:louis.h...@gmail.com
Inviato: ‎26/‎07/‎2015 10:28
A: Shixiong Zhumailto:zsxw...@gmail.com
Cc: Jerrick Hoangmailto:jerrickho...@gmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: Spark is much slower than direct access MySQL

Thanks for your explain

2015-07-26 16:22 GMT+08:00 Shixiong Zhu 
zsxw...@gmail.commailto:zsxw...@gmail.com:
Oh, I see. That's the total time of executing a query in Spark. Then the 
difference is reasonable, considering Spark has much more work to do, e.g., 
launching tasks in executors.


Best Regards,

Shixiong Zhu

2015-07-26 16:16 GMT+08:00 Louis Hust 
louis.h...@gmail.commailto:louis.h...@gmail.com:
Look at the given url:

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

2015-07-26 16:14 GMT+08:00 Shixiong Zhu 
zsxw...@gmail.commailto:zsxw...@gmail.com:
Could you clarify how you measure the Spark time cost? Is it the total time of 
running the query? If so, it's possible because the overhead of Spark dominates 
for small queries.


Best Regards,

Shixiong Zhu

2015-07-26 15:56 GMT+08:00 Jerrick Hoang 
jerrickho...@gmail.commailto:jerrickho...@gmail.com:
how big is the dataset? how complicated is the query?

On Sun, Jul 26, 2015 at 12:47 AM Louis Hust 
louis.h...@gmail.commailto:louis.h...@gmail.com wrote:
Hi, all,

I am using spark DataFrame to fetch small table from MySQL,
and i found it cost so much than directly access MySQL Using JDBC.

Time cost for Spark is about 2033ms, and direct access at about 16ms.

Code can be found at:

https://github.com/louishust/sparkDemo/blob/master/src/main/java/DirectQueryTest.java

So If my configuration for spark is wrong? How to optimise Spark to achieve the 
similar performance like direct access?

Any idea will be appreciated!







R: Is spark suitable for real time query

2015-07-22 Thread Paolo Platter
Are you using jdbc server?

Paolo

Inviata dal mio Windows Phone

Da: Louis Hustmailto:louis.h...@gmail.com
Inviato: ‎22/‎07/‎2015 13:47
A: Robin Eastmailto:robin.e...@xense.co.uk
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: Is spark suitable for real time query

I do a simple test using spark in standalone mode(not cluster),
 and found a simple action take a few seconds, the data size is small, just few 
rows.
So each spark job will cost some time for init or prepare work no matter what 
the job is?
I mean if the basic framework of spark job will cost seconds?

2015-07-22 19:17 GMT+08:00 Robin East 
robin.e...@xense.co.ukmailto:robin.e...@xense.co.uk:
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark 
is designed to process large amounts of data in a distributed fashion. No 
distributed system I know of could give any kind of guarantees at the 
microsecond level.

Robin

 On 22 Jul 2015, at 11:14, Louis Hust 
 louis.h...@gmail.commailto:louis.h...@gmail.com wrote:

 Hi, all

 I am using spark jar in standalone mode, fetch data from different mysql 
 instance and do some action, but i found the time is at second level.

 So i want to know if spark job is suitable for real time query which at 
 microseconds?




Scripting with groovy

2015-06-02 Thread Paolo Platter
Hi all,

Has anyone tried to add Scripting capabilities to spark streaming using groovy?
I would like to stop the streaming context, update a transformation function 
written in groovy( for example to manipulate json ), restart the streaming 
context and obtain a new behavior without re-submit the application.

Is it possible? Do you think it makes sense or there is a smarter way to 
accomplish that?

Thanks
Paolo

Inviata dal mio Windows Phone


Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Paolo Platter
Nice Job!

we are developing something very similar... I will contact you to understand if 
we can contribute to you with some piece !

Best

Paolo

Da: Evo Eftimovmailto:evo.efti...@isecc.com
Data invio: ?gioved?? ?14? ?maggio? ?2015 ?17?:?21
A: 'David Morales'mailto:dmora...@stratio.com, Matei 
Zahariamailto:matei.zaha...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org

That has been a really rapid evaluation of the work and its direction

From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:12 PM
To: Matei Zaharia
Cc: user@spark.apache.org
Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

Thanks for your kind words Matei, happy to see that our work is in the right 
way.




2015-05-14 17:10 GMT+02:00 Matei Zaharia 
matei.zaha...@gmail.commailto:matei.zaha...@gmail.com:
(Sorry, for non-English people: that means it's a good thing.)

Matei

 On May 14, 2015, at 10:53 AM, Matei Zaharia 
 matei.zaha...@gmail.commailto:matei.zaha...@gmail.com wrote:

 ...This is madness!

 On May 14, 2015, at 9:31 AM, dmoralesdf 
 dmora...@stratio.commailto:dmora...@stratio.com wrote:

 Hi there,

 We have released our real-time aggregation engine based on Spark Streaming.

 SPARKTA is fully open source (Apache2)


 You can checkout the slides showed up at the Strata past week:

 http://www.slideshare.net/Stratio/strata-sparkta

 Source code:

 https://github.com/Stratio/sparkta

 And documentation

 http://docs.stratio.com/modules/sparkta/development/


 We are open to your ideas and contributors are welcomed.


 Regards.



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: 
 user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: 
 user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





--
David Morales de Fr?as  ::  +34 607 010 411 :: 
@dmoralesdfhttps://twitter.com/dmoralesdf

[http://www.stratio.com/wp-content/uploads/2014/05/stratio_logo_2014.png]http://www.stratio.com/
V?a de las dos Castillas, 33, ?tica 4, 3? Planta
28224 Pozuelo de Alarc?n, Madrid
Tel: +34 91 828 6473 // www.stratio.comhttp://www.stratio.com // 
@stratiobdhttps://twitter.com/StratioBD


SparkSQL + Parquet performance

2015-04-06 Thread Paolo Platter
Hi all,

is there anyone using SparkSQL + Parquet that has made a benchmark  about 
storing parquet files on HDFS or on CFS ( Cassandra File System )?
What storage can improve performance of SparkSQL+ Parquet ?

Thanks

Paolo



Spark Druid integration

2015-04-06 Thread Paolo Platter
Hi,

Do you think it is possible to build an integration beetween druid and spark, 
using Datasource API ?
Is someone investigating this kind of solution ?
I think that Spark SQL could fill the lack of a complete SQL Layer of Druid. It 
could be a great OLAP solution.
WDYT ?

Paolo Platter
AgileLab CTO



R: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread Paolo Platter
Yes, I would suggest spark-notebook too.
It's very simple to setup and it's growing pretty fast.

Paolo

Inviata dal mio Windows Phone

Da: Irfan Ahmadmailto:ir...@cloudphysics.com
Inviato: ‎19/‎03/‎2015 04:05
A: davidhmailto:dav...@annaisystems.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: iPython Notebook + Spark + Accumulo -- best practice?

I forgot to mention that there is also Zeppelin and jove-notebook but I haven't 
got any experience with those yet.


Irfan Ahmad
CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 8:01 PM, Irfan Ahmad 
ir...@cloudphysics.commailto:ir...@cloudphysics.com wrote:
Hi David,

W00t indeed and great questions. On the notebook front, there are two options 
depending on what you are looking for. You can either go with iPython 3 with 
Spark-kernel as a backend or you can use spark-notebook. Both have interesting 
tradeoffs.

If you have looking for a single notebook platform for your data scientists 
that has R, Python as well as a Spark Shell, you'll likely want to go with 
iPython + Spark-kernel. Downsides with the spark-kernel project are that data 
visualization isn't quite there yet, early days for documentation and 
blogs/etc. Upside is that R and Python work beautifully and that the ipython 
committers are super-helpful.

If you are OK with a primarily spark/scala experience, then I suggest you with 
spark-notebook. Upsides are that the project is a little further along, 
visualization support is better than spark-kernel (though not as good as 
iPython with Python) and the committer is awesome with help. Downside is that 
you won't get R and Python.

FWIW: I'm using both at the moment!

Hope that helps.


Irfan Ahmad
CTO | Co-Founder | CloudPhysicshttp://www.cloudphysics.com
Best of VMworld Finalist
Best Cloud Management Award
NetworkWorld 10 Startups to Watch
EMA Most Notable Vendor

On Wed, Mar 18, 2015 at 5:45 PM, davidh 
dav...@annaisystems.commailto:dav...@annaisystems.com wrote:
hi all, I've been DDGing, Stack Overflowing, Twittering, RTFMing, and
scanning through this archive with only moderate success. in other words --
my way of saying sorry if this is answered somewhere obvious and I missed it
:-)

i've been tasked with figuring out how to connect Notebook, Spark, and
Accumulo together. The end user will do her work via notebook. thus far,
I've successfully setup a Vagrant image containing Spark, Accumulo, and
Hadoop. I was able to use some of the Accumulo example code to create a
table populated with data, create a simple program in scala that, when fired
off to Spark via spark-submit, connects to accumulo and prints the first ten
rows of data in the table. so w00t on that - but now I'm left with more
questions:

1) I'm still stuck on what's considered 'best practice' in terms of hooking
all this together. Let's say Sally, a  user, wants to do some analytic work
on her data. She pecks the appropriate commands into notebook and fires them
off. how does this get wired together on the back end? Do I, from notebook,
use spark-submit to send a job to spark and let spark worry about hooking
into accumulo or is it preferable to create some kind of open stream between
the two?

2) if I want to extend spark's api, do I need to first submit an endless job
via spark-submit that does something like what this gentleman describes
http://blog.madhukaraphatak.com/extending-spark-api  ? is there an
alternative (other than refactoring spark's source) that doesn't involve
extending the api via a job submission?

ultimately what I'm looking for help locating docs, blogs, etc that may shed
some light on this.

t/y in advance!

d



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/iPython-Notebook-Spark-Accumulo-best-practice-tp22137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org





Re: Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
I was speaking about 1.2 version of spark

Paolo

Da: Paolo Plattermailto:paolo.plat...@agilelab.it
Data invio: ?luned?? ?23? ?febbraio? ?2015 ?10?:?41
A: user@spark.apache.orgmailto:user@spark.apache.org

Hi guys,

Is the IN operator supported in Spark SQL over Hive Metastore ?

Thanks


Paolo



Spark SQL Where IN support

2015-02-23 Thread Paolo Platter
Hi guys,

Is the “IN” operator supported in Spark SQL over Hive Metastore ?

Thanks


Paolo



SparkSQL and star schema

2015-02-13 Thread Paolo Platter
Hi,

is SparkSQL + Parquet suitable to replicate a star schema ?

Paolo Platter
AgileLab CTO



R: Datastore HDFS vs Cassandra

2015-02-10 Thread Paolo Platter
Hi Mike,

I developed a Solution with cassandra and spark, using DSE.
The main difficult is about cassandra, you need to understand very well its 
data model and its Query patterns.
Cassandra has better performance than hdfs and it has DR and stronger 
availability.
Hdfs is a filesystem, cassandra is a dbms.
Cassandra supports full CRUD without acid.
Hdfs is more flexible than cassandra.

In my opinion, if you have a real time series, go with Cassandra paying 
attention at your reporting data access patterns.

Paolo

Inviata dal mio Windows Phone

Da: Mike Trienismailto:mike.trie...@orcsol.com
Inviato: ‎11/‎02/‎2015 05:59
A: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Datastore HDFS vs Cassandra

Hi,

I am considering implement Apache Spark on top of Cassandra database after
listing to related talk and reading through the slides from DataStax. It
seems to fit well with our time-series data and reporting requirements.

http://www.slideshare.net/patrickmcfadin/apache-cassandra-apache-spark-for-time-series-data

Does anyone have any experiences using Apache Spark and Cassandra, including
limitations (and or) technical difficulties? How does Cassandra compare with
HDFS and what use cases would make HDFS more suitable?

Thanks, Mike.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Datastore-HDFS-vs-Cassandra-tp21590.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



R: spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-08 Thread Paolo Platter
Could anyone figure out what is going in my spark cluster?

Thanks in advance

Paolo

Inviata dal mio Windows Phone

Da: Paolo Plattermailto:paolo.plat...@agilelab.it
Inviato: ‎06/‎02/‎2015 10:48
A: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: spark 1.2 writing on parquet after a join never ends - GC problems

Hi all,

I’m experiencing a strange behaviour of spark 1.2.

I’ve a 3 node cluster + the master.

each node has:
1 HDD 7200 rpm 1 TB
16 GB RAM
8 core

I configured executors with 6 cores and 10 GB each (  
spark.storage.memoryFraction = 0.6 )

My job is pretty simple:


val file1 = sc.parquetFile(“path1”)  //19M rows
val file2 = sc.textFile(“path2”) //12K rows

val join = file1.as(‘f1’).join(file2.as(‘f2’), LeftOuter, Some(“f1.field”.attr 
=== ”f2.field”.attr))

join.map( _.toCaseClass() ).saveAsParquetFile( “path3” )


When I perform this job into the spark-shell without writing on parquet file, 
but performing a final count to execute the pipeline, it’s pretty fast.
When I submit the application to the cluster with the saveAsParquetFile 
instruction, task execution slows progressively and it never ends.
I debugged this behaviour and I found that the cause is the executor’s 
disconnection due to missing heartbeat. Missing heatbeat in my opinion is 
related to GC (I report to you a piece of GC log from one of the executors)

484.861: [GC [PSYoungGen: 2053788K-718157K(2561024K)] 
7421222K-6240219K(9551872K), 2.6802130 secs] [Times: user=1.94 sys=0.60, 
real=2.68 secs]
497.751: [GC [PSYoungGen: 2560845K-782081K(2359808K)] 
8082907K-6984335K(9350656K), 4.8611660 secs] [Times: user=3.66 sys=1.55, 
real=4.86 secs]
510.654: [GC [PSYoungGen: 2227457K-625664K(2071552K)] 
8429711K-7611342K(9062400K), 22.5727850 secs] [Times: user=3.34 sys=2.43, 
real=22.57 secs]
533.745: [Full GC [PSYoungGen: 625664K-0K(2071552K)] [ParOldGen: 
6985678K-2723917K(6990848K)] 7611342K-2723917K(9062400K) [PSPermGen: 
62290K-6
K(124928K)], 56.9075910 secs] [Times: user=65.28 sys=5.91, real=56.90 secs]
667.637: [GC [PSYoungGen: 1445376K-623184K(2404352K)] 
4169293K-3347101K(9395200K), 11.7959290 secs] [Times: user=1.58 sys=0.60, 
real=11.79 secs]
690.936: [GC [PSYoungGen: 1973328K-584256K(2422784K)] 
4697245K-3932841K(9413632K), 39.3594850 secs] [Times: user=2.88 sys=0.96, 
real=39.36 secs]
789.891: [GC [PSYoungGen: 1934400K-585552K(2434048K)] 
5282985K-4519857K(9424896K), 17.4456720 secs] [Times: user=2.65 sys=1.36, 
real=17.44 secs]
814.697: [GC [PSYoungGen: 1951056K-330109K(2426880K)] 
5885361K-4851426K(9417728K), 20.9578300 secs] [Times: user=1.64 sys=0.81, 
real=20.96 secs]
842.968: [GC [PSYoungGen: 1695613K-180290K(2489344K)] 
6216930K-4888775K(9480192K), 3.2760780 secs] [Times: user=0.40 sys=0.30, 
real=3.28 secs]
886.660: [GC [PSYoungGen: 1649218K-427552K(2475008K)] 
6357703K-5239028K(9465856K), 5.4738210 secs] [Times: user=1.47 sys=0.25, 
real=5.48 secs]
897.979: [GC [PSYoungGen: 1896480K-634144K(2487808K)] 
6707956K-5874208K(9478656K), 23.6440110 secs] [Times: user=2.63 sys=1.11, 
real=23.64 secs]
929.706: [GC [PSYoungGen: 2169632K-663200K(2199040K)] 
7409696K-6538992K(9189888K), 39.3632270 secs] [Times: user=3.36 sys=1.71, 
real=39.36 secs]
1006.206: [GC [PSYoungGen: 2198688K-655584K(2449920K)] 
8074480K-7196224K(9440768K), 98.5040880 secs] [Times: user=161.53 sys=6.71, 
real=98.49 secs]
1104.790: [Full GC [PSYoungGen: 655584K-0K(2449920K)] [ParOldGen: 
6540640K-6290292K(6990848K)] 7196224K-6290292K(9440768K) [PSPermGen: 
62247K-6224
7K(131072K)], 610.0023700 secs] [Times: user=1630.17 sys=27.80, real=609.93 
secs]
1841.916: [Full GC [PSYoungGen: 1440256K-0K(2449920K)] [ParOldGen: 
6290292K-6891868K(6990848K)] 7730548K-6891868K(9440768K) [PSPermGen: 
62266K-622
66K(131072K)], 637.4852230 secs] [Times: user=2035.09 sys=36.09, real=637.40 
secs]
2572.012: [Full GC [PSYoungGen: 1440256K-509513K(2449920K)] [ParOldGen: 
6891868K-6990703K(6990848K)] 8332124K-7500217K(9440768K) [PSPermGen: 62275K
-62275K(129024K)], 698.2497860 secs] [Times: user=2261.54 sys=37.63, 
real=698.26 secs]
3326.711: [Full GC


It might seem that the writing file operation is too slow and it’s a 
bottleneck, but then I tried to chenge my algorithm in the following way :


val file1 = sc.parquetFile(“path1”)  //19M rows
val file2 = sc.textFile(“path2”) //12K rows

val bFile2 = sc.broadcast(  file2.collect.groupBy( f2 = f2.filed )  )   
//broadcast of the smaller file as Map()


file1.map( f1 = (   f1, bFile2.value( f1.field  ).head  )   )  //manual join
.map( _toCaseClass()   )
.saveAsParquetFile( “path3” )


in this way the task is fast and ends without problems, so now I’m pretty 
confused.


  *
Join works well if I use count as final action
  *
Parquet write is working well without previous join operation
  *
Parquet write after join never ends and I detected GC problems

Anyone can figure out what it’s happening ?

Thanks


Paolo



spark 1.2 writing on parquet after a join never ends - GC problems

2015-02-06 Thread Paolo Platter
Hi all,

I’m experiencing a strange behaviour of spark 1.2.

I’ve a 3 node cluster + the master.

each node has:
1 HDD 7200 rpm 1 TB
16 GB RAM
8 core

I configured executors with 6 cores and 10 GB each (  
spark.storage.memoryFraction = 0.6 )

My job is pretty simple:


val file1 = sc.parquetFile(“path1”)  //19M rows
val file2 = sc.textFile(“path2”) //12K rows

val join = file1.as(‘f1’).join(file2.as(‘f2’), LeftOuter, Some(“f1.field”.attr 
=== ”f2.field”.attr))

join.map( _.toCaseClass() ).saveAsParquetFile( “path3” )


When I perform this job into the spark-shell without writing on parquet file, 
but performing a final count to execute the pipeline, it’s pretty fast.
When I submit the application to the cluster with the saveAsParquetFile 
instruction, task execution slows progressively and it never ends.
I debugged this behaviour and I found that the cause is the executor’s 
disconnection due to missing heartbeat. Missing heatbeat in my opinion is 
related to GC (I report to you a piece of GC log from one of the executors)

484.861: [GC [PSYoungGen: 2053788K-718157K(2561024K)] 
7421222K-6240219K(9551872K), 2.6802130 secs] [Times: user=1.94 sys=0.60, 
real=2.68 secs]
497.751: [GC [PSYoungGen: 2560845K-782081K(2359808K)] 
8082907K-6984335K(9350656K), 4.8611660 secs] [Times: user=3.66 sys=1.55, 
real=4.86 secs]
510.654: [GC [PSYoungGen: 2227457K-625664K(2071552K)] 
8429711K-7611342K(9062400K), 22.5727850 secs] [Times: user=3.34 sys=2.43, 
real=22.57 secs]
533.745: [Full GC [PSYoungGen: 625664K-0K(2071552K)] [ParOldGen: 
6985678K-2723917K(6990848K)] 7611342K-2723917K(9062400K) [PSPermGen: 
62290K-6
K(124928K)], 56.9075910 secs] [Times: user=65.28 sys=5.91, real=56.90 secs]
667.637: [GC [PSYoungGen: 1445376K-623184K(2404352K)] 
4169293K-3347101K(9395200K), 11.7959290 secs] [Times: user=1.58 sys=0.60, 
real=11.79 secs]
690.936: [GC [PSYoungGen: 1973328K-584256K(2422784K)] 
4697245K-3932841K(9413632K), 39.3594850 secs] [Times: user=2.88 sys=0.96, 
real=39.36 secs]
789.891: [GC [PSYoungGen: 1934400K-585552K(2434048K)] 
5282985K-4519857K(9424896K), 17.4456720 secs] [Times: user=2.65 sys=1.36, 
real=17.44 secs]
814.697: [GC [PSYoungGen: 1951056K-330109K(2426880K)] 
5885361K-4851426K(9417728K), 20.9578300 secs] [Times: user=1.64 sys=0.81, 
real=20.96 secs]
842.968: [GC [PSYoungGen: 1695613K-180290K(2489344K)] 
6216930K-4888775K(9480192K), 3.2760780 secs] [Times: user=0.40 sys=0.30, 
real=3.28 secs]
886.660: [GC [PSYoungGen: 1649218K-427552K(2475008K)] 
6357703K-5239028K(9465856K), 5.4738210 secs] [Times: user=1.47 sys=0.25, 
real=5.48 secs]
897.979: [GC [PSYoungGen: 1896480K-634144K(2487808K)] 
6707956K-5874208K(9478656K), 23.6440110 secs] [Times: user=2.63 sys=1.11, 
real=23.64 secs]
929.706: [GC [PSYoungGen: 2169632K-663200K(2199040K)] 
7409696K-6538992K(9189888K), 39.3632270 secs] [Times: user=3.36 sys=1.71, 
real=39.36 secs]
1006.206: [GC [PSYoungGen: 2198688K-655584K(2449920K)] 
8074480K-7196224K(9440768K), 98.5040880 secs] [Times: user=161.53 sys=6.71, 
real=98.49 secs]
1104.790: [Full GC [PSYoungGen: 655584K-0K(2449920K)] [ParOldGen: 
6540640K-6290292K(6990848K)] 7196224K-6290292K(9440768K) [PSPermGen: 
62247K-6224
7K(131072K)], 610.0023700 secs] [Times: user=1630.17 sys=27.80, real=609.93 
secs]
1841.916: [Full GC [PSYoungGen: 1440256K-0K(2449920K)] [ParOldGen: 
6290292K-6891868K(6990848K)] 7730548K-6891868K(9440768K) [PSPermGen: 
62266K-622
66K(131072K)], 637.4852230 secs] [Times: user=2035.09 sys=36.09, real=637.40 
secs]
2572.012: [Full GC [PSYoungGen: 1440256K-509513K(2449920K)] [ParOldGen: 
6891868K-6990703K(6990848K)] 8332124K-7500217K(9440768K) [PSPermGen: 62275K
-62275K(129024K)], 698.2497860 secs] [Times: user=2261.54 sys=37.63, 
real=698.26 secs]
3326.711: [Full GC


It might seem that the writing file operation is too slow and it’s a 
bottleneck, but then I tried to chenge my algorithm in the following way :


val file1 = sc.parquetFile(“path1”)  //19M rows
val file2 = sc.textFile(“path2”) //12K rows

val bFile2 = sc.broadcast(  file2.collect.groupBy( f2 = f2.filed )  )   
//broadcast of the smaller file as Map()


file1.map( f1 = (   f1, bFile2.value( f1.field  ).head  )   )  //manual join
.map( _toCaseClass()   )
.saveAsParquetFile( “path3” )


in this way the task is fast and ends without problems, so now I’m pretty 
confused.


  *
Join works well if I use count as final action
  *
Parquet write is working well without previous join operation
  *
Parquet write after join never ends and I detected GC problems

Anyone can figure out what it’s happening ?

Thanks


Paolo



R: Broadcast variables: when should I use them?

2015-01-26 Thread Paolo Platter
Hi,

Yes, if they are not big, it's a good practice to broadcast them to avoid 
serializing them each time you use clojure.

Paolo

Inviata dal mio Windows Phone

Da: frodo777mailto:roberto.vaquer...@bitmonlab.com
Inviato: ‎26/‎01/‎2015 14:34
A: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Broadcast variables: when should I use them?

Hello.

I have a number of static Arrays and Maps in my Spark Streaming driver
program.
They are simple collections, initialized with integer values and strings
directly in the code. There is no RDD/DStream involvement here.
I do not expect them to contain more than 100 entries, each.
They are used in several subsequent parallel operations.

The question is:
Should I convert them into broadcast variables?

Thanks and regards.
-Bob



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Broadcast-variables-when-should-I-use-them-tp21366.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



R: RDD Moving Average

2015-01-06 Thread Paolo Platter
In my opinion you should use fold pattern. Obviously after an sort by 
trasformation.

Paolo

Inviata dal mio Windows Phone

Da: Asim Jalismailto:asimja...@gmail.com
Inviato: ‎06/‎01/‎2015 23:11
A: Sean Owenmailto:so...@cloudera.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: RDD Moving Average

One problem with this is that we are creating a lot of iterables containing a 
lot of repeated data. Is there a way to do this so that we can calculate a 
moving average incrementally?

On Tue, Jan 6, 2015 at 4:44 PM, Sean Owen 
so...@cloudera.commailto:so...@cloudera.com wrote:
Yes, if you break it down to...

tickerRDD.map(ticker =
  (ticker.timestamp, ticker)
).map { case(ts, ticker) =
  ((ts / 6) * 6, ticker)
}.groupByKey

... as Michael alluded to, then it more naturally extends to the sliding 
window, since you can flatMap one Ticker to many (bucket, ticker) pairs, then 
group. I think this would implementing 1 minute buckets, sliding by 10 seconds:

tickerRDD.flatMap(ticker =
  (ticker.timestamp - 6 to ticker.timestamp by 15000).map(ts = (ts, 
ticker))
).map { case(ts, ticker) =
  ((ts / 6) * 6, ticker)
}.groupByKey

On Tue, Jan 6, 2015 at 8:47 PM, Asim Jalis 
asimja...@gmail.commailto:asimja...@gmail.com wrote:
I guess I can use a similar groupBy approach. Map each event to all the windows 
that it can belong to. Then do a groupBy, etc. I was wondering if there was a 
more elegant approach.

On Tue, Jan 6, 2015 at 3:45 PM, Asim Jalis 
asimja...@gmail.commailto:asimja...@gmail.com wrote:
Except I want it to be a sliding window. So the same record could be in 
multiple buckets.




R: Clarifications on Spark

2014-12-05 Thread Paolo Platter
Hi,

1) yes you can. Spark is supporting a lot of file formats on  hdfs/s3 then is 
supporting cassandra and jdbc in General.

2) yes. Spark has a jdbc thrift server where you can attach BI tools. I suggest 
to you to pay attention to your Query response time requirements.

3) no you can go with Cassandra. If you are looking at mongodb you should give 
a try to stratio platform

4) yes. Using JdbcRDD you can leverage rdbms too

5) I suggest to use spark as a computation engine, build your pre-aggregated 
views and persist them on a data store like Cassandra. Then attach the BI tools 
to aggregated views directly.

Paolo

Inviata dal mio Windows Phone

Da: Ajaymailto:ajay.ga...@gmail.com
Inviato: ‎05/‎12/‎2014 07:25
A: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Oggetto: Clarifications on Spark

Hello,

I work for an eCommerce company. Currently we are looking at building a Data
warehouse platform as described below:

DW as a Service
|
REST API
|
SQL On No SQL (Drill/Pig/Hive/Spark SQL)
|
No SQL databases (One or more. May be RDBMS directly too)
| (Bulk load)
My SQL Database

I wish to get a few clarifications on Apache Drill as follows:

1) Can we use Spark for SQL on No SQL or do we need to mix them with
Pig/Hive or any other for any reason?
2) Can Spark SQL be used a query interface for Business Intelligence,
Analytics and Reporting
3) Is Spark supports only Hadoop, HBase?. We may use
Cassandra/MongoDb/CouchBase as well.
4) Is Spark supports RDBMS too?. We can have a single interface to pull out
data from multiple data sources?
5) Any recommendations(not limited to usage of Spark) for our specific
requirement described above.

Thanks
Ajay

Note : I have posted a similar post on the Drill User list as well as I am
not sure which one best fits for our usecase.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Clarifications-on-Spark-tp20440.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



R: Optimized spark configuration

2014-12-05 Thread Paolo Platter
What kind of Query are you performing?
You should set something like 2 partition per core that would be 400 Mb per 
partition.
As you have a lot of ram I suggest to cache the whole table, performance will 
increase a lot.

Paolo

Inviata dal mio Windows Phone

Da: vdiwakar.malladimailto:vdiwakar.mall...@gmail.com
Inviato: ‎05/‎12/‎2014 18:52
A: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Oggetto: Optimized spark configuration

Hi

Could any one help what would be better / optimized configuration for driver
memory, worker memory, number of parallelisms etc., parameters to be
configured when we are running 1 master node (it itself acting as slave node
also) and 1 slave node. Both are of 32 GB RAM with 4 cores.

On this, I loaded approx. 17M rows of data (3.2 GB) to hive store and when I
try to execute a query on this from jdbc thrift server, it is taking about
10-12 sec to retrieve the data which I think is too much.

Or guide please guide me any tutorial which will explain about these
optimize configurations.

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Optimized-spark-configuration-tp20495.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



R: map function

2014-12-04 Thread Paolo Platter
Hi,

rdd.flatMap( e = e._2.map( i = ( i, e._1)))

Should work, but I didn't test it so maybe I'm missing something.

Paolo

Inviata dal mio Windows Phone

Da: Yifan LImailto:iamyifa...@gmail.com
Inviato: ‎04/‎12/‎2014 09:27
A: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: map function

Hi,

I have a RDD like below:
(1, (10, 20))
(2, (30, 40, 10))
(3, (30))
…

Is there any way to map it to this:
(10,1)
(20,1)
(30,2)
(40,2)
(10,2)
(30,3)
…

generally, for each element, it might be mapped to multiple.

Thanks in advance!


Best,
Yifan LI







Re: How to enforce RDD to be cached?

2014-12-03 Thread Paolo Platter
Yes,

otherwise you can try:

rdd.cache().count()

and then run your benchmark

Paolo

Da: Daniel Darabosmailto:daniel.dara...@lynxanalytics.com
Data invio: ?mercoled?? ?3? ?dicembre? ?2014 ?12?:?28
A: shahabmailto:shahab.mok...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org



On Wed, Dec 3, 2014 at 10:52 AM, shahab 
shahab.mok...@gmail.commailto:shahab.mok...@gmail.com wrote:
Hi,

I noticed that rdd.cache() is not happening immediately rather due to lazy 
feature of Spark, it is happening just at the moment  you perform some 
map/reduce actions. Is this true?

Yes, this is correct.

If this is the case, how can I enforce Spark to cache immediately at its 
cache() statement? I need this to perform some benchmarking and I need to 
separate rdd caching and rdd transformation/action processing time.

The typical solution I think is to run rdd.foreach(_ = ()) to trigger a 
calculation.


Spark Shell strange worker Exception

2014-10-27 Thread Paolo Platter
Hi all,

I’m submitting a simple task using the spark shell against a cassandraRDD ( 
Datastax Environment ).
I’m getting the following eception from one of the workers:


INFO 2014-10-27 14:08:03 akka.event.slf4j.Slf4jLogger: Slf4jLogger started
INFO 2014-10-27 14:08:03 Remoting: Starting remoting
INFO 2014-10-27 14:08:03 Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkExecutor@10.105.111.130:50234]
INFO 2014-10-27 14:08:03 Remoting: Remoting now listens on addresses: 
[akka.tcp://sparkExecutor@10.105.111.130:50234]
INFO 2014-10-27 14:08:03 
org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: 
akka.tcp://sp...@srv02.pocbgsia.ats-online.it:39797/user/CoarseGrainedScheduler
INFO 2014-10-27 14:08:03 org.apache.spark.deploy.worker.WorkerWatcher: 
Connecting to worker akka.tcp://sparkWorker@10.105.111.130:34467/user/Worker
INFO 2014-10-27 14:08:04 org.apache.spark.deploy.worker.WorkerWatcher: 
Successfully connected to 
akka.tcp://sparkWorker@10.105.111.130:34467/user/Worker
INFO 2014-10-27 14:08:04 
org.apache.spark.executor.CoarseGrainedExecutorBackend: Successfully registered 
with driver
INFO 2014-10-27 14:08:04 org.apache.spark.executor.Executor: Using REPL class 
URI: http://159.8.18.11:51705
INFO 2014-10-27 14:08:04 akka.event.slf4j.Slf4jLogger: Slf4jLogger started
INFO 2014-10-27 14:08:04 Remoting: Starting remoting
INFO 2014-10-27 14:08:04 Remoting: Remoting started; listening on addresses 
:[akka.tcp://spark@10.105.111.130:49243]
INFO 2014-10-27 14:08:04 Remoting: Remoting now listens on addresses: 
[akka.tcp://spark@10.105.111.130:49243]
INFO 2014-10-27 14:08:04 org.apache.spark.SparkEnv: Connecting to 
BlockManagerMaster: 
akka.tcp://sp...@srv02.pocbgsia.ats-online.it:39797/user/BlockManagerMaster
INFO 2014-10-27 14:08:04 org.apache.spark.storage.DiskBlockManager: Created 
local directory at 
/usr/share/dse/spark/tmp/executor/spark-local-20141027140804-4d84
INFO 2014-10-27 14:08:04 org.apache.spark.storage.MemoryStore: MemoryStore 
started with capacity 23.0 GB.
INFO 2014-10-27 14:08:04 org.apache.spark.network.ConnectionManager: Bound 
socket to port 50542 with id = ConnectionManagerId(10.105.111.130,50542)
INFO 2014-10-27 14:08:04 org.apache.spark.storage.BlockManagerMaster: Trying to 
register BlockManager
INFO 2014-10-27 14:08:04 org.apache.spark.storage.BlockManagerMaster: 
Registered BlockManager
INFO 2014-10-27 14:08:04 org.apache.spark.SparkEnv: Connecting to 
MapOutputTracker: 
akka.tcp://sp...@srv02.pocbgsia.ats-online.it:39797/user/MapOutputTracker
INFO 2014-10-27 14:08:04 org.apache.spark.HttpFileServer: HTTP File server 
directory is 
/usr/share/dse/spark/tmp/executor/spark-a23656dc-efce-494b-875a-a1cf092c3230
INFO 2014-10-27 14:08:04 org.apache.spark.HttpServer: Starting HTTP Server
INFO 2014-10-27 14:08:27 
org.apache.spark.executor.CoarseGrainedExecutorBackend: Got assigned task 0
INFO 2014-10-27 14:08:28 org.apache.spark.executor.Executor: Running task ID 0
ERROR 2014-10-27 14:08:28 org.apache.spark.executor.Executor: Exception in task 
ID 0
java.lang.ClassNotFoundException: com.datastax.bdp.spark.CassandraRDD
at 
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:49)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:37)
at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source)
at java.io.ObjectInputStream.readClassDesc(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at 
org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:63)
at 
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:139)
at java.io.ObjectInputStream.readExternalData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:40)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:62)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:193)
at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:45)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 

R: Spark as a Library

2014-09-16 Thread Paolo Platter
Hi,

Spark job server by ooyala is the right tool for the job. It exposes rest api 
so calling it from a web app is suitable.
Is open source, you can find it on github

Best

Paolo Platter

Da: Ruebenacker, Oliver Amailto:oliver.ruebenac...@altisource.com
Inviato: ‎16/‎09/‎2014 21.18
A: Matei Zahariamailto:matei.zaha...@gmail.com; 
user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: RE: Spark as a Library


 Hello,

  Thanks for the response and great to hear it is possible. But how do I 
connect to Spark without using the submit script?

  I know how to start up a master and some workers and then connect to the 
master by packaging the app that contains the SparkContext and then submitting 
the package with the spark-submit script in standalone-mode. But I don’t want 
to submit the app that contains the SparkContext via the script, because I want 
that app to be running on a web server. So, what are other ways to connect to 
Spark? I can’t find in the docs anything other than using the script. Thanks!

 Best, Oliver

From: Matei Zaharia [mailto:matei.zaha...@gmail.com]
Sent: Tuesday, September 16, 2014 1:31 PM
To: Ruebenacker, Oliver A; user@spark.apache.org
Subject: Re: Spark as a Library

If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei


On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com) 
wrote:

 Hello,

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 Best, Oliver

Oliver Ruebenacker | Solutions Architect

Altisource™
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617) 728-5582 | ext: 275585
oliver.ruebenac...@altisource.commailto:oliver.ruebenac...@altisource.com | 
www.Altisource.com

***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***
***

This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***


Spark NLP

2014-09-10 Thread Paolo Platter
Hi all,

What is your preferred scala NLP lib ? why ?
Is there any items on the spark’s road map to integrate NLP features ?

I basically need to perform NER line by line, so I don’t need a deep 
integration with the distributed engine.
I only want simple dependencies and the chance to build a dictionary for 
italian Language.

Any suggestions ?

Thanks

Paolo Platter



RE: Spark and Shark

2014-09-01 Thread Paolo Platter
We tried to connect the old Simba Shark ODBC driver to the Thrift JDBC Server 
with Spark 1.1 RC2 and it works fine.



Best



Paolo



Paolo Platter
Agile Lab CTO

Da: Michael Armbrust mich...@databricks.com
Inviato: lunedì 1 settembre 2014 19:43
A: arthur.hk.c...@gmail.com
Cc: user@spark.apache.org
Oggetto: Re: Spark and Shark

I don't believe that Shark works with Spark  1.0.  Have you considered trying 
Spark SQL?


On Mon, Sep 1, 2014 at 8:21 AM, 
arthur.hk.c...@gmail.commailto:arthur.hk.c...@gmail.com 
arthur.hk.c...@gmail.commailto:arthur.hk.c...@gmail.com wrote:
Hi,

I have installed Spark 1.0.2 and Shark 0.9.2 on Hadoop 2.4.1 (by compiling from 
source).

spark: 1.0.2
shark: 0.9.2
hadoop: 2.4.1
java: java version 1.7.0_67
protobuf: 2.5.0


I have tried the smoke test in shark but got  
java.util.NoSuchElementException error,  can you please advise how to fix 
this?

shark create table x1 (a INT);
FAILED: Hive Internal Error: java.util.NoSuchElementException(null)
14/09/01 23:04:24 [main]: ERROR shark.SharkDriver: FAILED: Hive Internal Error: 
java.util.NoSuchElementException(null)
java.util.NoSuchElementException
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:925)
at java.util.HashMap$ValueIterator.next(HashMap.java:950)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:8117)
at 
shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:150)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:284)
at shark.SharkDriver.compile(SharkDriver.scala:215)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:342)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:977)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:340)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at shark.SharkCliDriver$.main(SharkCliDriver.scala:237)
at shark.SharkCliDriver.main(SharkCliDriver.scala)


spark-env.sh
#!/usr/bin/env bash
export CLASSPATH=$HBASE_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/mysql-connector-java-5.1.31-bin.jar
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Linux-amd64-64
export HADOOP_CONF_DIR=${HADOOP_CONF_DIR:-/etc/hadoop}
export 
SPARK_CLASSPATH=$SPARK_HOME/lib_managed/jars/mysql-connector-java-5.1.31-bin.jar
export SPARK_WORKER_MEMORY=2g
export HADOOP_HEAPSIZE=2000

spark-defaults.conf
spark.executor.memory   2048m
spark.shuffle.spill.compressfalse

shark-env.sh
#!/usr/bin/env bash
export SPARK_MEM=2g
export SHARK_MASTER_MEM=2g
SPARK_JAVA_OPTS= -Dspark.local.dir=/tmp 
SPARK_JAVA_OPTS+=-Dspark.kryoserializer.buffer.mb=10 
SPARK_JAVA_OPTS+=-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps 
export SPARK_JAVA_OPTS
export SHARK_EXEC_MODE=yarn
export 
SPARK_ASSEMBLY_JAR=$SCALA_HOME/assembly/target/scala-2.10/spark-assembly-1.0.2-hadoop2.4.1.jar
export SHARK_ASSEMBLY_JAR=target/scala-2.10/shark_2.10-0.9.2.jar
export HIVE_CONF_DIR=$HIVE_HOME/conf
export SPARK_LIBPATH=$HADOOP_HOME/lib/native/
export SPARK_LIBRARY_PATH=$HADOOP_HOME/lib/native/
export 
SPARK_CLASSPATH=$SHARK_HOME/lib/hadoop-snappy-0.0.1-SNAPSHOT.jar:$SHARK_HOME/lib/protobuf-java-2.5.0.jar


Regards
Arthur