Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Matei Zaharia
I’m pretty sure that Catalyst was built before Calcite, or at least in 
parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, 
building Catalyst in Scala also made it more concise and easier to extend than 
an optimizer written in Java (you can find various presentations about how 
Catalyst works).

Matei

> On Jan 13, 2020, at 8:41 AM, Michael Mior  wrote:
> 
> It's fairly common for adapters (Calcite's abstraction of a data
> source) to push down predicates. However, the API certainly looks a
> lot different than Catalyst's.
> --
> Michael Mior
> mm...@apache.org
> 
> Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
>  a écrit :
>> 
>> The implementation they chose supports push down predicates, Datasets and 
>> other features that are not available in Calcite:
>> 
>> https://databricks.com/glossary/catalyst-optimizer
>> 
>> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
>>> 
>>> Was there a qualitative or quantitative benchmark done before a design
>>> decision was made not to use Calcite?
>>> 
>>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>>> in Calcite, and frameworks built on top of Calcite? In the context of big
>>> data / TCPH benchmarks.
>>> 
>>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>>> if any Catalyst veteran here can give me pointers. Trying to defend
>>> Spark/Catalyst.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>> 
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>> 
>> 
>> 
>> --
>> Thanks,
>> Jason
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Matei Zaharia
Hi Bartosz,

This is because the vote on 2.4 has passed (you can see the vote thread on the 
dev mailing list) and we are just working to get the release into various 
channels (Maven, PyPI, etc), which can take some time. Expect to see an 
announcement soon once that’s done.

Matei

> On Nov 4, 2018, at 7:14 AM, Bartosz Konieczny  wrote:
> 
> Hi,
> 
> Today I wanted to set up a development environment for GraphX and when I 
> visited Maven central repository 
> (https://mvnrepository.com/artifact/org.apache.spark/spark-graphx) I saw that 
> it was already available in 2.4.0 version. Does it mean that the new version 
> of Apache Spark was released ? It seems quite surprising for me because I 
> didn't find any release information and the 2.4 artifact was deployed 
> 29/10/2018. Maybe somebody here has some explanation for that ?
> 
> Best regards,
> Bartosz Konieczny.


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
That’s a good point — I’d say there’s just a risk of creating a perception 
issue. First, some users might feel that this means they have to migrate now, 
which is before Python itself drops support; they might also be surprised that 
we did this in a minor release (e.g. might we drop Python 2 altogether in a 
Spark 2.5 if that later comes out?). Second, contributors might feel that this 
means new features no longer have to work with Python 2, which would be 
confusing. Maybe it’s OK on both fronts, but it just seems scarier for users to 
do this now if we do plan to have Spark 3.0 in the next 6 months anyway.

Matei

> On Sep 17, 2018, at 1:04 PM, Mark Hamstra  wrote:
> 
> What is the disadvantage to deprecating now in 2.4.0? I mean, it doesn't 
> change the code at all; it's just a notification that we will eventually 
> cease supporting Py2. Wouldn't users prefer to get that notification sooner 
> rather than later?
> 
> On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia  
> wrote:
> I’d like to understand the maintenance burden of Python 2 before deprecating 
> it. Since it is not EOL yet, it might make sense to only deprecate it once 
> it’s EOL (which is still over a year from now). Supporting Python 2+3 seems 
> less burdensome than supporting, say, multiple Scala versions in the same 
> codebase, so what are we losing out?
> 
> The other thing is that even though Python core devs might not support 2.x 
> later, it’s quite possible that various Linux distros will if moving from 2 
> to 3 remains painful. In that case, we may want Apache Spark to continue 
> releasing for it despite the Python core devs not supporting it.
> 
> Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it 
> later in 3.x instead of deprecating it in 2.4. I’d also consider looking at 
> what other data science tools are doing before fully removing it: for 
> example, if Pandas and TensorFlow no longer support Python 2 past some point, 
> that might be a good point to remove it.
> 
> Matei
> 
> > On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> > 
> > If we're going to do that, then we need to do it right now, since 2.4.0 is 
> > already in release candidates.
> > 
> > On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> > I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem 
> > like a ways off but even now there may be some spark versions supporting 
> > Py2 past the point where Py2 is no longer receiving security patches 
> > 
> > 
> > On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  
> > wrote:
> > We could also deprecate Py2 already in the 2.4.0 release.
> > 
> > On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> > In case this didn't make it onto this thread:
> > 
> > There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove 
> > it entirely on a later 3.x release.
> > 
> > On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  
> > wrote:
> > On a separate dev@spark thread, I raised a question of whether or not to 
> > support python 2 in Apache Spark, going forward into Spark 3.0.
> > 
> > Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> > is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> > good time to consider support for Python-2 on PySpark.
> > 
> > Key advantages to dropping Python 2 are:
> >   • Support for PySpark becomes significantly easier.
> >   • Avoid having to support Python 2 until Spark 4.0, which is likely 
> > to imply supporting Python 2 for some time after it goes EOL.
> > (Note that supporting python 2 after EOL means, among other things, that 
> > PySpark would be supporting a version of python that was no longer 
> > receiving security patches)
> > 
> > The main disadvantage is that PySpark users who have legacy python-2 code 
> > would have to migrate their code to python 3 to take advantage of Spark 3.0
> > 
> > This decision obviously has large implications for the Apache Spark 
> > community and we want to solicit community feedback.
> > 
> > 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating 
it. Since it is not EOL yet, it might make sense to only deprecate it once it’s 
EOL (which is still over a year from now). Supporting Python 2+3 seems less 
burdensome than supporting, say, multiple Scala versions in the same codebase, 
so what are we losing out?

The other thing is that even though Python core devs might not support 2.x 
later, it’s quite possible that various Linux distros will if moving from 2 to 
3 remains painful. In that case, we may want Apache Spark to continue releasing 
for it despite the Python core devs not supporting it.

Basically, I’d suggest to deprecate this in Spark 3.0 and then remove it later 
in 3.x instead of deprecating it in 2.4. I’d also consider looking at what 
other data science tools are doing before fully removing it: for example, if 
Pandas and TensorFlow no longer support Python 2 past some point, that might be 
a good point to remove it.

Matei

> On Sep 17, 2018, at 11:01 AM, Mark Hamstra  wrote:
> 
> If we're going to do that, then we need to do it right now, since 2.4.0 is 
> already in release candidates.
> 
> On Mon, Sep 17, 2018 at 10:57 AM Erik Erlandson  wrote:
> I like Mark’s concept for deprecating Py2 starting with 2.4: It may seem like 
> a ways off but even now there may be some spark versions supporting Py2 past 
> the point where Py2 is no longer receiving security patches 
> 
> 
> On Sun, Sep 16, 2018 at 12:26 PM Mark Hamstra  wrote:
> We could also deprecate Py2 already in the 2.4.0 release.
> 
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:
> In case this didn't make it onto this thread:
> 
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
> entirely on a later 3.x release.
> 
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson  wrote:
> On a separate dev@spark thread, I raised a question of whether or not to 
> support python 2 in Apache Spark, going forward into Spark 3.0.
> 
> Python-2 is going EOL at the end of 2019. The upcoming release of Spark 3.0 
> is an opportunity to make breaking changes to Spark's APIs, and so it is a 
> good time to consider support for Python-2 on PySpark.
> 
> Key advantages to dropping Python 2 are:
>   • Support for PySpark becomes significantly easier.
>   • Avoid having to support Python 2 until Spark 4.0, which is likely to 
> imply supporting Python 2 for some time after it goes EOL.
> (Note that supporting python 2 after EOL means, among other things, that 
> PySpark would be supporting a version of python that was no longer receiving 
> security patches)
> 
> The main disadvantage is that PySpark users who have legacy python-2 code 
> would have to migrate their code to python 3 to take advantage of Spark 3.0
> 
> This decision obviously has large implications for the Apache Spark community 
> and we want to solicit community feedback.
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that 
then executes on Spark SQL.

> On Sep 14, 2018, at 2:42 AM, kant kodali  wrote:
> 
> Hi All,
> 
> Is there any open source framework that converts Cypher to SparkSQL?
> 
> Thanks!


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its 
SparkContext. I see you are still passing “yarn-client” as an argument later to 
it in your command.

> On Jun 17, 2018, at 11:53 AM, Raymond Xie  wrote:
> 
> Thank you Subhash.
> 
> Here is the new command:
> spark-submit --master local[*] --class retail_db.GetRevenuePerOrder --conf 
> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
> /public/retail_db/order_items /home/rxie/output/revenueperorder
> 
> Still seeing the same issue here.
> 2018-06-17 11:51:25 INFO  RMProxy:98 - Connecting to ResourceManager at 
> /0.0.0.0:8032
> 2018-06-17 11:51:27 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:28 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:29 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:30 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:31 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:32 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:33 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 6 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:34 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 7 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:35 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 8 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 2018-06-17 11:51:36 INFO  Client:871 - Retrying connect to server: 
> 0.0.0.0/0.0.0.0:8032. Already tried 9 time(s); retry policy is
>   
>RetryUpToMaximumCountWithFixedSleep(maxRetries=10, 
> sleepTime=1000 MILLISECONDS)
> 
> 
> 
> 
> Sincerely yours,
> 
> 
> Raymond
> 
> On Sun, Jun 17, 2018 at 2:36 PM, Subhash Sriram  
> wrote:
> Hi Raymond,
> 
> If you set your master to local[*] instead of yarn-client, it should run on 
> your local machine.
> 
> Thanks,
> Subhash 
> 
> Sent from my iPhone
> 
> On Jun 17, 2018, at 2:32 PM, Raymond Xie  wrote:
> 
>> Hello,
>> 
>> I am wondering how can I run spark job in my environment which is a single 
>> Ubuntu host with no hadoop installed? if I run my job like below, I will end 
>> up with infinite loop at the end. Thank you very much.
>> 
>> rxie@ubuntu:~/data$ spark-submit --class retail_db.GetRevenuePerOrder --conf 
>> spark.ui.port=12678 spark2practice_2.11-0.1.jar yarn-client 
>> /public/retail_db/order_items /home/rxie/output/revenueperorder
>> 2018-06-17 11:19:36 WARN  Utils:66 - Your hostname, ubuntu resolves to a 
>> loopback address: 127.0.1.1; using 192.168.112.141 instead (on interface 
>> ens33)
>> 2018-06-17 11:19:36 WARN  Utils:66 - Set SPARK_LOCAL_IP 

Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael,

It depends on what you mean by “support”. In general, there won’t be new 
feature releases for 1.X (e.g. Spark 1.7) because all the new features are 
being added to the master branch. However, there is always room for bug fix 
releases if there is a catastrophic bug, and committers can make those at any 
time. In general though, I’d recommend moving workloads to Spark 2.x. We tried 
to make the migration as easy as possible (a few APIs changed, but not many), 
and 2.x has been out for a long time now and is widely used.

We should perhaps write a more explicit maintenance policy, but all of this is 
run based on what committers want to work on; if someone thinks that there’s a 
serious enough issue in 1.6 to update it, they can put together a new release. 
It does help to hear from users about this though, e.g. if you think there’s a 
significant issue that people are missing.

Matei

> On Oct 19, 2017, at 5:20 AM, Ismaël Mejía  wrote:
> 
> Hello,
> 
> I noticed that some of the (Big Data / Cloud Managed) Hadoop
> distributions are starting to (phase out / deprecate) Spark 1.x and I
> was wondering if the Spark community has already decided when will it
> end the support for Spark 1.x. I ask this also considering that the
> latest release in the series is already almost one year old. Any idea
> on this ?
> 
> Thanks,
> Ismaël
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Kill Spark Streaming JOB from Spark UI or Yarn

2017-08-27 Thread Matei Zaharia
The batches should all have the same application ID, so use that one. You can 
also find the application in the YARN UI to terminate it from there.

Matei

> On Aug 27, 2017, at 10:27 AM, KhajaAsmath Mohammed  
> wrote:
> 
> Hi,
> 
> I am new to spark streaming and not able to find an option to kill it after 
> starting spark streaming context.
> 
> Streaming Tab doesnt have option to kill it.
> 
> Jobs tab too doesn't have option to kill it
> 
> 
> 
> if scheduled on yarn, how to kill that if spark submit is running in 
> background as I will not have an option to find yarn application id. does 
> batches have separate yarn application id or same one?
> 
> Thanks,
> Asmath


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: real world spark code

2017-07-25 Thread Matei Zaharia
You can also find a lot of GitHub repos for external packages here: 
http://spark.apache.org/third-party-projects.html

Matei

> On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft  
> wrote:
> 
> There’s a number of real-world open source Spark applications in the sciences:
> 
> genomics:
> 
> github.com/bigdatagenomics/adam <— core is scala, has py/r wrappers
> https://github.com/broadinstitute/gatk <— core is java
> https://github.com/hail-is/hail <— core is scala, mostly used through python 
> wrappers
> 
> neuroscience:
> 
> https://github.com/thunder-project/thunder#using-with-spark <— pyspark
> 
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
> 
>> On Jul 25, 2017, at 8:09 AM, Jörn Franke  wrote:
>> 
>> Continuous integration (Travis, jenkins) and reporting on unit tests, 
>> integration tests etc for each source code version.
>> 
>> On 25. Jul 2017, at 16:58, Adaryl Wakefield  
>> wrote:
>> 
>>> ci+reporting? I’ve never heard of that term before. What is that?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
>>>  
>>>  
>>> From: Jörn Franke [mailto:jornfra...@gmail.com] 
>>> Sent: Tuesday, July 25, 2017 8:31 AM
>>> To: Adaryl Wakefield 
>>> Cc: user@spark.apache.org
>>> Subject: Re: real world spark code
>>>  
>>> Look for the ones that have unit and integration tests as well as a 
>>> ci+reporting on code quality.
>>>  
>>> All the others are just toy examples. Well should be :)
>>> 
>>> On 25. Jul 2017, at 01:08, Adaryl Wakefield  
>>> wrote:
>>> 
>>> Anybody know of publicly available GitHub repos of real world Spark 
>>> applications written in scala?
>>>  
>>> Adaryl "Bob" Wakefield, MBA
>>> Principal
>>> Mass Street Analytics, LLC
>>> 913.938.6685
>>> www.massstreet.net
>>> www.linkedin.com/in/bobwakefieldmba
>>> Twitter: @BobLovesData
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia
The Kafka source will only appear in 2.0.2 -- see this thread for the current 
release candidate: 
https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E
 . You can try that right now if you want from the staging Maven repo shown 
there. The vote looks likely to pass so an actual release should hopefully also 
be out soon.

Matei

> On Nov 6, 2016, at 5:25 PM, shyla deshpande  wrote:
> 
> Hi Jaya!
> 
> Thanks for the reply. Structured streaming works fine for me with socket text 
> stream . I think structured streaming with kafka source not yet supported.
> 
> Please if anyone has got it working with kafka source, please provide me some 
> sample code or direction.
> 
> Thanks
> 
> 
> On Sun, Nov 6, 2016 at 5:17 PM, Jayaradha Natarajan  > wrote:
> Shyla!
> 
> Check
> https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
>  
> 
> 
> Thanks,
> Jayaradha
> 
> On Sun, Nov 6, 2016 at 5:13 PM, shyla  > wrote:
> I am trying to do Structured Streaming with Kafka Source. Please let me know
> where I can find some sample code for this. Thanks
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Structured-Streaming-with-Kafka-Source-does-it-work-tp19748.html
>  
> 
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 
> 
> 



Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
I think people explained this pretty well, but in practice, this distinction is 
also somewhat of a marketing term, because every system will perform some kind 
of batching. For example, every time you use TCP, the OS and network stack may 
buffer multiple messages together and send them at once; and likewise, 
virtually all streaming engines can batch data internally to achieve higher 
throughput. Furthermore, in all APIs, you can see individual records and 
respond to them one by one. The main question is just what overall performance 
you get (throughput and latency).

Matei

> On Aug 23, 2016, at 4:08 PM, Aseem Bansal  wrote:
> 
> Thanks everyone for clarifying.
> 
> On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal  > wrote:
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ 
>  and it mentioned that spark 
> streaming actually mini-batch not actual streaming. 
> 
> I have not used streaming and I am not sure what is the difference in the 2 
> terms. Hence could not make a judgement myself.
> 



Re: unsubscribe

2016-08-10 Thread Matei Zaharia
To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from 
the address you're subscribed from.

Matei

> On Aug 10, 2016, at 12:48 PM, Sohil Jain  wrote:
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Dropping late date in Structured Streaming

2016-08-06 Thread Matei Zaharia
Yes, a built-in mechanism is planned in future releases. You can also drop it 
using a filter for now but the stateful operators will still keep state for old 
windows.

Matei

> On Aug 6, 2016, at 9:40 AM, Amit Sela  wrote:
> 
> I've noticed that when using Structured Streaming with event-time windows 
> (fixed/sliding), all windows are retained. This is clearly how "late" data is 
> handled, but I was wondering if there is some pruning mechanism that I might 
> have missed ? or is this planned in future releases ?
> 
> Thanks,
> Amit


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia
Yup, they will definitely coexist. Structured Streaming is currently alpha and 
will probably be complete in the next few releases, but Spark Streaming will 
continue to exist, because it gives the user more low-level control. It's 
similar to DataFrames vs RDDs (RDDs are the lower-level API for when you want 
control, while DataFrames do more optimizations automatically by restricting 
the computation model).

Matei

> On Jul 27, 2016, at 12:03 AM, Ofir Manor  wrote:
> 
> Structured Streaming in 2.0 is declared as alpha - plenty of bits still 
> missing:
>  
> http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
>  
> 
> I assume that it will be declared stable / GA in a future 2.x release, and 
> then it will co-exist with DStream for quite a while before someone will 
> suggest to start a deprecation process that will eventually lead to its 
> removal...
> As a user, I guess we will need to apply judgement about when to switch to 
> Structured Streaming - each of us have a different risk/value tradeoff, based 
> on our specific situation...
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Wed, Jul 27, 2016 at 8:02 AM, Chang Chen  > wrote:
> Hi guys
> 
> Structure Stream is coming with spark 2.0,  but I noticed that DStream is 
> still here
> 
> What's the future of the DStream, will it be deprecated and removed 
> eventually? Or co-existed with  Structure Stream forever?
> 
> Thanks
> Chang
> 
> 



Updated Spark logo

2016-06-10 Thread Matei Zaharia
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ 
to say "Apache Spark" instead of just "Spark". Many ASF projects have been 
doing this recently to make it clearer that they are associated with the ASF, 
and indeed the ASF's branding guidelines generally require that projects be 
referred to as "Apache X" in various settings, especially in related commercial 
or open source products (https://www.apache.org/foundation/marks/). If you have 
any kind of site or product that uses Spark logo, it would be great to update 
to this full one.

There are EPS versions of the logo available at 
https://spark.apache.org/images/spark-logo.eps and 
https://spark.apache.org/images/spark-logo-reverse.eps; before using these also 
check https://www.apache.org/foundation/marks/.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia
I don't think any of the developers use this as an official channel, but all 
the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can 
document this on the website and say that it's mostly for users to find other 
users. Development discussions should happen on the dev mailing list and JIRA 
so that they can easily be archived and found afterward.

Matei

> On May 16, 2016, at 1:06 PM, Dood@ODDO  wrote:
> 
> On 5/16/2016 9:52 AM, Xinh Huynh wrote:
>> I just went to IRC. It looks like the correct channel is #apache-spark.
>> So, is this an "official" chat room for Spark?
>> 
> 
> Ah yes, my apologies, it is #apache-spark indeed. Not sure if there is an 
> official channel on IRC for spark :-)
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia
This sounds good to me as well. The one thing we should pay attention to is how 
we update the docs so that people know to start with the spark.ml classes. 
Right now the docs list spark.mllib first and also seem more comprehensive in 
that area than in spark.ml, so maybe people naturally move towards that.

Matei

> On Apr 5, 2016, at 4:44 PM, Xiangrui Meng  wrote:
> 
> Yes, DB (cc'ed) is working on porting the local linear algebra library over 
> (SPARK-13944). There are also frequent pattern mining algorithms we need to 
> port over in order to reach feature parity. -Xiangrui
> 
> On Tue, Apr 5, 2016 at 12:08 PM Shivaram Venkataraman 
> > wrote:
> Overall this sounds good to me. One question I have is that in
> addition to the ML algorithms we have a number of linear algebra
> (various distributed matrices) and statistical methods in the
> spark.mllib package. Is the plan to port or move these to the spark.ml 
> 
> namespace in the 2.x series ?
> 
> Thanks
> Shivaram
> 
> On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen  > wrote:
> > FWIW, all of that sounds like a good plan to me. Developing one API is
> > certainly better than two.
> >
> > On Tue, Apr 5, 2016 at 7:01 PM, Xiangrui Meng  > > wrote:
> >> Hi all,
> >>
> >> More than a year ago, in Spark 1.2 we introduced the ML pipeline API built
> >> on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API 
> >> has
> >> been developed under the spark.ml  package, while the 
> >> old RDD-based API has
> >> been developed in parallel under the spark.mllib package. While it was
> >> easier to implement and experiment with new APIs under a new package, it
> >> became harder and harder to maintain as both packages grew bigger and
> >> bigger. And new users are often confused by having two sets of APIs with
> >> overlapped functions.
> >>
> >> We started to recommend the DataFrame-based API over the RDD-based API in
> >> Spark 1.5 for its versatility and flexibility, and we saw the development
> >> and the usage gradually shifting to the DataFrame-based API. Just counting
> >> the lines of Scala code, from 1.5 to the current master we added ~1
> >> lines to the DataFrame-based API while ~700 to the RDD-based API. So, to
> >> gather more resources on the development of the DataFrame-based API and to
> >> help users migrate over sooner, I want to propose switching RDD-based MLlib
> >> APIs to maintenance mode in Spark 2.0. What does it mean exactly?
> >>
> >> * We do not accept new features in the RDD-based spark.mllib package, 
> >> unless
> >> they block implementing new features in the DataFrame-based spark.ml 
> >> 
> >> package.
> >> * We still accept bug fixes in the RDD-based API.
> >> * We will add more features to the DataFrame-based API in the 2.x series to
> >> reach feature parity with the RDD-based API.
> >> * Once we reach feature parity (possibly in Spark 2.2), we will deprecate
> >> the RDD-based API.
> >> * We will remove the RDD-based API from the main Spark repo in Spark 3.0.
> >>
> >> Though the RDD-based API is already in de facto maintenance mode, this
> >> announcement will make it clear and hence important to both MLlib 
> >> developers
> >> and users. So we’d greatly appreciate your feedback!
> >>
> >> (As a side note, people sometimes use “Spark ML” to refer to the
> >> DataFrame-based API or even the entire MLlib component. This also causes
> >> confusion. To be clear, “Spark ML” is not an official name and there are no
> >> plans to rename MLlib to “Spark ML” at this time.)
> >>
> >> Best,
> >> Xiangrui
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > 
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > 
> >



Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
They'll be able to run concurrently and share workers / data. Take a look at 
http://spark.apache.org/docs/latest/job-scheduling.html 
<http://spark.apache.org/docs/latest/job-scheduling.html> for how scheduling 
happens across multiple running jobs in the same SparkContext.

Matei

> On Jan 17, 2016, at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> Same rdd means same sparkcontext means same workers
> 
> Cache/persist the rdd to avoid repeated jobs
> 
> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com 
> <mailto:mennou...@gmail.com>> wrote:
> Hi,
> 
> Thank you all for your answers,
> 
> If I correctly understand, actions (in my case foreach) can be run 
> concurrently and simultaneously on the SAME rdd, (which is logical because 
> they are read only object). however, I want to know if the same workers are 
> used for the concurrent analysis ?
> 
> Thank you
> 
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com 
> <mailto:joder...@gmail.com>>:
> I stand corrected. How considerable are the benefits though? Will the 
> scheduler be able to dispatch jobs from both actions simultaneously (or on a 
> when-workers-become-available basis)?
> 
> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> we run multiple actions on the same (cached) rdd all the time, i guess in 
> different threads indeed (its in akka)
> 
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> RDDs actually are thread-safe, and quite a few applications use them this 
> way, e.g. the JDBC server.
> 
> Matei
> 
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com 
>> <mailto:joder...@gmail.com>> wrote:
>> 
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in 
>> parallel? The idea behind RDDs is to provide you with an abstraction for 
>> computing parallel operations on distributed data. Even if you were to call 
>> actions from several threads at once, the individual executors of your spark 
>> environment would still have to perform operations sequentially.
>> 
>> As an alternative, I would suggest to restructure your RDD transformations 
>> to compute the required results in one single operation.
>> 
>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com 
>> <mailto:jcove...@gmail.com>> wrote:
>> Threads
>> 
>> 
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com 
>> <mailto:mennou...@gmail.com>> escribió:
>> Hi,
>> 
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>> 
>> Thank you,
>> Regards
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
>> 
> 
> 
> 
> 



Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, 
e.g. the JDBC server.

Matei

> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
> 
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in 
> parallel? The idea behind RDDs is to provide you with an abstraction for 
> computing parallel operations on distributed data. Even if you were to call 
> actions from several threads at once, the individual executors of your spark 
> environment would still have to perform operations sequentially.
> 
> As an alternative, I would suggest to restructure your RDD transformations to 
> compute the required results in one single operation.
> 
> On 15 January 2016 at 06:18, Jonathan Coveney  > wrote:
> Threads
> 
> 
> El viernes, 15 de enero de 2016, Kira  > escribió:
> Hi,
> 
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
> 
> Thank you,
> Regards
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: user-h...@spark.apache.org <>
> 
> 



Re: Compiling only MLlib?

2016-01-15 Thread Matei Zaharia
Have you tried just downloading a pre-built package, or linking to Spark 
through Maven? You don't need to build it unless you are changing code inside 
it. Check out 
http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications
 for how to link to it.

Matei

> On Jan 15, 2016, at 6:13 PM, Colin Woodbury  wrote:
> 
> Hi, I'm very much interested in using Spark's MLlib in standalone programs. 
> I've never used Hadoop, and don't intend to deploy on massive clusters. 
> Building Spark has been an honest nightmare, and I've been on and off it for 
> weeks.
> 
> The build always runs out of RAM on my laptop (4g of RAM, Arch Linux) when I 
> try to build with Scala 2.11 support. No matter how I tweak JVM flags to 
> reduce maximum RAM use, the build always crashes.
> 
> When trying to build Spark 1.6.0 for Scala 2.10 just now, the build had 
> compilation errors. Here is one, as a sample. I've saved the rest:
> 
> [error] 
> /home/colin/building/apache-spark/spark-1.6.0/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkJLineReader.scala:16:
>  object jline is not a member of package tools
> [error] import scala.tools.jline.console.completer._
> 
> It informs me:
> 
> [ERROR] After correcting the problems, you can resume the build with the 
> command
> [ERROR]   mvn  -rf :spark-repl_2.10
> 
> I don't feel safe doing that, given that I don't know what my "" are. 
> 
> I've noticed that the build is compiling a lot of things I have no interest 
> in. Is it possible to just compile the Spark core, its tools, and MLlib? I 
> just want to experiment, and this is causing me a  lot of stress.
> 
> Thank you kindly,
> Colin



Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Matei Zaharia
In production, I'd recommend using IAM roles to avoid having keys altogether. 
Take a look at 
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html.

Matei

> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan 
>  wrote:
> 
> If you are on EMR, these can go into your hdfs site config. And will work 
> with Spark on YARN by default.
> 
> Regards
> Sab
> 
> On 11-Jan-2016 5:16 pm, "Krishna Rao"  > wrote:
> Hi all,
> 
> Is there a method for reading from s3 without having to hard-code keys? The 
> only 2 ways I've found both require this:
> 
> 1. Set conf in code e.g.: 
> sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey", "") 
> 
> 2. Set keys in URL, e.g.:
> sc.textFile("s3n://@/bucket/test/testdata")
> 
> 
> Both if which I'm reluctant to do within production code!
> 
> 
> Cheers



Re: How to compile Spark with customized Hadoop?

2015-10-09 Thread Matei Zaharia
You can publish your version of Hadoop to your Maven cache with mvn publish 
(just give it a different version number, e.g. 2.7.0a) and then pass that as 
the Hadoop version to Spark's build (see 
http://spark.apache.org/docs/latest/building-spark.html 
).

Matei

> On Oct 9, 2015, at 3:10 PM, Dogtail L  wrote:
> 
> Hi all,
> 
> I have modified Hadoop source code, and I want to compile Spark with my 
> modified Hadoop. Do you know how to do that? Great thanks!



Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
If you run on YARN, you can use Kerberos, be authenticated as the right user, 
etc in the same way as MapReduce jobs.

Matei

> On Sep 3, 2015, at 1:37 PM, Daniel Schulz  
> wrote:
> 
> Hi,
> 
> I really enjoy using Spark. An obstacle to sell it to our clients currently 
> is the missing Kerberos-like security on a Hadoop with simple authentication. 
> Are there plans, a proposal, or a project to deliver a Ranger plugin or 
> something similar to Spark. The target is to differentiate users and their 
> privileges when reading and writing data to HDFS? Is Kerberos my only option 
> then?
> 
> Kind regards, Daniel.
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
Even simple Spark-on-YARN should run as the user that submitted the job, yes, 
so HDFS ACLs should be enforced. Not sure how it plays with the rest of Ranger.

Matei

> On Sep 3, 2015, at 4:57 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> Well if it needs to read from hdfs then it will adhere to the permissions 
> defined there And/or in ranger. However, I am not aware that you can protect 
> dataframes, tables or streams in general in Spark.
> 
> Le jeu. 3 sept. 2015 à 21:47, Daniel Schulz <danielschulz2...@hotmail.com 
> <mailto:danielschulz2...@hotmail.com>> a écrit :
> Hi Matei,
> 
> Thanks for your answer.
> 
> My question is regarding simple authenticated Spark-on-YARN only, without 
> Kerberos. So when I run Spark on YARN and HDFS, Spark will pass through my 
> HDFS user and only be able to access files I am entitled to read/write? Will 
> it enforce HDFS ACLs and Ranger policies as well?
> 
> Best regards, Daniel.
> 
> > On 03 Sep 2015, at 21:16, Matei Zaharia <matei.zaha...@gmail.com 
> > <mailto:matei.zaha...@gmail.com>> wrote:
> >
> > If you run on YARN, you can use Kerberos, be authenticated as the right 
> > user, etc in the same way as MapReduce jobs.
> >
> > Matei
> >
> >> On Sep 3, 2015, at 1:37 PM, Daniel Schulz <danielschulz2...@hotmail.com 
> >> <mailto:danielschulz2...@hotmail.com>> wrote:
> >>
> >> Hi,
> >>
> >> I really enjoy using Spark. An obstacle to sell it to our clients 
> >> currently is the missing Kerberos-like security on a Hadoop with simple 
> >> authentication. Are there plans, a proposal, or a project to deliver a 
> >> Ranger plugin or something similar to Spark. The target is to 
> >> differentiate users and their privileges when reading and writing data to 
> >> HDFS? Is Kerberos my only option then?
> >>
> >> Kind regards, Daniel.
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> >> <mailto:user-unsubscr...@spark.apache.org>
> >> For additional commands, e-mail: user-h...@spark.apache.org 
> >> <mailto:user-h...@spark.apache.org>
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> > For additional commands, e-mail: user-h...@spark.apache.org 
> > <mailto:user-h...@spark.apache.org>
> >
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 



Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. 
You can fix it by having more partitions. If you read data from a file system 
like HDFS or S3, set the number of partitions higher in the sc.textFile, 
hadoopFile, etc methods (it's an optional second parameter to those methods). 
If you create it through parallelize or if this particular RDD comes from a 
shuffle, use more tasks in the parallelize or shuffle.

Matei

 On Jul 9, 2015, at 3:35 PM, Michal Čizmazia mici...@gmail.com wrote:
 
 Spark version 1.4.0 in the Standalone mode
 
 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3) 
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 - 
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127)
 at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134)
 at 
 org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:509)
 at 
 org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:427)
 at org.apache.spark.storage.BlockManager.get(BlockManager.scala:615)
 at 
 org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:154)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
 
 
 On 9 July 2015 at 18:11, Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com wrote:
 Which release of Spark are you using ?
 
 Can you show the complete stack trace ?
 
 getBytes() could be called from:
 getBytes(file, 0, file.length)
 or:
 getBytes(segment.file, segment.offset, segment.length)
 
 Cheers
 
 On Thu, Jul 9, 2015 at 2:50 PM, Michal Čizmazia mici...@gmail.com 
 mailto:mici...@gmail.com wrote:
 Please could anyone give me pointers for appropriate SparkConf to work around 
 Size exceeds Integer.MAX_VALUE?
 
 Stacktrace:
 
 2015-07-09 20:12:02 INFO  (sparkDriver-akka.actor.default-dispatcher-3) 
 BlockManagerInfo:59 - Added rdd_0_0 on disk on localhost:51132 (size: 29.8 GB)
 2015-07-09 20:12:02 ERROR (Executor task launch worker-0) Executor:96 - 
 Exception in task 0.0 in stage 0.0 (TID 0)
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
 at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836)
 at 
 org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125)
 ...
 
 
 



Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the 
counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow 
to keep track of a running count) is exactly-once. When you write to a storage 
system, no matter which streaming framework you use, you'll have to make sure 
the writes are idempotent, because the storage system can't know whether you 
meant to write the same data again or not. But the place where Spark Streaming 
helps over Storm, etc is for tracking state within your computation. Without 
that facility, you'd not only have to make sure that writes are idempotent, but 
you'd have to make sure that updates to your own internal state (e.g. 
reduceByKeyAndWindow) are exactly-once too.

Matei

 On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com wrote:
 
 The thing is, even with that improvement, you still have to make updates 
 idempotent or transactional yourself. If you read 
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
  
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 
 that refers to the latest version, it says:
 
 Semantics of output operations
 Output operations (like foreachRDD) have at-least once semantics, that is, 
 the transformed data may get written to an external entity more than once in 
 the event of a worker failure. While this is acceptable for saving to file 
 systems using the saveAs***Files operations (as the file will simply get 
 overwritten with the same data), additional effort may be necessary to 
 achieve exactly-once semantics. There are two approaches.
 
 Idempotent updates: Multiple attempts always write the same data. For 
 example, saveAs***Files always writes the same data to the generated files.
 
 Transactional updates: All updates are made transactionally so that updates 
 are made exactly once atomically. One way to do this would be the following.
 
 Use the batch time (available in foreachRDD) and the partition index of the 
 transformed RDD to create an identifier. This identifier uniquely identifies 
 a blob data in the streaming application.
 Update external system with this blob transactionally (that is, exactly once, 
 atomically) using the identifier. That is, if the identifier is not already 
 committed, commit the partition data and the identifier atomically. Else if 
 this was already committed, skip the update.
 
 So either you make the update idempotent, or you have to make it 
 transactional yourself, and the suggested mechanism is very similar to what 
 Storm does.
 
 
 
 
 On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com 
 mailto:asoni.le...@gmail.com wrote:
 @Enno 
 As per the latest version and documentation Spark Streaming does offer 
 exactly once semantics using improved kafka integration , Not i have not 
 tested yet.
 
 Any feedback will be helpful if anyone is tried the same.
 
 http://koeninger.github.io/kafka-exactly-once/#7 
 http://koeninger.github.io/kafka-exactly-once/#7
 
 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
  
 https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
 
 
 
 On Wed, Jun 17, 2015 at 10:33 AM, Enno Shioji eshi...@gmail.com 
 mailto:eshi...@gmail.com wrote:
 AFAIK KCL is *supposed* to provide fault tolerance and load balancing (plus 
 additionally, elastic scaling unlike Storm), Kinesis providing the 
 coordination. My understanding is that it's like a naked Storm worker process 
 that can consequently only do map.
 
 I haven't really used it tho, so can't really comment how it compares to 
 Spark/Storm. Maybe somebody else will be able to comment.
 
 
 
 On Wed, Jun 17, 2015 at 3:13 PM, ayan guha guha.a...@gmail.com 
 mailto:guha.a...@gmail.com wrote:
 Thanks for this. It's kcl based kinesis application. But because its just a 
 Java application we are thinking to use spark on EMR or storm for fault 
 tolerance and load balancing. Is it a correct approach?
 
 On 17 Jun 2015 23:07, Enno Shioji eshi...@gmail.com 
 mailto:eshi...@gmail.com wrote:
 Hi Ayan,
 
 Admittedly I haven't done much with Kinesis, but if I'm not mistaken you 
 should be able to use their processor interface for that. In this example, 
 it's incrementing a counter: 
 https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java
  
 https://github.com/awslabs/amazon-kinesis-data-visualization-sample/blob/master/src/main/java/com/amazonaws/services/kinesis/samples/datavis/kcl/CountingRecordProcessor.java
 
 Instead of incrementing a counter, you could do your transformation and send 
 it to HBase.
 
 
 
 
 
 
 On Wed, Jun 17, 2015 at 1:40 PM, ayan guha guha.a...@gmail.com 
 mailto:guha.a...@gmail.com wrote:
 Great discussion!!
 
 One qs about some comment: Also, you can do 

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
The major difference is that in Spark Streaming, there's no *need* for a 
TridentState for state inside your computation. All the stateful operations 
(reduceByWindow, updateStateByKey, etc) automatically handle exactly-once 
processing, keeping updates in order, etc. Also, you don't need to run a 
separate transactional system (e.g. MySQL) to store the state.

After your computation runs, if you want to write the final results (e.g. the 
counts you've been tracking) to a storage system, you use one of the output 
operations (saveAsFiles, foreach, etc). Those actually will run in order, but 
some might run multiple times if nodes fail, thus attempting to write the same 
state again.

You can read about how it works in this research paper: 
http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf 
http://people.csail.mit.edu/matei/papers/2013/sosp_spark_streaming.pdf.

Matei

 On Jun 17, 2015, at 11:49 AM, Enno Shioji eshi...@gmail.com wrote:
 
 Hi Matei,
 
 
 Ah, can't get more accurate than from the horse's mouth... If you don't mind 
 helping me understand it correctly..
 
 From what I understand, Storm Trident does the following (when used with 
 Kafka):
 1) Sit on Kafka Spout and create batches
 2) Assign global sequential ID to the batches
 3) Make sure that all result of processed batches are written once to 
 TridentState, in order (for example, by skipping batches that were already 
 applied once, ultimately by using Zookeeper)
 
 TridentState is an interface that you have to implement, and the underlying 
 storage has to be transactional for this to work. The necessary skipping etc. 
 is handled by Storm.
 
 In case of Spark Streaming, I understand that
 1) There is no global ordering; e.g. an output operation for batch consisting 
 of offset [4,5,6] can be invoked before the operation for offset [1,2,3]
 2) If you wanted to achieve something similar to what TridentState does, 
 you'll have to do it yourself (for example using Zookeeper)
 
 Is this a correct understanding?
 
 
 
 
 On Wed, Jun 17, 2015 at 7:14 PM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 This documentation is only for writes to an external system, but all the 
 counting you do within your streaming app (e.g. if you use 
 reduceByKeyAndWindow to keep track of a running count) is exactly-once. When 
 you write to a storage system, no matter which streaming framework you use, 
 you'll have to make sure the writes are idempotent, because the storage 
 system can't know whether you meant to write the same data again or not. But 
 the place where Spark Streaming helps over Storm, etc is for tracking state 
 within your computation. Without that facility, you'd not only have to make 
 sure that writes are idempotent, but you'd have to make sure that updates to 
 your own internal state (e.g. reduceByKeyAndWindow) are exactly-once too.
 
 Matei
 
 
 On Jun 17, 2015, at 8:26 AM, Enno Shioji eshi...@gmail.com 
 mailto:eshi...@gmail.com wrote:
 
 The thing is, even with that improvement, you still have to make updates 
 idempotent or transactional yourself. If you read 
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
  
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#fault-tolerance-semantics
 
 that refers to the latest version, it says:
 
 Semantics of output operations
 Output operations (like foreachRDD) have at-least once semantics, that is, 
 the transformed data may get written to an external entity more than once in 
 the event of a worker failure. While this is acceptable for saving to file 
 systems using the saveAs***Files operations (as the file will simply get 
 overwritten with the same data), additional effort may be necessary to 
 achieve exactly-once semantics. There are two approaches.
 
 Idempotent updates: Multiple attempts always write the same data. For 
 example, saveAs***Files always writes the same data to the generated files.
 
 Transactional updates: All updates are made transactionally so that updates 
 are made exactly once atomically. One way to do this would be the following.
 
 Use the batch time (available in foreachRDD) and the partition index of the 
 transformed RDD to create an identifier. This identifier uniquely identifies 
 a blob data in the streaming application.
 Update external system with this blob transactionally (that is, exactly 
 once, atomically) using the identifier. That is, if the identifier is not 
 already committed, commit the partition data and the identifier atomically. 
 Else if this was already committed, skip the update.
 
 So either you make the update idempotent, or you have to make it 
 transactional yourself, and the suggested mechanism is very similar to what 
 Storm does.
 
 
 
 
 On Wed, Jun 17, 2015 at 3:51 PM, Ashish Soni asoni.le...@gmail.com 
 mailto:asoni.le...@gmail.com wrote:
 @Enno 
 As per the latest version and documentation Spark Streaming

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, 
updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys 
on a specific node and sends new tuples with that key to that.

Matei

 On Jun 3, 2015, at 6:31 AM, allonsy luke1...@gmail.com wrote:
 
 Hi everybody,
 is there in Spark anything sharing the philosophy of Storm's field grouping?
 
 I'd like to manage data partitioning across the workers by sending tuples
 sharing the same key to the very same worker in the cluster, but I did not
 find any method to do that.
 
 Suggestions?
 
 :)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Equivalent-to-Storm-s-field-grouping-in-Spark-tp23135.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
You shouldn't have to persist the RDD at all, just call flatMap and reduce on 
it directly. If you try to persist it, that will try to load the original dat 
into memory, but here you are only scanning through it once.

Matei

 On Jun 2, 2015, at 2:09 AM, Octavian Ganea octavian.ga...@inf.ethz.ch wrote:
 
 Thanks,
 
 I was actually using reduceByKey, not groupByKey. 
 
 I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . 
 However, I got the same error as before, namely the error described here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-td23098.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-td23098.html
 
 My task is to count the frequencies of pairs of words that occur in a set of 
 documents at least 5 times. I know that this final output is sparse and 
 should comfortably fit in memory. However, the intermediate pairs that are 
 spilled by flatMap might need to be stored on the disk, but I don't 
 understand why the persist option does not work and my job fails.
 
 My code:
 
 rdd.persist(StorageLevel.MEMORY_AND_DISK)
  .flatMap(x = outputPairsOfWords(x)) // outputs pairs of type 
 ((word1,word2) , 1)
 .reduceByKey((a,b) = (a + b).toShort)
 .filter({case((x,y),count) = count = 5})
  
 
 My cluster has 8 nodes, each with 129 GB of RAM and 16 cores per node. One 
 node I keep for the master, 7 nodes for the workers.
 
 my conf:
 
 conf.set(spark.cores.max, 128)
 conf.set(spark.akka.frameSize, 1024)
 conf.set(spark.executor.memory, 115g)
 conf.set(spark.shuffle.file.buffer.kb, 1000)
 
 my spark-env.sh:
  ulimit -n 20
  SPARK_JAVA_OPTS=-Xss1g -Xmx129g -d64 -XX:-UseGCOverheadLimit 
 -XX:-UseCompressedOops
  SPARK_DRIVER_MEMORY=129G
 
 spark version: 1.1.1
 
 Thank you a lot for your help!
 
 
 2015-06-02 4:40 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com:
 As long as you don't use cache(), these operations will go from disk to disk, 
 and will only use a fixed amount of memory to build some intermediate 
 results. However, note that because you're using groupByKey, that needs the 
 values for each key to all fit in memory at once. In this case, if you're 
 going to reduce right after, you should use reduceByKey, which will be more 
 efficient.
 
 Matei
 
  On Jun 1, 2015, at 2:21 PM, octavian.ganea octavian.ga...@inf.ethz.ch 
  mailto:octavian.ga...@inf.ethz.ch wrote:
 
  Dear all,
 
  Does anyone know how can I force Spark to use only the disk when doing a
  simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you!
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
   
  http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  mailto:user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org 
  mailto:user-h...@spark.apache.org
 
 
 
 
 
 -- 
 Octavian Ganea
 
 Research assistant at ETH Zurich
 octavian.ga...@inf.ethz.ch mailto:octavian.ga...@inf.ethz.ch
 http://da.inf.ethz.ch/people/OctavianGanea/ 
 http://da.inf.ethz.ch/people/OctavianGanea/



Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
Yup, exactly.

All the workers will use local disk in addition to RAM, but maybe one thing you 
need to configure is the directory to use for that. It should be set trough 
spark.local.dir. By default it's /tmp, which on some machines is also in RAM, 
so that could be a problem. You should set it to a folder on a real disk, or 
even better, a comma-separated list of disks (e.g. /mnt1,/mnt2) if you have 
multiple disks.

Matei

 On Jun 2, 2015, at 1:03 PM, Octavian Ganea octavian.ga...@inf.ethz.ch wrote:
 
 Thanks a lot! 
 
 So my understanding is that you call persist only if you need to use the rdd 
 at least twice to compute different things. So, if I just need the RDD for a 
 single scan , like in a simple flatMap(..).reduceByKey(..).saveAsTextFile(..) 
 how do I force the slaves to use the hard-disk (in addition to the RAM) when 
 the RAM is full and not to fail like they do now?
 
 Thank you! 
 
 2015-06-02 21:25 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com:
 You shouldn't have to persist the RDD at all, just call flatMap and reduce on 
 it directly. If you try to persist it, that will try to load the original dat 
 into memory, but here you are only scanning through it once.
 
 Matei
 
 On Jun 2, 2015, at 2:09 AM, Octavian Ganea octavian.ga...@inf.ethz.ch 
 mailto:octavian.ga...@inf.ethz.ch wrote:
 
 Thanks,
 
 I was actually using reduceByKey, not groupByKey. 
 
 I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey . 
 However, I got the same error as before, namely the error described here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-td23098.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/flatMap-output-on-disk-flatMap-memory-overhead-td23098.html
 
 My task is to count the frequencies of pairs of words that occur in a set of 
 documents at least 5 times. I know that this final output is sparse and 
 should comfortably fit in memory. However, the intermediate pairs that are 
 spilled by flatMap might need to be stored on the disk, but I don't 
 understand why the persist option does not work and my job fails.
 
 My code:
 
 rdd.persist(StorageLevel.MEMORY_AND_DISK)
  .flatMap(x = outputPairsOfWords(x)) // outputs pairs of type 
 ((word1,word2) , 1)
 .reduceByKey((a,b) = (a + b).toShort)
 .filter({case((x,y),count) = count = 5})
  
 
 My cluster has 8 nodes, each with 129 GB of RAM and 16 cores per node. One 
 node I keep for the master, 7 nodes for the workers.
 
 my conf:
 
 conf.set(spark.cores.max, 128)
 conf.set(spark.akka.frameSize, 1024)
 conf.set(spark.executor.memory, 115g)
 conf.set(spark.shuffle.file.buffer.kb, 1000)
 
 my spark-env.sh:
  ulimit -n 20
  SPARK_JAVA_OPTS=-Xss1g -Xmx129g -d64 -XX:-UseGCOverheadLimit 
 -XX:-UseCompressedOops
  SPARK_DRIVER_MEMORY=129G
 
 spark version: 1.1.1
 
 Thank you a lot for your help!
 
 
 2015-06-02 4:40 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com:
 As long as you don't use cache(), these operations will go from disk to 
 disk, and will only use a fixed amount of memory to build some intermediate 
 results. However, note that because you're using groupByKey, that needs the 
 values for each key to all fit in memory at once. In this case, if you're 
 going to reduce right after, you should use reduceByKey, which will be more 
 efficient.
 
 Matei
 
  On Jun 1, 2015, at 2:21 PM, octavian.ganea octavian.ga...@inf.ethz.ch 
  mailto:octavian.ga...@inf.ethz.ch wrote:
 
  Dear all,
 
  Does anyone know how can I force Spark to use only the disk when doing a
  simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you!
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
   
  http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com 
  http://nabble.com/.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  mailto:user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org 
  mailto:user-h...@spark.apache.org
 
 
 
 
 
 -- 
 Octavian Ganea
 
 Research assistant at ETH Zurich
 octavian.ga...@inf.ethz.ch mailto:octavian.ga...@inf.ethz.ch
 http://da.inf.ethz.ch/people/OctavianGanea/ 
 http://da.inf.ethz.ch/people/OctavianGanea/
 
 
 
 
 -- 
 Octavian Ganea
 
 Research assistant at ETH Zurich
 octavian.ga...@inf.ethz.ch mailto:octavian.ga...@inf.ethz.ch
 http://da.inf.ethz.ch/people/OctavianGanea/ 
 http://da.inf.ethz.ch/people/OctavianGanea/



Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: 
http://www.apache.org/foundation/marks/ 
http://www.apache.org/foundation/marks/

Matei

 On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote:
 
 What is the license on using the spark logo. Is it free to be used for
 displaying commercially?
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom,

Are you using the fine-grained or coarse-grained scheduler? For the 
coarse-grained scheduler, there is a spark.cores.max config setting that will 
limit the total # of cores it grabs. This was there in earlier versions too.

Matei

 On May 19, 2015, at 12:39 PM, Thomas Dudziak tom...@gmail.com wrote:
 
 I read the other day that there will be a fair number of improvements in 1.4 
 for Mesos. Could I ask for one more (if it isn't already in there): a 
 configurable limit for the number of tasks for jobs run on Mesos ? This would 
 be a very simple yet effective way to prevent a job dominating the cluster.
 
 cheers,
 Tom
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Yeah, this definitely seems useful there. There might also be some ways to cap 
the application in Mesos, but I'm not sure.

Matei

 On May 19, 2015, at 1:11 PM, Thomas Dudziak tom...@gmail.com wrote:
 
 I'm using fine-grained for a multi-tenant environment which is why I would 
 welcome the limit of tasks per job :)
 
 cheers,
 Tom
 
 On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 Hey Tom,
 
 Are you using the fine-grained or coarse-grained scheduler? For the 
 coarse-grained scheduler, there is a spark.cores.max config setting that will 
 limit the total # of cores it grabs. This was there in earlier versions too.
 
 Matei
 
  On May 19, 2015, at 12:39 PM, Thomas Dudziak tom...@gmail.com 
  mailto:tom...@gmail.com wrote:
 
  I read the other day that there will be a fair number of improvements in 
  1.4 for Mesos. Could I ask for one more (if it isn't already in there): a 
  configurable limit for the number of tasks for jobs run on Mesos ? This 
  would be a very simple yet effective way to prevent a job dominating the 
  cluster.
 
  cheers,
  Tom
 
 
 



Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness!

 On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
 
 Hi there,
 
 We have released our real-time aggregation engine based on Spark Streaming.
 
 SPARKTA is fully open source (Apache2)
 
 
 You can checkout the slides showed up at the Strata past week:
 
 http://www.slideshare.net/Stratio/strata-sparkta
 
 Source code:
 
 https://github.com/Stratio/sparkta
 
 And documentation
 
 http://docs.stratio.com/modules/sparkta/development/
 
 
 We are open to your ideas and contributors are welcomed.
 
 
 Regards.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.)

Matei

 On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 ...This is madness!
 
 On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
 
 Hi there,
 
 We have released our real-time aggregation engine based on Spark Streaming.
 
 SPARKTA is fully open source (Apache2)
 
 
 You can checkout the slides showed up at the Strata past week:
 
 http://www.slideshare.net/Stratio/strata-sparkta
 
 Source code:
 
 https://github.com/Stratio/sparkta
 
 And documentation
 
 http://docs.stratio.com/modules/sparkta/development/
 
 
 We are open to your ideas and contributors are welcomed.
 
 
 Regards.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class 
you have for the reduceByKey / groupByKey?

Matei

 On May 12, 2015, at 10:08 AM, Night Wolf nightwolf...@gmail.com wrote:
 
 I'm seeing a similar thing with a slightly different stack trace. Ideas?
 
 org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
 org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:205)
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:64)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 
 On Tue, May 12, 2015 at 5:55 AM, Reynold Xin r...@databricks.com 
 mailto:r...@databricks.com wrote:
 Looks like it is spending a lot of time doing hash probing. It could be a
 number of the following:
 
 1. hash probing itself is inherently expensive compared with rest of your
 workload
 
 2. murmur3 doesn't work well with this key distribution
 
 3. quadratic probing (triangular sequence) with a power-of-2 hash table
 works really badly for this workload.
 
 One way to test this is to instrument changeValue function to store the
 number of probes in total, and then log it. We added this probing
 capability to the new Bytes2Bytes hash map we built. We should consider
 just having it being reported as some built-in metrics to facilitate
 debugging.
 
 https://github.com/apache/spark/blob/b83091ae4589feea78b056827bc3b7659d271e41/unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L214
  
 https://github.com/apache/spark/blob/b83091ae4589feea78b056827bc3b7659d271e41/unsafe/src/main/java/org/apache/spark/unsafe/map/BytesToBytesMap.java#L214
 
 
 
 
 
 
 On Mon, May 11, 2015 at 4:21 AM, Michal Haris michal.ha...@visualdna.com 
 mailto:michal.ha...@visualdna.com
 wrote:
 
  This is the stack trace of the worker thread:
 
 
  org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
 
  org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
 
  org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:130)
  org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
 
  org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
  org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
  org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
  org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
  org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
  org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
  org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
  org.apache.spark.scheduler.Task.run(Task.scala:64)
  org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  java.lang.Thread.run(Thread.java:745)
 
  On 8 May 2015 at 22:12, Josh Rosen rosenvi...@gmail.com 
  mailto:rosenvi...@gmail.com wrote:
 
  Do you have any more specific profiling data that you can share?  I'm
  curious to know where AppendOnlyMap.changeValue is being called from.
 
  On Fri, May 8, 2015 at 1:26 PM, Michal Haris michal.ha...@visualdna.com 
  mailto:michal.ha...@visualdna.com
  wrote:
 
  +dev
  On 6 May 2015 10:45, Michal Haris michal.ha...@visualdna.com 
  mailto:michal.ha...@visualdna.com wrote:
 
   Just wanted to check if somebody has seen similar behaviour or knows
  what
   we might be doing wrong. We have a relatively complex spark application
   which processes half a terabyte of data at various stages. We have
  profiled
   it in several ways and everything seems to point to one place where
  90% of
   the time is spent:  AppendOnlyMap.changeValue. The job scales and is
   relatively faster than its map-reduce alternative but it still 

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to 
Windows. AFAIK it should build on Windows too, the only problem is that Maven 
might take a long time to download dependencies. What errors are you seeing?

Matei

 On Apr 16, 2015, at 9:23 AM, Arun Lists lists.a...@gmail.com wrote:
 
 We run Spark on Mac and Linux but also need to run it on Windows 8.1 and  
 Windows Server. We ran into problems with the Scala 2.10 binary bundle for 
 Spark 1.3.0 but managed to get it working. However, on Mac/Linux, we are on 
 Scala 2.11.6 (we built Spark from the sources). On Windows, however despite 
 our best efforts we cannot get Spark 1.3.0 as built from sources working for 
 Scala 2.11.6. Spark has too many moving parts and dependencies!
 
 When can we expect to see a binary bundle for Spark 1.3.0 that is built for 
 Scala 2.11.6?  I read somewhere that the only reason that Spark 1.3.0 is 
 still built for Scala 2.10 is because Kafka is still on Scala 2.10. For those 
 of us who don't use Kafka, can we have a Scala 2.10 bundle.
 
 If there isn't an official bundle arriving any time soon, can someone who has 
 built it for Windows 8.1 successfully please share with the group?
 
 Thanks,
 arun
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this.

Matei

 On Apr 15, 2015, at 5:58 PM, Olivier Chapelle oliv...@chapelle.cc wrote:
 
 Dear Spark users,
 
 I would like to draw your attention to a dataset that we recently released,
 which is as of now the largest machine learning dataset ever released; see
 the following blog announcements:
 - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
 -
 http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx
 
 The characteristics of this dataset are:
 - 1 TB of data
 - binary classification
 - 13 integer features
 - 26 categorical features, some of them taking millions of values.
 - 4B rows
 
 Hopefully this dataset will be useful to assess and push further the
 scalability of Spark and MLlib.
 
 Cheers,
 Olivier
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's 
needed in).

Matei

 On Mar 20, 2015, at 6:49 PM, Krishna Sankar ksanka...@gmail.com wrote:
 
 Yep the command-option is gone. No big deal, just add the '%pylab inline' 
 command as part of your notebook.
 Cheers
 k/
 
 On Fri, Mar 20, 2015 at 3:45 PM, cong yue yuecong1...@gmail.com 
 mailto:yuecong1...@gmail.com wrote:
 Hello :
 
 I tried ipython notebook with the following command in my enviroment.
 
 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark
 
 But it shows  --pylab inline support is removed from ipython newest version.
 the log is as :
 ---
 $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark
 [E 15:29:43.076 NotebookApp] Support for specifying --pylab on the
 command line has been removed.
 [E 15:29:43.077 NotebookApp] Please use `%pylab inline` or
 `%matplotlib inline` in the notebook itself.
 --
 I am using IPython 3.0.0. and only IPython works in my enviroment.
 --
 $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS=notebook
 --pylab inline ./bin/pyspark
 --
 
 Does somebody have the same issue as mine? How do you solve it?
 
 Thanks,
 Cong
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets. 
Note that once you infer a schema for a JSON dataset, you can also use nested 
path notation (e.g. select user.name from users) in the same way as in Hive.

Matei

 On Mar 16, 2015, at 4:47 PM, Fatma Ozcan fatma@gmail.com wrote:
 
 Is there any documentation that explains how to query JSON documents using 
 SparkSQL? 
 
 Thanks,
 Fatma



Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you.

Matei

 On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu 
 ra...@the4thfloor.eu wrote:
 
 Hi,
 
 
 there is a small Spark Meetup group in Berlin, Germany :-)
 http://www.meetup.com/Berlin-Apache-Spark-Meetup/
 
 Plaes add this group to the Meetups list at
 https://spark.apache.org/community.html
 
 
 Ralph
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, 
unzip it and run it on your laptop. See 
http://spark.apache.org/docs/latest/index.html 
http://spark.apache.org/docs/latest/index.html.

Matei


 On Feb 6, 2015, at 2:58 PM, David Fallside falls...@us.ibm.com wrote:
 
 King, consider trying the Spark Kernel 
 (https://github.com/ibm-et/spark-kernel 
 https://github.com/ibm-et/spark-kernel) which will install Spark etc and 
 provide you with a Spark/Scala Notebook in which you can develop your 
 algorithm. The Vagrant installation described in 
 https://github.com/ibm-et/spark-kernel/wiki/Vagrant-Development-Environment 
 https://github.com/ibm-et/spark-kernel/wiki/Vagrant-Development-Environment 
 will have you quickly up and running on a single machine without having to 
 manage the details of the system installations. There is a Docker version, 
 https://github.com/ibm-et/spark-kernel/wiki/Using-the-Docker-Container-for-the-Spark-Kernel
  
 https://github.com/ibm-et/spark-kernel/wiki/Using-the-Docker-Container-for-the-Spark-Kernel,
  if you prefer Docker.
 Regards,
 David
 
 
 King sami kgsam...@gmail.com wrote on 02/06/2015 08:09:39 AM:
 
  From: King sami kgsam...@gmail.com
  To: user@spark.apache.org
  Date: 02/06/2015 08:11 AM
  Subject: Beginner in Spark
  
  Hi,
  
  I'm new in Spark, I'd like to install Spark with Scala. The aim is 
  to build a data processing system foor door events. 
  
  the first step is install spark, scala, hdfs and other required tools.
  the second is build the algorithm programm in Scala which can treat 
  a file of my data logs (events).
  
  Could you please help me to install the required tools: Spark, 
  Scala, HDF and tell me how can I execute my programm treating the entry 
  file.
  
  Best regards,
 



Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark 
driver program crashes, Spark Streaming can recover the application by reading 
the set of DStreams and output operations from a checkpoint file (see 
https://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing).
 But to do that, it needs to remember all the operations you're running 
periodically, including those in foreachRDD.

Matei

 On Jan 27, 2015, at 6:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:
 
 Hi,
 
 I want to do something like
 
 dstream.foreachRDD(rdd = if (someCondition) ssc.stop())
 
 so in particular the function does not touch any element in the RDD and runs 
 completely within the driver. However, this fails with a 
 NotSerializableException because $outer is not serializable etc. The DStream 
 code says:
 
   def foreachRDD(foreachFunc: (RDD[T], Time) = Unit) {
 // because the DStream is reachable from the outer object here, and 
 because 
 // DStreams can't be serialized with closures, we can't proactively check 
 // it for serializability and so we pass the optional false to 
 SparkContext.clean
 new ForEachDStream(this, context.sparkContext.clean(foreachFunc, 
 false)).register()
   }
 
 To be honest, I don't understand the comment. Why must that function be 
 serializable even when there is no RDD action involved?
 
 Thanks
 Tobias


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest 
asking in the GCE support forum. You could also try to launch a Spark cluster 
by hand on nodes in there. Sigmoid Analytics published a package for this here: 
http://spark-packages.org/package/9

Matei

 On Jan 17, 2015, at 4:47 PM, Soumya Simanta soumya.sima...@gmail.com wrote:
 
 I'm deploying Spark using the Click to Deploy Hadoop - Install Apache 
 Spark on Google Compute Engine.
 
 I can run Spark jobs on the REPL and read data from Google storage. However, 
 I'm not sure how to access the Spark UI in this deployment. Can anyone help? 
 
 Also, it deploys Spark 1.1. It there an easy way to bump it to Spark 1.2 ? 
 
 Thanks
 -Soumya
 
 
 image.png
 
 image.png


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get 
support from HDP (if you have that).

Matei

 On Jan 16, 2015, at 5:36 PM, Judy Nash judyn...@exchange.microsoft.com 
 wrote:
 
 Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have 
 not seen a problem. 
 
 However officially HDP 2.1 + Spark 1.2 is not a supported scenario. 
 
 -Original Message-
 From: Judy Nash 
 Sent: Friday, January 16, 2015 5:35 PM
 To: 'bhavyateja'; user@spark.apache.org
 Subject: RE: spark 1.2 compatibility
 
 Yes. It's compatible with HDP 2.1 
 
 -Original Message-
 From: bhavyateja [mailto:bhavyateja.potin...@gmail.com] 
 Sent: Friday, January 16, 2015 3:17 PM
 To: user@spark.apache.org
 Subject: spark 1.2 compatibility
 
 Is spark 1.2 is compatibly with HDP 2.1
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-2-compatibility-tp21197.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark 
shell unfortunately (though they do work in the Scala shell) because we change 
the way lines of code compile to allow shipping functions across the network. 
The best way to get case classes in there is to compile them into a JAR and 
then add that to your spark-shell's classpath with --jars.

Matei

 On Jan 12, 2015, at 10:04 AM, Rosner, Frank (Allianz SE) 
 frank.ros...@allianz.com wrote:
 
 Dear Spark Users,
  
 I googled the web for several hours now but I don't find a solution for my 
 problem. So maybe someone from this list can help.
  
 I have an RDD of case classes, generated from CSV files with Spark. When I 
 used the distinct operator, there were still duplicates. So I investigated 
 and found out that the equals returns false although the two objects were 
 equal (so were their individual fields as well as toStrings).
  
 After googling it I found that the case class equals might break in case the 
 two objects are created by different class loaders. So I implemented my own 
 equals method using mattern matching (code example below). It still didn't 
 work. Some debugging revealed that the problem lies in the pattern matching. 
 Depending on the objects I compare (and maybe the split / classloader they 
 are generated in?) the patternmatching works /doesn't:
  
 case class Customer(id: String, age: Option[Int], entryDate: 
 Option[java.util.Date]) {
   def equals(that: Any): Boolean = that match {
 case Customer(id, age, entryDate) = {
   println(Pattern matching worked!)
   this.id == id  this.age == age  this.entryDate == entryDate
 }
 case _ = false
   }
 }
  
 //val x: Array[Customer]
 // ... some spark code to filter original data and collect x
  
 scala x(0)
 Customer(a, Some(5), Some(Fri Sep 23 00:00:00 CEST 1994))
 scala x(1)
 Customer(a, None, None)
 scala x(2)
 Customer(a, None, None)
 scala x(3)
 Customer(a, None, None)
  
 scala x(0) == x(0) // should be true and works
 Pattern matching works!
 res0: Boolean = true
 scala x(0) == x(1) // should be false and works
 Pattern matching works!
 res1: Boolean = false
 scala x(1) == x(2) // should be true, does not work
 res2: Boolean = false
 scala x(2) == x(3) // should be true, does not work
 Pattern matching works!
 res3: Boolean = true
 scala x(0) == x(3) // should be false, does not work
 res4: Boolean = false
  
 Why is the pattern matching not working? It seems that there are two kinds of 
 Customers: 0,1 and 2,3 which don't match somehow. Is this related to some 
 classloaders? Is there a way around this other than using instanceof and 
 defining a custom equals operation for every case class I write?
  
 Thanks for the help!
 Frank



Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up.

Matei

 Begin forwarded message:
 
 Date: January 5, 2015 at 9:40:41 AM PST
 From: Rich Bowen rbo...@rcbowen.com
 Reply-To: dev d...@community.apache.org
 To: dev d...@community.apache.org
 Subject: ApacheCon North America 2015 Call For Papers
 
 Fellow ASF enthusiasts,
 
 We now have less than a month remaining in the Call For Papers for ApacheCon 
 North America 2015, and so far the submissions are on the paltry side. Please 
 consider submitting papers for consideration for this event.
 
 Details about the event are available at 
 http://events.linuxfoundation.org/events/apachecon-north-america
 
 The call for papers is at 
 http://events.linuxfoundation.org//events/apachecon-north-america/program/cfp
 
 Please help us out by getting this message out to your user@ and dev@ 
 community on the projects that you're involved in, so that these projects can 
 be represented in Austin.
 
 If you are interested in chairing a content track, and taking on the task of 
 wrangling your community together to create a compelling story about your 
 technology space, please join the comdev mailing list - 
 dev-subscr...@community.apache.org - and speak up there.
 
 (Message is Bcc'ed committers@, and Reply-to set to dev@community, if you 
 want to discuss this topic further there.)
 
 Thanks!
 
 -- 
 Rich Bowen - rbo...@rcbowen.com - @rbowen
 http://apachecon.com/ - @apachecon



Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The 
best way to pass it in is probably to package it into your application JAR. You 
can put it in src/main/resources in a Maven or SBT project, and check that it 
makes it into the JAR using jar tf yourfile.jar.

Matei

 On Dec 30, 2014, at 4:21 PM, durga durgak...@gmail.com wrote:
 
 I am not sure , the way I can pass jets3t.properties file for spark-submit.
 --file option seems not working.
 can some one please help me. My production spark jobs get hung up when
 reading s3 file sporadically.
 
 Thanks,
 -D 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/JetS3T-settings-spark-tp20916.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for 
reporting them. Just to comment on a few of these:

 I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty 
 despite my calling cache(). 

This is expected until you compute the RDDs the first time and they actually 
get cached, though I can see why it would be confusing. They don't get 
registered with the UI until this happens.

 I think that attempts to access a directory of parquet files still requires 
 reading the schema from the footer of every file. Painfully slow for 
 terabytes of data. 

This is also expected unfortunately due to the way Parquet works, but if you 
use Spark SQL to read these, the metadata gets cached after the first time you 
access a particular file.

 Exceptions are still often reflective of a symptom rather than a root cause. 
 For example, I had a join that was blowing up, but it was variously reported 
 as insufficient kryro buffers and even an AST error in the SQL parser. 
 
 Saving an SRDD to a table in Hive doesn't work. I had to sneak it in by 
 saving to a file and the creating an external table. 

Yeah, unfortunately this is a bug in 1.2.0 
(https://issues.apache.org/jira/browse/SPARK-4825). It should be fixed for 
1.2.1.

 In interactive work, it would be nice if I could interrupt the current job 
 without killing the whole session.

You can actually do this with the kill button for that stage on the 
application web UI. It's a bit non-obvious but it does work.

Anyway, thanks for reporting this stuff. Don't be afraid to open JIRAs for such 
issues and for usability suggestions, especially if you have a way to reproduce 
them. Some of the usability things are obvious to people who know the UI 
inside-out but not to anyone else.

Matei


 The lower latency potential of Sparrow is also very intriguing. 
 
 Getting GraphX for PySpark would be very welcome. 
 
 It's easy to find fault, of course. I do want to say again how grateful I am 
 to have a usable release in 1.2 and look forward to 1.3 and beyond with real 
 excitement. 
 
 
 Eric Friedman
 
 On Dec 28, 2014, at 5:40 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey Eric,
 
 I'm just curious - which specific features in 1.2 do you find most
 help with usability? This is a theme we're focusing on for 1.3 as
 well, so it's helpful to hear what makes a difference.
 
 - Patrick
 
 On Sun, Dec 28, 2014 at 1:36 AM, Eric Friedman
 eric.d.fried...@gmail.com wrote:
 Hi Josh,
 
 Thanks for the informative answer. Sounds like one should await your changes
 in 1.3. As information, I found the following set of options for doing the
 visual in a notebook.
 
 http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/Animations_and_Progress.ipynb
 
 
 On Dec 27, 2014, at 4:07 PM, Josh Rosen rosenvi...@gmail.com wrote:
 
 The console progress bars are implemented on top of a new stable status
 API that was added in Spark 1.2.  It's possible to query job progress using
 this interface (in older versions of Spark, you could implement a custom
 SparkListener and maintain the counts of completed / running / failed tasks
 / stages yourself).
 
 There are actually several subtleties involved in implementing job-level
 progress bars which behave in an intuitive way; there's a pretty extensive
 discussion of the challenges at https://github.com/apache/spark/pull/3009.
 Also, check out the pull request for the console progress bars for an
 interesting design discussion around how they handle parallel stages:
 https://github.com/apache/spark/pull/3029.
 
 I'm not sure about the plumbing that would be necessary to display live
 progress updates in the IPython notebook UI, though.  The general pattern
 would probably involve a mapping to relate notebook cells to Spark jobs (you
 can do this with job groups, I think), plus some periodic timer that polls
 the driver for the status of the current job in order to update the progress
 bar.
 
 For Spark 1.3, I'm working on designing a REST interface to accesses this
 type of job / stage / task progress information, as well as expanding the
 types of information exposed through the stable status API interface.
 
 - Josh
 
 On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com
 wrote:
 
 Spark 1.2.0 is SO much more usable than previous releases -- many thanks
 to the team for this release.
 
 A question about progress of actions.  I can see how things are
 progressing using the Spark UI.  I can also see the nice ASCII art 
 animation
 on the spark driver console.
 
 Has anyone come up with a way to accomplish something similar in an
 iPython notebook using pyspark?
 
 Thanks
 Eric
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, An Apache infrastructure issue prevented me from 
pushing this last night. The issue was resolved today and I should be able to 
push the final release artifacts tonight.

 On Dec 18, 2014, at 10:14 PM, Andrew Ash and...@andrewash.com wrote:
 
 Patrick is working on the release as we speak -- I expect it'll be out later 
 tonight (US west coast) or tomorrow at the latest.
 
 On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu yuzhih...@gmail.com 
 mailto:yuzhih...@gmail.com wrote:
 Interesting, the maven artifacts were dated Dec 10th. 
 However vote for RC2 closed recently:
 http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+
  
 http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+
 
 Cheers
 
 On Dec 18, 2014, at 10:02 PM, madhu phatak phatak@gmail.com 
 mailto:phatak@gmail.com wrote:
 
 It’s on Maven Central already http://search.maven.org/#browse%7C717101892 
 http://search.maven.org/#browse%7C717101892
 
 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com 
 mailto:vboylin1...@gmail.com vboylin1...@gmail.com 
 mailto:vboylin1...@gmail.com wrote:
 Hi, 
Dose any know when will spark 1.2 released? 1.2 has many great feature 
 that we can't wait now ,-)
 
 Sincely
 Lin wukang
 
 
 发自网易邮箱大师
 
 
 -- 
 Regards,
 Madhukara Phatak
 http://www.madhukaraphatak.com http://www.madhukaraphatak.com/



Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted 
over? You can also test the performance of your NFS by copying a file from it 
to a local disk or to /dev/null and seeing how many bytes per second it can 
copy.

Matei

 On Dec 17, 2014, at 9:38 AM, Larryliu larryli...@gmail.com wrote:
 
 A wordcounting job for about 1G text file takes 1 hour while input from a NFS
 mount. The same job took 30 seconds while input from local file system.
 
 Is there any tuning required for a NFS mount input?
 
 Thanks
 
 Larry
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/wordcount-job-slow-while-input-from-NFS-mount-tp20747.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
I see, you may have something else configured weirdly then. You should look at 
CPU and disk utilization while your Spark job is reading from NFS and, if you 
see high CPU use, run jstack to see where the process is spending time. Also 
make sure Spark's local work directories (spark.local.dir) are not on NFS. They 
shouldn't be though, that should be /tmp.

Matei

 On Dec 17, 2014, at 11:56 AM, Larry Liu larryli...@gmail.com wrote:
 
 Hi, Matei
 
 Thanks for your response.
 
 I tried to copy the file (1G) from NFS and took 10 seconds. The NFS mount is 
 a LAN environment and the NFS server is running on the same server that Spark 
 is running on. So basically I mount the NFS on the same bare metal machine.
 
 Larry
 
 On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 The problem is very likely NFS, not Spark. What kind of network is it mounted 
 over? You can also test the performance of your NFS by copying a file from it 
 to a local disk or to /dev/null and seeing how many bytes per second it can 
 copy.
 
 Matei
 
  On Dec 17, 2014, at 9:38 AM, Larryliu larryli...@gmail.com 
  mailto:larryli...@gmail.com wrote:
 
  A wordcounting job for about 1G text file takes 1 hour while input from a 
  NFS
  mount. The same job took 30 seconds while input from local file system.
 
  Is there any tuning required for a NFS mount input?
 
  Thanks
 
  Larry
 
 
 
  --
  View this message in context: 
  http://apache-spark-user-list.1001560.n3.nabble.com/wordcount-job-slow-while-input-from-NFS-mount-tp20747.html
   
  http://apache-spark-user-list.1001560.n3.nabble.com/wordcount-job-slow-while-input-from-NFS-mount-tp20747.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  mailto:user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org 
  mailto:user-h...@spark.apache.org
 
 



Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the alpha component label is 
that we are still tweaking some of the APIs so we have not yet guaranteed API 
stability for it. However, that is likely to happen soon (possibly 1.3). One of 
the major things added in Spark 1.2 was an external data sources API 
(https://github.com/apache/spark/pull/2475), so we wanted to get a bit of 
feedback on that to provide a stable API for those as well.

Matei

 On Dec 13, 2014, at 5:26 PM, Xiaoyong Zhu xiaoy...@microsoft.com wrote:
 
 Thanks Denny for your information!
 For #1, what I meant is the Spark SQL beta/official release date (as today it 
 is still in alpha phase)… thought today I see it has most basic 
 functionalities,  I don’t know when will the next milestone happen? i.e. Beta?
 For #2, thanks for the information! I read it and it’s really useful! My take 
 is that, Hive on Spark is still Hive (thus having all the metastore 
 information and Hive interfaces such as the REST APIs), while Spark SQL is 
 the expansion of Spark and use several interfaces (HiveContext for example) 
 to support run Hive queries. Is this correct?
  
 Then a following question would be, does Spark SQL has some REST APIs, just 
 as what WebHCat exposes, to help users to submit queries remotely, other than 
 logging into a cluster and execute the command in spark-sql command line? 
  
 Xiaoyong
   
 From: Denny Lee [mailto:denny.g@gmail.com mailto:denny.g@gmail.com] 
 Sent: Saturday, December 13, 2014 10:59 PM
 To: Xiaoyong Zhu; user@spark.apache.org mailto:user@spark.apache.org
 Subject: Re: Spark SQL Roadmap?
  
 Hi Xiaoyong,
 
 SparkSQL has already been released and has been part of the Spark code-base 
 since Spark 1.0.  The latest stable release is Spark 1.1 (here's the Spark 
 SQL Programming Guide 
 http://spark.apache.org/docs/1.1.0/sql-programming-guide.html) and we're 
 currently voting on Spark 1.2.
  
 Hive on Spark is an initiative by Cloudera to help folks whom are already 
 using Hive but instead of using traditional MR it will utilize Spark.  For 
 more information, check 
 outhttp://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
  
 http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/.
  
 For anyone who is building new projects in Spark, IMHO I would suggest 
 jumping to SparkSQL first.
  
 HTH!
 Denny
  
  
 On Sat Dec 13 2014 at 5:00:56 AM Xiaoyong Zhu xiaoy...@microsoft.com 
 mailto:xiaoy...@microsoft.com wrote:
 Dear spark experts, I am very interested in Spark SQL availability in the 
 future – could someone share with me the information about the following 
 questions?
 1.   Is there some ETAs for the Spark SQL release?
 
 2.   I heard there is a Hive on Spark program also – what’s the 
 difference between Spark SQL and Hive on Spark?
 
  
 Thanks!
 Xiaoyong



Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on 
the iterator in each one to get a sliding window. One problem is that you will 
not be able to slide forward into the next partition at partition boundaries. 
If this matters to you, you need to do something more complicated to get those, 
such as the repartition that you said (where you map each record to the 
partition it should be in).

Matei

 On Dec 11, 2014, at 10:16 AM, ll duy.huynh@gmail.com wrote:
 
 any advice/comment on this would be much appreciated.  
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there 
was actually some ongoing work for this.

Matei

 On Dec 3, 2014, at 9:46 AM, Dick Davies d...@hellooperator.net wrote:
 
 Just wondered if anyone had managed to start spark
 jobs on mesos wrapped in a docker container?
 
 At present (i.e. very early testing) I'm able to submit executors
 to mesos via spark-submit easily enough, but they fall over
 as we don't have a JVM on our slaves out of the box.
 
 I can push one out via our CM system if push comes to shove,
 but it'd be nice to have that as part of the job (I'm thinking it might
 be a way to get some of the dependencies deployed too).
 
 bear in mind I'm a total clueless newbie at this so please be gentle
 if I'm doing this completely wrong.
 
 Thanks!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: configure to run multiple tasks on a core

2014-11-26 Thread Matei Zaharia
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have 
one worker that thinks it has more cores.

Matei

 On Nov 26, 2014, at 5:01 PM, Yotto Koga yotto.k...@autodesk.com wrote:
 
 Thanks Sean. That worked out well.
 
 For anyone who happens onto this post and wants to do the same, these are the 
 steps I took to do as Sean suggested...
 
 (Note this is for a stand alone cluster)
 
 login to the master
 
 ~/spark/sbin/stop-all.sh
 
 edit ~/spark/conf/spark-env.sh
 
 modify the line
 export SPARK_WORKER_INSTANCES=1
 to the multiple you want to set (e.g 2)
 
 I also added
 export SPARK_WORKER_MEMORY=some reasonable value so that the total number of 
 workers on a node is within the available memory available on the node (e.g. 
 2g)
 
 ~/spark-ec2/copy-dir /root/spark/conf
 
 ~/spark/sbin/start-all.sh
 
 
 
 From: Sean Owen [so...@cloudera.com]
 Sent: Wednesday, November 26, 2014 12:14 AM
 To: Yotto Koga
 Cc: user@spark.apache.org
 Subject: Re: configure to run multiple tasks on a core
 
 What about running, say, 2 executors per machine, each of which thinks
 it should use all cores?
 
 You can also multi-thread your map function manually, directly, within
 your code, with careful use of a java.util.concurrent.Executor
 
 On Wed, Nov 26, 2014 at 6:57 AM, yotto yotto.k...@autodesk.com wrote:
 I'm running a spark-ec2 cluster.
 
 I have a map task that calls a specialized C++ external app. The app doesn't
 fully utilize the core as it needs to download/upload data as part of the
 task. Looking at the worker nodes, it appears that there is one task with my
 app running per core.
 
 I'd like to better utilize the cpu resources with the hope of increasing
 throughput by running multiple tasks (with my app) per core in parallel.
 
 I see there is a spark.task.cpus config setting with a default value of 1.
 It appears though that this is used to go the other way than what I am
 looking for.
 
 Is there a way where I can specify multiple tasks per core rather than
 multiple cores per task?
 
 thanks for any help.
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/configure-to-run-multiple-tasks-on-a-core-tp19834.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL - Any time line to move beyond Alpha version ?

2014-11-25 Thread Matei Zaharia
The main reason for the alpha tag is actually that APIs might still be 
evolving, but we'd like to freeze the API as soon as possible. Hopefully it 
will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data 
source API that we'd like to get experience with before freezing it.

Matei

 On Nov 24, 2014, at 2:53 PM, Manoj Samel manojsamelt...@gmail.com wrote:
 
 Is there any timeline where Spark SQL goes beyond alpha version?
 
 Thanks,


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
How are you creating the object in your Scala shell? Maybe you can write a 
function that directly returns the RDD, without assigning the object to a 
temporary variable.

Matei

 On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote:
 
 The closer I look @ the stack trace in the Scala shell, it appears to be the 
 call to toString() that is causing the construction of the Job object to 
 fail. Is there a ways to suppress this output since it appears to be 
 hindering my ability to new up this object?
 
 On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com 
 mailto:cjno...@gmail.com wrote:
 I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. 
 Creating the new RDD works fine but setting up the configuration file via the 
 static methods on input formats that require a Hadoop Job object is proving 
 to be difficult. 
 
 Trying to new up my own Job object with the SparkContext.hadoopConfiguration 
 is throwing the exception on line 283 of this grepcode:
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job
  
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job
 
 Looking in the SparkContext code, I'm seeing that it's newing up Job objects 
 just fine using nothing but the configuraiton. Using SparkContext.textFile() 
 appears to be working for me. Any ideas? Has anyone else run into this as 
 well? Is it possible to have a method like SparkContext.getJob() or something 
 similar?
 
 Thanks.
 
 



Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
Yeah, unfortunately that will be up to them to fix, though it wouldn't hurt to 
send them a JIRA mentioning this.

Matei

 On Nov 25, 2014, at 2:58 PM, Corey Nolet cjno...@gmail.com wrote:
 
 I was wiring up my job in the shell while i was learning Spark/Scala. I'm 
 getting more comfortable with them both now so I've been mostly testing 
 through Intellij with mock data as inputs.
 
 I think the problem lies more on Hadoop than Spark as the Job object seems to 
 check it's state and throw an exception when the toString() method is called 
 before the Job has physically been submitted.
 
 On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 How are you creating the object in your Scala shell? Maybe you can write a 
 function that directly returns the RDD, without assigning the object to a 
 temporary variable.
 
 Matei
 
 On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com 
 mailto:cjno...@gmail.com wrote:
 
 The closer I look @ the stack trace in the Scala shell, it appears to be the 
 call to toString() that is causing the construction of the Job object to 
 fail. Is there a ways to suppress this output since it appears to be 
 hindering my ability to new up this object?
 
 On Wed, Nov 5, 2014 at 5:49 PM, Corey Nolet cjno...@gmail.com 
 mailto:cjno...@gmail.com wrote:
 I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. 
 Creating the new RDD works fine but setting up the configuration file via 
 the static methods on input formats that require a Hadoop Job object is 
 proving to be difficult. 
 
 Trying to new up my own Job object with the SparkContext.hadoopConfiguration 
 is throwing the exception on line 283 of this grepcode:
 
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job
  
 http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hadoop/hadoop-mapreduce-client-core/2.5.0/org/apache/hadoop/mapreduce/Job.java#Job
 
 Looking in the SparkContext code, I'm seeing that it's newing up Job objects 
 just fine using nothing but the configuraiton. Using SparkContext.textFile() 
 appears to be working for me. Any ideas? Has anyone else run into this as 
 well? Is it possible to have a method like SparkContext.getJob() or 
 something similar?
 
 Thanks.
 
 
 
 



Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
You can do sbt/sbt assembly/assembly to assemble only the main package.

Matei

 On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com wrote:
 
 Hi,
 The spark assembly is time costly. If  I only need the 
 spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the 
 spark-examples-1.1.0-hadoop2.3.0.jar.  How to configure the spark to avoid 
 assemble the example jar. I know export SPARK_PREPEND_CLASSES=true method can 
 reduce the assembly, but I do not
 develop locally. Any advice?
 
 -- 
 Best Wishes!
 
 



Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
BTW as another tip, it helps to keep the SBT console open as you make source 
changes (by just running sbt/sbt with no args). It's a lot faster the second 
time it builds something.

Matei

 On Nov 25, 2014, at 8:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 You can do sbt/sbt assembly/assembly to assemble only the main package.
 
 Matei
 
 On Nov 25, 2014, at 7:50 PM, lihu lihu...@gmail.com 
 mailto:lihu...@gmail.com wrote:
 
 Hi,
 The spark assembly is time costly. If  I only need the 
 spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the 
 spark-examples-1.1.0-hadoop2.3.0.jar.  How to configure the spark to avoid 
 assemble the example jar. I know export SPARK_PREPEND_CLASSES=true method 
 can reduce the assembly, but I do not
 develop locally. Any advice?
 
 -- 
 Best Wishes!
 
 
 



Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is 
the file present on cluster nodes? If not, look at your hdfs-site.xml and 
remove the setting for a rack topology script there (or it might be in 
core-site.xml).

Matei

 On Nov 19, 2014, at 12:13 PM, Arun Luthra arun.lut...@gmail.com wrote:
 
 I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm getting 
 this error:
 
 14/11/19 13:46:34 INFO cluster.YarnClientSchedulerBackend: Registered 
 executor: 
 Actor[akka.tcp://sparkExecutor@#/user/Executor#-2027837001 
 tel:2027837001] with ID 42
 14/11/19 13:46:34 WARN net.ScriptBasedMapping: Exception running 
 /etc/hadoop/conf/rack-topology.sh 10.0.28.130 
 java.io.IOException: Cannot run program /etc/hadoop/conf/rack-topology.sh 
 (in directory ###): error=2, No such file or directory
 
 The rack-topology script is not on system (find / 2/dev/null -name 
 rack-topology).
 
 Any possibly solution?
 
 Arun Luthra



Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still 
talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version 
exactly?

Matei

 On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta bhas...@gmail.com wrote:
 
 Hi,
 
 Is there any plan to bump the Kafka version dependency in Spark 1.2 from 
 0.8.0 to 0.8.1.1?
 
 Current dependency is still on Kafka 0.8.0
 https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml 
 https://github.com/apache/spark/blob/branch-1.2/external/kafka/pom.xml
 
 Thanks
 Bhaskie



Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy,

Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to 
print the contents of the objects. In addition, something else that helps is to 
do the following:

{
  val  _arr = arr
  models.map(... _arr ...)
}

Basically, copy the global variable into a local one. Then the field access 
from outside (from the interpreter-generated object that contains the line 
initializing arr) is no longer required, and the closure no longer has a 
reference to that.

I'm really confused as to why Array.ofDim would be so large by the way, but are 
you sure you haven't flipped around the dimensions (e.g. it should be 5 x 
1800)? A 5-double array will consume more than 5*8 bytes (probably something 
like 60 at least), and an array of those will still have a pointer to each one, 
so I'd expect that many of them to be more than 80 MB (which is very close to 
1867*5*8).

Matei

 On Nov 10, 2014, at 1:01 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 I'm experiencing some strange behavior with closure serialization that is 
 totally mind-boggling to me.  It appears that two arrays of equal size take 
 up vastly different amount of space inside closures if they're generated in 
 different ways.
 
 The basic flow of my app is to run a bunch of tiny regressions using Commons 
 Math's OLSMultipleLinearRegression and then reference a 2D array of the 
 results from a transformation.  I was running into OOME's and 
 NotSerializableExceptions and tried to get closer to the root issue by 
 calling the closure serializer directly.
   scala val arr = models.map(_.estimateRegressionParameters()).toArray
 
 The result array is 1867 x 5. It serialized is 80k bytes, which seems about 
 right:
   scala SparkEnv.get.closureSerializer.newInstance().serialize(arr)
   res17: java.nio.ByteBuffer = java.nio.HeapByteBuffer[pos=0 lim=80027 
 cap=80027]
 
 If I reference it from a simple function:
   scala def func(x: Long) = arr.length
   scala SparkEnv.get.closureSerializer.newInstance().serialize(func)
 I get a NotSerializableException.
 
 If I take pains to create the array using a loop:
   scala val arr = Array.ofDim[Double](1867, 5)
   scala for (s - 0 until models.length) {
   | factorWeights(s) = models(s).estimateRegressionParameters()
   | }
 Serialization works, but the serialized closure for the function is a 
 whopping 400MB.
 
 If I pass in an array of the same length that was created in a different way, 
 the size of the serialized closure is only about 90K, which seems about right.
 
 Naively, it seems like somehow the history of how the array was created is 
 having an effect on what happens to it inside a closure.
 
 Is this expected behavior?  Can anybody explain what's going on?
 
 any insight very appreciated,
 Sandy


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of 
partitions. You can also specify it when doing parallelize, e.g.

rdd = sc.parallelize(xrange(1000), 10))

This should run in parallel if you have multiple partitions and cores, but it 
might be that during part of the process only one node (e.g. the master 
process) is doing anything.

Matei


 On Nov 9, 2014, at 9:27 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
 
 You can set the following entry inside the conf/spark-defaults.conf file 
 
 spark.cores.max 16
 
 If you want to read the default value, then you can use the following api call
 
 sc.defaultParallelism
 
 where ​sc is your sparkContext object.​
 
 Thanks
 Best Regards
 
 On Sun, Nov 9, 2014 at 6:48 PM, ReticulatedPython person.of.b...@gmail.com 
 mailto:person.of.b...@gmail.com wrote:
 So, I'm running this simple program on a 16 core multicore system. I run it
 by issuing the following.
 
 spark-submit --master local[*] pi.py
 
 And the code of that program is the following. When I use top to see CPU
 consumption, only 1 core is being utilized. Why is it so? Seconldy, spark
 documentation says that the default parallelism is contained in property
 spark.default.parallelism. How can I read this property from within my
 python program?
 
 #pi.py
 from pyspark import SparkContext
 import random
 
 NUM_SAMPLES = 1250
 
 def sample(p):
 x, y = random.random(), random.random()
 return 1 if x*x + y*y  1 else 0
 
 sc = SparkContext(local, Test App)
 count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a,
 b: a + b)
 print Pi is roughly %f % (4.0 * count / NUM_SAMPLES)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-this-siimple-spark-program-uses-only-one-core-tp18434.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 



Re: wierd caching

2014-11-08 Thread Matei Zaharia
It might mean that some partition was computed on two nodes, because a task for 
it wasn't able to be scheduled locally on the first node. Did the RDD really 
have 426 partitions total? You can click on it and see where there are copies 
of each one.

Matei

 On Nov 8, 2014, at 10:16 PM, Nathan Kronenfeld nkronenf...@oculusinfo.com 
 wrote:
 
 RDD Name  Storage Level   Cached Partitions   Fraction Cached Size in 
 Memory  Size in Tachyon Size on Disk
 8 http://hadoop-s1.oculus.guest:4042/storage/rdd?id=8   Memory 
 Deserialized 1x Replicated   426 107%59.7 GB 0.0 B   0.0 B
 Anyone understand what it means to have more than 100% of an rdd cached?
 
 Thanks,
 -Nathan
 



Re: Any Replicated RDD in Spark?

2014-11-05 Thread Matei Zaharia
If you start with an RDD, you do have to collect to the driver and broadcast to 
do this. Between the two options you listed, I think this one is simpler to 
implement, and there won't be a huge difference in performance, so you can go 
for it. Opening InputStreams to a distributed file system by hand can be a lot 
of code.

Matei

 On Nov 5, 2014, at 12:37 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 And another similar case:
 
 If I have get a RDD from previous step, but for next step it should be a map
 side join (so I need to broadcast this RDD to every nodes). What is the best
 way for me to do that? Collect RDD in driver first and create broadcast? Or
 any shortcut in spark for this?
 
 Thanks!
 
 -Original Message-
 From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
 Sent: Wednesday, November 05, 2014 3:32 PM
 To: 'Matei Zaharia'
 Cc: 'user@spark.apache.org'
 Subject: RE: Any Replicated RDD in Spark?
 
 Nice.
 
 Then I have another question, if I have a file (or a set of files: part-0,
 part-1, might be a few hundreds MB csv to 1-2 GB, created by other program),
 need to create hashtable from it, later broadcast it to each node to allow
 query (map side join). I have two options to do it:
 
 1, I can just load the file in a general code (open a inputstream, etc),
 parse content and then create the broadcast from it. 
 2, I also can use a standard way to create the RDD from these file, run the
 map to parse it, then collect it as map, wrap the result as broadcast to
 push to all nodes again.
 
 I think the option 2 might be more consistent with spark's concept (and less
 code?)? But how about the performance? The gain is can parallel load and
 parse the data, penalty is after load we need to collect and broadcast
 result again? Please share your opinion. I am not sure what is the best
 practice here (in theory, either way works, but in real world, which one is
 better?). 
 
 Regards,
 
 Shuai
 
 -Original Message-
 From: Matei Zaharia [mailto:matei.zaha...@gmail.com] 
 Sent: Monday, November 03, 2014 4:15 PM
 To: Shuai Zheng
 Cc: user@spark.apache.org
 Subject: Re: Any Replicated RDD in Spark?
 
 You need to use broadcast followed by flatMap or mapPartitions to do
 map-side joins (in your map function, you can look at the hash table you
 broadcast and see what records match it). Spark SQL also does it by default
 for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by
 default 10 KB, which is really small, but you can bump this up with set
 spark.sql.autoBroadcastJoinThreshold=100 for example).
 
 Matei
 
 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some
 enhanced features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word
 in Pig), how can I do it? Is it anyway to declare a RDD as replicated
 (means distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what
 is the best practice. And if I accumulator to broadcast the data set, can
 then (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
Congrats to everyone who helped make this happen. And if anyone has even more 
machines they'd like us to run on next year, let us know :).

Matei

 On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote:
 
 Hi all,
 
 We are excited to announce that the benchmark entry has been reviewed by
 the Sort Benchmark committee and Spark has officially won the Daytona
 GraySort contest in sorting 100TB of data.
 
 Our entry tied with a UCSD research team building high performance systems
 and we jointly set a new world record. This is an important milestone for
 the project, as it validates the amount of engineering work put into Spark
 by the community.
 
 As Matei said, For an engine to scale from these multi-hour petabyte batch
 jobs down to 100-millisecond streaming and interactive queries is quite
 uncommon, and it's thanks to all of you folks that we are able to make this
 happen.
 
 Updated blog post:
 http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
 
 
 
 
 On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:
 
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some
 pretty cool news for the project, which is that we've been able to use
 Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x
 faster on 10x fewer nodes. There's a detailed writeup at
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on
 206 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few
 weeks, along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali
 Ghodsi. In addition, we'd really like to thank Amazon's EC2 team for
 providing the machines to make this possible. Finally, this result would of
 course not be possible without the many many other contributions, testing
 and feature requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to
 100-millisecond streaming and interactive queries is quite uncommon, and
 it's thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
provides a broader set of capabilities than Redshift because it has APIs in 
general-purpose languages (Java, Scala, Python) and libraries for things like 
machine learning and graph processing. For example, you might use Spark to do 
the ETL that will put data into a database such as Redshift, or you might pull 
data out of Redshift into Spark for machine learning. On the other hand, if 
*all* you want to do is SQL and you are okay with the set of data formats and 
features in Redshift (i.e. you can express everything using its UDFs and you 
have a way to get data in), then Redshift is a complete service which will do 
more management out of the box.

Matei

 On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
 I'm in the midst of a heated debate about the use of Redshift v Spark with a
 colleague.  We keep trading anecdotes and links back and forth (eg airbnb
 post from 2013 or amplab benchmarks), and we don't seem to be getting
 anywhere. 
 
 So before we start down the prototype /benchmark road, and in desperation 
 of finding *some* kind of objective third party perspective,  was wondering
 if anyone who has used both in 2014 would care to provide commentary about
 the sweet spot use cases / gotchas for non trivial use (eg a simple filter
 scan isn't really interesting).  Soft issues like operational maintenance
 and time spent developing v out of the box are interesting too... 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
BTW while I haven't actually used Redshift, I've seen many companies that use 
both, usually using Spark for ETL and advanced analytics and Redshift for SQL 
on the cleaned / summarized data. Xiangrui Meng also wrote 
https://github.com/mengxr/redshift-input-format to make it easy to read data 
exported from Redshift into Spark or Hadoop.

Matei

 On Nov 4, 2014, at 3:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 Is this about Spark SQL vs Redshift, or Spark in general? Spark in general 
 provides a broader set of capabilities than Redshift because it has APIs in 
 general-purpose languages (Java, Scala, Python) and libraries for things like 
 machine learning and graph processing. For example, you might use Spark to do 
 the ETL that will put data into a database such as Redshift, or you might 
 pull data out of Redshift into Spark for machine learning. On the other hand, 
 if *all* you want to do is SQL and you are okay with the set of data formats 
 and features in Redshift (i.e. you can express everything using its UDFs and 
 you have a way to get data in), then Redshift is a complete service which 
 will do more management out of the box.
 
 Matei
 
 On Nov 4, 2014, at 3:11 PM, agfung agf...@gmail.com wrote:
 
 I'm in the midst of a heated debate about the use of Redshift v Spark with a
 colleague.  We keep trading anecdotes and links back and forth (eg airbnb
 post from 2013 or amplab benchmarks), and we don't seem to be getting
 anywhere. 
 
 So before we start down the prototype /benchmark road, and in desperation 
 of finding *some* kind of objective third party perspective,  was wondering
 if anyone who has used both in 2014 would care to provide commentary about
 the sweet spot use cases / gotchas for non trivial use (eg a simple filter
 scan isn't really interesting).  Soft issues like operational maintenance
 and time spent developing v out of the box are interesting too... 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-v-Redshift-tp18112.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Any Replicated RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side 
joins (in your map function, you can look at the hash table you broadcast and 
see what records match it). Spark SQL also does it by default for tables 
smaller than the spark.sql.autoBroadcastJoinThreshold setting (by default 10 
KB, which is really small, but you can bump this up with set 
spark.sql.autoBroadcastJoinThreshold=100 for example).

Matei

 On Nov 3, 2014, at 1:03 PM, Shuai Zheng szheng.c...@gmail.com wrote:
 
 Hi All,
 
 I have spent last two years on hadoop but new to spark.
 I am planning to move one of my existing system to spark to get some enhanced 
 features.
 
 My question is:
 
 If I try to do a map side join (something similar to Replicated key word in 
 Pig), how can I do it? Is it anyway to declare a RDD as replicated (means 
 distribute it to all nodes and each node will have a full copy)?
 
 I know I can use accumulator to get this feature, but I am not sure what is 
 the best practice. And if I accumulator to broadcast the data set, can then 
 (after broadcast) convert it into a RDD and do the join?
 
 Regards,
 
 Shuai


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on 
the results.

Matei

 On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com wrote:
 
 I would like to combine 2 parquet tables I have create.
 I tried:
 
   sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB))
 
 but that just returns RDD[Row].
 How do I combine them to get a SchemaRDD[Row]?
 
 thanks
 Daniel


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
It does generalize types, but only on the intersection of the columns it seems. 
There might be a way to get the union of the columns too using HiveQL. Types 
generalize up with string being the most general.

Matei

 On Nov 1, 2014, at 6:22 PM, Daniel Mahler dmah...@gmail.com wrote:
 
 Thanks Matei. What does unionAll do if the input RDD schemas are not 100% 
 compatible. Does it take the union of the columns and generalize the types?
 
 thanks
 Daniel
 
 On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 Try unionAll, which is a special method on SchemaRDDs that keeps the schema 
 on the results.
 
 Matei
 
  On Nov 1, 2014, at 3:57 PM, Daniel Mahler dmah...@gmail.com 
  mailto:dmah...@gmail.com wrote:
 
  I would like to combine 2 parquet tables I have create.
  I tried:
 
sc.union(sqx.parquetFile(fileA), sqx.parquetFile(fileB))
 
  but that just returns RDD[Row].
  How do I combine them to get a SchemaRDD[Row]?
 
  thanks
  Daniel
 
 



Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia
You don't have to call it if you just exit your application, but it's useful 
for example in unit tests if you want to create and shut down a separate 
SparkContext for each test.

Matei

 On Oct 31, 2014, at 10:39 AM, Evan R. Sparks evan.spa...@gmail.com wrote:
 
 In cluster settings if you don't explicitly call sc.stop() your application 
 may hang. Like closing files, network connections, etc, when you're done with 
 them, it's a good idea to call sc.stop(), which lets the spark master know 
 that your application is finished consuming resources.
 
 On Fri, Oct 31, 2014 at 10:13 AM, Daniel Siegmann daniel.siegm...@velos.io 
 mailto:daniel.siegm...@velos.io wrote:
 It is used to shut down the context when you're done with it, but if you're 
 using a context for the lifetime of your application I don't think it matters.
 
 I use this in my unit tests, because they start up local contexts and you 
 can't have multiple local contexts open so each test must stop its context 
 when it's done.
 
 On Fri, Oct 31, 2014 at 11:12 AM, ll duy.huynh@gmail.com 
 mailto:duy.huynh@gmail.com wrote:
 what is it for?  when do we call it?
 
 thanks!
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
  
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-tp17826.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 mailto:user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org 
 mailto:user-h...@spark.apache.org
 
 
 
 
 -- 
 Daniel Siegmann, Software Developer
 Velos
 Accelerating Machine Learning
 
 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
 E: daniel.siegm...@velos.io mailto:daniel.siegm...@velos.io W: www.velos.io 
 http://www.velos.io/



Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with 
spark-shell too but they may be less tested.

Unfortunately, you do have to specify each JAR separately; you can maybe use a 
shell script to list a directory and get a big list, or set up a project that 
builds all of the dependencies into one assembly JAR.

Matei

 On Oct 30, 2014, at 5:24 PM, Shay Seng s...@urbanengines.com wrote:
 
 Hi,
 
 I've been trying to move up from spark 0.9.2 to 1.1.0. 
 I'm getting a little confused with the setup for a few different use cases, 
 grateful for any pointers...
 
 (1) spark-shell + with jars that are only required by the driver
 (1a) 
 I added spark.driver.extraClassPath  /mypath/to.jar to my 
 spark-defaults.conf
 I launched spark-shell with:  ./spark-shell
 
 Here I see on the WebUI that spark.driver.extraClassPath has been set, but I 
 am NOT able to access any methods in the jar.
 
 (1b)
 I removed spark.driver.extraClassPath from my spark-default.conf
 I launched spark-shell with  .//spark-shell --driver.class.path /mypath/to.jar
 
 Again I see that the WebUI spark.driver.extraClassPath has been set. 
 But this time I am able to access the methods in the jar. 
 
 Q: Is spark-shell not considered the driver in this case?  why does using 
 --driver.class.path on the command line have a different behavior to setting 
 it in spark-defaults.conf ?
  
 
 (2) Rather than adding each jar individually, is there a way to use 
 wildcards? Previously with SPARK_CLASS_PATH I was able to use mypath/*  but 
 with --driver.class.path it seems to require individual files.
 
 tks
 Shay


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Yeah, I think you should file this as a bug. The problem is that JARs need to 
also be added into the Scala compiler and REPL class loader, and we probably 
don't do this for the ones in this driver config property.

Matei

 On Oct 30, 2014, at 6:07 PM, Shay Seng s...@urbanengines.com wrote:
 
 -- jars does indeed work but this causes the jars to also get shipped to 
 the workers -- which I don't want to do for efficiency reasons.
 
 I think you are saying that setting spark.driver.extraClassPath in 
 spark-default.conf  ought to have the same behavior as providing 
 --driver.class.apth  to spark-shell. Correct? If so I will file a bug 
 report since this is definitely not the case.
 
 
 On Thu, Oct 30, 2014 at 5:39 PM, Matei Zaharia matei.zaha...@gmail.com 
 mailto:matei.zaha...@gmail.com wrote:
 Try using --jars instead of the driver-only options; they should work with 
 spark-shell too but they may be less tested.
 
 Unfortunately, you do have to specify each JAR separately; you can maybe use 
 a shell script to list a directory and get a big list, or set up a project 
 that builds all of the dependencies into one assembly JAR.
 
 Matei
 
  On Oct 30, 2014, at 5:24 PM, Shay Seng s...@urbanengines.com 
  mailto:s...@urbanengines.com wrote:
 
  Hi,
 
  I've been trying to move up from spark 0.9.2 to 1.1.0.
  I'm getting a little confused with the setup for a few different use cases, 
  grateful for any pointers...
 
  (1) spark-shell + with jars that are only required by the driver
  (1a)
  I added spark.driver.extraClassPath  /mypath/to.jar to my 
  spark-defaults.conf
  I launched spark-shell with:  ./spark-shell
 
  Here I see on the WebUI that spark.driver.extraClassPath has been set, but 
  I am NOT able to access any methods in the jar.
 
  (1b)
  I removed spark.driver.extraClassPath from my spark-default.conf
  I launched spark-shell with  .//spark-shell --driver.class.path 
  /mypath/to.jar
 
  Again I see that the WebUI spark.driver.extraClassPath has been set.
  But this time I am able to access the methods in the jar.
 
  Q: Is spark-shell not considered the driver in this case?  why does using 
  --driver.class.path on the command line have a different behavior to 
  setting it in spark-defaults.conf ?
 
 
  (2) Rather than adding each jar individually, is there a way to use 
  wildcards? Previously with SPARK_CLASS_PATH I was able to use mypath/*  
  but with --driver.class.path it seems to require individual files.
 
  tks
  Shay
 
 



Re: BUG: when running as extends App, closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in 
docs/ to do this (see 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark), 
otherwise maybe open an issue on https://issues.apache.org/jira/browse/SPARK 
https://issues.apache.org/jira/browse/SPARK so we can track it.

Matei

 On Oct 29, 2014, at 3:16 PM, Michael Albert m_albert...@yahoo.com.INVALID 
 wrote:
 
 Greetings!
 
 This might be a documentation issue as opposed to a coding issue, in that 
 perhaps the correct answer is don't do that, but as this is not obvious, I 
 am writing.
 
 The following code produces output most would not expect:
 
 package misc
 
 import org.apache.spark.SparkConf
 import org.apache.spark.SparkContext
 import org.apache.spark.SparkContext._
 
 object DemoBug extends App {
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 
 val rdd = sc.parallelize(List(A,B,C,D))
 val str1 = A
 
 val rslt1 = rdd.filter(x = { x != A }).count
 val rslt2 = rdd.filter(x = { str1 != null  x != A }).count
 
 println(DemoBug: rslt1 =  + rslt1 +  rslt2 =  + rslt2)
 }
 
 This produces the output:
 DemoBug: rslt1 = 3 rslt2 = 0
 
 Compiled with sbt:
 libraryDependencies += org.apache.spark % spark-core_2.10 % 1.1.0
 Run on an EC2 EMR instance with a recent image (hadoop 2.4.0, spark 1.1.0)
 
 If instead there is a proper main(), it works as expected.
 
 Thank you.
 
 Sincerely,
  Mike



Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not 
available in it. JdbcRDD is one of them -- this functionality will likely be 
superseded by Spark SQL when we add JDBC as a data source. In the meantime, to 
use it, I'd recommend writing a class in Scala that has Java-friendly methods 
and getting an RDD to it from that. Basically the two parameters that weren't 
friendly there were the ClassTag and the getConnection and mapRow functions.

Subclassing RDD in Java is also not really supported, because that's an 
internal API. We don't expect users to be defining their own RDDs.

Matei

 On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote:
 
 Hi Ron,
 what ever api you have in scala you can possibly use it form java. scala is
 inter-operable with java and vice versa. scala being both object oriented
 and functional will make your job easier on jvm and it is more consise than
 java. Take it as an opportunity and start learning scala ;).
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that 
these are internal APIs used by people that might want to extend Spark, but are 
not guaranteed to remain stable across Spark versions (unlike Spark's public 
APIs).

BTW, if you want a way to do this that does not involve JdbcRDD or internal 
APIs, you can use SoarkContext.paralellize followed by mapPartitions to read a 
subset of the data in each of your tasks. That can be done purely in Java. 
You'd probably parallelize a collection that contains ranges of the table that 
you want to scan, then open a connection to the DB in each task (in 
mapPartitions) and read the records from that range.

Matei

 On Oct 28, 2014, at 12:15 PM, Ron Ayoub ronalday...@live.com wrote:
 
 I interpret this to mean you have to learn Scala in order to work with Spark 
 in Scala (goes without saying) and also to work with Spark in Java (since you 
 have to jump through some hoops for basic functionality).
 
 The best path here is to take this as a learning opportunity and sit down and 
 learn Scala. 
 
 Regarding RDD being an internal API, it has two methods that clearly allow 
 you to override them which the JdbcRDD does and it looks close to trivial - 
 if I only new Scala. Once I learn Scala, I would say the first thing I plan 
 on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why 
 would this not be advisable?
  
 
  Subject: Re: Is Spark in Java a bad idea?
  From: matei.zaha...@gmail.com mailto:matei.zaha...@gmail.com
  Date: Tue, 28 Oct 2014 11:56:39 -0700
  CC: u...@spark.incubator.apache.org mailto:u...@spark.incubator.apache.org
  To: isasmani@gmail.com mailto:isasmani@gmail.com
  
  A pretty large fraction of users use Java, but a few features are still not 
  available in it. JdbcRDD is one of them -- this functionality will likely 
  be superseded by Spark SQL when we add JDBC as a data source. In the 
  meantime, to use it, I'd recommend writing a class in Scala that has 
  Java-friendly methods and getting an RDD to it from that. Basically the two 
  parameters that weren't friendly there were the ClassTag and the 
  getConnection and mapRow functions.
  
  Subclassing RDD in Java is also not really supported, because that's an 
  internal API. We don't expect users to be defining their own RDDs.
  
  Matei
  
   On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com 
   mailto:isasmani@gmail.com wrote:
   
   Hi Ron,
   what ever api you have in scala you can possibly use it form java. scala 
   is
   inter-operable with java and vice versa. scala being both object oriented
   and functional will make your job easier on jvm and it is more consise 
   than
   java. Take it as an opportunity and start learning scala ;).
   
   
   
   --
   View this message in context: 
   http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
   Sent from the Apache Spark User List mailing list archive at Nabble.com 
   http://nabble.com/.
   
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org 
   mailto:user-h...@spark.apache.org
   
  
  
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
  mailto:user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org 
  mailto:user-h...@spark.apache.org
  



Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another 
one:

scala val a = Array[Long](1,2,3)
a: Array[Long] = Array(1, 2, 3)

scala val b = Array[Long](1,2,3)
b: Array[Long] = Array(1, 2, 3)

scala a ++ b
res0: Array[Long] = Array(1, 2, 3, 1, 2, 3)

scala res0.getClass
res1: Class[_ : Array[Long]] = class [J

The problem might be that lots of intermediate space is allocated as you merge 
values two by two. In particular, if a key has N arrays mapping to it, your 
code will allocate O(N^2) space because it builds first an array of size 1, 
then 2, then 3, etc. You can make this faster by using aggregateByKey instead, 
and using an intermediate data structure other than an Array to do the merging 
(ideally you'd find a growable ArrayBuffer-like class specialized for Longs, 
but you can also just try ArrayBuffer).

Matei



 On Oct 21, 2014, at 1:08 PM, Akshat Aranya aara...@gmail.com wrote:
 
 This is as much of a Scala question as a Spark question
 
 I have an RDD:
 
 val rdd1: RDD[(Long, Array[Long])]
 
 This RDD has duplicate keys that I can collapse such
 
 val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = a++b)
 
 If I start with an Array of primitive longs in rdd1, will rdd2 also have 
 Arrays of primitive longs?  I suspect, based on my memory usage, that this is 
 not the case.
 
 Also, would it be more efficient to do this:
 
 val rdd1: RDD[(Long, ArrayBuffer[Long])]
 
 and then
 
 val rdd2: RDD[(Long, Array[Long])] = rdd1.reduceByKey((a,b) = 
 a++b).map(_.toArray)
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration 
will open in a few weeks, and like in previous Spark Summits, I expect there to 
be a special pass for students.

Matei

 On Oct 18, 2014, at 9:52 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
 
 After successful events in the past two years, the Spark Summit conference 
 has expanded for 2015, offering both an event in New York on March 18-19 and 
 one in San Francisco on June 15-17. The conference is a great chance to meet 
 people from throughout the Spark community and see the latest news, tips and 
 use cases.
 
 Submissions are now open for Spark Summit East 2015, to be held in New York 
 on March 18-19. If you’d like to give a talk on use cases, neat applications, 
 or ongoing Spark development, submit your talk online today at 
 http://prevalentdesignevents.com/sparksummit2015/east/speaker/. Submissions 
 will be open until December 6th, 2014.
 
 If you missed this year’s Spark Summit, you can still find videos from all 
 talks online at http://spark-summit.org/2014.
 
 Hope to see you there,
 
 Matei


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Matei Zaharia
toBreeze is private within Spark, it should not be accessible to users. If you 
want to make a Breeze vector from an MLlib one, it's pretty straightforward, 
and you can make your own utility function for it.

Matei


 On Oct 17, 2014, at 5:09 PM, Sean Owen so...@cloudera.com wrote:
 
 Yes, I think that's the logic, but then what do toBreezeVector return
 if it is not based on Breeze? and this is called a lot by client code
 since you often have to do something nontrivial to the vector. I
 suppose you can still have that thing return a Breeze vector and use
 it for no other purpose. I think this abstraction leaks though.
 
 On Fri, Oct 17, 2014 at 7:48 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 I don't know the answer for sure, but just from an API perspective I'd guess
 that the Spark authors don't want to tie their API to Breeze. If at a future
 point they swap out a different implementation for Breeze, they don't have
 to change their public interface. MLlib's interface remains consistent while
 the internals are free to evolve.
 
 Nick
 
 
 2014년 10월 17일 금요일, llduy.huynh@gmail.com님이 작성한 메시지:
 
 hello... i'm looking at the source code for mllib.linalg.Vectors and it
 looks
 like it's a wrapper around Breeze with very small changes (mostly changing
 the names).
 
 i don't have any problem with using spark wrapper around Breeze or Breeze
 directly.  i'm just curious to understand why this wrapper was created vs.
 pointing everyone to Breeze directly?
 
 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
 
 
 
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/mllib-linalg-Vectors-vs-Breeze-tp16722.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has 
expanded for 2015, offering both an event in New York on March 18-19 and one in 
San Francisco on June 15-17. The conference is a great chance to meet people 
from throughout the Spark community and see the latest news, tips and use cases.

Submissions are now open for Spark Summit East 2015, to be held in New York on 
March 18-19. If you’d like to give a talk on use cases, neat applications, or 
ongoing Spark development, submit your talk online today at 
http://prevalentdesignevents.com/sparksummit2015/east/speaker/. Submissions 
will be open until December 6th, 2014.

If you missed this year’s Spark Summit, you can still find videos from all 
talks online at http://spark-summit.org/2014.

Hope to see you there,

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
The biggest scaling issue was supporting a large number of reduce tasks 
efficiently, which the JIRAs in that post handle. In particular, our current 
default shuffle (the hash-based one) has each map task open a separate file 
output stream for each reduce task, which wastes a lot of memory (since each 
stream has its own buffer).

A second thing that helped efficiency tremendously was Reynold's new network 
module (https://issues.apache.org/jira/browse/SPARK-2468). Doing I/O on 32 
cores, 10 Gbps Ethernet and 8+ disks efficiently is not easy, as can be seen 
when you try to scale up other software.

Finally, with 30,000 tasks even sending info about every map's output size to 
each reducer was a problem, so Reynold has a patch that avoids that if the 
number of tasks is large.

Matei

On Oct 10, 2014, at 10:09 PM, Ilya Ganelin ilgan...@gmail.com wrote:

 Hi Matei - I read your post with great interest. Could you possibly comment 
 in more depth on some of the issues you guys saw when scaling up spark and 
 how you resolved them? I am interested specifically in spark-related 
 problems. I'm working on scaling up spark to very large datasets and have 
 been running into a variety of issues. Thanks in advance!
 
 On Oct 10, 2014 10:54 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
 Hi folks,
 
 I interrupt your regularly scheduled user / dev list to bring you some pretty 
 cool news for the project, which is that we've been able to use Spark to 
 break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x 
 fewer nodes. There's a detailed writeup at 
 http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
  Summary: while Hadoop MapReduce held last year's 100 TB world record by 
 sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
 nodes; and we also scaled up to sort 1 PB in 234 minutes.
 
 I want to thank Reynold Xin for leading this effort over the past few weeks, 
 along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
 addition, we'd really like to thank Amazon's EC2 team for providing the 
 machines to make this possible. Finally, this result would of course not be 
 possible without the many many other contributions, testing and feature 
 requests from throughout the community.
 
 For an engine to scale from these multi-hour petabyte batch jobs down to 
 100-millisecond streaming and interactive queries is quite uncommon, and it's 
 thanks to all of you folks that we are able to make this happen.
 
 Matei
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this!

Matei

On Oct 11, 2014, at 9:46 AM, Denny Lee denny.g@gmail.com wrote:

 https://www.concur.com/blog/en-us/connect-tableau-to-sparksql
 
 If you're wondering how to connect Tableau to SparkSQL - here are the steps 
 to connect Tableau to SparkSQL.  
 
 image.png
 
 Enjoy!
 



Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks,

I interrupt your regularly scheduled user / dev list to bring you some pretty 
cool news for the project, which is that we've been able to use Spark to break 
MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer 
nodes. There's a detailed writeup at 
http://databricks.com/blog/2014/10/10/spark-breaks-previous-large-scale-sort-record.html.
 Summary: while Hadoop MapReduce held last year's 100 TB world record by 
sorting 100 TB in 72 minutes on 2100 nodes, we sorted it in 23 minutes on 206 
nodes; and we also scaled up to sort 1 PB in 234 minutes.

I want to thank Reynold Xin for leading this effort over the past few weeks, 
along with Parviz Deyhim, Xiangrui Meng, Aaron Davidson and Ali Ghodsi. In 
addition, we'd really like to thank Amazon's EC2 team for providing the 
machines to make this possible. Finally, this result would of course not be 
possible without the many many other contributions, testing and feature 
requests from throughout the community.

For an engine to scale from these multi-hour petabyte batch jobs down to 
100-millisecond streaming and interactive queries is quite uncommon, and it's 
thanks to all of you folks that we are able to make this happen.

Matei
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated).

Matei

On Oct 10, 2014, at 1:52 PM, Michael Oczkowski michael.oczkow...@seeq.com 
wrote:

 Please add the Boulder-Denver Spark meetup group to the list on the website.
 http://www.meetup.com/Boulder-Denver-Spark-Meetup/
  
  
 Michael Oczkowski, Ph.D.
 Big Data Architect  Chief Data Scientist 
 Seeq Corporation
 206-801-9339 x704
 image001.png
 Transforming industrial process data into actionable intelligence
 Connect: LinkedIn | Facebook | Twitter



Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row = row.toString). 
Or if you want to get a particular field of the row, you can do rdd.map(row = 
row(3).toString).

Matei

On Oct 9, 2014, at 1:22 PM, Soumya Simanta soumya.sima...@gmail.com wrote:

 I've a SchemaRDD that I want to convert to a RDD that contains String. How do 
 I convert the Row inside the SchemaRDD to String? 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. 
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer 
joins do both, and it seems like we could optimize it for those that are not 
full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Liquan, yes, for full outer join, one hash table on both sides is more 
 efficient.
  
 For the left/right outer join, it looks like one hash table should be enought.
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 18:34
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 How about full outer join? One hash table may not be efficient for this case. 
  
 Liquan
  
 On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Hi, Liquan, thanks for the response.
  
 In your example, I think the hash table should be built on the right side, 
 so Spark can iterate through the left side and find matches in the right side 
 from the hash table efficiently. Please comment and suggest, thanks again!
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 12:31
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 My understanding is that the hashtable on both left and right side is used 
 for including null values in result in an efficient manner. If hash table is 
 only built on one side, let's say left side and we perform a left outer join, 
 for each row in left side, a scan over the right side is needed to make sure 
 that no matching tuples for that row on left side. 
  
 Hope this helps!
 Liquan
  
 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:
 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.
 
 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?
 
 Thanks!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst



Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext 
implements a smaller subset of the SQL language and so you're getting a SQL 
parse error because it doesn't support the syntax you have. Look at how you'd 
write this in HiveQL, and then try doing that with HiveContext.

On Oct 7, 2014, at 7:20 AM, Gen gen.tan...@gmail.com wrote:

 Hi, in fact, the same problem happens when I try several joins together:
 
 SELECT * 
 FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY 
 INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY =
 eans.FORM_KEY)
 
 py4j.protocol.Py4JJavaError: An error occurred while calling o1229.sql.
 : java.lang.RuntimeException: [1.269] failure: ``UNION'' expected but
 `INNER' found
 
 SELECT sales.Date AS Date, sales.ID_FOYER AS ID_FOYER, Sales.STO_KEY AS
 STO_KEY,sales.Quantite AS Quantite, sales.Prix AS Prix, sales.Total AS
 Total, magasin.FORM_KEY AS FORM_KEY, eans.UB_KEY AS UB_KEY FROM sales INNER
 JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON
 (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)
 
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.catalyst.SqlParser.apply(SqlParser.scala:60)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:73)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:260)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at
 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
 
 I use spark 1.1.0, so I have an impression that sparksql doesn't support
 several joins together. 
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-more-than-two-tables-for-join-tp13865p15848.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: 
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects.

Matei

On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote:

 well, sort of! we make input/output formats (cascading taps, scalding 
 sources) available in spark, and we ported the scalding fields api to spark. 
 so it's for those of us that have a serious investment in cascading/scalding 
 and want to leverage that in spark.
 
 blog is here:
 http://tresata.com/tresata-open-sources-spark-on-scalding
 



Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps.

Matei

On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote:

 
 Yes.. you should use maprfs://
 
 I personally haven't used pyspark, I just used scala shell or standalone with 
 MapR.
 
 I think you need to set classpath right, adding jar like
  /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar to the 
 classpath
 in the classpath.
 
 Sungwook
 
 On Wed, Oct 1, 2014 at 4:09 PM, Addanki, Santosh Kumar 
 santosh.kumar.adda...@sap.com wrote:
 Hi
 
  
 
 We would like to do this in PySpark Environment
 
  
 
 i.e something like
 
  
 
 test = sc.textFile(maprfs:///user/root/test) or
 
 test = sc.textFile(hdfs:///user/root/test) or
 
  
 
  
 
 Currently when we try
 
 test = sc.textFile(maprfs:///user/root/test)
 
  
 
 It throws the error
 
 No File-System for scheme: maprfs
 
  
 
  
 
 Best Regards
 
 Santosh
 
  
 
  
 
  
 
 From: Vladimir Rodionov [mailto:vrodio...@splicemachine.com] 
 Sent: Wednesday, October 01, 2014 3:59 PM
 To: Addanki, Santosh Kumar
 Cc: user@spark.apache.org
 Subject: Re: Spark And Mapr
 
  
 
 There is doc on MapR:
 
  
 
 http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications
 
  
 
 -Vladimir Rodionov
 
  
 
 On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar 
 santosh.kumar.adda...@sap.com wrote:
 
 Hi
 
  
 
 We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR
 
  
 
 Previously to read a text file  we would use :
 
  
 
 test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\)
 
  
 
  
 
 What would be the equivalent of the same for Mapr.
 
  
 
 Best Regards
 
 Santosh
 
  
 
 



Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs 
on the cluster. --executor-cores is just for each individual executor, but it 
will try to launch many of them.

Matei

On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian 
sanjaysubraman...@yahoo.com.INVALID wrote:

 hey guys
 
 I am using  spark 1.0.0+cdh5.1.0+41
 When two users try to run spark-shell , the first guy's spark-shell shows
 active in the 18080 Web UI but the second user shows WAITING and the shell
 has a bunch of errors but does go the spark-shell and sc.master seems to
 point to the correct master.
 
 I tried controlling the number of cores in the spark-shell command
 --executor-cores 8
 Does not work
 
 thanks
 
 sanjay 
 
 



Re: Spark Code to read RCFiles

2014-09-23 Thread Matei Zaharia
Is your file managed by Hive (and thus present in a Hive metastore)? In that 
case, Spark SQL 
(https://spark.apache.org/docs/latest/sql-programming-guide.html) is the 
easiest way.

Matei

On September 23, 2014 at 2:26:10 PM, Pramod Biligiri (pramodbilig...@gmail.com) 
wrote:

Hi,
I'm trying to read some data in RCFiles using Spark, but can't seem to find a 
suitable example anywhere. Currently I've written the following bit of code 
that lets me count() the no. of records, but when I try to do a collect() or a 
map(), it fails with a ConcurrentModificationException. I'm running Spark 1.0.1 
on a Hadoop YARN cluster:

import org.apache.hadoop.hive.ql.io.RCFileInputFormat;
val file = sc.hadoopFile(/hdfs/path/to/file,
classOf[org.apache.hadoop.hive.ql.io.RCFileInputFormat[org.apache.hadoop.io.LongWritable,org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable]],
classOf[org.apache.hadoop.io.LongWritable], 
classOf[org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable]
)
file.collect()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 10.0:6 
failed 4 times, most recent failure: Exception failure in TID 395 on host 
(redacted): com.esotericsoftware.kryo.KryoException: 
java.util.ConcurrentModificationException
Serialization trace:
classes (sun.misc.Launcher$AppClassLoader)
parent (org.apache.spark.repl.ExecutorClassLoader)
classLoader (org.apache.hadoop.mapred.JobConf)
conf (org.apache.hadoop.io.compress.GzipCodec)
codec (org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer)
this$0 
(org.apache.hadoop.hive.ql.io.RCFile$ValueBuffer$LazyDecompressionCallbackImpl)
lazyDecompressObj (org.apache.hadoop.hive.serde2.columnar.BytesRefWritable)
bytesRefWritables (org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
        
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
        com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
        
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
        
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:38)
        com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:34)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
        
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
        com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
        
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:141)
        

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This 
means that Dataset is more general (it can also save to storage systems that 
are not file systems, such as key-value stores), but is more annoying to use if 
you actually have a file.

Matei

On September 21, 2014 at 11:24:35 PM, innowireless TaeYun Kim 
(taeyun@innowireless.co.kr) wrote:

Hi,

 

I’m confused with saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset.

What’s the difference between the two?

What’s the individual use cases of the two APIs?

Could you describe the internal flows of the two APIs briefly?

 

I’ve used Spark several months, but I have no experience on MapReduce 
programming.

(I’ve read a few book chapters on MapReduce, but actually not written code 
myself.)

So maybe this confusion comes from my lack of experience on MapReduce 
programming.

(I hoped it won’t necessary to have since I could use Spark…)

 

Thanks.

 

Re: paging through an RDD that's too large to collect() all at once

2014-09-18 Thread Matei Zaharia
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads 
one RDD partition at a time. Scala iterators also have methods like grouped() 
that let you get fixed-size groups.

Matei

On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) 
wrote:

I have an RDD on the cluster that I'd like to iterate over and perform some 
operations on each element (push data from each element to another 
downstream system outside of Spark). I'd like to do this at the driver so I 
can throttle the rate that I push to the downstream system (as opposed to 
submitting a job to the Spark cluster and parallelizing the work - and 
likely flooding the downstream system). 

The RDD is too big to collect() all at once back into the memory space of 
the driver. Ideally I'd like to be able to page through the dataset, 
iterating through a chunk of n RDD elements at a time back at the driver. 
It doesn't have to be _exactly_ n elements, just a reasonably small set of 
elements at a time. 

Is there a simple way to do this? 

It looks like I could use RDD.filter() or RDD.collect[U](f: 
PartialFunction[T, U]). Either of those techniques requires defining a 
function that will filter the RDD. But the shape of the data in the RDD 
could be such that, for a given function (say splitting by timestamp by hour 
of day), it won't reliably split up into reasonably sized pages. Also, it 
requires doing some analysis to determine boundaries on the data for 
filtering. Point is, it's extra logic 

Any thoughts on if there's a simpler way to page through RDD elements back 
at the driver? 




-- 
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/paging-through-an-RDD-that-s-too-large-to-collect-all-at-once-tp14638.html
 
Sent from the Apache Spark User List mailing list archive at Nabble.com. 

- 
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
For additional commands, e-mail: user-h...@spark.apache.org 



Re: Short Circuit Local Reads

2014-09-17 Thread Matei Zaharia
I'm pretty sure it does help, though I don't have any numbers for it. In any 
case, Spark will automatically benefit from this if you link it to a version of 
HDFS that contains this.

Matei

On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) wrote:

Cloudera had a blog post about this in August 2013: 
http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/

Has anyone been using this in production - curious as to if it made a 
significant difference from a Spark perspective.

Re: Spark as a Library

2014-09-16 Thread Matei Zaharia
If you want to run the computation on just one machine (using Spark's local 
mode), it can probably run in a container. Otherwise you can create a 
SparkContext there and connect it to a cluster outside. Note that I haven't 
tried this though, so the security policies of the container might be too 
restrictive. In that case you'd have to run the app outside and expose an RPC 
interface between them.

Matei

On September 16, 2014 at 8:17:08 AM, Ruebenacker, Oliver A 
(oliver.ruebenac...@altisource.com) wrote:

 

 Hello,

 

  Suppose I want to use Spark from an application that I already submit to run 
in another container (e.g. Tomcat). Is this at all possible? Or do I have to 
split the app into two components, and submit one to Spark and one to the other 
container? In that case, what is the preferred way for the two components to 
communicate with each other? Thanks!

 

 Best, Oliver

 

Oliver Ruebenacker | Solutions Architect

 

Altisource™

290 Congress St, 7th Floor | Boston, Massachusetts 02210

P: (617) 728-5582 | ext: 275585

oliver.ruebenac...@altisource.com | www.Altisource.com

 

***
This email message and any attachments are intended solely for the use of the 
addressee. If you are not the intended recipient, you are prohibited from 
reading, disclosing, reproducing, distributing, disseminating or otherwise 
using this transmission. If you have received this message in error, please 
promptly notify the sender by reply email and immediately delete this message 
from your system. This message and any attachments may contain information that 
is confidential, privileged or exempt from disclosure. Delivery of this message 
to any person other than the intended recipient is not intended to waive any 
right or privilege. Message transmission is not guaranteed to be secure or free 
of software viruses.
***

Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
Scala 2.11 work is under way in open pull requests though, so hopefully it will 
be in soon.

Matei

On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:

ah...thanks!

On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote:
No, not yet.  Spark SQL is using org.scalamacros:quasiquotes_2.10.

On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to 
Scala 2.11?

Mohit.




Re: scala 2.11?

2014-09-15 Thread Matei Zaharia
I think the current plan is to put it in 1.2.0, so that's what I meant by 
soon. It might be possible to backport it too, but I'd be hesitant to do that 
as a maintenance release on 1.1.x and 1.0.x since it would require nontrivial 
changes to the build that could break things on Scala 2.10.

Matei

On September 15, 2014 at 12:19:04 PM, Mark Hamstra (m...@clearstorydata.com) 
wrote:

Are we going to put 2.11 support into 1.1 or 1.0?  Else will be in soon 
applies to the master development branch, but actually in the Spark 1.2.0 
release won't occur until the second half of November at the earliest.

On Mon, Sep 15, 2014 at 12:11 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Scala 2.11 work is under way in open pull requests though, so hopefully it will 
be in soon.

Matei

On September 15, 2014 at 9:48:42 AM, Mohit Jaggi (mohitja...@gmail.com) wrote:

ah...thanks!

On Mon, Sep 15, 2014 at 9:47 AM, Mark Hamstra m...@clearstorydata.com wrote:
No, not yet.  Spark SQL is using org.scalamacros:quasiquotes_2.10.

On Mon, Sep 15, 2014 at 9:28 AM, Mohit Jaggi mohitja...@gmail.com wrote:
Folks,
I understand Spark SQL uses quasiquotes. Does that mean Spark has now moved to 
Scala 2.11?

Mohit.





  1   2   3   >