Re: ORC v/s Parquet for Spark 2.0

2016-07-28 Thread Alexander Pivovarov
Found 0 matching posts for *ORC v/s Parquet for Spark 2.0* in Apache Spark
User List 
http://apache-spark-user-list.1001560.n3.nabble.com/

Anyone have a link to this discussion? Want to share it with my colleagues.

On Thu, Jul 28, 2016 at 2:35 PM, Mich Talebzadeh 
wrote:

> As far as I know Spark still lacks the ability to handle Updates or
> deletes vis-à-vis ORC transactional tables. As you may know in Hive an ORC
> transactional table can handle updates and deletes. Transactional support
> was added to Hive for ORC tables. No transactional support with Spark SQL
> on ORC tables yet. Locking and concurrency (as used by Hive) with Spark
> app running a Hive context. I am not convinced this works actually. Case in
> point, you can test it for yourself in Spark and see whether locks are
> applied in Hive metastore . In my opinion Spark value comes as a query tool
> for faster query processing (DAG + IM capability)
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 28 July 2016 at 18:46, Ofir Manor  wrote:
>
>> BTW - this thread has many anecdotes on Apache ORC vs. Apache Parquet (I
>> personally think both are great at this point).
>> But the original question was about Spark 2.0. Anyone has some insights
>> about Parquet-specific optimizations / limitations vs. ORC-specific
>> optimizations / limitations in pre-2.0 vs. 2.0? I've put one in the
>> beginning of the thread regarding Structured Streaming, but there was a
>> general claim that pre-2.0 Spark was missing many ORC optimizations, and
>> that some (all?) were added in 2.0.
>> I saw that a lot of related tickets closed in 2.0, but it would great if
>> someone close to the details can explain.
>>
>> Ofir Manor
>>
>> Co-Founder & CTO | Equalum
>>
>> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>>
>> On Thu, Jul 28, 2016 at 6:49 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Like anything else your mileage varies.
>>>
>>> ORC with Vectorised query execution
>>> 
>>>  is
>>> the nearest one can get to proper Data Warehouse like SAP IQ or Teradata
>>> with columnar indexes. To me that is cool. Parquet has been around and has
>>> its use case as well.
>>>
>>> I guess there is no hard and fast rule which one to use all the time.
>>> Use the one that provides best fit for the condition.
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>> On 28 July 2016 at 09:18, Jörn Franke  wrote:
>>>
 I see it more as a process of innovation and thus competition is good.
 Companies just should not follow these religious arguments but try
 themselves what suits them. There is more than software when using software
 ;)

 On 28 Jul 2016, at 01:44, Mich Talebzadeh 
 wrote:

 And frankly this is becoming some sort of religious arguments now



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



 On 28 July 2016 at 00:01, Sudhir Babu Pothineni 
 wrote:

> It depends 

Re: Is there a way to merge parquet small files?

2016-05-19 Thread Alexander Pivovarov
Try to use hadoop setting mapreduce.input.fileinputformat.split.maxsize to
control RDD partition size
I heard that DF can several files in 1 task


On Thu, May 19, 2016 at 8:50 PM, 王晓龙/0515 
wrote:

> I’m using a spark streaming program to store log message into parquet file
> every 10 mins.
> Now, when I query the parquet, it usually takes hundreds of thousands of
> stages to compute a single count.
> I looked into the parquet file’s path and find a great amount of small
> files.
>
> Do the small files caused the problem? Can I merge them, or is there a
> better way to solve it?
>
> Lots of thanks.
>
> 
>
> 此邮件内容仅代表发送者的个人观点和意见,与招商银行股份有限公司及其下属分支机构的观点和意见无关,招商银行股份有限公司及其下属分支机构不对此邮件内容承担任何责任。此邮件内容仅限收件人查阅,如误收此邮件请立即删除。
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark on AWS

2016-04-28 Thread Alexander Pivovarov
Fatima, the easiest way to create Spark cluster on AWS is to create EMR
cluster and select Spark application. (the latest EMR includes Spark 1.6.1)

Spark works well with S3 (read and write). However it's recommended to
set spark.speculation true (it's expected that some tasks fail if you read
large S3 folder, so speculation should help)



On Thu, Apr 28, 2016 at 2:39 PM, Fatma Ozcan  wrote:

> What is your experience using Spark on AWS? Are you setting up your own
> Spark cluster, and using HDFS? Or are you using Spark as a service from
> AWS? In the latter case, what is your experience of using S3 directly,
> without having HDFS in between?
>
> Thanks,
> Fatma
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
Spark on Yarn supports dynamic resource allocation

So, you can run several spark-shells / spark-submits / spark-jobserver /
zeppelin on one cluster without defining upfront how many executors /
memory you want to allocate to each app

Great feature for regular users who just want to run Spark / Spark SQL


On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think usage is the differentiating factor. YARN and standalone
> are pretty well supported. If you are only running a Spark cluster by
> itself with nothing else, standalone is probably simpler than setting
> up YARN just for Spark. However if you're running on a cluster that
> will host other applications, you'll need to integrate with a shared
> resource manager and its security model, and for anything
> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>
> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
> <apivova...@gmail.com> wrote:
> > AWS EMR includes Spark on Yarn
> > Hortonworks and Cloudera platforms include Spark on Yarn as well
> >
> >
> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
> arkadiusz.b...@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
> >> production ?
> >>
> >> I would like to choose most supported and used technology in
> >> production for our project.
> >>
> >>
> >> BR,
> >>
> >> Arkadiusz Bicz
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
AWS EMR includes Spark on Yarn
Hortonworks and Cloudera platforms include Spark on Yarn as well


On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz 
wrote:

> Hello,
>
> Is there any statistics regarding YARN vs Standalone Spark Usage in
> production ?
>
> I would like to choose most supported and used technology in
> production for our project.
>
>
> BR,
>
> Arkadiusz Bicz
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Here is a step-by-step instruction on how to run it on EMR (or other)
clusters
https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/EMR.md

On Tue, Mar 29, 2016 at 4:44 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> It is a separate project based on my understanding.  I am currently
> evaluating it right now.
>
>
> On Mar 29, 2016, at 16:17, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>
>
> Begin forwarded message:
>
> *From: *Michael Segel <mse...@segel.com>
> *Subject: **Re: Spark and N-tier architecture*
> *Date: *March 29, 2016 at 4:16:44 PM MST
> *To: *Alexander Pivovarov <apivova...@gmail.com>
> *Cc: *Mich Talebzadeh <mich.talebza...@gmail.com>, Ashok Kumar <
> ashok34...@yahoo.com>, User <user@spark.apache.org>
>
> So…
>
> Is spark-jobserver an official part of spark or something else?
>
> From what I can find via a quick Google … this isn’t part of the core
> spark distribution.
>
> On Mar 29, 2016, at 3:50 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
> https://github.com/spark-jobserver/spark-jobserver
>
>
>
>


Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark-jobserver was originally created by Ooyala
Now it's Open Source Apache Licensed project
​


Re: Spark and N-tier architecture

2016-03-29 Thread Alexander Pivovarov
Spark is a distributed data processing engine plus distributed in-memory /
disk data cache

spark-jobserver provides REST API to your spark applications. It allows you
to submit jobs to spark and get results in sync or async mode

It also can create long running Spark context to cache RDDs in memory with
some name (namedRDD) and then use it to serve requests from multiple users.
Because RDD is in memory response should be super fast (seconds)

https://github.com/spark-jobserver/spark-jobserver


On Tue, Mar 29, 2016 at 2:50 PM, Mich Talebzadeh 
wrote:

> Interesting question.
>
> The most widely used application of N-tier is the traditional three-tier
> architecture that has been the backbone of Client-server architecture by
> having presentation layer, application layer and data layer. This is
> primarily for performance, scalability and maintenance. The most profound
> changes that Big data space has introduced to N-tier architecture is the
> concept of horizontal scaling as opposed to the previous tiers that relied
> on vertical scaling. HDFS is an example of horizontal scaling at the data
> tier by adding more JBODS to storage. Similarly adding more nodes to Spark
> cluster should result in better performance.
>
> Bear in mind that these tiers are at Logical levels which means that there
> or may not be so many so many physical layers. For example multiple virtual
> servers can be hosted on the same physical server.
>
> With regard to Spark, it is effectively a powerful query tools that sits
> in between the presentation layer (say Tableau) and the HDFS or Hive as you
> alluded. In that sense you can think of Spark as part of the application
> layer that communicates with the backend via a number of protocols
> including the standard JDBC. There is rather a blurred vision here whether
> Spark is a database or query tool. IMO it is a query tool in a sense that
> Spark by itself does not have its own storage concept or metastore. Thus it
> relies on others to provide that service.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 March 2016 at 22:07, Ashok Kumar 
> wrote:
>
>> Experts,
>>
>> One of terms used and I hear is N-tier architecture within Big Data used
>> for availability, performance etc. I also hear that Spark by means of its
>> query engine and in-memory caching fits into middle tier (application
>> layer) with HDFS and Hive may be providing the data tier.  Can someone
>> elaborate the role of Spark here. For example A Scala program that we write
>> uses JDBC to talk to databases so in that sense is Spark a middle tier
>> application?
>>
>> I hope that someone can clarify this and if so what would the best
>> practice in using Spark as middle tier and within Big data.
>>
>> Thanks
>>
>>
>


Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
ok, start EMR-4.3.0 or 4.2.0 cluster and look at how to configure spark on
yarn properly


Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
for small cluster set the following settings

yarn-site.xml


  yarn.scheduler.minimum-allocation-mb
  32



capacity-scheduler.xml

  
yarn.scheduler.capacity.maximum-am-resource-percent
0.5

  Maximum percent of resources in the cluster which can be used to run
  application masters i.e. controls number of concurrent running
  applications.

  


Probably yarn can not allocate mem for AM container. dafault value is 0.1
and Spark AM need 896 MB   (and 0.1 gives just 393 MB which is not enough)

On Tue, Mar 29, 2016 at 2:35 PM, Vineet Mishra <clearmido...@gmail.com>
wrote:

> Yarn seems to be running fine, I have successful MR jobs completed on the
> same,
>
> *Cluster Metrics*
> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
> Running Memory Used Memory Total Memory Reserved VCores Used VCores Total
> VCores Reserved Active Nodes Decommissioned Nodes Lost Nodes Unhealthy
> Nodes Rebooted Nodes*
> *1 0 0 1 0 0 B 8 GB 0 B 0 8 0 1 0 0 0 0*
> *User Metrics for dr.who*
> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
> Running Containers Pending Containers Reserved Memory Used Memory Pending
> Memory Reserved VCores Used VCores Pending VCores Reserved*
> *0 0 0 1 0 0 0 0 B 0 B 0 B 0 0 0*
> *Show  entriesSearch: *
> *ID*
> *User*
> *Name*
> *Application Type*
> *Queue*
> *StartTime*
> *FinishTime*
> *State*
> *FinalStatus*
> *Progress*
> *Tracking UI*
> *application_1459287061048_0001 myhost word count MAPREDUCE root.myhost
> Tue, 29 Mar 2016 21:31:39 GMT Tue, 29 Mar 2016 21:31:59 GMT FINISHED
> SUCCEEDED *
> *History*
>
> On Wed, Mar 30, 2016 at 2:52 AM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> check resource manager and node manager logs.
>> Maybe you find smth explaining why 1 app is pending
>>
>> do you have any app run successfully? *Apps Completed is 0 on the UI*
>>
>>
>> On Tue, Mar 29, 2016 at 2:13 PM, Vineet Mishra <clearmido...@gmail.com>
>> wrote:
>>
>>> Hi Alex/Surendra,
>>>
>>> Hadoop is up and running fine and I am able to run example on the same.
>>>
>>> *Cluster Metrics*
>>> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
>>> Running Memory Used Memory Total Memory Reserved VCores Used VCores Total
>>> VCores Reserved Active Nodes Decommissioned Nodes Lost Nodes Unhealthy
>>> Nodes Rebooted Nodes*
>>> *1 1 0 0 0 0 B 3.93 GB 0 B 0 4 0 1 0 0 0 0*
>>> *User Metrics for dr.who*
>>> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
>>> Running Containers Pending Containers Reserved Memory Used Memory Pending
>>> Memory Reserved VCores Used VCores Pending VCores Reserved*
>>> *0 1 0 0 0 0 0 0 B 0 B 0 B 0 0 0*
>>>
>>> Any Other trace?
>>>
>>> On Wed, Mar 30, 2016 at 2:31 AM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
>>>> check 8088 ui
>>>> - how many cores and memory available
>>>> - how many slaves are active
>>>>
>>>> run teragen or pi from hadoop examples to make sure that yarn works
>>>>
>>>> On Tue, Mar 29, 2016 at 1:25 PM, Surendra , Manchikanti <
>>>> surendra.manchika...@gmail.com> wrote:
>>>>
>>>>> Hi Vineeth,
>>>>>
>>>>> Can you please check resource(RAM,Cores) availability in your local
>>>>> cluster, And change accordingly.
>>>>>
>>>>> Regards,
>>>>> Surendra M
>>>>>
>>>>> -- Surendra Manchikanti
>>>>>
>>>>> On Tue, Mar 29, 2016 at 1:15 PM, Vineet Mishra <clearmido...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> While starting Spark on Yarn on local cluster(Single Node Hadoop 2.6
>>>>>> yarn) I am facing some issues.
>>>>>>
>>>>>> As I try to start the Spark Shell it keeps on iterating in a endless
>>>>>> loop while initiating,
>>>>>>
>>>>>> *6/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120)
>>>>>> connection to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from
>>>>>> myhost sending #11971*
>>>>>> *16/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120)
>>>>>> connection to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from
>>>>>> myhost got value #11971*
>>&g

Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
check resource manager and node manager logs.
Maybe you find smth explaining why 1 app is pending

do you have any app run successfully? *Apps Completed is 0 on the UI*


On Tue, Mar 29, 2016 at 2:13 PM, Vineet Mishra <clearmido...@gmail.com>
wrote:

> Hi Alex/Surendra,
>
> Hadoop is up and running fine and I am able to run example on the same.
>
> *Cluster Metrics*
> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
> Running Memory Used Memory Total Memory Reserved VCores Used VCores Total
> VCores Reserved Active Nodes Decommissioned Nodes Lost Nodes Unhealthy
> Nodes Rebooted Nodes*
> *1 1 0 0 0 0 B 3.93 GB 0 B 0 4 0 1 0 0 0 0*
> *User Metrics for dr.who*
> *Apps Submitted Apps Pending Apps Running Apps Completed Containers
> Running Containers Pending Containers Reserved Memory Used Memory Pending
> Memory Reserved VCores Used VCores Pending VCores Reserved*
> *0 1 0 0 0 0 0 0 B 0 B 0 B 0 0 0*
>
> Any Other trace?
>
> On Wed, Mar 30, 2016 at 2:31 AM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> check 8088 ui
>> - how many cores and memory available
>> - how many slaves are active
>>
>> run teragen or pi from hadoop examples to make sure that yarn works
>>
>> On Tue, Mar 29, 2016 at 1:25 PM, Surendra , Manchikanti <
>> surendra.manchika...@gmail.com> wrote:
>>
>>> Hi Vineeth,
>>>
>>> Can you please check resource(RAM,Cores) availability in your local
>>> cluster, And change accordingly.
>>>
>>> Regards,
>>> Surendra M
>>>
>>> -- Surendra Manchikanti
>>>
>>> On Tue, Mar 29, 2016 at 1:15 PM, Vineet Mishra <clearmido...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> While starting Spark on Yarn on local cluster(Single Node Hadoop 2.6
>>>> yarn) I am facing some issues.
>>>>
>>>> As I try to start the Spark Shell it keeps on iterating in a endless
>>>> loop while initiating,
>>>>
>>>> *6/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost
>>>> sending #11971*
>>>> *16/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost got
>>>> value #11971*
>>>> *16/03/30 01:32:38 DEBUG ipc.ProtobufRpcEngine: Call:
>>>> getApplicationReport took 1ms*
>>>> *16/03/30 01:32:38 INFO yarn.Client: Application report for
>>>> application_1459260674306_0003 (state: ACCEPTED)*
>>>> *16/03/30 01:32:38 DEBUG yarn.Client: *
>>>> * client token: N/A*
>>>> * diagnostics: N/A*
>>>> * ApplicationMaster host: N/A*
>>>> * ApplicationMaster RPC port: -1*
>>>> * queue: root.thequeue*
>>>> * start time: 1459269797431*
>>>> * final status: UNDEFINED*
>>>> * tracking URL:
>>>> http://myhost:8088/proxy/application_1459260674306_0003/
>>>> <http://myhost:8088/proxy/application_1459260674306_0003/>*
>>>> * user: myhost*
>>>>
>>>> *16/03/30 01:45:07 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost
>>>> sending #338*
>>>> *16/03/30 01:45:07 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost got
>>>> value #338*
>>>> *16/03/30 01:45:07 DEBUG ipc.ProtobufRpcEngine: Call:
>>>> getApplicationReport took 2ms*
>>>> *16/03/30 01:45:08 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost
>>>> sending #339*
>>>> *16/03/30 01:45:08 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost got
>>>> value #339*
>>>> *16/03/30 01:45:08 DEBUG ipc.ProtobufRpcEngine: Call:
>>>> getApplicationReport took 2ms*
>>>> *16/03/30 01:45:09 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost
>>>> sending #340*
>>>> *16/03/30 01:45:09 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost got
>>>> value #340*
>>>> *16/03/30 01:45:09 DEBUG ipc.ProtobufRpcEngine: Call:
>>>> getApplicationReport took 2ms*
>>>> *16/03/30 01:45:10 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost
>>>> sending #341*
>>>> *16/03/30 01:45:10 DEBUG ipc.Client: IPC Client (101088744) connection
>>>> to myhost/192.168.1.108:8032 <http://192.168.1.108:8032> from myhost got
>>>> value #341*
>>>> *16/03/30 01:45:10 DEBUG ipc.ProtobufRpcEngine: Call:
>>>> getApplicationReport took 1ms*
>>>>
>>>> Any leads would be appreciated.
>>>>
>>>> Thanks!
>>>>
>>>
>>>
>>
>


Re: Running Spark on Yarn

2016-03-29 Thread Alexander Pivovarov
check 8088 ui
- how many cores and memory available
- how many slaves are active

run teragen or pi from hadoop examples to make sure that yarn works

On Tue, Mar 29, 2016 at 1:25 PM, Surendra , Manchikanti <
surendra.manchika...@gmail.com> wrote:

> Hi Vineeth,
>
> Can you please check resource(RAM,Cores) availability in your local
> cluster, And change accordingly.
>
> Regards,
> Surendra M
>
> -- Surendra Manchikanti
>
> On Tue, Mar 29, 2016 at 1:15 PM, Vineet Mishra 
> wrote:
>
>> Hi All,
>>
>> While starting Spark on Yarn on local cluster(Single Node Hadoop 2.6
>> yarn) I am facing some issues.
>>
>> As I try to start the Spark Shell it keeps on iterating in a endless loop
>> while initiating,
>>
>> *6/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120) connection to
>> myhost/192.168.1.108:8032  from myhost sending
>> #11971*
>> *16/03/30 01:32:38 DEBUG ipc.Client: IPC Client (1782965120) connection
>> to myhost/192.168.1.108:8032  from myhost got
>> value #11971*
>> *16/03/30 01:32:38 DEBUG ipc.ProtobufRpcEngine: Call:
>> getApplicationReport took 1ms*
>> *16/03/30 01:32:38 INFO yarn.Client: Application report for
>> application_1459260674306_0003 (state: ACCEPTED)*
>> *16/03/30 01:32:38 DEBUG yarn.Client: *
>> * client token: N/A*
>> * diagnostics: N/A*
>> * ApplicationMaster host: N/A*
>> * ApplicationMaster RPC port: -1*
>> * queue: root.thequeue*
>> * start time: 1459269797431*
>> * final status: UNDEFINED*
>> * tracking URL: http://myhost:8088/proxy/application_1459260674306_0003/
>> *
>> * user: myhost*
>>
>> *16/03/30 01:45:07 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost sending
>> #338*
>> *16/03/30 01:45:07 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost got value
>> #338*
>> *16/03/30 01:45:07 DEBUG ipc.ProtobufRpcEngine: Call:
>> getApplicationReport took 2ms*
>> *16/03/30 01:45:08 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost sending
>> #339*
>> *16/03/30 01:45:08 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost got value
>> #339*
>> *16/03/30 01:45:08 DEBUG ipc.ProtobufRpcEngine: Call:
>> getApplicationReport took 2ms*
>> *16/03/30 01:45:09 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost sending
>> #340*
>> *16/03/30 01:45:09 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost got value
>> #340*
>> *16/03/30 01:45:09 DEBUG ipc.ProtobufRpcEngine: Call:
>> getApplicationReport took 2ms*
>> *16/03/30 01:45:10 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost sending
>> #341*
>> *16/03/30 01:45:10 DEBUG ipc.Client: IPC Client (101088744) connection to
>> myhost/192.168.1.108:8032  from myhost got value
>> #341*
>> *16/03/30 01:45:10 DEBUG ipc.ProtobufRpcEngine: Call:
>> getApplicationReport took 1ms*
>>
>> Any leads would be appreciated.
>>
>> Thanks!
>>
>
>


Re: Testing spark with AWS spot instances

2016-03-27 Thread Alexander Pivovarov
I use spot instances for 100 slaves cluster (r3.2xlarge on us-west-1)
Jobs I run usually take about 15 hours - cluster is stable and fast. 1-2
computers might be terminated but it's very rare event and Spark can handle
it.

On Fri, Mar 25, 2016 at 6:28 PM, Sven Krasser  wrote:

> When a spot instance terminates, you lose all data (RDD partitions) stored
> in the executors that ran on that instance. Spark can recreate the
> partitions from input data, but if that requires going through multiple
> preceding shuffles a good chunk of the job will need to be redone.
> -Sven
>
> On Thu, Mar 24, 2016 at 10:15 PM, Dillian Murphey  > wrote:
>
>> I'm very new to apache spark. I'm just a user not a developer.
>>
>> I'm running a cluster with many spot instances. Am I correct in
>> understanding that spark can handle an unlimited number of spot instance
>> failures and restarts?  Sometimes all the spot instances will dissapear
>> without warning, and then they come back.  Can I trust spark to pickup all
>> jobs where it left off?
>>
>> I'm noticing some instability with my system. I'm suspecting it could be
>> disk or RAM issues.  When I add a lot of slaves I run low on RAM on my
>> master.  Maybe that's part of the problem. But jut want to confirm my
>> understanding.
>>
>
>
>
> --
> www.skrasser.com 
>


Re: YARN process with Spark

2016-03-14 Thread Alexander Pivovarov
As in Hadoop 2.5.1 of MapR 4.1.0, virtual memory checker is disabled while
physical memory checker is enabled by default.

Since on Centos/RHEL 6 there are aggressive allocation of virtual memory
due to OS behavior, you should disable virtual memory checker or increase
yarn.nodemanager.vmem-pmem-ratio to a relatively larger value.

https://www.mapr.com/blog/best-practices-yarn-resource-management

On Mon, Mar 14, 2016 at 3:36 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 11 Mar 2016, at 23:01, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
> Forgot to mention. To avoid unnecessary container termination add the
> following setting to yarn
>
> yarn.nodemanager.vmem-check-enabled = false
>
>
> That can kill performance on a shared cluster: if your container code
> starts to swap, performance of everything suffers. A good ops team will
> decline such a request in a multi-tenant cluster.
>
> In such a cluster: aask for the amount of memory you think you actually
> need, and let the scheduler find space for it. This not only stops you
> killing cluster performance, it means that on a busy cluster, you get the
> same memory and CPU is you would on an idle one: so more consistent
> workloads. (and nobody else swapping your code out)
>
> regarding the numbers, people need to remember that if they are running
> python work in the cluster, they need to include more headroom.
>
> if you are going to turn off memory monitoring, have a play
> with yarn.nodemanager.pmem-check-enabled=false too
>


Re: Spark with Yarn Client

2016-03-11 Thread Alexander Pivovarov
Check doc - http://spark.apache.org/docs/latest/running-on-yarn.html

also you can start EMR-4.2.0 or 4.3.0 cluster with Spark app and see how
it's configured

On Fri, Mar 11, 2016 at 7:50 PM, Divya Gehlot 
wrote:

> Hi,
> I am trying to understand behaviour /configuration of spark with yarn
> client on hadoop cluster .
> Can somebody help me or point me document /blog/books which has deeper
> understanding of above two.
> Thanks,
> Divya
>


Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
you need to set

yarn.scheduler.minimum-allocation-mb=32

otherwise Spark AM container will be running on dedicated box instead of
running together with the executor container on one of the boxes

for slaves I use Amazon EC2 r3.2xlarge box (61GB / 8 cores) - cost ~$0.10 /
hour (spot instance)



On Fri, Mar 11, 2016 at 3:17 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Koert and Alexander
>
> I think the yarn configuration parameters in yarn-site,xml are important.
> For those I have
>
>
> 
>   yarn.nodemanager.resource.memory-mb
>   Amount of max physical memory, in MB, that can be allocated
> for YARN containers.
>   8192
> 
> 
>yarn.nodemanager.vmem-pmem-ratio
> Ratio between virtual memory to physical memory when
> setting memory limits for containers
> 2.1
>   
> 
> yarn.scheduler.maximum-allocation-mb
> Maximum memory for each container
> 8192
>   
> 
> yarn.scheduler.minimum-allocation-mb
> Minimum memory for each container
> 2048
>   
>
> However, I noticed that you Alexander have the following settings
>
> yarn.nodemanager.resource.memory-mb = 54272
> yarn.scheduler.maximum-allocation-mb = 54272
>
> With 8 Spark executor cores that gives you 6GB of memory per core. As a
> matter of interest how much memory and how many cores do you have for each
> node?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 March 2016 at 23:01, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
>> Forgot to mention. To avoid unnecessary container termination add the
>> following setting to yarn
>>
>> yarn.nodemanager.vmem-check-enabled = false
>>
>>
>


Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
Forgot to mention. To avoid unnecessary container termination add the
following setting to yarn

yarn.nodemanager.vmem-check-enabled = false


Re: YARN process with Spark

2016-03-11 Thread Alexander Pivovarov
YARN cores are virtual cores which are used just to calculate available
resources. But usually memory is used to manage yarn resources (not cores)

spark executor memory should be ~90% of  yarn.scheduler.maximum-allocation-mb
(which should be the same as yarn.nodemanager.resource.memory-mb)
~10% should go to executor memory overhead

e.g. for r3.2xlarge slave (61GB / 8 cores) the setting are the following
(on EMR-4.2.0)

yarn settings:
yarn.nodemanager.resource.memory-mb = 54272
yarn.scheduler.maximum-allocation-mb = 54272

spark settings:
spark.executor.memory = 47924M
spark.yarn.executor.memoryOverhead = 5324
spark.executor.cores = 8// cores available on each slave

1024M of YARN memory is reserved on each box to run Spark AM container(s) -
Spark AM container uses 896 MB of yarn memory (AM used in both client and
cluster mode)



On Fri, Mar 11, 2016 at 2:08 PM, Koert Kuipers  wrote:

> you get a spark executor per yarn container. the spark executor can have
> multiple cores, yes. this is configurable. so the number of partitions that
> can be processed in parallel is num-executors * executor-cores. and for
> processing a partition the available memory is executor-memory /
> executor-cores (roughly, cores can of course borrow memory from each other
> within executor).
>
> the relevant setting for spark-submit are:
>  --executor-memory
>  --executor-cores
>  --num-executors
>
> On Fri, Mar 11, 2016 at 4:58 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> Can these be clarified please
>>
>>
>>1. Can a YARN container use more than one core and if this is
>>configurable?
>>2. A YARN container is constraint to 8MB by
>>" yarn.scheduler.maximum-allocation-mb". If a YARN container is a Spark
>>process will that limit also include the memory Spark going to be using?
>>
>> Thanks,
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Is there Graph Partitioning impl for Scala/Spark?

2016-03-11 Thread Alexander Pivovarov
JUNG library has 4 Community Detection (Community Structure) algorithms
implemented including Girvan–Newman algorithm
(EdgeBetweennessClusterer.java)

https://github.com/jrtom/jung/tree/master/jung-algorithms/src/main/java/edu/uci/ics/jung/algorithms/cluster

Girvan–Newman algorithm paper
http://www.santafe.edu/media/workingpapers/01-12-077.pdf
---

iGraph has 7 algorithms implemented including InfoMap and Louvain but lic
is written in C/C++
http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/

On Wed, Mar 9, 2016 at 10:40 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Is there Graph Partitioning impl (e.g. Spectral ) which can be used in
> Spark?
> I guess it should be at least java/scala lib
> Maybe even tuned to work with GraphX
>


Re: Graphx

2016-03-11 Thread Alexander Pivovarov
we use it in prod

70 boxes, 61GB RAM each

GraphX Connected Components works fine on 250M Vertices and 1B Edges (takes
about 5-10 min)

Spark likes memory, so use r3.2xlarge boxes (61GB)
For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge
(30.5 GB) (especially if you have skewed data)

Also, use checkpoints before and after Connected Components to reduce DAG
delays

You can also try to enable Kryo and register classes used in RDD


On Fri, Mar 11, 2016 at 8:07 AM, John Lilley 
wrote:

> I suppose for a 2.6bn case we’d need Long:
>
>
>
> public class GenCCInput {
>
>   public static void main(String[] args) {
>
> if (args.length != 2) {
>
>   System.err.println("Usage: \njava GenCCInput  ");
>
>   System.exit(-1);
>
> }
>
> long edges = Long.parseLong(args[0]);
>
> long groupSize = Long.parseLong(args[1]);
>
> long currentEdge = 1;
>
> long currentGroupSize = 0;
>
> for (long i = 0; i < edges; i++) {
>
>   System.out.println(currentEdge + " " + (currentEdge + 1));
>
>   if (currentGroupSize == 0) {
>
> currentGroupSize = 2;
>
>   } else {
>
> currentGroupSize++;
>
>   }
>
>   if (currentGroupSize >= groupSize) {
>
> currentGroupSize = 0;
>
> currentEdge += 2;
>
>   } else {
>
> currentEdge++;
>
>   }
>
> }
>
>   }
>
> }
>
>
>
> *John Lilley*
>
> Chief Architect, RedPoint Global Inc.
>
> T: +1 303 541 1516  *| *M: +1 720 938 5761 *|* F: +1 781-705-2077
>
> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net
>
>
>
> *From:* John Lilley [mailto:john.lil...@redpoint.net]
> *Sent:* Friday, March 11, 2016 8:46 AM
> *To:* Ovidiu-Cristian MARCU 
> *Cc:* lihu ; Andrew A ;
> u...@spark.incubator.apache.org; Geoff Thompson <
> geoff.thomp...@redpoint.net>
> *Subject:* RE: Graphx
>
>
>
> Ovidiu,
>
>
>
> IMHO, this is one of the biggest issues facing GraphX and Spark.  There
> are a lot of knobs and levers to pull to affect performance, with very
> little guidance about which settings work in general.  We cannot ship
> software that requires end-user tuning; it just has to work.  Unfortunately
> GraphX seems very sensitive to working set size relative to available RAM
> and fails catastrophically as opposed to gracefully when working set is too
> large.  It is also very sensitive to the nature of the data.  For example,
> if we build a test file with input-edge representation like:
>
> 1 2
>
> 2 3
>
> 3 4
>
> 5 6
>
> 6 7
>
> 7 8
>
> …
>
> this represents a graph with connected components in groups of four.  We
> found experimentally that when this data in input in clustered order, the
> required memory is lower and runtime is much faster than when data is input
> in random order.  This makes intuitive sense because of the additional
> communication required for the random order.
>
>
>
> Our 1bn-edge test case was of this same form, input in clustered order,
> with groups of 10 vertices per component.  It failed at 8 x 60GB.  This is
> the kind of data that our application processes, so it is a realistic test
> for us.  I’ve found that social media test data sets tend to follow
> power-law distributions, and that GraphX has much less problem with them.
>
>
>
> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges
> in 10-vertex components using the synthetic test input I describe above.  I
> would be curious to know if this works and what settings you use to
> succeed, and if it continues to succeed for random input order.
>
>
>
> As for the C++ algorithm, it scales multi-core.  It exhibits O(N^2)
> behavior for large data sets, but it processes the 1bn-edge case on a
> single 60GB node in about 20 minutes.  It degrades gracefully along the
> O(N^2) curve and additional memory reduces time.
>
>
>
> *John Lilley*
>
>
>
> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr
> ]
> *Sent:* Friday, March 11, 2016 8:14 AM
> *To:* John Lilley 
> *Cc:* lihu ; Andrew A ;
> u...@spark.incubator.apache.org
> *Subject:* Re: Graphx
>
>
>
> Hi,
>
>
>
> I wonder what version of Spark and different parameter configuration you
> used.
>
> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations)
> using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
>
> John: I suppose your C++ app (algorithm) does not scale if you used only
> one node.
>
> I don’t understand how RDD’s serialization is taking excessive time,
> compared to the total time or other expected time?
>
>
>
> For the different RDD times you have events and UI console and a bunch of
> papers describing how measure different things, lihu: did you used some
> incomplete tool or what are you looking for?
>
>
>
> Best,
>
> Ovidiu
>
>
>
> On 11 Mar 2016, at 16:02, John 

Re: EMR 4.3.0 spark 1.6 shell problem

2016-03-01 Thread Alexander Pivovarov
EMR-4.3.0 and Spark-1.6.0 works fine for me
I use r3.2xlarge boxes  (spot) (even 3 slave boxes works fine)

I use the following settings (in json)

[
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.driver.extraJavaOptions": "-Dfile.encoding=UTF-8",
  "spark.executor.extraJavaOptions": "-Dfile.encoding=UTF-8"
}
  },
  {
"Classification": "spark",
"Properties": {
  "maximizeResourceAllocation": "true"
}
  },
  {
"Classification": "spark-log4j",
"Properties": {
  "log4j.logger.com.amazon": "WARN",
  "log4j.logger.com.amazonaws": "WARN",
  "log4j.logger.amazon.emr": "WARN",
  "log4j.logger.akka": "WARN"
}
  }
]


BTW, Oleg you do not need to cd /usr/bin

just ssh as hadoop to master box and type
$ spark-shell

you can look at how spark works at :8088  WEB UI

On Tue, Mar 1, 2016 at 10:38 AM, Daniel Siegmann <
daniel.siegm...@teamaol.com> wrote:

> How many core nodes does your cluster have?
>
> On Tue, Mar 1, 2016 at 4:15 AM, Oleg Ruchovets 
> wrote:
>
>> Hi , I am installed EMR 4.3.0 with spark. I tries to enter spark shell
>> but it looks it does't work and throws exceptions.
>> Please advice:
>>
>> [hadoop@ip-172-31-39-37 conf]$ cd  /usr/bin/
>> [hadoop@ip-172-31-39-37 bin]$ ./spark-shell
>> OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512M;
>> support was removed in 8.0
>> 16/03/01 09:11:48 INFO SecurityManager: Changing view acls to: hadoop
>> 16/03/01 09:11:48 INFO SecurityManager: Changing modify acls to: hadoop
>> 16/03/01 09:11:48 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hadoop); users
>> with modify permissions: Set(hadoop)
>> 16/03/01 09:11:49 INFO HttpServer: Starting HTTP Server
>> 16/03/01 09:11:49 INFO Utils: Successfully started service 'HTTP class
>> server' on port 47223.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>   /_/
>>
>> Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_71)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 16/03/01 09:11:53 INFO SparkContext: Running Spark version 1.6.0
>> 16/03/01 09:11:53 INFO SecurityManager: Changing view acls to: hadoop
>> 16/03/01 09:11:53 INFO SecurityManager: Changing modify acls to: hadoop
>> 16/03/01 09:11:53 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hadoop); users
>> with modify permissions: Set(hadoop)
>> 16/03/01 09:11:54 INFO Utils: Successfully started service 'sparkDriver'
>> on port 52143.
>> 16/03/01 09:11:54 INFO Slf4jLogger: Slf4jLogger started
>> 16/03/01 09:11:54 INFO Remoting: Starting remoting
>> 16/03/01 09:11:54 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriverActorSystem@172.31.39.37:42989]
>> 16/03/01 09:11:54 INFO Utils: Successfully started service
>> 'sparkDriverActorSystem' on port 42989.
>> 16/03/01 09:11:54 INFO SparkEnv: Registering MapOutputTracker
>> 16/03/01 09:11:54 INFO SparkEnv: Registering BlockManagerMaster
>> 16/03/01 09:11:54 INFO DiskBlockManager: Created local directory at
>> /mnt/tmp/blockmgr-afaf0e7f-086e-49f1-946d-798e605a3fdc
>> 16/03/01 09:11:54 INFO MemoryStore: MemoryStore started with capacity
>> 518.1 MB
>> 16/03/01 09:11:55 INFO SparkEnv: Registering OutputCommitCoordinator
>> 16/03/01 09:11:55 INFO Utils: Successfully started service 'SparkUI' on
>> port 4040.
>> 16/03/01 09:11:55 INFO SparkUI: Started SparkUI at
>> http://172.31.39.37:4040
>> 16/03/01 09:11:55 INFO RMProxy: Connecting to ResourceManager at /
>> 172.31.39.37:8032
>> 16/03/01 09:11:55 INFO Client: Requesting a new application from cluster
>> with 2 NodeManagers
>> 16/03/01 09:11:55 INFO Client: Verifying our application has not
>> requested more than the maximum memory capability of the cluster (11520 MB
>> per container)
>> 16/03/01 09:11:55 INFO Client: Will allocate AM container, with 896 MB
>> memory including 384 MB overhead
>> 16/03/01 09:11:55 INFO Client: Setting up container launch context for
>> our AM
>> 16/03/01 09:11:55 INFO Client: Setting up the launch environment for our
>> AM container
>> 16/03/01 09:11:55 INFO Client: Preparing resources for our AM container
>> 16/03/01 09:11:56 INFO Client: Uploading resource
>> file:/usr/lib/spark/lib/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar ->
>> hdfs://
>> 172.31.39.37:8020/user/hadoop/.sparkStaging/application_1456818849676_0005/spark-assembly-1.6.0-hadoop2.7.1-amzn-0.jar
>> 16/03/01 09:11:56 INFO MetricsSaver: MetricsConfigRecord
>> disabledInCluster: false instanceEngineCycleSec: 60 clusterEngineCycleSec:
>> 60 disableClusterEngine: false maxMemoryMb: 3072 maxInstanceCount: 500
>> lastModified: 1456818856695
>> 16/03/01 09:11:56 INFO MetricsSaver: Created MetricsSaver
>> 

Re: Spark Integration Patterns

2016-02-29 Thread Alexander Pivovarov
There is a spark-jobserver (SJS) which is REST interface for spark and
spark-sql
you can deploy your jar file with Jobs impl to spark-jobserver
and use rest API to submit jobs in synch or async mode
in sync mode you need to poll SJS to get job result
job result might be actual data in json or path on s3 / hdfs with the data

There is an instruction on how to start job-server on AWS EMR and submit
simple workdcount job using culr
https://github.com/spark-jobserver/spark-jobserver/blob/master/doc/EMR.md

On Mon, Feb 29, 2016 at 12:54 PM, skaarthik oss 
wrote:

> Check out http://toree.incubator.apache.org/. It might help with your
> need.
>
>
>
> *From:* moshir mikael [mailto:moshir.mik...@gmail.com]
> *Sent:* Monday, February 29, 2016 5:58 AM
> *To:* Alex Dzhagriev 
> *Cc:* user 
> *Subject:* Re: Spark Integration Patterns
>
>
>
> Thanks, will check too, however : just want to use Spark core RDD and
> standard data sources.
>
>
>
> Le lun. 29 févr. 2016 à 14:54, Alex Dzhagriev  a écrit :
>
> Hi Moshir,
>
>
>
> Regarding the streaming, you can take a look at the spark streaming, the
> micro-batching framework. If it satisfies your needs it has a bunch of
> integrations. Thus, the source for the jobs could be Kafka, Flume or Akka.
>
>
>
> Cheers, Alex.
>
>
>
> On Mon, Feb 29, 2016 at 2:48 PM, moshir mikael 
> wrote:
>
> Hi Alex,
>
> thanks for the link. Will check it.
>
> Does someone know of a more streamlined approach ?
>
>
>
>
>
>
>
> Le lun. 29 févr. 2016 à 10:28, Alex Dzhagriev  a écrit :
>
> Hi Moshir,
>
>
>
> I think you can use the rest api provided with Spark:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/RestSubmissionServer.scala
>
>
>
> Unfortunately, I haven't find any documentation, but it looks fine.
>
> Thanks, Alex.
>
>
>
> On Sun, Feb 28, 2016 at 3:25 PM, mms  wrote:
>
> Hi, I cannot find a simple example showing how a typical application can
> 'connect' to a remote spark cluster and interact with it. Let's say I have
> a Python web application hosted somewhere *outside *a spark cluster, with
> just python installed on it. How can I talk to Spark without using a
> notebook, or using ssh to connect to a cluster master node ? I know of
> spark-submit and spark-shell, however forking a process on a remote host to
> execute a shell script seems like a lot of effort What are the recommended
> ways to connect and query Spark from a remote client ? Thanks Thx !
> --
>
> View this message in context: Spark Integration Patterns
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>
>
>
>


Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
DirectOutputCommitter doc says:
The FileOutputCommitter is required for HDFS + speculation, which allows
only one writer at
 a time for a file (so two people racing to write the same file would not
work). However, S3
 supports multiple writers outputting to the same file, where visibility is
guaranteed to be
 atomic. This is a monotonic operation: all writers should be writing the
same data, so which
 one wins is immaterial.

aws impl is better because it uses DirectFileOutputCommitter only for
s3n:// files
https://gist.github.com/apivovarov/bb215f08318318570567

But for some reason it does not work for me.

On Fri, Feb 26, 2016 at 11:50 AM, Reynold Xin  wrote:

> It could lose data in speculation mode, or if any job fails.
>
> On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman 
> wrote:
>
>> Takeshi, do you know the reason why they wanted to remove this commiter
>> in SPARK-10063?
>> the jira has no info inside
>> as far as I understand the direct committer can't be used when either of
>> two is true
>> 1. speculation mode
>> 2. append mode(ie. not creating new version of data but appending to
>> existing data)
>>
>> On 26 February 2016 at 08:24, Takeshi Yamamuro 
>> wrote:
>>
>>> Hi,
>>>
>>> Great work!
>>> What is the concrete performance gain of the committer on s3?
>>> I'd like to know.
>>>
>>> I think there is no direct committer for files because these kinds of
>>> committer has risks
>>> to loss data (See: SPARK-10063).
>>> Until this resolved, ISTM files cannot support direct commits.
>>>
>>> thanks,
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu  wrote:
>>>
 yes, should be this one
 https://gist.github.com/aarondav/c513916e72101bbe14ec

 then need to set it in spark-defaults.conf :
 https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13

 Am Freitag, 26. Februar 2016 schrieb Yin Yang :
 > The header of DirectOutputCommitter.scala says Databricks.
 > Did you get it from Databricks ?
 > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu  wrote:
 >>
 >> interesting in this topic as well, why the DirectFileOutputCommitter
 not included?
 >> we added it in our fork,
 under 
 core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
 >> moreover, this DirectFileOutputCommitter is not working for the
 insert operations in HiveContext, since the Committer is called by hive
 (means uses dependencies in hive package)
 >> we made some hack to fix this, you can take a look:
 >>
 https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
 >>
 >> may bring some ideas to other spark contributors to find a better
 way to use s3.
 >>
 >> 2016-02-22 23:18 GMT+01:00 igor.berman :
 >>>
 >>> Hi,
 >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
 alikes
 >>> especially when working with s3?
 >>> I know that there is one impl in spark distro for parquet format,
 but not
 >>> for files -  why?
 >>>
 >>> Imho, it can bring huge performance boost.
 >>> Using default FileOutputCommiter with s3 has big overhead at commit
 stage
 >>> when all parts are copied one-by-one to destination dir from
 _temporary,
 >>> which is bottleneck when number of partitions is high.
 >>>
 >>> Also, wanted to know if there are some problems when using
 >>> DirectFileOutputCommitter?
 >>> If writing one partition directly will fail in the middle is spark
 will
 >>> notice this and will fail job(say after all retries)?
 >>>
 >>> thanks in advance
 >>>
 >>>
 >>>
 >>>
 >>> --
 >>> View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
 >>> Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >>>
 >>>
 -
 >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 >>> For additional commands, e-mail: user-h...@spark.apache.org
 >>>
 >>
 >
 >

>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>


Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
Amazon uses the following impl
https://gist.github.com/apivovarov/bb215f08318318570567
But for some reason Spark show error at the end of the job

16/02/26 08:16:54 INFO scheduler.DAGScheduler: ResultStage 0
(saveAsTextFile at :28) finished in 14.305 s
16/02/26 08:16:54 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose
tasks have all completed, from pool
16/02/26 08:16:54 INFO scheduler.DAGScheduler: Job 0 finished:
saveAsTextFile at :28, took 14.467271 s
java.io.FileNotFoundException: File s3n://my-backup/test/test1/_temporary/0
does not exist.
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:564)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1485)
at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1525)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.getAllCommittedTaskPaths(FileOutputCommitter.java:269)
at
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:309)
at
org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
at org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:112)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1214)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1156)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)


Another implementation works fine
https://gist.github.com/aarondav/c513916e72101bbe14ec

On Thu, Feb 25, 2016 at 10:24 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Great work!
> What is the concrete performance gain of the committer on s3?
> I'd like to know.
>
> I think there is no direct committer for files because these kinds of
> committer has risks
> to loss data (See: SPARK-10063).
> Until this resolved, ISTM files cannot support direct commits.
>
> thanks,
>
>
>
> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu  wrote:
>
>> yes, should be this one
>> https://gist.github.com/aarondav/c513916e72101bbe14ec
>>
>> then need to set it in spark-defaults.conf :
>> https://github.com/zalando/spark/commit/3473f3f1ef27830813c1e0b3686e96a55f49269c#diff-f7a46208be9e80252614369be6617d65R13
>>
>> Am Freitag, 26. Februar 2016 schrieb Yin Yang :
>> > The header of DirectOutputCommitter.scala says Databricks.
>> > Did you get it from Databricks ?
>> > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu  wrote:
>> >>
>> >> interesting in this topic as well, why the DirectFileOutputCommitter
>> not included?
>> >> we added it in our fork,
>> under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
>> >> moreover, this DirectFileOutputCommitter is not working for the insert
>> operations in HiveContext, since the Committer is called by hive (means
>> uses dependencies in hive package)
>> >> we made some hack to fix this, you can take a look:
>> >>
>> https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
>> >>
>> >> may bring some ideas to other spark contributors to find a better way
>> to use s3.
>> >>
>> >> 2016-02-22 23:18 GMT+01:00 igor.berman :
>> >>>
>> >>> Hi,
>> >>> Wanted to understand if anybody uses DirectFileOutputCommitter or
>> alikes
>> >>> especially when working with s3?
>> >>> I know that there is one impl in spark distro for parquet format, but
>> not
>> >>> for files -  why?
>> >>>
>> >>> Imho, it can bring huge performance boost.
>> >>> Using default FileOutputCommiter with s3 has big overhead at commit
>> stage
>> >>> when all parts are copied one-by-one to destination dir from
>> _temporary,
>> >>> which is bottleneck when number of partitions is high.
>> >>>
>> >>> Also, wanted to know if there are some problems when using
>> >>> DirectFileOutputCommitter?
>> >>> If writing one partition directly will fail in the middle is spark
>> will
>> >>> notice this and will fail job(say after all retries)?
>> >>>
>> >>> thanks in advance
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/DirectFileOutputCommiter-tp26296.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >>
>> >
>> >
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Difference between spark-shell and spark-submit.Which one to use when ?

2016-02-14 Thread Alexander Pivovarov
Consider streaming for real time cases
http://zdatainc.com/2014/08/real-time-streaming-apache-spark-streaming/

On Sun, Feb 14, 2016 at 7:28 PM, Divya Gehlot 
wrote:

> Hi,
> I would like to know difference between spark-shell and spark-submit in
> terms of real time scenarios.
>
> I am using Hadoop cluster with Spark on EC2.
>
>
> Thanks,
> Divya
>


Re: AM creation in yarn-client mode

2016-02-09 Thread Alexander Pivovarov
the pictures to illustrate it
http://www.cloudera.com/documentation/enterprise/5-4-x/topics/cdh_ig_running_spark_on_yarn.html

On Tue, Feb 9, 2016 at 10:18 PM, Jonathan Kelly 
wrote:

> In yarn-client mode, the driver is separate from the AM. The AM is created
> in YARN, and YARN controls where it goes (though you can somewhat control
> it using YARN node labels--I just learned earlier today in a different
> thread on this list that this can be controlled by
> spark.yarn.am.labelExpression). Then what I understand is that the driver
> talks to the AM in order to request additional YARN containers in which to
> run executors.
>
> In yarn-cluster mode, the SparkSubmit process outside of the cluster
> creates the AM in YARN, and then what I understand is that the AM *becomes*
> the driver (by invoking the driver's main method), and then it requests the
> executor containers.
>
> So yes, one difference between yarn-client and yarn-cluster mode is that
> in yarn-client mode the driver and AM are separate, whereas they are the
> same in yarn-cluster.
>
> ~ Jonathan
>
> On Tue, Feb 9, 2016 at 9:57 PM praveen S  wrote:
>
>> Can you explain what happens in yarn client mode?
>>
>> Regards,
>> Praveen
>> On 10 Feb 2016 10:55, "ayan guha"  wrote:
>>
>>> It depends on yarn-cluster and yarn-client mode.
>>>
>>> On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
>>>
 Hi,

 I have 2 questions when running the spark jobs on yarn in client mode :

 1) Where is the AM(application master) created :

 A) is it created on the client where the job was submitted? i.e driver
 and AM on the same client?
 Or
 B) yarn decides where the the AM should be created?

 2) Driver and AM run in different processes : is my assumption correct?

 Regards,
 Praveen

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>


Re: Is spark-ec2 going away?

2016-01-27 Thread Alexander Pivovarov
you can use EMR-4.3.0 run on spot instances to control the price

yes, you can add/remove instances to the cluster on fly  (CORE instances
support add only, TASK instances - add and remove)



On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung 
wrote:

> I noticed that in the main branch, the ec2 directory along with the
> spark-ec2 script is no longer present.
>
> Is spark-ec2 going away in the next release? If so, what would be the best
> alternative at that time?
>
> A couple more additional questions:
> 1. Is there any way to add/remove additional workers while the cluster is
> running without stopping/starting the EC2 cluster?
> 2. For 1, if no such capability is provided with the current script., do
> we have to write it ourselves? Or is there any plan in the future to add
> such functions?
> 2. In PySpark, is it possible to dynamically change driver/executor
> memory, number of cores per executor without having to restart it? (e.g.
> via changing sc configuration or recreating sc?)
>
> Our ideal scenario is to keep running PySpark (in our case, as a notebook)
> and connect/disconnect to any spark clusters on demand.
>


save rdd with gzip compresson but without .gz extension?

2016-01-26 Thread Alexander Pivovarov
Question #1
When spark saves rdd using Gzip codec it generates files with .gz extension.
Is it possible to ask spark not to add .gz extension to file names and keep
file names like part-x

I want to compress existing text files to gzip and want to keep original
file names (and context)


Question #2.
is it possible to not produce crc files? e.g. .part-x.gz.crc and
._SUCCESS.crc


Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
Look at
to_utc_timestamp

from_utc_timestamp
On Jan 18, 2016 9:39 AM, "Jerry Lam"  wrote:

> Hi spark users and developers,
>
> what do you do if you want the from_unixtime function in spark sql to
> return the timezone you want instead of the system timezone?
>
> Best Regards,
>
> Jerry
>


Re: [Spark-SQL] from_unixtime with user-specified timezone

2016-01-18 Thread Alexander Pivovarov
If you can find the function in Oracle or Mysql or Postgress which works
better then we can create similar one.

Timezone convertion is tricky because of daylight saving time.
so better to use UTC without dst in database/DW
On Jan 18, 2016 1:24 PM, "Jerry Lam" <chiling...@gmail.com> wrote:

> Thanks Alex:
>
> So you suggested something like:
> from_utc_timestamp(to_utc_timestamp(from_unixtime(1389802875),'America/Montreal'),
> 'America/Los_Angeles')?
>
> This is a lot of conversion :)
>
> Is there a particular reason not to have from_unixtime to take timezone
> information?
>
> I think I will make a UDF if this is the only way out of the box.
>
> Thanks!
>
> Jerry
>
> On Mon, Jan 18, 2016 at 2:32 PM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> Look at
>> to_utc_timestamp
>>
>> from_utc_timestamp
>> On Jan 18, 2016 9:39 AM, "Jerry Lam" <chiling...@gmail.com> wrote:
>>
>>> Hi spark users and developers,
>>>
>>> what do you do if you want the from_unixtime function in spark sql to
>>> return the timezone you want instead of the system timezone?
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>
>


automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Alexander Pivovarov
Is it possible to automatically unpersist RDDs which are not used for 24
hours?


Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Alexander Pivovarov
try coalesce(1, true).

On Tue, Jan 5, 2016 at 11:58 AM, unk1102  wrote:

> hi I am trying to save many partitions of Dataframe into one CSV file and
> it
> take forever for large data sets of around 5-6 GB.
>
>
> sourceFrame.coalesce(1).write().format("com.databricks.spark.csv").option("gzip").save("/path/hadoop")
>
> For small data above code works well but for large data it hangs forever
> does not move on because of only one partitions has to shuffle data of GBs
> please help me
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/coalesce-1-saveAsTextfile-takes-forever-tp25886.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: combining multiple JSON files to one DataFrame

2015-12-20 Thread Alexander Pivovarov
Just point loader to the folder. You do not need *
On Dec 19, 2015 11:21 PM, "Eran Witkon"  wrote:

> Hi,
> Can I combine multiple JSON files to one DataFrame?
>
> I tried
> val df = sqlContext.read.json("/home/eranw/Workspace/JSON/sample/*")
> but I get an empty DF
> Eran
>


Re: How to do map join in Spark SQL

2015-12-20 Thread Alexander Pivovarov
spark.sql.autoBroadcastJoinThreshold default value in 1.5.2 is 10MB
According to the output in console Spark is doing broadcast, but query
which looks like the following does not perform well

select
 big_t.*,
 small_t.name range_name
from big_t
join small_t on (1=1)
where small_t.min <= big_t.v and big_t.v < small_t.max

instead of it I registered UDF which returns range_name

val ranges = sqlContext.sql("select min_v, max_v, name from small_t").
  collect().map(r => (r.getLong(0), r.getLong(1), r.getString(2))).sortBy(_._1)

sqlContext.udf.register("findRangeName", (v: java.lang.Long) =>
RangeUDF.findName(v, ranges))

// RangeUDF.findName

def findName(vObj: java.lang.Long, ranges: Array[(Long, Long,
String)]): String = {
  val v = if (vObj == null) -1L else vObj.longValue()
  ranges.find(x => x._1 <= v && v < x._2).map(_._3).getOrElse("")
}

// Now I can use udf to get rangeName
select
 big_t.*,
 findRangeName(v) range_name
from big_t

On Sun, Dec 20, 2015 at 9:16 AM, Chris Fregly <ch...@fregly.com> wrote:

> this type of broadcast should be handled by Spark SQL/DataFrames
> automatically.
>
> this is the primary cost-based, physical-plan query optimization that the
> Spark SQL Catalyst optimizer supports.
>
> in Spark 1.5 and before, you can trigger this optimization by properly
> setting the spark.sql.autobroadcastThreshold to a value that is *above* the
> size of your smaller table when fully bloated in JVM memory (not the
> serialized size of the data on disk - very common mistake).
>
> in Spark 1.6+, there are heuristics to make this decision dynamically -
> and even allow hybrid execution where certain keys - within the same Spark
> job - will be broadcast and others won't depending on their relative "
> "hotness" for that particular job.
>
> common theme of Spark 1.6 and beyond will be adaptive physical plan
> execution, adaptive memory allocation to RDD Cache vs Spark Execution
> Engine, adaptive cluster resource allocation, etc.
>
> the goal being to minimize manual configuration and enable many diff types
> of workloads to run efficiently on the same Spark cluster.
>
> On Dec 19, 2015, at 12:10 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
> I collected small DF to array of tuple3
> Then I registered UDF with function which is doing lookup in the array
> Then I just run select which uses the UDF.
> On Dec 18, 2015 1:06 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:
>
>> You can broadcast your json data and then do a map side join. This
>> article is a good start
>> http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> I have big folder having ORC files. Files have duration field (e.g.
>>> 3,12,26, etc)
>>> Also I have small json file  (just 8 rows) with ranges definition (min,
>>> max , name)
>>> 0, 10, A
>>> 10, 20, B
>>> 20, 30, C
>>> etc
>>>
>>> Because I can not do equi-join btw duration and range min/max I need to
>>> do cross join and apply WHERE condition to take records which belong to the
>>> range
>>> Cross join is an expensive operation I think that it's better if this
>>> particular join done using Map Join
>>>
>>> How to do Map join in Spark Sql?
>>>
>>
>>


Re: How to do map join in Spark SQL

2015-12-19 Thread Alexander Pivovarov
I collected small DF to array of tuple3
Then I registered UDF with function which is doing lookup in the array
Then I just run select which uses the UDF.
On Dec 18, 2015 1:06 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:

> You can broadcast your json data and then do a map side join. This article
> is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
>
> Thanks
> Best Regards
>
> On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> I have big folder having ORC files. Files have duration field (e.g.
>> 3,12,26, etc)
>> Also I have small json file  (just 8 rows) with ranges definition (min,
>> max , name)
>> 0, 10, A
>> 10, 20, B
>> 20, 30, C
>> etc
>>
>> Because I can not do equi-join btw duration and range min/max I need to
>> do cross join and apply WHERE condition to take records which belong to the
>> range
>> Cross join is an expensive operation I think that it's better if this
>> particular join done using Map Join
>>
>> How to do Map join in Spark Sql?
>>
>
>


Re: which aws instance type for shuffle performance

2015-12-18 Thread Alexander Pivovarov
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge.

Rastan, you can run quick test using emr spark cluster on spot instances
and see what configuration works better. Without the tests it is all
speculation.
On Dec 18, 2015 1:53 PM, "Andrew Or"  wrote:

> Hi Rastan,
>
> Unless you're using off-heap memory or starting multiple executors per
> machine, I would recommend the r3.2xlarge option, since you don't actually
> want gigantic heaps (100GB is more than enough). I've personally run Spark
> on a very large scale with r3.8xlarge instances, but I've been using
> off-heap, so much of the memory was actually not used.
>
> Yes, if a shuffle file exists locally Spark just reads from disk.
>
> -Andrew
>
> 2015-12-15 23:11 GMT-08:00 Rastan Boroujerdi :
>
>> I'm trying to determine whether I should be using 10 r3.8xlarge or 40
>> r3.2xlarge. I'm mostly concerned with shuffle performance of the
>> application.
>>
>> If I go with r3.8xlarge I will need to configure 4 worker instances per
>> machine to keep the JVM size down. The worker instances will likely contend
>> with each other for network and disk I/O if they are on the same machine.
>> If I go with 40 r3.2xlarge I will be able to allocate a single worker
>> instance per box, allowing each worker instance to have its own dedicated
>> network and disk I/O.
>>
>> Since shuffle performance is heavily impacted by disk and network
>> throughput, it seems like going with 40 r3.2xlarge would be the better
>> configuration between the two. Is my analysis correct? Are there other
>> tradeoffs that I'm not taking into account? Does spark bypass the network
>> transfer and read straight from disk if worker instances are on the same
>> machine?
>>
>> Thanks,
>>
>> Rastan
>>
>
>


Re: Can't run spark on yarn

2015-12-17 Thread Alexander Pivovarov
Try to start aws EMR 4.2.0 with hadoop and spark applications on spot
instances. Then look at how hadoop and spark configured. Try to configure
your hadoop and spark similar way
On Dec 17, 2015 6:09 PM, "Saisai Shao"  wrote:

> Please check the Yarn AM log to see why AM is failed to start. That's the
> reason why using `sc` will get such complaint.
>
> On Fri, Dec 18, 2015 at 4:25 AM, Eran Witkon  wrote:
>
>> Hi,
>> I am trying to install spark 1.5.2 on Apache hadoop 2.6 and Hive and yarn
>>
>> spark-env.sh
>> export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
>>
>> bash_profile
>> #HADOOP VARIABLES START
>> export JAVA_HOME=/usr/lib/jvm/java-8-oracle/
>> export HADOOP_INSTALL=/usr/local/hadoop
>> export PATH=$PATH:$HADOOP_INSTALL/bin
>> export PATH=$PATH:$HADOOP_INSTALL/sbin
>> export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
>> export HADOOP_COMMON_HOME=$HADOOP_INSTALL
>> export HADOOP_HDFS_HOME=$HADOOP_INSTALL
>> export YARN_HOME=$HADOOP_INSTALL
>> export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
>> export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
>> export HADOOP_USER_CLASSPATH_FIRST=true
>> export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
>> export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
>> #HADOOP VARIABLES END
>>
>> export SPARK_HOME=/usr/local/spark
>> export HIVE_HOME=/usr/local/hive
>> export PATH=$PATH:$HIVE_HOME/bin
>>
>>
>> When I run spark-shell
>> ./bin/spark-shell --master yarn-client
>>
>> Output:
>> 15/12/17 22:22:07 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 15/12/17 22:22:07 INFO spark.SecurityManager: Changing view acls to:
>> hduser
>> 15/12/17 22:22:07 INFO spark.SecurityManager: Changing modify acls to:
>> hduser
>> 15/12/17 22:22:07 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(hduser); users with modify permissions: Set(hduser)
>> 15/12/17 22:22:07 INFO spark.HttpServer: Starting HTTP Server
>> 15/12/17 22:22:07 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/12/17 22:22:08 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:38389
>> 15/12/17 22:22:08 INFO util.Utils: Successfully started service 'HTTP
>> class server' on port 38389.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>>   /_/
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.8.0_66)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 15/12/17 22:22:11 WARN util.Utils: Your hostname, eranw-Lenovo-Yoga-2-Pro
>> resolves to a loopback address: 127.0.1.1; using 10.0.0.1 instead (on
>> interface wlp1s0)
>> 15/12/17 22:22:11 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind
>> to another address
>> 15/12/17 22:22:11 INFO spark.SparkContext: Running Spark version 1.5.2
>> 15/12/17 22:22:11 INFO spark.SecurityManager: Changing view acls to:
>> hduser
>> 15/12/17 22:22:11 INFO spark.SecurityManager: Changing modify acls to:
>> hduser
>> 15/12/17 22:22:11 INFO spark.SecurityManager: SecurityManager:
>> authentication disabled; ui acls disabled; users with view permissions:
>> Set(hduser); users with modify permissions: Set(hduser)
>> 15/12/17 22:22:11 INFO slf4j.Slf4jLogger: Slf4jLogger started
>> 15/12/17 22:22:11 INFO Remoting: Starting remoting
>> 15/12/17 22:22:12 INFO util.Utils: Successfully started service
>> 'sparkDriver' on port 36381.
>> 15/12/17 22:22:12 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriver@10.0.0.1:36381]
>> 15/12/17 22:22:12 INFO spark.SparkEnv: Registering MapOutputTracker
>> 15/12/17 22:22:12 INFO spark.SparkEnv: Registering BlockManagerMaster
>> 15/12/17 22:22:12 INFO storage.DiskBlockManager: Created local directory
>> at /tmp/blockmgr-139fac31-5f21-4c61-9575-3110d5205f7d
>> 15/12/17 22:22:12 INFO storage.MemoryStore: MemoryStore started with
>> capacity 530.0 MB
>> 15/12/17 22:22:12 INFO spark.HttpFileServer: HTTP File server directory
>> is
>> /tmp/spark-955ef002-a802-49c6-b440-0656861f737c/httpd-2127cbe1-97d7-40a5-a96f-75216f115f00
>> 15/12/17 22:22:12 INFO spark.HttpServer: Starting HTTP Server
>> 15/12/17 22:22:12 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/12/17 22:22:12 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:36760
>> 15/12/17 22:22:12 INFO util.Utils: Successfully started service 'HTTP
>> file server' on port 36760.
>> 15/12/17 22:22:12 INFO spark.SparkEnv: Registering OutputCommitCoordinator
>> 15/12/17 22:22:12 INFO server.Server: jetty-8.y.z-SNAPSHOT
>> 15/12/17 22:22:12 INFO server.AbstractConnector: Started
>> SelectChannelConnector@0.0.0.0:4040
>> 15/12/17 22:22:12 INFO util.Utils: Successfully started service 'SparkUI'
>> on port 4040.
>> 15/12/17 22:22:12 INFO ui.SparkUI: Started 

RE: How to submit spark job to YARN from scala code

2015-12-17 Thread Alexander Pivovarov
Spark-submit --master yarn-cluster

Look docs for more details
On Dec 17, 2015 5:00 PM, "Forest Fang"  wrote:

> Maybe I'm not understanding your question correctly but would it be
> possible for you to piece up your job submission information as if you are
> operating spark-submit? If so, you could just call
>  org.apache.spark.deploy.SparkSubmit and pass your regular spark-submit
> arguments.
>
> This is how I do it with my sbt plugin which allows you to codify a
> spark-submit command in sbt build so the JAR gets automatically rebuilt and
> potentially redeployed every time you submit a Spark job using a custom sbt
> task:
> https://github.com/saurfang/sbt-spark-submit/blob/master/src/main/scala/sbtsparksubmit/SparkSubmitPlugin.scala#L85
>
>
> --
> Subject: Re: How to submit spark job to YARN from scala code
> From: ste...@hortonworks.com
> CC: user@spark.apache.org
> Date: Thu, 17 Dec 2015 19:45:16 +
>
>
> On 17 Dec 2015, at 16:50, Saiph Kappa  wrote:
>
> Hi,
>
> Since it is not currently possible to submit a spark job to a spark
> cluster running in standalone mode (cluster mode - it's not currently
> possible to specify this deploy mode within the code), can I do it with
> YARN?
>
> I tried to do something like this (but in scala):
>
> «
>
> ... // Client object - main methodSystem.setProperty("SPARK_YARN_MODE", 
> "true")val sparkConf = new SparkConf()try {  val args = new 
> ClientArguments(argStrings, sparkConf)  new Client(args, sparkConf).run()} 
> catch {  case e: Exception => {Console.err.println(e.getMessage)
> System.exit(1)  }}System.exit(0)
>
> » in http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/
>
> However it is not possible to create a new instance of Client since import 
> org.apache.spark.deploy.yarn.Client is private
>
>
> the standard way to work around a problem like this is to place your code
> in a package which has access. File a JIRA asking for a public API too —one
> that doesn't require you to set system properties as a way of passing
> parameters down
>
>
> Is there any way I can submit spark jobs from the code in cluster mode and 
> not using the spark-submit script?
>
> Thanks.
>
>
>


How to do map join in Spark SQL

2015-12-15 Thread Alexander Pivovarov
I have big folder having ORC files. Files have duration field (e.g.
3,12,26, etc)
Also I have small json file  (just 8 rows) with ranges definition (min, max
, name)
0, 10, A
10, 20, B
20, 30, C
etc

Because I can not do equi-join btw duration and range min/max I need to do
cross join and apply WHERE condition to take records which belong to the
range
Cross join is an expensive operation I think that it's better if this
particular join done using Map Join

How to do Map join in Spark Sql?


Spark does not clean garbage in blockmgr folders on slaves if long running spark-shell is used

2015-12-12 Thread Alexander Pivovarov
Recently I faced an issue with Spark 1.5.2 standalone. Spark does not clean
garbage in blockmgr folders on slaves until I exit from spark-shell.
I opened spark-shell and run my spark program for several input folders.
Then I noticed that Spark uses several GBs of disk space on all slaves in
blockmgr folder, e.g.
spark/spark-xxx/executor-yyy/blockmgr-zzz

Yes, I have several RDDs in memory but according to Spark UI all RDDs use
only Memory (but not disk).
RDDs are cached at the beginning of data processing and at that time
blockmgr folders are almost empty.

So, looks like the jobs which I run in shell produced some garbage in
blockmgr folders and Spark did clean the folders after the jobs are done.
If I exit from spark-shell then blockmgr folders are instantly cleaned.

How to force Spark to clean blockmgr folders without exiting from the shell?
Should I use spark.cleaner.ttl setting?


Workflow manager for Spark and Spark SQL

2015-12-10 Thread Alexander Pivovarov
Hi Everyone

I'm curious what people usually use to build ETL workflows based on
DataFrames and Spark API?

In Hadoop/Hive world people usually use Oozie. Is it different in Spark
world?


Re: spark-ec2 vs. EMR

2015-12-02 Thread Alexander Pivovarov
Do you think it's a security issue if EMR started in VPC with a subnet
having Auto-assign Public IP: Yes

you can remove all Inbound rules having 0.0.0.0/0 Source in master and
slave Security Group
So, master and slave boxes will be accessible only for users who are on VPN




On Wed, Dec 2, 2015 at 9:44 AM, Dana Powers <dana.pow...@gmail.com> wrote:

> EMR was a pain to configure on a private VPC last I tried. Has anyone had
> success with that? I found spark-ec2 easier to use w private networking,
> but also agree that I would use for prod.
>
> -Dana
> On Dec 1, 2015 12:29 PM, "Alexander Pivovarov" <apivova...@gmail.com>
> wrote:
>
>> 1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks
>>
>> 2. Emr has Ganglia 3.6.0
>>
>> 3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem)
>>
>> 4. EMR has s3 keys in hadoop configs
>>
>> 5. EMR allows to resize cluster on fly.
>>
>> 6. EMR has aws sdk in spark classpath. Helps to reduce app assembly jar
>> size
>>
>> 7. ec2 script installs all in /root, EMR has dedicated users: hadoop,
>> zeppelin, etc. EMR is similar to Cloudera or Hortonworks
>>
>> 8. There are at least 3 spark-ec2 projects. (in apache/spark, in mesos,
>> in amplab). Master branch in spark has outdated ec2 script. Other projects
>> have broken links in readme. WHAT A MESS!
>>
>> 9. ec2 script has bad documentation and non informative error messages.
>> e.g. readme does not say anything about --private-ips option. If you did
>> not add the flag it will connect to empty string host (localhost) instead
>> of master. Fixed only last week. Not sure if fixed in all branches
>>
>> 10. I think Amazon will include spark-jobserver to EMR soon.
>>
>> 11. You do not need to be aws expert to start EMR cluster. Users can use
>> EMR web ui to start cluster to run some jobs or work in Zeppelun during the
>> day
>>
>> 12. EMR cluster starts in abour 8 min. Ec2 script works longer and you
>> need to be online.
>> On Dec 1, 2015 9:22 AM, "Jerry Lam" <chiling...@gmail.com> wrote:
>>
>>> Simply put:
>>>
>>> EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR
>>> API + Selected Instance Types + Amazon EC2 Friendly (bootstrapping)
>>> spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any
>>> Instance Type
>>>
>>> I use spark-ec2 for prototyping and I have never use it for production.
>>>
>>> just my $0.02
>>>
>>>
>>>
>>> On Dec 1, 2015, at 11:15 AM, Nick Chammas <nicholas.cham...@gmail.com>
>>> wrote:
>>>
>>> Pinging this thread in case anyone has thoughts on the matter they want
>>> to share.
>>>
>>> On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <[hidden email]>
>>> wrote:
>>>
>>>> Spark has come bundled with spark-ec2
>>>> <http://spark.apache.org/docs/latest/ec2-scripts.html> for many years.
>>>> At the same time, EMR has been capable of running Spark for a while, and
>>>> earlier this year it added "official" support
>>>> <https://aws.amazon.com/blogs/aws/new-apache-spark-on-amazon-emr/>.
>>>>
>>>> If you're looking for a way to provision Spark clusters, there are some
>>>> clear differences between these 2 options. I think the biggest one would be
>>>> that EMR is a "production" solution backed by a company, whereas spark-ec2
>>>> is not really intended for production use (as far as I know).
>>>>
>>>> That particular difference in intended use may or may not matter to
>>>> you, but I'm curious:
>>>>
>>>> What are some of the other differences between the 2 that do matter to
>>>> you? If you were considering these 2 solutions for your use case at one
>>>> point recently, why did you choose one over the other?
>>>>
>>>> I'd be especially interested in hearing about why people might choose
>>>> spark-ec2 over EMR, since the latter option seems to have shaped up nicely
>>>> this year.
>>>>
>>>> Nick
>>>>
>>>>
>>> --
>>> View this message in context: Re: spark-ec2 vs. EMR
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/Re-spark-ec2-vs-EMR-tp25538.html>
>>> Sent from the Apache Spark User List mailing list archive
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>
>>>
>>>


Re: spark-ec2 vs. EMR

2015-12-01 Thread Alexander Pivovarov
1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks

2. Emr has Ganglia 3.6.0

3. Emr has hadoop fs settings to make s3 work fast (direct.EmrFileSystem)

4. EMR has s3 keys in hadoop configs

5. EMR allows to resize cluster on fly.

6. EMR has aws sdk in spark classpath. Helps to reduce app assembly jar size

7. ec2 script installs all in /root, EMR has dedicated users: hadoop,
zeppelin, etc. EMR is similar to Cloudera or Hortonworks

8. There are at least 3 spark-ec2 projects. (in apache/spark, in mesos, in
amplab). Master branch in spark has outdated ec2 script. Other projects
have broken links in readme. WHAT A MESS!

9. ec2 script has bad documentation and non informative error messages.
e.g. readme does not say anything about --private-ips option. If you did
not add the flag it will connect to empty string host (localhost) instead
of master. Fixed only last week. Not sure if fixed in all branches

10. I think Amazon will include spark-jobserver to EMR soon.

11. You do not need to be aws expert to start EMR cluster. Users can use
EMR web ui to start cluster to run some jobs or work in Zeppelun during the
day

12. EMR cluster starts in abour 8 min. Ec2 script works longer and you need
to be online.
On Dec 1, 2015 9:22 AM, "Jerry Lam"  wrote:

> Simply put:
>
> EMR = Hadoop Ecosystem (Yarn, HDFS, etc) + Spark + EMRFS + Amazon EMR API
> + Selected Instance Types + Amazon EC2 Friendly (bootstrapping)
> spark-ec2 = HDFS + Yarn (Optional) + Spark (Standalone Default) + Any
> Instance Type
>
> I use spark-ec2 for prototyping and I have never use it for production.
>
> just my $0.02
>
>
>
> On Dec 1, 2015, at 11:15 AM, Nick Chammas 
> wrote:
>
> Pinging this thread in case anyone has thoughts on the matter they want to
> share.
>
> On Sat, Nov 21, 2015 at 11:32 AM Nicholas Chammas <[hidden email]> wrote:
>
>> Spark has come bundled with spark-ec2
>>  for many years.
>> At the same time, EMR has been capable of running Spark for a while, and
>> earlier this year it added "official" support
>> .
>>
>> If you're looking for a way to provision Spark clusters, there are some
>> clear differences between these 2 options. I think the biggest one would be
>> that EMR is a "production" solution backed by a company, whereas spark-ec2
>> is not really intended for production use (as far as I know).
>>
>> That particular difference in intended use may or may not matter to you,
>> but I'm curious:
>>
>> What are some of the other differences between the 2 that do matter to
>> you? If you were considering these 2 solutions for your use case at one
>> point recently, why did you choose one over the other?
>>
>> I'd be especially interested in hearing about why people might choose
>> spark-ec2 over EMR, since the latter option seems to have shaped up nicely
>> this year.
>>
>> Nick
>>
>>
> --
> View this message in context: Re: spark-ec2 vs. EMR
> 
> Sent from the Apache Spark User List mailing list archive
>  at Nabble.com.
>
>
>


Re: Spark Expand Cluster

2015-12-01 Thread Alexander Pivovarov
Try to run spark shell with correct number of executors

e.g. for 10 box cluster running on r3.2xlarge (61 RAM, 8 cores) you can use
the following

spark-shell \
--num-executors 20 \
--driver-memory 2g \
--executor-memory 24g \
--executor-cores 4


you might also want to set spark.yarn.executor.memoryOverhead to 2662





On Tue, Nov 24, 2015 at 2:07 AM, Dinesh Ranganathan <
dineshranganat...@gmail.com> wrote:

> Thanks Christopher, I will try that.
>
> Dan
>
> On 20 November 2015 at 21:41, Bozeman, Christopher 
> wrote:
>
>> Dan,
>>
>>
>>
>> Even though you may be adding more nodes to the cluster, the Spark
>> application has to be requesting additional executors in order to thus use
>> the added resources.  Or the Spark application can be using Dynamic
>> Resource Allocation (
>> http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation)
>> [which may use the resources based on application need and availability].
>> For example, in EMR release 4.x (
>> http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html#spark-dynamic-allocation)
>> you can request Spark Dynamic Resource Allocation as the default
>> configuration at cluster creation.
>>
>>
>>
>> Best regards,
>>
>> Christopher
>>
>>
>>
>>
>>
>> *From:* Dinesh Ranganathan [mailto:dineshranganat...@gmail.com]
>> *Sent:* Monday, November 16, 2015 4:57 AM
>> *To:* Sabarish Sasidharan
>> *Cc:* user
>> *Subject:* Re: Spark Expand Cluster
>>
>>
>>
>> Hi Sab,
>>
>>
>>
>> I did not specify number of executors when I submitted the spark
>> application. I was in the impression spark looks at the cluster and figures
>> out the number of executors it can use based on the cluster size
>> automatically, is this what you call dynamic allocation?. I am spark
>> newbie, so apologies if I am missing the obvious. While the application was
>> running I added more core nodes by resizing my EMR instance and I can see
>> the new nodes on the resource manager but my running application did not
>> pick up those machines I've just added.   Let me know If i am missing a
>> step here.
>>
>>
>>
>> Thanks,
>>
>> Dan
>>
>>
>>
>> On 16 November 2015 at 12:38, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>> Spark will use the number of executors you specify in spark-submit. Are
>> you saying that Spark is not able to use more executors after you modify it
>> in spark-submit? Are you using dynamic allocation?
>>
>>
>>
>> Regards
>>
>> Sab
>>
>>
>>
>> On Mon, Nov 16, 2015 at 5:54 PM, dineshranganathan <
>> dineshranganat...@gmail.com> wrote:
>>
>> I have my Spark application deployed on AWS EMR on yarn cluster mode.
>> When I
>> increase the capacity of my cluster by adding more Core instances on AWS,
>> I
>> don't see Spark picking up the new instances dynamically. Is there
>> anything
>> I can do to tell Spark to pick up the newly added boxes??
>>
>> Dan
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Expand-Cluster-tp25393.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>>
>> --
>>
>>
>>
>> Architect - Big Data
>>
>> Ph: +91 99805 99458
>>
>>
>>
>> Manthan Systems | *Company of the year - Analytics (2014 Frost and
>> Sullivan India ICT)*
>>
>> +++
>>
>>
>>
>>
>>
>> --
>>
>> Dinesh Ranganathan
>>
>
>
>
> --
> Dinesh Ranganathan
>


Join and HashPartitioner question

2015-11-13 Thread Alexander Pivovarov
Hi Everyone

Is there any difference in performance btw the following two joins?


val r1: RDD[(String, String]) = ???
val r2: RDD[(String, String]) = ???

val partNum = 80
val partitioner = new HashPartitioner(partNum)

// Join 1
val res1 = r1.partitionBy(partitioner).join(r2.partitionBy(partitioner))

// Join 2
val res2 = r1.join(r2, partNum)


Re: spark ec2 script doest not install necessary files to launch spark

2015-11-06 Thread Alexander Pivovarov
try to use EMR-4.1.0. it has spark-1.5.0 running on yarn

replace subnet-xxx with correct one

$ aws emr create-cluster --name emr41_3 --release-label emr-4.1.0
--instance-groups
InstanceCount=1,Name=sparkMaster,InstanceGroupType=MASTER,InstanceType=r3.2xlarge
InstanceCount=3,BidPrice=2.99,Name=sparkSlave,InstanceGroupType=CORE,InstanceType=r3.2xlarge
--applications Name=Spark --ec2-attributes
KeyName=spark,SubnetId=subnet-xxx --region us-east-1 --tags Name=emr41_3
 --use-default-roles --configurations file:///tmp/emr41.json

/tmp/emr41.json

[
  {
"Classification": "spark-defaults",
"Properties": {
  "spark.driver.extraJavaOptions": "-Dfile.encoding=UTF-8",
  "spark.executor.extraJavaOptions": "-Dfile.encoding=UTF-8"
}
  },
  {
"Classification": "spark",
"Properties": {
  "maximizeResourceAllocation": "true"
}
  },
  {
"Classification": "spark-log4j",
"Properties": {
  "log4j.logger.com.amazon": "WARN",
  "log4j.logger.com.amazonaws": "WARN",
  "log4j.logger.amazon.emr": "WARN",
  "log4j.logger.akka": "WARN"
}
  },
  {
"Classification": "yarn-site",
"Properties": {
  "yarn.nodemanager.pmem-check-enabled": "false",
  "yarn.nodemanager.vmem-check-enabled": "false"
}
  }
]



On Fri, Nov 6, 2015 at 3:30 PM, Emaasit  wrote:

> Hello,
> I followed the instructions for launching Spark 1.5.1 on my AWS EC2 but the
> script is not installing all the folders/files required to initialize
> Spark.
> Since the log message is long, I have created a gist here:
> https://gist.github.com/Emaasit/696145959bbbd989bfe1
>
> Please help. I have been going at this for more than 6 hours now to no
> success.
>
>
>
> -
> Daniel Emaasit,
> Ph.D. Research Assistant
> Transportation Research Center (TRC)
> University of Nevada, Las Vegas
> Las Vegas, NV 89154-4015
> Cell: 615-649-2489
> www.danielemaasit.com
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-script-doest-not-install-necessary-files-to-launch-spark-tp25311.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Generated ORC files cause NPE in Hive

2015-10-13 Thread Alexander Pivovarov
Daniel,

Looks like we already have Jira for that error
https://issues.apache.org/jira/browse/HIVE-11431

Could you put details on how to reproduce the issue to the ticket?

Thank you
Alex

On Tue, Oct 13, 2015 at 11:14 AM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:

> Hi,
> We are inserting streaming data into a hive orc table via a simple insert
> statement passed to HiveContext.
> When trying to read the files generated using Hive 1.2.1 we are getting
> NPE:
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:68)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:290)
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:148)
> ... 14 more
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime
> Error while processing row
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:52)
> at
> org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:83)
> ... 17 more
> Caused by: java.lang.NullPointerException
> at java.lang.System.arraycopy(Native Method)
> at org.apache.hadoop.io.Text.set(Text.java:225)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow$StringExtractorByValue.extract(VectorExtractRow.java:472)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorExtractRow.extractRow(VectorExtractRow.java:732)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorReduceSinkOperator.process(VectorReduceSinkOperator.java:102)
> at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:138)
> at
> org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:837)
> at
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:97)
> at
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:162)
> at
> org.apache.hadoop.hive.ql.exec.vector.VectorMapOperator.process(VectorMapOperator.java:45)
> ... 18 more
>
> Is this a known issue ?
>
>


OutOfMemoryError OOM ByteArrayOutputStream.hugeCapacity

2015-10-12 Thread Alexander Pivovarov
I have one job which fails if I enable KryoSerializer

I use spark 1.5.0 on emr-4.1.0

Settings:
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max  1024m
spark.executor.memory47924M
spark.yarn.executor.memoryOverhead 5324


The job works fine if I keep default spark.serializer BUT it fails if I use
KryoSerializer. I tried to increase kryoserializer buffer max to 1024m -
still having OOM error.


java.lang.OutOfMemoryError
at
java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at
org.xerial.snappy.SnappyOutputStream.dumpOutput(SnappyOutputStream.java:294)
at
org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:306)
at
org.xerial.snappy.SnappyOutputStream.rawWrite(SnappyOutputStream.java:245)
at
org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:107)
at
org.apache.spark.io.SnappyOutputStreamWrapper.write(CompressionCodec.scala:189)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477)
at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$DoubleSerializer.write(DefaultSerializers.java:137)
at
com.esotericsoftware.kryo.serializers.DefaultSerializers$DoubleSerializer.write(DefaultSerializers.java:131)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:576)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:21)
at com.twitter.chill.SomeSerializer.write(SomeSerializer.scala:19)


Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alexander Pivovarov
The easiest way to control logging in spark shell is to run Logger.setLevel
commands at the beginning of your program
e.g.

org.apache.log4j.Logger.getLogger("com.amazon").setLevel(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger.getLogger("com.amazonaws").setLevel(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger.getLogger("amazon.emr").setLevel(org.apache.log4j.Level.WARN)
org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.WARN)

On Tue, Oct 6, 2015 at 10:50 AM, Alex Kozlov  wrote:

> Try
>
> JAVA_OPTS='-Dlog4j.configuration=file:/'
>
> Internally, this is just spark.driver.extraJavaOptions, which you should
> be able to set in conf/spark-defaults.conf
>
> Can you provide more details how you invoke the driver?
>
> On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones 
> wrote:
>
>> Thanks. Any chance you know how to pass this to a Scala app that is run
>> via TypeSafe activator?
>>
>> I tried putting it $JAVA_OPTS but I get:
>>
>> Unrecognized option: --driver-java-options
>>
>> Error: Could not create the Java Virtual Machine.
>>
>> Error: A fatal exception has occurred. Program will exit.
>>
>>
>> I tried a bunch of different quoting but nothing produced a good result.
>> I also tried passing it directly to activator using –jvm but it still
>> produces the same results with verbose logging. Is there a way I can tell
>> if it’s picking up my file?
>>
>>
>>
>> From: Alex Kozlov
>> Date: Monday, October 5, 2015 at 8:34 PM
>> To: Jeff Jones
>> Cc: "user@spark.apache.org"
>> Subject: Re: How can I disable logging when running local[*]?
>>
>> Did you try “--driver-java-options
>> '-Dlog4j.configuration=file:/'” and setting the
>> log4j.rootLogger=FATAL,console?
>>
>> On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones 
>> wrote:
>>
>>> I’ve written an application that hosts the Spark driver in-process using
>>> “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve
>>> also tried putting the following code prior to creating my SparkContext.
>>> These were coupled together from various posts I’ve. None of these steps
>>> have worked. I’m still getting a ton of logging to the console. Anything
>>> else I can try?
>>>
>>> Thanks,
>>> Jeff
>>>
>>> private def disableLogging(): Unit = {
>>>   import org.apache.log4j.PropertyConfigurator
>>>
>>>   PropertyConfigurator.configure("conf/log4j.properties")
>>>   Logger.getRootLogger().setLevel(Level.OFF)
>>>   Logger.getLogger("org").setLevel(Level.OFF)
>>>   Logger.getLogger("akka").setLevel(Level.OFF)
>>> }
>>>
>>>
>>>
>>> This message (and any attachments) is intended only for the designated
>>> recipient(s). It
>>> may contain confidential or proprietary information, or have other
>>> limitations on use as
>>> indicated by the sender. If you are not a designated recipient, you may
>>> not review, use,
>>> copy or distribute this message. If you received this in error, please
>>> notify the sender by
>>> reply e-mail and delete this message.
>>>
>>
>>
>>
>> --
>> Alex Kozlov
>> (408) 507-4987
>> (408) 830-9982 fax
>> (650) 887-2135 efax
>> ale...@gmail.com
>>
>>
>> This message (and any attachments) is intended only for the designated
>> recipient(s). It
>> may contain confidential or proprietary information, or have other
>> limitations on use as
>> indicated by the sender. If you are not a designated recipient, you may
>> not review, use,
>> copy or distribute this message. If you received this in error, please
>> notify the sender by
>> reply e-mail and delete this message.
>>
>
>


Does YARN start new executor in place of the failed one?

2015-09-28 Thread Alexander Pivovarov
Hello Everyone

I use Spark on YARN on EMR-4

The spark program which I run has several jobs/stages and run for about 10
hours
During the execution some executors might fail for some reason.
BUT I do not see that new executor are started in place of the failed ones

So, what I see in spark UI is that at the beginning of my program I have
100 executors but in 10 hours I see only 67 executors.

I remember that in Standalone mode Spark Worker starts new executor in
place of failed one automatically.

How to active the same behavior on YARN?

The only non-default YARN setting I use are the following:
yarn.nodemanager.pmem-check-enabled=false
yarn.nodemanager.vmem-check-enabled=false

Thank you
Alex


Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I repartitioned input RDD from 4,800 to 24,000 partitions
After that the stage (24000 tasks) was done in 22 min on 100 boxes
Shuffle read/write: 905 GB / 710 GB

Task Metrics (Dur/GC/Read/Write)
Min: 7s/1s/38MB/30MB
Med: 22s/9s/38MB/30MB
Max:1.8min/1.6min/38MB/30MB

On Mon, Sep 21, 2015 at 5:55 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> The warning your seeing in Spark is no issue.  The scratch space lives
> inside the heap, so it'll never result in YARN killing the container by
> itself.  The issue is that Spark is using some off-heap space on top of
> that.
>
> You'll need to bump the spark.yarn.executor.memoryOverhead property to
> give the executors some additional headroom above the heap space.
>
> -Sandy
>
> On Mon, Sep 21, 2015 at 5:43 PM, Saisai Shao <sai.sai.s...@gmail.com>
> wrote:
>
>> I think you need to increase the memory size of executor through command
>> arguments "--executor-memory", or configuration "spark.executor.memory".
>>
>> Also yarn.scheduler.maximum-allocation-mb in Yarn side if necessary.
>>
>> Thanks
>> Saisai
>>
>>
>> On Mon, Sep 21, 2015 at 5:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> I noticed that some executors have issue with scratch space.
>>> I see the following in yarn app container stderr around the time when
>>> yarn killed the executor because it uses too much memory.
>>>
>>> -- App container stderr --
>>> 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
>>> rdd_6_346 in memory! (computed 3.0 GB so far)
>>> 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB
>>> (blocks) + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB.
>>> Storage limit = 25.2 GB.
>>> 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition
>>> rdd_6_346 to disk instead.
>>> 15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
>>> rdd_6_49 in memory! (computed 3.1 GB so far)
>>> 15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB
>>> (blocks) + 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB.
>>> Storage limit = 25.2 GB.
>>> 15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_49
>>> to disk instead.
>>>
>>> -- Yarn Nodemanager log --
>>> 2015-09-21 21:44:05,716 WARN
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Container
>>> [pid=5114,containerID=container_1442869100946_0001_01_0
>>> 00056] is running beyond physical memory limits. Current usage: 52.2 GB
>>> of 52 GB physical memory used; 53.0 GB of 260 GB virtual memory used.
>>> Killing container.
>>> Dump of the process-tree for container_1442869100946_0001_01_56 :
>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>> |- 5117 5114 5114 5114 (java) 1322810 27563 56916316160 13682772
>>> /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError=kill %p
>>> -Xms47924m -Xmx47924m -verbose:gc -XX:+PrintGCDetails
>>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>>> -XX:+CMSClassUnloadingEnabled
>>> -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/tmp
>>> -Dspark.akka.failure-detector.threshold=3000.0
>>> -Dspark.akka.heartbeat.interval=1s -Dspark.akka.threads=4
>>> -Dspark.history.ui.port=18080 -Dspark.akka.heartbeat.pauses=6s
>>> -Dspark.akka.timeout=1000s -Dspark.akka.frameSize=50
>>> -Dspark.driver.port=52690
>>> -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
>>> akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler
>>> --executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id
>>> application_1442869100946_0001 --user-class-path
>>> file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/__app__.jar
>>> |- 5114 5112 5114 5114 (bash) 0 0 9658368 291 /bin/bash -c
>>> /usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError='kill %p'
>>> -Xms47924m -Xmx47924m '-verbose:gc' '-XX:+PrintGCDetails'
>>> 

Re: Spark on Yarn vs Standalone

2015-09-21 Thread Alexander Pivovarov
I noticed that some executors have issue with scratch space.
I see the following in yarn app container stderr around the time when yarn
killed the executor because it uses too much memory.

-- App container stderr --
15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
rdd_6_346 in memory! (computed 3.0 GB so far)
15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks)
+ 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage
limit = 25.2 GB.
15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_346
to disk instead.
15/09/21 21:43:22 WARN storage.MemoryStore: Not enough space to cache
rdd_6_49 in memory! (computed 3.1 GB so far)
15/09/21 21:43:22 INFO storage.MemoryStore: Memory use = 477.6 KB (blocks)
+ 24.4 GB (scratch space shared across 8 thread(s)) = 24.4 GB. Storage
limit = 25.2 GB.
15/09/21 21:43:22 WARN spark.CacheManager: Persisting partition rdd_6_49 to
disk instead.

-- Yarn Nodemanager log --
2015-09-21 21:44:05,716 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Container
[pid=5114,containerID=container_1442869100946_0001_01_0
00056] is running beyond physical memory limits. Current usage: 52.2 GB of
52 GB physical memory used; 53.0 GB of 260 GB virtual memory used. Killing
container.
Dump of the process-tree for container_1442869100946_0001_01_56 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 5117 5114 5114 5114 (java) 1322810 27563 56916316160 13682772
/usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError=kill %p
-Xms47924m -Xmx47924m -verbose:gc -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-XX:+CMSClassUnloadingEnabled
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/tmp
-Dspark.akka.failure-detector.threshold=3000.0
-Dspark.akka.heartbeat.interval=1s -Dspark.akka.threads=4
-Dspark.history.ui.port=18080 -Dspark.akka.heartbeat.pauses=6s
-Dspark.akka.timeout=1000s -Dspark.akka.frameSize=50
-Dspark.driver.port=52690
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler
--executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id
application_1442869100946_0001 --user-class-path
file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/__app__.jar
|- 5114 5112 5114 5114 (bash) 0 0 9658368 291 /bin/bash -c
/usr/lib/jvm/java-openjdk/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms47924m -Xmx47924m '-verbose:gc' '-XX:+PrintGCDetails'
'-XX:+PrintGCDateStamps' '-XX:+UseConcMarkSweepGC'
'-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70'
'-XX:+CMSClassUnloadingEnabled'
-Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/tmp
'-Dspark.akka.failure-detector.threshold=3000.0'
'-Dspark.akka.heartbeat.interval=1s' '-Dspark.akka.threads=4'
'-Dspark.history.ui.port=18080' '-Dspark.akka.heartbeat.pauses=6s'
'-Dspark.akka.timeout=1000s' '-Dspark.akka.frameSize=50'
'-Dspark.driver.port=52690'
-Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url
akka.tcp://sparkDriver@10.0.24.153:52690/user/CoarseGrainedScheduler
--executor-id 55 --hostname ip-10-0-28-96.ec2.internal --cores 8 --app-id
application_1442869100946_0001 --user-class-path
file:/mnt/yarn/usercache/hadoop/appcache/application_1442869100946_0001/container_1442869100946_0001_01_56/__app__.jar
1>
/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56/stdout
2>
/var/log/hadoop-yarn/containers/application_1442869100946_0001/container_1442869100946_0001_01_56/stderr



Is it possible to get what scratch space is used for?

What spark setting should I try to adjust to solve the issue?

On Thu, Sep 10, 2015 at 2:52 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> YARN will never kill processes for being unresponsive.
>
> It may kill processes for occupying more memory than it allows.  To get
> around this, you can either bump spark.yarn.executor.memoryOverhead or turn
> off the memory checks entirely with yarn.nodemanager.pmem-check-enabled.
>
> -Sandy
>
> On Tue, Sep 8, 2015 at 10:48 PM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> The problem which we have now is skew data (2360 tasks 

Re: Spark on Yarn vs Standalone

2015-09-08 Thread Alexander Pivovarov
The problem which we have now is skew data (2360 tasks done in 5 min, 3
tasks in 40 min and 1 task in 2 hours)

Some people from the team worry that the executor which runs the longest
task can be killed by YARN (because executor might be unresponsive because
of GC or it might occupy more memory than Yarn allows)



On Tue, Sep 8, 2015 at 3:02 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Those settings seem reasonable to me.
>
> Are you observing performance that's worse than you would expect?
>
> -Sandy
>
> On Mon, Sep 7, 2015 at 11:22 AM, Alexander Pivovarov <apivova...@gmail.com
> > wrote:
>
>> Hi Sandy
>>
>> Thank you for your reply
>> Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB)
>> with emr setting for Spark "maximizeResourceAllocation": "true"
>>
>> It is automatically converted to Spark settings
>> spark.executor.memory47924M
>> spark.yarn.executor.memoryOverhead 5324
>>
>> we also set spark.default.parallelism = slave_count * 16
>>
>> Does it look good for you? (we run single heavy job on cluster)
>>
>> Alex
>>
>> On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> If they're both configured correctly, there's no reason that Spark
>>> Standalone should provide performance or memory improvement over Spark on
>>> YARN.
>>>
>>> -Sandy
>>>
>>> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
>>>> Hi Everyone
>>>>
>>>> We are trying the latest aws emr-4.0.0 and Spark and my question is
>>>> about YARN vs Standalone mode.
>>>> Our usecase is
>>>> - start 100-150 nodes cluster every week,
>>>> - run one heavy spark job (5-6 hours)
>>>> - save data to s3
>>>> - stop cluster
>>>>
>>>> Officially aws emr-4.0.0 comes with Spark on Yarn
>>>> It's probably possible to hack emr by creating bootstrap script which
>>>> stops yarn and starts master and slaves on each computer  (to start Spark
>>>> in standalone mode)
>>>>
>>>> My questions are
>>>> - Does Spark standalone provides significant performance / memory
>>>> improvement in comparison to YARN mode?
>>>> - Does it worth hacking official emr Spark on Yarn and switch Spark to
>>>> Standalone mode?
>>>>
>>>>
>>>> I already created comparison table and want you to check if my
>>>> understanding is correct
>>>>
>>>> Lets say r3.2xlarge computer has 52GB ram available for Spark Executor
>>>> JVMs
>>>>
>>>> standalone to yarn comparison
>>>>
>>>>
>>>> STDLN   YARN
>>>>
>>>> can executor allocate up to 52GB ram   - yes  |
>>>>  yes
>>>>
>>>> will executor be unresponsive after using all 52GB ram because of GC -
>>>> yes  |  yes
>>>>
>>>> additional JVMs on slave except of spark executor- workr | node
>>>> mngr
>>>>
>>>> are additional JVMs lightweight -
>>>> yes  |  yes
>>>>
>>>>
>>>> Thank you
>>>>
>>>> Alex
>>>>
>>>
>>>
>>
>


Re: Spark on Yarn vs Standalone

2015-09-07 Thread Alexander Pivovarov
Hi Sandy

Thank you for your reply
Currently we use r3.2xlarge boxes (vCPU: 8, Mem: 61 GiB)
with emr setting for Spark "maximizeResourceAllocation": "true"

It is automatically converted to Spark settings
spark.executor.memory47924M
spark.yarn.executor.memoryOverhead 5324

we also set spark.default.parallelism = slave_count * 16

Does it look good for you? (we run single heavy job on cluster)

Alex

On Mon, Sep 7, 2015 at 11:03 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote:

> Hi Alex,
>
> If they're both configured correctly, there's no reason that Spark
> Standalone should provide performance or memory improvement over Spark on
> YARN.
>
> -Sandy
>
> On Fri, Sep 4, 2015 at 1:24 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
>> Hi Everyone
>>
>> We are trying the latest aws emr-4.0.0 and Spark and my question is about
>> YARN vs Standalone mode.
>> Our usecase is
>> - start 100-150 nodes cluster every week,
>> - run one heavy spark job (5-6 hours)
>> - save data to s3
>> - stop cluster
>>
>> Officially aws emr-4.0.0 comes with Spark on Yarn
>> It's probably possible to hack emr by creating bootstrap script which
>> stops yarn and starts master and slaves on each computer  (to start Spark
>> in standalone mode)
>>
>> My questions are
>> - Does Spark standalone provides significant performance / memory
>> improvement in comparison to YARN mode?
>> - Does it worth hacking official emr Spark on Yarn and switch Spark to
>> Standalone mode?
>>
>>
>> I already created comparison table and want you to check if my
>> understanding is correct
>>
>> Lets say r3.2xlarge computer has 52GB ram available for Spark Executor
>> JVMs
>>
>> standalone to yarn comparison
>>
>>
>>   STDLN   YARN
>>
>> can executor allocate up to 52GB ram   - yes  |
>>  yes
>>
>> will executor be unresponsive after using all 52GB ram because of GC -
>> yes  |  yes
>>
>> additional JVMs on slave except of spark executor- workr | node
>> mngr
>>
>> are additional JVMs lightweight - yes
>>  |  yes
>>
>>
>> Thank you
>>
>> Alex
>>
>
>


Spark on Yarn vs Standalone

2015-09-04 Thread Alexander Pivovarov
Hi Everyone

We are trying the latest aws emr-4.0.0 and Spark and my question is about
YARN vs Standalone mode.
Our usecase is
- start 100-150 nodes cluster every week,
- run one heavy spark job (5-6 hours)
- save data to s3
- stop cluster

Officially aws emr-4.0.0 comes with Spark on Yarn
It's probably possible to hack emr by creating bootstrap script which stops
yarn and starts master and slaves on each computer  (to start Spark in
standalone mode)

My questions are
- Does Spark standalone provides significant performance / memory
improvement in comparison to YARN mode?
- Does it worth hacking official emr Spark on Yarn and switch Spark to
Standalone mode?


I already created comparison table and want you to check if my
understanding is correct

Lets say r3.2xlarge computer has 52GB ram available for Spark Executor JVMs

standalone to yarn comparison


STDLN   YARN

can executor allocate up to 52GB ram   - yes  |  yes

will executor be unresponsive after using all 52GB ram because of GC - yes
 |  yes

additional JVMs on slave except of spark executor- workr | node mngr

are additional JVMs lightweight - yes
 |  yes


Thank you

Alex


spark-shell does not see conf folder content on emr-4

2015-09-03 Thread Alexander Pivovarov
Hi Everyone

My question is specific to running spark-1.4.1 on emr-4.0.0

spark installed to /usr/lib/spark
conf folder linked to /etc/spark/conf
spark-shell location /usr/bin/spark-shell

I noticed that if I run spark-shell it does not read /etc/spark/conf folder
files (e.g. spark-env.sh and log4j configuration)

To solve the problem I have to add /etc/spark/conf to SPARK_CLASSPATH
export SPARK_CLASSPATH=/etc/spark/conf

How to configure spark/emr4 to avoid manual step of adding /etc/spark/conf
to SPARK_CLASSPATH?

Alex


Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-02 Thread Alexander Pivovarov
Hi Neil

Yes! it helps!!! I do  not see _temporary in console output anymore.
saveAsTextFile
is fast now.

2015-09-02 23:07:00,022 INFO  [task-result-getter-0]
scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 18.0
in stage 0.0 (TID 18) in 4398 ms on ip-10-0-24-103.ec2.internal (1/24)

2015-09-02 23:07:01,887 INFO  [task-result-getter-2]
scheduler.TaskSetManager (Logging.scala:logInfo(59)) - Finished task 5.0 in
stage 0.0 (TID 5) in 6282 ms on ip-10-0-26-14.ec2.internal (24/24)

2015-09-02 23:07:01,888 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 0
(saveAsTextFile at :22) finished in 6.319 s

2015-09-02 23:07:02,123 INFO  [main] s3n.Jets3tNativeFileSystemStore
(Jets3tNativeFileSystemStore.java:storeFile(141)) - s3.putObject foo-bar
tmp/test40_141_24_406/_SUCCESS 0


Thank you!

On Wed, Sep 2, 2015 at 12:54 AM, Neil Jonkers <neilod...@gmail.com> wrote:

> Hi,
>
> Can you set the following parameters in your mapred-site.xml file please:
>
>
> mapred.output.direct.EmrFileSystemtrue
>
> mapred.output.direct.NativeS3FileSystemtrue
>
> You can also config this at cluster launch time with the following
> Classification via EMR console:
>
>
> classification=mapred-site,properties=[mapred.output.direct.EmrFileSystem=true,mapred.output.direct.NativeS3FileSystem=true]
>
>
> Thank you
>
> On Wed, Sep 2, 2015 at 6:02 AM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
>> I checked previous emr config (emr-3.8)
>> mapred-site.xml has the following setting
>> 
>> mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter
>> 
>>
>>
>> On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov <apivova...@gmail.com
>> > wrote:
>>
>>> Should I use DirectOutputCommitter?
>>> spark.hadoop.mapred.output.committer.class
>>>  com.appsflyer.spark.DirectOutputCommitter
>>>
>>>
>>>
>>> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
>>>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>>>
>>>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>>>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>>>
>>>> Actually saveAsTextFile says that it's done in 4.356 sec but after that
>>>> I see lots of INFO messages with 404 error from com.amazonaws.latency
>>>> logger for next 90 sec
>>>>
>>>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>>>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>>>
>>>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>>>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>>>> (saveAsTextFile at :22) finished in 4.356 s
>>>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>>>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>>>> whose tasks have all completed, from pool
>>>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>>>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>>>> :22, took 4.547829 s
>>>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>>>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>>>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>>>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>>>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>>>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>>>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>>>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>>>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>>>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>>>> https://foo-bar.s3.amazonaws.com], Exception=1,
>>>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>>>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>>>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>>>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>>>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>>>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>>>> ServiceName=[Amazon S3], AWS

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
Should I use DirectOutputCommitter?
spark.hadoop.mapred.output.committer.class
 com.appsflyer.spark.DirectOutputCommitter



On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> I run spark 1.4.1 in amazom aws emr 4.0.0
>
> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
> comparison to emr 3.8  (was 5 sec, now 95 sec)
>
> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
> see lots of INFO messages with 404 error from com.amazonaws.latency logger
> for next 90 sec
>
> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>
> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
> (saveAsTextFile at :22) finished in 4.356 s
> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
> (Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
> completed, from pool
> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
> :22, took 4.547829 s
> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
> (S3NativeFileSystem.java:listStatus(896)) - listStatus
> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 3B2F06FD11682D22), S3 Extended Request ID:
> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
> HttpClientSendRequestTime=[0.089],
> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 62C6B413965447FD), S3 Extended Request ID:
> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
> RequestCount=1, HttpClientPoolPendingCount=0,
> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
> HttpClientSendRequestTime=[0.068],
> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
> ID: 4846575A1C373BB9), S3 Extended Request ID:
> aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
> AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
> https://foo-bar.s3.amazonaws.com], Exception=1,
> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClie

spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I run spark 1.4.1 in amazom aws emr 4.0.0

For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
comparison to emr 3.8  (was 5 sec, now 95 sec)

Actually saveAsTextFile says that it's done in 4.356 sec but after that I
see lots of INFO messages with 404 error from com.amazonaws.latency logger
for next 90 sec

spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
"A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")

2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
(saveAsTextFile at :22) finished in 4.356 s
2015-09-01 21:16:17,637 INFO  [task-result-getter-2] cluster.YarnScheduler
(Logging.scala:logInfo(59)) - Removed TaskSet 5.0, whose tasks have all
completed, from pool
2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
:22, took 4.547829 s
2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 3B2F06FD11682D22), S3 Extended Request ID:
C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
HttpClientSendRequestTime=[0.089],
2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 62C6B413965447FD), S3 Extended Request ID:
4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
HttpClientSendRequestTime=[0.068],
2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
ID: 4846575A1C373BB9), S3 Extended Request ID:
aw/MMKxKPmuDuxTj4GKyDbp8hgpQbTjipJBzdjdTgbwPgt5NsZS4z+tRf2bk3I2E],
ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
AWSRequestID=[4846575A1C373BB9], ServiceEndpoint=[
https://foo-bar.s3.amazonaws.com], Exception=1,
HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.531],
HttpRequestTime=[11.134], HttpClientReceiveResponseTime=[9.434],
RequestSigningTime=[0.206], HttpClientSendRequestTime=[0.13],
2015-09-01 21:16:17,786 INFO  [main] s3n.S3NativeFileSystem
(S3NativeFileSystem.java:listStatus(896)) - listStatus
s3n://foo-bar/tmp/test40_20/_temporary/0/task_201509012116_0005_m_00
with recursive false
2015-09-01 21:16:17,798 INFO  [main] amazonaws.latency
(AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
(Service: Amazon S3; Status Code: 404; Error 

Re: spark 1.4.1 saveAsTextFile is slow on emr-4.0.0

2015-09-01 Thread Alexander Pivovarov
I checked previous emr config (emr-3.8)
mapred-site.xml has the following setting

mapred.output.committer.classorg.apache.hadoop.mapred.DirectFileOutputCommitter



On Tue, Sep 1, 2015 at 7:33 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Should I use DirectOutputCommitter?
> spark.hadoop.mapred.output.committer.class
>  com.appsflyer.spark.DirectOutputCommitter
>
>
>
> On Tue, Sep 1, 2015 at 4:01 PM, Alexander Pivovarov <apivova...@gmail.com>
> wrote:
>
>> I run spark 1.4.1 in amazom aws emr 4.0.0
>>
>> For some reason spark saveAsTextFile is very slow on emr 4.0.0 in
>> comparison to emr 3.8  (was 5 sec, now 95 sec)
>>
>> Actually saveAsTextFile says that it's done in 4.356 sec but after that I
>> see lots of INFO messages with 404 error from com.amazonaws.latency logger
>> for next 90 sec
>>
>> spark> sc.parallelize(List.range(0, 160),160).map(x => x + "\t" +
>> "A"*100).saveAsTextFile("s3n://foo-bar/tmp/test40_20")
>>
>> 2015-09-01 21:16:17,637 INFO  [dag-scheduler-event-loop]
>> scheduler.DAGScheduler (Logging.scala:logInfo(59)) - ResultStage 5
>> (saveAsTextFile at :22) finished in 4.356 s
>> 2015-09-01 21:16:17,637 INFO  [task-result-getter-2]
>> cluster.YarnScheduler (Logging.scala:logInfo(59)) - Removed TaskSet 5.0,
>> whose tasks have all completed, from pool
>> 2015-09-01 21:16:17,637 INFO  [main] scheduler.DAGScheduler
>> (Logging.scala:logInfo(59)) - Job 5 finished: saveAsTextFile at
>> :22, took 4.547829 s
>> 2015-09-01 21:16:17,638 INFO  [main] s3n.S3NativeFileSystem
>> (S3NativeFileSystem.java:listStatus(896)) - listStatus
>> s3n://foo-bar/tmp/test40_20/_temporary/0 with recursive false
>> 2015-09-01 21:16:17,651 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 3B2F06FD11682D22), S3 Extended Request ID:
>> C8T3rXVSEIk3swlwkUWJJX3gWuQx3QKC3Yyfxuhs7y0HXn3sEI9+c1a0f7/QK8BZ],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[3B2F06FD11682D22], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.923],
>> HttpRequestTime=[11.388], HttpClientReceiveResponseTime=[9.544],
>> RequestSigningTime=[0.274], HttpClientSendRequestTime=[0.129],
>> 2015-09-01 21:16:17,723 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[E5D513E52B20FF17], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[71.927],
>> HttpRequestTime=[53.517], HttpClientReceiveResponseTime=[51.81],
>> RequestSigningTime=[0.209], ResponseProcessingTime=[17.97],
>> HttpClientSendRequestTime=[0.089],
>> 2015-09-01 21:16:17,756 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[404],
>> Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found
>> (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request
>> ID: 62C6B413965447FD), S3 Extended Request ID:
>> 4w5rKMWCt9EdeEKzKBXZgWpTcBZCfDikzuRrRrBxmtHYxkZyS4GxQVyADdLkgtZf],
>> ServiceName=[Amazon S3], AWSErrorCode=[404 Not Found],
>> AWSRequestID=[62C6B413965447FD], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], Exception=1,
>> HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[11.044],
>> HttpRequestTime=[10.543], HttpClientReceiveResponseTime=[8.743],
>> RequestSigningTime=[0.271], HttpClientSendRequestTime=[0.138],
>> 2015-09-01 21:16:17,774 INFO  [main] amazonaws.latency
>> (AWSRequestMetricsFullSupport.java:log(203)) - StatusCode=[200],
>> ServiceName=[Amazon S3], AWSRequestID=[F62B991825042889], ServiceEndpoint=[
>> https://foo-bar.s3.amazonaws.com], HttpClientPoolLeasedCount=0,
>> RequestCount=1, HttpClientPoolPendingCount=0,
>> HttpClientPoolAvailableCount=1, ClientExecuteTime=[16.724],
>> HttpRequestTime=[16.292], HttpClientReceiveResponseTime=[14.728],
>> RequestSigningTime=[0.148], ResponseProcessingTime=[0.155],
>> HttpClientSendRequestTime=[0.068],
>> 2015-09-01 21:16:17,786 INFO  [main] amazonaws.latency
>> (AW

Re: TimeoutException on start-slave spark 1.4.0

2015-08-28 Thread Alexander Pivovarov
I have a workaround to the issue

As you can see from the log it is about 15 sec btw worker start and
shutdown.

The workaround might be to sleep 30 sec, check if worker is running and if
not try to start-slave again

part of emr spark bootstrap py script

spark_master = spark://...:7077
...
curl_worker_cmd = curl -o /dev/null --silent --head --write-out
'%{http_code}' localhost:8081
while True:
subprocess.call([/home/hadoop/spark/sbin/start-slave.sh,spark_master])
time.sleep(30)
if subprocess.Popen(curl_worker_cmd.split( ),
stdout=subprocess.PIPE).communicate()[0] == '200':
break

On Thu, Aug 27, 2015 at 3:07 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 I see the following error time to time when try to start slaves on spark
 1.4.0


 [hadoop@ip-10-0-27-240 apps]$ pwd
 /mnt/var/log/apps

 [hadoop@ip-10-0-27-240 apps]$ cat
 spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out
 Spark Command: /usr/java/latest/bin/java -cp
 /home/hadoop/spark/conf/:/home/hadoop/conf/:/home/hadoop/spark/classpath/distsupplied/*:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar:/usr/share/aws/emr/auxlib/*:/home/hadoop/.versions/spark-1.4.0.b/sbin/../conf/:/home/hadoop/.versions/spark-1.4.0.b/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-core-3.2.10.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/conf/:/home/hadoop/conf/
 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
 -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
 -XX:MaxHeapFreeRatio=70 -Xms2048m -Xmx2048m -XX:MaxPermSize=128m
 org.apache.spark.deploy.worker.Worker --webui-port 8081
 spark://ip-10-0-27-185.ec2.internal:7077
 
 15/08/27 21:10:25 INFO Worker: Registered signal handlers for [TERM, HUP,
 INT]
 15/08/27 21:10:26 INFO SecurityManager: Changing view acls to: hadoop
 15/08/27 21:10:26 INFO SecurityManager: Changing modify acls to: hadoop
 15/08/27 21:10:26 INFO SecurityManager: SecurityManager: authentication
 disabled; ui acls disabled; users with view permissions: Set(hadoop); users
 with modify permissions: Set(hadoop)
 15/08/27 21:10:26 INFO Slf4jLogger: Slf4jLogger started
 15/08/27 21:10:26 INFO Remoting: Starting remoting
 Exception in thread main java.util.concurrent.TimeoutException: Futures
 timed out after [1 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at akka.remote.Remoting.start(Remoting.scala:180)
 at
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
 at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
 at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
 at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
 at
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
 at
 org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
 at
 org.apache.spark.deploy.worker.Worker$.startSystemAndActor(Worker.scala:553)
 at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:533)
 at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
 15/08/27 21:10:39 INFO Utils: Shutdown hook called
 Heap
  par new generation   total 613440K, used 338393K [0x00077800,
 0x0007a199, 0x0007a199)
   eden space 545344K,  62% used [0x00077800, 0x00078ca765b0,
 0x00079949)
   from space 68096K,   0% used [0x00079949, 0x00079949,
 0x00079d71)
   to   space 68096K,   0% used [0x00079d71, 0x00079d71,
 0x0007a199)
  concurrent mark-sweep generation total 1415616K, used 0K
 [0x0007a199, 0x0007f800, 0x0007f800)
  concurrent-mark-sweep perm gen total 21248K, used 19285K
 [0x0007f800, 0x0007f94c, 0x0008)



TimeoutException on start-slave spark 1.4.0

2015-08-27 Thread Alexander Pivovarov
I see the following error time to time when try to start slaves on spark
1.4.0


[hadoop@ip-10-0-27-240 apps]$ pwd
/mnt/var/log/apps

[hadoop@ip-10-0-27-240 apps]$ cat
spark-hadoop-org.apache.spark.deploy.worker.Worker-1-ip-10-0-27-240.ec2.internal.out
Spark Command: /usr/java/latest/bin/java -cp
/home/hadoop/spark/conf/:/home/hadoop/conf/:/home/hadoop/spark/classpath/distsupplied/*:/home/hadoop/spark/classpath/emr/*:/home/hadoop/spark/classpath/emrfs/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/common/lib/hadoop-lzo.jar:/usr/share/aws/emr/auxlib/*:/home/hadoop/.versions/spark-1.4.0.b/sbin/../conf/:/home/hadoop/.versions/spark-1.4.0.b/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-core-3.2.10.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-rdbms-3.2.9.jar:/home/hadoop/.versions/spark-1.4.0.b/lib/datanucleus-api-jdo-3.2.6.jar:/home/hadoop/conf/:/home/hadoop/conf/
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70 -Xms2048m -Xmx2048m -XX:MaxPermSize=128m
org.apache.spark.deploy.worker.Worker --webui-port 8081
spark://ip-10-0-27-185.ec2.internal:7077

15/08/27 21:10:25 INFO Worker: Registered signal handlers for [TERM, HUP,
INT]
15/08/27 21:10:26 INFO SecurityManager: Changing view acls to: hadoop
15/08/27 21:10:26 INFO SecurityManager: Changing modify acls to: hadoop
15/08/27 21:10:26 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hadoop); users
with modify permissions: Set(hadoop)
15/08/27 21:10:26 INFO Slf4jLogger: Slf4jLogger started
15/08/27 21:10:26 INFO Remoting: Starting remoting
Exception in thread main java.util.concurrent.TimeoutException: Futures
timed out after [1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:180)
at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1991)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1982)
at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at
org.apache.spark.deploy.worker.Worker$.startSystemAndActor(Worker.scala:553)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:533)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
15/08/27 21:10:39 INFO Utils: Shutdown hook called
Heap
 par new generation   total 613440K, used 338393K [0x00077800,
0x0007a199, 0x0007a199)
  eden space 545344K,  62% used [0x00077800, 0x00078ca765b0,
0x00079949)
  from space 68096K,   0% used [0x00079949, 0x00079949,
0x00079d71)
  to   space 68096K,   0% used [0x00079d71, 0x00079d71,
0x0007a199)
 concurrent mark-sweep generation total 1415616K, used 0K
[0x0007a199, 0x0007f800, 0x0007f800)
 concurrent-mark-sweep perm gen total 21248K, used 19285K
[0x0007f800, 0x0007f94c, 0x0008)


Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Alexander Pivovarov
Hi Everyone

Which one should work faster (coalesce or repartition) if I need to reduce
number of partitions from 5000 to 3 before saving RDD asTextFile

Total data size is about 400MB on disk in text format

Thank you