Re: Remove dependence on HDFS

2017-02-13 Thread Calvin Jia
Hi Ben,

You can replace HDFS with a number of storage systems since Spark is
compatible with other storage like S3. This would allow you to scale your
compute nodes solely for the purpose of adding compute power and not disk
space. You can deploy Alluxio on your compute nodes to offset the
performance impact of decoupling your compute and storage, as well as unify
multiple storage spaces if you would like to still use HDFS, S3, and/or
other storage solutions in tandem. Here is an article

which describes a similar architecture.

Hope this helps,
Calvin

On Mon, Feb 13, 2017 at 12:46 AM, Saisai Shao 
wrote:

> IIUC Spark doesn't strongly bind to HDFS, it uses a common FileSystem
> layer which supports different FS implementations, HDFS is just one option.
> You could also use S3 as a backend FS, from Spark's point it is transparent
> to different FS implementations.
>
>
>
> On Sun, Feb 12, 2017 at 5:32 PM, ayan guha  wrote:
>
>> How about adding more NFS storage?
>>
>> On Sun, 12 Feb 2017 at 8:14 pm, Sean Owen  wrote:
>>
>>> Data has to live somewhere -- how do you not add storage but store more
>>> data?  Alluxio is not persistent storage, and S3 isn't on your premises.
>>>
>>> On Sun, Feb 12, 2017 at 4:29 AM Benjamin Kim  wrote:
>>>
>>> Has anyone got some advice on how to remove the reliance on HDFS for
>>> storing persistent data. We have an on-premise Spark cluster. It seems like
>>> a waste of resources to keep adding nodes because of a lack of storage
>>> space only. I would rather add more powerful nodes due to the lack of
>>> processing power at a less frequent rate, than add less powerful nodes at a
>>> more frequent rate just to handle the ever growing data. Can anyone point
>>> me in the right direction? Is Alluxio a good solution? S3? I would like to
>>> hear your thoughts.
>>>
>>> Cheers,
>>> Ben
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Question about Spark and filesystems

2016-12-19 Thread Calvin Jia
Hi,

If you are concerned with the performance of the alternative filesystems
(ie. needing a caching client), you can use Alluxio on top of any of NFS
,
Ceph

, GlusterFS
,
or other/multiple storages. Especially since your working sets will not be
huge, you most likely will be able to store all the relevant data within
Alluxio during computation, giving you flexibility to store your data in
your preferred storage without performance penalties.

Hope this helps,
Calvin

On Sun, Dec 18, 2016 at 11:23 PM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> I am using gluster and i have decent performance with basic maintenance
> effort. Advantage of gluster: you can plug Alluxio on top to improve perf
> but I still need to be validate...
>
> Le 18 déc. 2016 8:50 PM,  a écrit :
>
>> Hello,
>>
>> We are trying out Spark for some file processing tasks.
>>
>> Since each Spark worker node needs to access the same files, we have
>> tried using Hdfs. This worked, but there were some oddities making me a
>> bit uneasy. For dependency hell reasons I compiled a modified Spark, and
>> this version exhibited the odd behaviour with Hdfs. The problem might
>> have nothing to do with Hdfs, but the situation made me curious about
>> the alternatives.
>>
>> Now I'm wondering what kind of file system would be suitable for our
>> deployment.
>>
>> - There won't be a great number of nodes. Maybe 10 or so.
>>
>> - The datasets won't be big by big-data standards(Maybe a couple of
>>   hundred gb)
>>
>> So maybe I could just use a NFS server, with a caching client?
>> Or should I try Ceph, or Glusterfs?
>>
>> Does anyone have any experiences to share?
>>
>> --
>> Joakim Verona
>> joa...@verona.se
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: About Spark Multiple Shared Context with Spark 2.0

2016-12-13 Thread Calvin Jia
Hi,

Alluxio will allow you to share or cache data in-memory between different
Spark contexts by storing RDDs or Dataframes as a file in the Alluxio
system. The files can then be accessed by any Spark job like a file in any
other distributed storage system.

These two blogs do a good job of summarizing the end-to-end workflow of
using Alluxio to share RDDs
 or Dataframes
 between
Spark jobs.

Hope this helps,
Calvin

On Tue, Dec 13, 2016 at 3:42 AM, Chetan Khatri 
wrote:

> Hello Guys,
>
> What would be approach to accomplish Spark Multiple Shared Context without
> Alluxio and with with Alluxio , and what would be best practice to achieve
> parallelism and concurrency for spark jobs.
>
> Thanks.
>
> --
> Yours Aye,
> Chetan Khatri.
> M.+91 7 80574 <+91%207%2080574>
> Data Science Researcher
> INDIA
>
> ​​Statement of Confidentiality
> 
> The contents of this e-mail message and any attachments are confidential
> and are intended solely for addressee. The information may also be legally
> privileged. This transmission is sent in trust, for the sole purpose of
> delivery to the intended recipient. If you have received this transmission
> in error, any use, reproduction or dissemination of this transmission is
> strictly prohibited. If you are not the intended recipient, please
> immediately notify the sender by reply e-mail or phone and delete this
> message and its attachments, if any.​​
>


Re: sanboxing spark executors

2016-11-04 Thread Calvin Jia
Hi,

If you are using the latest Alluxio release (1.3.0), authorization is
enabled, preventing users from accessing data they do not have permissions
to. For older versions, you will need to enable the security flag. The
documentation
on security  has more
details.

Hope this helps,
Calvin

On Fri, Nov 4, 2016 at 6:31 AM, Andrew Holway <
andrew.hol...@otternetworks.de> wrote:

> I think running it on a Mesos cluster could give you better control over
> this kinda stuff.
>
>
> On Fri, Nov 4, 2016 at 7:41 AM, blazespinnaker 
> wrote:
>
>> Is there a good method / discussion / documentation on how to sandbox a
>> spark
>> executor?   Assume the code is untrusted and you don't want it to be able
>> to
>> make un validated network connections or do unvalidated alluxio/hdfs/file
>> io.
>>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/sanboxing-spark-executors-tp28014.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Otter Networks UG
> http://otternetworks.de
> Gotenstraße 17
> 10829 Berlin
>


Re: feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-19 Thread Calvin Jia
Hi,

Alluxio allows for data sharing between applications through a File System
API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse).
If your MPI applications can use any of these interfaces, you should be
able to use Alluxio for data sharing out of the box.

In terms of duplicating in-memory data, you should only need one copy in
Alluxio if you are able to stream your dataset. As for the performance of
using Alluxio to back your data compared to using Spark's native in-memory
representation, here is a blog
 which
details the pros and cons of the two approaches. At a high level, Alluxio
performs better with larger datasets or if you plan to use your dataset in
more than one Spark job.

Hope this helps,
Calvin


Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-28 Thread Calvin Jia
Hi,

Thanks for the detailed information. How large is the dataset you are 
running against? Also did you change any Tachyon configurations?

Thanks,
Calvin

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: spark 1.6.0 on ec2 doesn't work

2016-01-19 Thread Calvin Jia
Hi Oleg,

The Tachyon related issue should be fixed.

Hope this helps,
Calvin

On Mon, Jan 18, 2016 at 2:51 AM, Oleg Ruchovets 
wrote:

> Hi ,
>I try to follow the spartk 1.6.0 to install spark on EC2.
>
> It doesn't work properly -  got exceptions and at the end standalone spark
> cluster installed.
> here is log information:
>
> Any suggestions?
>
> Thanks
> Oleg.
>
> oleg@robinhood:~/install/spark-1.6.0-bin-hadoop2.6/ec2$ ./spark-ec2
> --key-pair=CC-ES-Demo
>  
> --identity-file=/home/oleg/work/entity_extraction_framework/ec2_pem_key/CC-ES-Demo.pem
> --region=us-east-1 --zone=us-east-1a --spot-price=0.05   -s 5
> --spark-version=1.6.0launch entity-extraction-spark-cluster
> Setting up security groups...
> Searching for existing cluster entity-extraction-spark-cluster in region
> us-east-1...
> Spark AMI: ami-5bb18832
> Launching instances...
> Requesting 5 slaves as spot instances with price $0.050
> Waiting for spot instances to be granted...
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> 0 of 5 slaves granted, waiting longer
> All 5 slaves granted
> Launched master in us-east-1a, regid = r-9384033f
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state..
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
>
> Warning: SSH connection error. (This could be temporary.)
> Host: ec2-52-90-186-83.compute-1.amazonaws.com
> SSH return code: 255
> SSH output: ssh: connect to host ec2-52-90-186-83.compute-1.amazonaws.com
> port 22: Connection refused
>
> .
> Cluster is now in 'ssh-ready' state. Waited 442 seconds.
> Generating cluster's SSH key on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Transferring cluster's SSH key to slaves...
> ec2-54-165-243-74.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-243-74.compute-1.amazonaws.com,54.165.243.74'
> (ECDSA) to the list of known hosts.
> ec2-54-88-245-107.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-88-245-107.compute-1.amazonaws.com,54.88.245.107'
> (ECDSA) to the list of known hosts.
> ec2-54-172-29-47.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-29-47.compute-1.amazonaws.com,54.172.29.47'
> (ECDSA) to the list of known hosts.
> ec2-54-165-131-210.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-165-131-210.compute-1.amazonaws.com,54.165.131.210'
> (ECDSA) to the list of known hosts.
> ec2-54-172-46-184.compute-1.amazonaws.com
> Warning: Permanently added 
> 'ec2-54-172-46-184.compute-1.amazonaws.com,54.172.46.184'
> (ECDSA) to the list of known hosts.
> Cloning spark-ec2 scripts from
> https://github.com/amplab/spark-ec2/tree/branch-1.5 on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Cloning into 'spark-ec2'...
> remote: Counting objects: 2068, done.
> remote: Total 2068 (delta 0), reused 0 (delta 0), pack-reused 2068
> Receiving objects: 100% (2068/2068), 349.76 KiB, done.
> Resolving deltas: 100% (796/796), done.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Deploying files to master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> sending incremental file list
> root/spark-ec2/ec2-variables.sh
>
> sent 1,835 bytes  received 40 bytes  416.67 bytes/sec
> total size is 1,684  speedup is 0.90
> Running setup on master...
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Connection to ec2-52-90-186-83.compute-1.amazonaws.com closed.
> Warning: Permanently added 
> 'ec2-52-90-186-83.compute-1.amazonaws.com,52.90.186.83'
> (ECDSA) to the list of known hosts.
> Setting up Spark on ip-172-31-24-124.ec2.internal...
> Setting executable permissions on scripts...
> RSYNC'ing /root/spark-ec2 to other cluster nodes...
> 

Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark,

Were you able to successfully store the RDD with Akhil's method? When you
read it back as an objectFile, you will also need to specify the correct
type.

You can find more information about integrating Spark and Tachyon on this
page: http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html
.

Hope this helps,
Calvin

On Fri, Oct 30, 2015 at 7:04 AM, Akhil Das 
wrote:

> I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile
>
> Thanks
> Best Regards
>
> On Fri, Oct 23, 2015 at 7:57 AM, mark  wrote:
>
>> I have Avro records stored in Parquet files in HDFS. I want to read these
>> out as an RDD and save that RDD in Tachyon for any spark job that wants the
>> data.
>>
>> How do I save the RDD in Tachyon? What format do I use? Which RDD
>> 'saveAs...' method do I want?
>>
>> Thanks
>>
>
>


Re: Re: Spark RDD cache persistence

2015-12-09 Thread Calvin Jia
Hi Deepak,

For persistence across Spark jobs, you can store and access the RDDs in
Tachyon. Tachyon works with ramdisk which would give you similar in-memory
performance you would have within a Spark job.

For more information, you can take a look at the docs on Tachyon-Spark
integration:
http://tachyon-project.org/documentation/Running-Spark-on-Tachyon.html

Hope this helps,
Calvin

On Thu, Nov 5, 2015 at 10:29 PM, Deenar Toraskar 
wrote:

> You can have a long running Spark context in several fashions. This will
> ensure your data will be cached in memory. Clients will access the RDD
> through a REST API that you can expose. See the Spark Job Server, it does
> something similar. It has something called Named RDDs
>
> Using Named RDDs
>
> Named RDDs are a way to easily share RDDs among job. Using this facility,
> computed RDDs can be cached with a given name and later on retrieved. To
> use this feature, the SparkJob needs to mixinNamedRddSupport:
>
> Alternatively if you use the Spark Thrift Server, any cached
> dataframes/RDDs will be available to all clients of Spark via the Thrift
> Server until it is shutdown.
>
> If you want to support key value lookups you might want to use IndexedRDD
> 
>
> Finally not the same as sharing RDDs, Tachyon can cache underlying HDFS
> blocks.
>
> Deenar
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
>  wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed
>
>
> On 6 November 2015 at 05:56, r7raul1...@163.com 
> wrote:
>
>> You can try
>> http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html#Archival_Storage_SSD__Memory
>>  .
>>   Hive tmp table use this function to speed job.
>> https://issues.apache.org/jira/browse/HIVE-7313
>>
>> --
>> r7raul1...@163.com
>>
>>
>> *From:* Christian 
>> *Date:* 2015-11-06 13:50
>> *To:* Deepak Sharma 
>> *CC:* user 
>> *Subject:* Re: Spark RDD cache persistence
>> I've never had this need and I've never done it. There are options that
>> allow this. For example, I know there are web apps out there that work like
>> the spark REPL. One of these I think is called Zepplin. . I've never used
>> them, but I've seen them demoed. There is also Tachyon that Spark
>> supports.. Hopefully, that gives you a place to start.
>> On Thu, Nov 5, 2015 at 9:21 PM Deepak Sharma 
>> wrote:
>>
>>> Thanks Christian.
>>> So is there any inbuilt mechanism in spark or api integration  to other
>>> inmemory cache products such as redis to load the RDD to these system upon
>>> program exit ?
>>> What's the best approach to have long lived RDD cache ?
>>> Thanks
>>>
>>>
>>> Deepak
>>> On 6 Nov 2015 8:34 am, "Christian"  wrote:
>>>
 The cache gets cleared out when the job finishes. I am not aware of a
 way to keep the cache around between jobs. You could save it as an object
 file to disk and load it as an object file on your next job for speed.
 On Thu, Nov 5, 2015 at 6:17 PM Deepak Sharma 
 wrote:

> Hi All
> I am confused on RDD persistence in cache .
> If I cache RDD , is it going to stay there in memory even if my spark
> program completes execution , which created it.
> If not , how can I guarantee that RDD is persisted in cache even after
> the program finishes execution.
>
> Thanks
>
>
> Deepak
>

>


Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia
Hi Shane,

Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.

Hope this helps,
Calvin

On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane 
wrote:

> Hi all,
>
>
>
> I am looking into how Spark handles data locality wrt Tachyon. My main
> concern is how this is coordinated. Will it send a task based on a file
> loaded from Tachyon to a node that it knows has that file locally and how
> does it know which nodes has what?
>
>
>
> Kind regards,
>
> Shane
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have received
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>


Re: TTL for saveAsObjectFile()

2015-10-14 Thread Calvin Jia
Hi Antonio,

I don't think Spark provides a way to pass down params with
saveAsObjectFile. One way could be to pass a default TTL in the
configuration, but the approach doesn't make much sense since TTL is not
necessarily uniform.

Baidu will be talking about their use of TTL in Tachyon with Spark in this
meetup , which may be
helpful to understanding different ways to integrate.

Hope this helps,
Calvin

On Tue, Oct 13, 2015 at 1:07 PM, antoniosi  wrote:

> Hi,
>
> I am using RDD.saveAsObjectFile() to save the RDD dataset to Tachyon. In
> version 0.8, Tachyon will support for TTL for saved file. Is that supported
> from Spark as well? Is there a way I could specify an TTL for a saved
> object
> file?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/TTL-for-saveAsObjectFile-tp25051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark is in-memory processing, how then can Tachyon make Spark faster?

2015-08-07 Thread Calvin Jia
Hi,

Tachyon http://tachyon-project.org manages memory off heap which can help
prevent long GC pauses. Also, using Tachyon will allow the data to be
shared between Spark jobs if they use the same dataset.

Here's http://www.meetup.com/Tachyon/events/222485713/ a production use
case where Baidu runs Tachyon to get 30x performance improvement in their
SparkSQL workload.

Hope this helps,
Calvin

On Fri, Aug 7, 2015 at 9:42 AM, Muler mulugeta.abe...@gmail.com wrote:

 Spark is an in-memory engine and attempts to do computation in-memory.
 Tachyon is memory-centeric distributed storage, OK, but how would that help
 ran Spark faster?



Re: Spark SQL 1.3.1 saveAsParquetFile will output tachyon file with different block size

2015-04-28 Thread Calvin Jia
Hi,

You can apply this patch https://github.com/apache/spark/pull/5354 and
recompile.

Hope this helps,
Calvin

On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com
wrote:

 Hi Zhang,

 How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon
 version
 to 0.6.3 in core/pom.xml, make-distribution.sh and try to compile again,
 many compilation errors raised.

 Thanks,




 --
 View this message in context:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-1-3-1-saveAsParquetFile-will-output-tachyon-file-with-different-block-size-tp11561p11870.html
 Sent from the Apache Spark Developers List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org