Hi, you can take a look at this project, it is a distributed HA Spark
cluster for AWS environment using Docker, we put the spark ec2
instances in an ELB, and using this code snippet to get the instance
IPs:
https://github.com/zalando-incubator/spark-appliance/blob/master/utils.py#L49-L56
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional
data source during the data processing, main goal of spark is still
analyzing data from HDFS-like file system.
to use spark as a data integration tool to transfer billions of records
from RDBMS to HDFS etc. could work, but
hmm, i do not believe security group can control s3 bucket access... is
this something new? or you mean IAM role?
@Daniel, using spark on EMR, you should be able to use IAM role to access
AWS resources, you do not need to specify fs.s3a.access.key or
fs.s3a.secret.key at all. S3A is able to use
there are several reasons that AWS users do (can) not use EMR, one
point for us is that security compliance problem, EMR is totally not
open sourced, we can not use it in production system. second is that
EMR do not support HA yet.
but to the original question from @Everett :
-> Credentials and
really beyond the design goals for standalone mode. If you want
> more in the way of multi-Application resource scheduling, then you should be
> looking at Yarn or Mesos. Is there some reason why neither of those options
> can work for you?
>
> On Fri, Jul 15, 2016 at 9:15 AM, Teng
Hi,
http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/spark-standalone.html#resource-scheduling
The standalone cluster mode currently only supports a simple FIFO
scheduler across applications.
is this sentence still true? any progress on this? it will really
helpful. some
yes, only for engine, but maybe newer version has more optimization
from tungsten project? at least since spark 1.6?
> -- Forwarded message --
> From: Mich Talebzadeh <mich.talebza...@gmail.com>
> Date: 27 May 2016 at 17:09
> Subject: Re: Pros and Cons
&g
tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p
2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
> Teng:
> Why not try out the 2.0 SANPSHOT build ?
>
> Thanks
>
>> On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com>
vendor's product here?
>
> thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 27 May 2016 at 13:05, Teng Qiu <te
I agree with Koert and Reynold, spark works well with large dataset now.
back to the original discussion, compare SparkSQL vs Hive in Spark vs Spark API.
SparkSQL vs Spark API you can simply imagine you are in RDBMS world,
SparkSQL is pure SQL, and Spark API is language for writing stored
Hi, here we made several optimizations for accessing s3 from spark:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando
such as:
https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133
you can deploy
d me figure out the missing
> dependency.
> Was missing hbase-hadoop-compat.jar
>
>
>
>
>
> Thanks a lot,
>
> Divya
>
> On 2 March 2016 at 17:05, Teng Qiu <teng...@gmail.com> wrote:
>>
>> Hi, maybe the dependencies described in
>> http://www.ab
st and plots that list.
>
> What I am specifically asking was that is there a convenient way to plot a
> DataFrame object directly?(like pandas DataFrame objects)
>
>
> On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu <teng...@gmail.com> wrote:
>>
>> not sure about 3d p
not sure about 3d plot, but there is a nice example:
https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb
for plotting rdd or dataframe using matplotlib.
Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı :
> Hi all,
> I'm trying to plot the
This is really depends on how you defined "hot" :) and use cases, spark is
definitely not that one fits all. At least not yet. Specially for heavy
joins and full scans.
Maybe spark alone fits your production workload and analytical
requirements, but in general, I agree with Chris, for high
and also make sure that hbase-site.xml is set in your classpath on all
nodes, both master and workers, and also client.
normally i put it into $SPARK_HOME/conf/ then the spark cluster will
be started with this conf file.
btw. @Ted, did you tried insert into hbase table with spark's
HiveContext?
as Gourav said, the application UI on port 4040 will no more available
after your spark app finished. you should go to spark master's UI
(port 8080), and take a look "completed applications"...
refer to doc: http://spark.apache.org/docs/latest/monitoring.html
read the first "note that" :)
: SPARK-10063).
> Until this resolved, ISTM files cannot support direct commits.
>
> thanks,
>
>
>
> On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote:
>
>> yes, should be this one
>> https://gist.github.com/aarondav/c513916e72101bbe14ec
&
:
> The header of DirectOutputCommitter.scala says Databricks.
> Did you get it from Databricks ?
> On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote:
>>
>> interesting in this topic as well, why the DirectFileOutputCommitter not
included?
>> we added it
interesting in this topic as well, why the DirectFileOutputCommitter not
included?
we added it in our fork, under
core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala
moreover, this DirectFileOutputCommitter is not working for the insert
operations in HiveContext, since the
woow, great post, very detailed, question is that, what kind of "web logs"
do they have, if those logs are some application logs, like apache httpd
logs or oracle logs, then, sure, this is a typical use cases for spark or
generally, for hadoop tech stack.
but if Philippe is talking about network
@Daniel, there are at least 3 things that EMR can not solve, yet:
- HA support
- AWS provides auto scaling feature, but scale up/down EMR needs manual
operations
- security concerns in a public VPC
EMR is basically designed for short term running use cases with some
pre-defined bootstrap actions
EMR is great, but I'm curiosity how are you dealing with security settings
with EMR, only whitelisting some IP range with security group setting is
really too weak.
are there really many production system are using EMR? for me, i feel using
EMR means everyone in my IP range (for some ISP it may
download a right version of this jar
http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 (or 2.11),
and append it to SPARK_CLASSPATH
2016-02-18 11:05 GMT+01:00 Devesh Raj Singh :
> Hi,
>
> I want to read CSV file in pyspark
>
> I am running pyspark on pycharm
executors for any larger reduce tasks.
>
>
>
> Thanks,
>
>
>
> Dillon Dukek
>
> Software Engineer,
>
> Product Realization
>
> Data Products & Intelligence
>
> *•**T**•••Mobile•*
>
> Cell: 360-316-9309
>
> Email: dillon.du...@t-mobi
i believe this is a known issue for using spark/hive with files on s3, this
huge delay on driver side is caused by partition listing and split
computation, and it is more like a issue by hive, since you are using
thrift server, the sql queries are running in HiveContext.
qubole made some
Hi Gourav, Hi Alex, you can try this
https://github.com/zalando/spark-appliance
this docker image (registry.opensource.zalan.do/bi/spark:1.6.0-1) is
integrated with Jupyter notebook, plugins (kernels) for spark and R are
installed, some python libs, like NumPy,SciPy and matplotlib are already
27 matches
Mail list logo