Re: Spark in docker over EC2

2017-01-13 Thread Teng Qiu
Hi, you can take a look at this project, it is a distributed HA Spark cluster for AWS environment using Docker, we put the spark ec2 instances in an ELB, and using this code snippet to get the instance IPs: https://github.com/zalando-incubator/spark-appliance/blob/master/utils.py#L49-L56

Re: jdbcRDD for data ingestion from RDBMS

2016-10-18 Thread Teng Qiu
Hi Ninad, i believe the purpose of jdbcRDD is to use RDBMS as an addtional data source during the data processing, main goal of spark is still analyzing data from HDFS-like file system. to use spark as a data integration tool to transfer billions of records from RDBMS to HDFS etc. could work, but

Re: Access S3 buckets in multiple accounts

2016-09-28 Thread Teng Qiu
hmm, i do not believe security group can control s3 bucket access... is this something new? or you mean IAM role? @Daniel, using spark on EMR, you should be able to use IAM role to access AWS resources, you do not need to specify fs.s3a.access.key or fs.s3a.secret.key at all. S3A is able to use

Re: Role-based S3 access outside of EMR

2016-07-21 Thread Teng Qiu
there are several reasons that AWS users do (can) not use EMR, one point for us is that security compliance problem, EMR is totally not open sourced, we can not use it in production system. second is that EMR do not support HA yet. but to the original question from @Everett : -> Credentials and

Re: standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-16 Thread Teng Qiu
really beyond the design goals for standalone mode. If you want > more in the way of multi-Application resource scheduling, then you should be > looking at Yarn or Mesos. Is there some reason why neither of those options > can work for you? > > On Fri, Jul 15, 2016 at 9:15 AM, Teng

standalone mode only supports FIFO scheduler across applications ? still in spark 2.0 time ?

2016-07-15 Thread Teng Qiu
Hi, http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/spark-standalone.html#resource-scheduling The standalone cluster mode currently only supports a simple FIFO scheduler across applications. is this sentence still true? any progress on this? it will really helpful. some

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
yes, only for engine, but maybe newer version has more optimization from tungsten project? at least since spark 1.6? > -- Forwarded message -- > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: 27 May 2016 at 17:09 > Subject: Re: Pros and Cons &g

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
tried spark 2.0.0 preview, but no assembly jar there... then just gave up... :p 2016-05-27 17:39 GMT+02:00 Ted Yu <yuzhih...@gmail.com>: > Teng: > Why not try out the 2.0 SANPSHOT build ? > > Thanks > >> On May 27, 2016, at 7:44 AM, Teng Qiu <teng...@gmail.com>

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
vendor's product here? > > thanks > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > > > On 27 May 2016 at 13:05, Teng Qiu <te

Re: Pros and Cons

2016-05-27 Thread Teng Qiu
I agree with Koert and Reynold, spark works well with large dataset now. back to the original discussion, compare SparkSQL vs Hive in Spark vs Spark API. SparkSQL vs Spark API you can simply imagine you are in RDBMS world, SparkSQL is pure SQL, and Spark API is language for writing stored

Re: Spark on AWS

2016-05-01 Thread Teng Qiu
Hi, here we made several optimizations for accessing s3 from spark: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando such as: https://github.com/apache/spark/compare/branch-1.6...zalando:branch-1.6-zalando#diff-d579db9a8f27e0bbef37720ab14ec3f6R133 you can deploy

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-14 Thread Teng Qiu
d me figure out the missing > dependency. > Was missing hbase-hadoop-compat.jar > > > > > > Thanks a lot, > > Divya > > On 2 March 2016 at 17:05, Teng Qiu <teng...@gmail.com> wrote: >> >> Hi, maybe the dependencies described in >> http://www.ab

Re: Plot DataFrame with matplotlib

2016-03-23 Thread Teng Qiu
st and plots that list. > > What I am specifically asking was that is there a convenient way to plot a > DataFrame object directly?(like pandas DataFrame objects) > > > On Wed, Mar 23, 2016 at 11:47 AM Teng Qiu <teng...@gmail.com> wrote: >> >> not sure about 3d p

Re: Plot DataFrame with matplotlib

2016-03-23 Thread Teng Qiu
not sure about 3d plot, but there is a nice example: https://github.com/zalando/spark-appliance/blob/master/examples/notebooks/PySpark_sklearn_matplotlib.ipynb for plotting rdd or dataframe using matplotlib. Am Mittwoch, 23. März 2016 schrieb Yavuz Nuzumlalı : > Hi all, > I'm trying to plot the

Re: Can we use spark inside a web service?

2016-03-10 Thread Teng Qiu
This is really depends on how you defined "hot" :) and use cases, spark is definitely not that one fits all. At least not yet. Specially for heavy joins and full scans. Maybe spark alone fits your production workload and analytical requirements, but in general, I agree with Chris, for high

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-03-01 Thread Teng Qiu
and also make sure that hbase-site.xml is set in your classpath on all nodes, both master and workers, and also client. normally i put it into $SPARK_HOME/conf/ then the spark cluster will be started with this conf file. btw. @Ted, did you tried insert into hbase table with spark's HiveContext?

Re: Spark UI standalone "crashes" after an application finishes

2016-03-01 Thread Teng Qiu
as Gourav said, the application UI on port 4040 will no more available after your spark app finished. you should go to spark master's UI (port 8080), and take a look "completed applications"... refer to doc: http://spark.apache.org/docs/latest/monitoring.html read the first "note that" :)

Re: DirectFileOutputCommiter

2016-02-26 Thread Teng Qiu
: SPARK-10063). > Until this resolved, ISTM files cannot support direct commits. > > thanks, > > > > On Fri, Feb 26, 2016 at 8:39 AM, Teng Qiu <teng...@gmail.com> wrote: > >> yes, should be this one >> https://gist.github.com/aarondav/c513916e72101bbe14ec &

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
: > The header of DirectOutputCommitter.scala says Databricks. > Did you get it from Databricks ? > On Thu, Feb 25, 2016 at 3:01 PM, Teng Qiu <teng...@gmail.com> wrote: >> >> interesting in this topic as well, why the DirectFileOutputCommitter not included? >> we added it

Re: DirectFileOutputCommiter

2016-02-25 Thread Teng Qiu
interesting in this topic as well, why the DirectFileOutputCommitter not included? we added it in our fork, under core/src/main/scala/org/apache/spark/mapred/DirectOutputCommitter.scala moreover, this DirectFileOutputCommitter is not working for the insert operations in HiveContext, since the

Re: Newbie questions regarding log processing

2016-02-22 Thread Teng Qiu
woow, great post, very detailed, question is that, what kind of "web logs" do they have, if those logs are some application logs, like apache httpd logs or oracle logs, then, sure, this is a typical use cases for spark or generally, for hadoop tech stack. but if Philippe is talking about network

Is this likely to cause any problems?

2016-02-20 Thread Teng Qiu
@Daniel, there are at least 3 things that EMR can not solve, yet: - HA support - AWS provides auto scaling feature, but scale up/down EMR needs manual operations - security concerns in a public VPC EMR is basically designed for short term running use cases with some pre-defined bootstrap actions

Re: Is this likely to cause any problems?

2016-02-18 Thread Teng Qiu
EMR is great, but I'm curiosity how are you dealing with security settings with EMR, only whitelisting some IP range with security group setting is really too weak. are there really many production system are using EMR? for me, i feel using EMR means everyone in my IP range (for some ISP it may

Re: Reading CSV file using pyspark

2016-02-18 Thread Teng Qiu
download a right version of this jar http://mvnrepository.com/artifact/com.databricks/spark-csv_2.10 (or 2.11), and append it to SPARK_CLASSPATH 2016-02-18 11:05 GMT+01:00 Devesh Raj Singh : > Hi, > > I want to read CSV file in pyspark > > I am running pyspark on pycharm

Re: Spark SQL step with many tasks takes a long time to begin processing

2016-02-17 Thread Teng Qiu
executors for any larger reduce tasks. > > > > Thanks, > > > > Dillon Dukek > > Software Engineer, > > Product Realization > > Data Products & Intelligence > > *•**T**•••Mobile•* > > Cell: 360-316-9309 > > Email: dillon.du...@t-mobi

Re: Spark SQL step with many tasks takes a long time to begin processing

2016-02-16 Thread Teng Qiu
i believe this is a known issue for using spark/hive with files on s3, this huge delay on driver side is caused by partition listing and split computation, and it is more like a issue by hive, since you are using thrift server, the sql queries are running in HiveContext. qubole made some

Re: Scala from Jupyter

2016-02-16 Thread Teng Qiu
Hi Gourav, Hi Alex, you can try this https://github.com/zalando/spark-appliance this docker image (registry.opensource.zalan.do/bi/spark:1.6.0-1) is integrated with Jupyter notebook, plugins (kernels) for spark and R are installed, some python libs, like NumPy,SciPy and matplotlib are already