Spark 1.x - End of life

2017-10-19 Thread Ismaël Mejía
Hello, I noticed that some of the (Big Data / Cloud Managed) Hadoop distributions are starting to (phase out / deprecate) Spark 1.x and I was wondering if the Spark community has already decided when will it end the support for Spark 1.x. I ask this also considering that the latest release in the

Metadata Management

2017-10-19 Thread Vasu Gourabathina
All: This may be off topic for Spark, but I'm sure several of you might have used some form of this as part of your BigData implementations. So, wanted to reach out. As part of the Data Lake and Data Processing (by Spark as an example), we might end up different form-factors for the files (via,

Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael, It depends on what you mean by “support”. In general, there won’t be new feature releases for 1.X (e.g. Spark 1.7) because all the new features are being added to the master branch. However, there is always room for bug fix releases if there is a catastrophic bug, and committers can

Spark Inner Join on pivoted datasets results empty dataset

2017-10-19 Thread Anil Langote
Hi All, I have a requirement to pivot multiple columns using single columns, the pivot API doesn't support doing that hence I have been doing pivot for two columns and then trying to merge the dataset the result is producing empty dataset. Below is the sudo code Main dataset => 33 columns (30

Re: Spark Inner Join on pivoted datasets results empty dataset

2017-10-19 Thread Anil Langote
Is there any limit on number of columns used in inner join ? Thank you Anil Langote Sent from my iPhone _ From: Anil Langote > Sent: Thursday, October 19, 2017 5:01 PM Subject: Spark Inner Join on pivoted

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
IE: If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down? I noticed in a query like this: df = spark.hiveContext.read.jdbc( url=jdbc_url, table="schema.table", column="id", lowerBound=lower_bound_id, upperBound=upper_bound_id,

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index. If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the

Re: jar file problem

2017-10-19 Thread Imran Rajjad
Simple way is to have a network volume mounted with same name to make things easy On Thu, 19 Oct 2017 at 8:24 PM Uğur Sopaoğlu wrote: > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is there any way to do simple?.

jar file problem

2017-10-19 Thread Uğur Sopaoğlu
Hello, I have a very easy problem. How I run a spark job, I must copy jar file to all worker nodes. Is there any way to do simple?. -- Uğur Sopaoğlu

Re: jar file problem

2017-10-19 Thread Riccardo Ferrari
This is a good place to start from: https://spark.apache.org/docs/latest/submitting-applications.html Best, On Thu, Oct 19, 2017 at 5:24 PM, Uğur Sopaoğlu wrote: > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is

Re: jar file problem

2017-10-19 Thread Weichen Xu
Use `bin/spark-submit --jars` option. On Thu, Oct 19, 2017 at 11:54 PM, 郭鹏飞 wrote: > You can use bin/spark-submit tool to submit you jar to the cluster. > > > 在 2017年10月19日,下午11:24,Uğur Sopaoğlu 写道: > > > > Hello, > > > > I have a very easy

Re: jar file problem

2017-10-19 Thread 郭鹏飞
You can use bin/spark-submit tool to submit you jar to the cluster. > 在 2017年10月19日,下午11:24,Uğur Sopaoğlu 写道: > > Hello, > > I have a very easy problem. How I run a spark job, I must copy jar file to > all worker nodes. Is there any way to do simple?. > > -- > Uğur

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table? I assume you are referring to "id" column on the table that you are reading through JDBC connection. Then you are creating a temp Table called "df". That temp table is created in temporary

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread lucas.g...@gmail.com
If the underlying table(s) have indexes on them. Does spark use those indexes to optimize the query? IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index. Would spark know to push that predicate to the

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

2017-10-19 Thread Mich Talebzadeh
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple