date:20160523

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani

Thanks, It worked !!!

On Tue, May 24, 2016 at 1:14 AM, Marcelo Vanzin  wrote:

> On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani
>  wrote:
> > I am passing hive-site.xml through --files option.
>
> You need hive-site-xml in Spark's classpath too. Easiest way is to
> copy / symlink hive-site.xml in your Spark's conf directory.
>
> --
> Marcelo
>



-- 
Thanks & Regards,
Chandra Prakash Bhagtani

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang

Spark actually used to depend on Akka. Unfortunately this brought in
all of Akka's dependencies (in addition to Spark's already quite
complex dependency graph) and, as Todd mentioned, led to conflicts
with projects using both Spark and Akka.

It would probably be possible to use Akka and shade it to avoid
conflicts (some additional classloading tricks may be required).
However, considering that only a small portion of Akka's features was
used and scoped quite narrowly across Spark, it isn't worth the extra
maintenance burden. Furthermore, akka-remote uses Netty internally, so
reducing dependencies to core functionality is a good thing IMO

On Mon, May 23, 2016 at 6:35 AM, Todd  wrote:
> As far as I know, there would be Akka version conflicting issue when 
> using
> Akka as spark streaming source.
>
>
>
>
>
>
> At 2016-05-23 21:19:08, "Chaoqiang"  wrote:
>>I want to know why spark 1.6 use Netty instead of Akka? Is there some
>>difficult problems which Akka can not solve, but using Netty can solve
>>easily?
>>If not, can you give me some references about this changing?
>>Thank you.
>>
>>
>>
>>--
>>View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27004.html
>>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>-
>>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>For additional commands, e-mail: user-h...@spark.apache.org
>>

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27005p27010.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread ayan guha

Hi

Thanks for very useful stats.

Did you have any benchmark for using Spark as backend engine for Hive vs
using Spark thrift server (and run spark code for hive queries)? We are
using later but it will be very useful to remove thriftserver, if we can.

On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  wrote:

>
> Hi Mich,
>
> I think these comparisons are useful. One interesting aspect could be
> hardware scalability in this context. Additionally different type of
> computations. Furthermore, one could compare Spark and Tez+llap as
> execution engines. I have the gut feeling that  each one can be justified
> by different use cases.
> Nevertheless, there should be always a disclaimer for such comparisons,
> because Spark and Hive are not good for a lot of concurrent lookups of
> single rows. They are not good for frequently write small amounts of data
> (eg sensor data). Here hbase could be more interesting. Other use cases can
> justify graph databases, such as Titan, or text analytics/ data matching
> using Solr on Hadoop.
> Finally, even if you have a lot of data you need to think if you always
> have to process everything. For instance, I have found valid use cases in
> practice where we decided to evaluate 10 machine learning models in
> parallel on only a sample of data and only evaluate the "winning" model of
> the total of data.
>
> As always it depends :)
>
> Best regards
>
> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive
> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how to
> manage bringing both together. You may check also Apache Bigtop (vendor
> neutral distribution) on how they managed to bring both together.
>
> On 23 May 2016, at 01:42, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> I have done a number of extensive tests using Spark-shell with Hive DB and
> ORC tables.
>
>
>
> Now one issue that we typically face is and I quote:
>
>
>
> Spark is fast as it uses Memory and DAG. Great but when we save data it is
> not fast enough
>
> OK but there is a solution now. If you use Spark with Hive and you are on
> a descent version of Hive >= 0.14, then you can also deploy Spark as
> execution engine for Hive. That will make your application run pretty fast
> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
> what you are gaining speed in both querying and storage.
>
>
>
> I have made some comparisons on this set-up and I am sure some of you will
> find it useful.
>
>
>
> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>
> The version of Hive I use in Hive 2
>
> The version of Spark I use as Hive execution engine is 1.3.1 It works and
> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
> the Hadoop libraries mismatch).
>
>
>
> An example I am using Hive on Spark engine to find the min and max of IDs
> for a table with 1 billion rows:
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
> stddev(id) from oraclehadoop.dummy;
>
> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
>
>
>
>
> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>
>
>
> INFO  : Completed compiling
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
> Time taken: 1.911 seconds
>
> INFO  : Executing
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>
> INFO  : Query ID =
> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
> INFO  : Total jobs = 1
>
> INFO  : Launching Job 1 out of 1
>
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>
>
>
> Query Hive on Spark job[0] stages:
>
> 0
>
> 1
>
> Status: Running (Hive on Spark job[0])
>
> Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0: 0(+1)/1
>
> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0: 1/1
>

Re: how to config spark thrift jdbc server high available

2016-05-23 Thread 7??

Dear Toadd 
  thankyou for your reply. I don't know how to reply your message from 
Nabble.com.
Can you tell me which jira works on spark thrift server HA and how to reply 
from Nabble.com.


Thanks a lot




-- Original --
From:  "Todd";;
Date:  Mon, May 23, 2016 09:37 PM
To:  "7??"<578967...@qq.com>; 
Cc:  "user"; 
Subject:  Re:how to config spark thrift jdbc server high available





There is a jira that works on spark thrift server HA, the patch works,but still 
hasn't merged into the master branch.






At 2016-05-23 20:10:26, "qmzhang" <578967...@qq.com> wrote: >Dear guys, please 
help... > >In hive,we can enable hiveserver2 high available by using dynamic 
service >discovery for HiveServer2. But how to enable spark thriftserver high 
>available? > >Thank you for your help > > > > >-- >View this message in 
context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-config-spark-thrift-jdbc-server-high-available-tp27003.html
 >Sent from the Apache Spark User List mailing list archive at Nabble.com. > 
>- >To 
unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional 
commands, e-mail: user-h...@spark.apache.org >

Re: how to config spark thrift jdbc server high available

2016-05-23 Thread 7??

I already found jira SPARK-11100
Can you tell me how to reply from Nabble.com.


thanks a lot 


-- Original --
From:  "7??";<578967...@qq.com>;
Date:  Tue, May 24, 2016 09:59 AM
To:  "Todd"; 
Cc:  "user"; 
Subject:  Re: how to config spark thrift jdbc server high available



Dear Toadd 
  thankyou for your reply. I don't know how to reply your message from 
Nabble.com.
Can you tell me which jira works on spark thrift server HA and how to reply 
from Nabble.com.


Thanks a lot




-- Original --
From:  "Todd";;
Date:  Mon, May 23, 2016 09:37 PM
To:  "7??"<578967...@qq.com>; 
Cc:  "user"; 
Subject:  Re:how to config spark thrift jdbc server high available





There is a jira that works on spark thrift server HA, the patch works,but still 
hasn't merged into the master branch.






At 2016-05-23 20:10:26, "qmzhang" <578967...@qq.com> wrote: >Dear guys, please 
help... > >In hive,we can enable hiveserver2 high available by using dynamic 
service >discovery for HiveServer2. But how to enable spark thriftserver high 
>available? > >Thank you for your help > > > > >-- >View this message in 
context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-config-spark-thrift-jdbc-server-high-available-tp27003.html
 >Sent from the Apache Spark User List mailing list archive at Nabble.com. > 
>- >To 
unsubscribe, e-mail: user-unsubscr...@spark.apache.org >For additional 
commands, e-mail: user-h...@spark.apache.org >

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Jörn Franke


Hi Mich,

I think these comparisons are useful. One interesting aspect could be hardware 
scalability in this context. Additionally different type of computations. 
Furthermore, one could compare Spark and Tez+llap as execution engines. I have 
the gut feeling that  each one can be justified by different use cases.
Nevertheless, there should be always a disclaimer for such comparisons, because 
Spark and Hive are not good for a lot of concurrent lookups of single rows. 
They are not good for frequently write small amounts of data (eg sensor data). 
Here hbase could be more interesting. Other use cases can justify graph 
databases, such as Titan, or text analytics/ data matching using Solr on Hadoop.
Finally, even if you have a lot of data you need to think if you always have to 
process everything. For instance, I have found valid use cases in practice 
where we decided to evaluate 10 machine learning models in parallel on only a 
sample of data and only evaluate the "winning" model of the total of data.

As always it depends :) 

Best regards

P.s.: at least Hortonworks has in their distribution spark 1.5 with hive 1.2 
and spark 1.6 with hive 1.2. Maybe they have somewhere described how to manage 
bringing both together. You may check also Apache Bigtop (vendor neutral 
distribution) on how they managed to bring both together.

> On 23 May 2016, at 01:42, Mich Talebzadeh  wrote:
> 
> Hi,
>  
> I have done a number of extensive tests using Spark-shell with Hive DB and 
> ORC tables.
>  
> Now one issue that we typically face is and I quote:
>  
> Spark is fast as it uses Memory and DAG. Great but when we save data it is 
> not fast enough
> 
> OK but there is a solution now. If you use Spark with Hive and you are on a 
> descent version of Hive >= 0.14, then you can also deploy Spark as execution 
> engine for Hive. That will make your application run pretty fast as you no 
> longer rely on the old Map-Reduce for Hive engine. In a nutshell what you are 
> gaining speed in both querying and storage.
>  
> I have made some comparisons on this set-up and I am sure some of you will 
> find it useful.
>  
> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
> The version of Hive I use in Hive 2
> The version of Spark I use as Hive execution engine is 1.3.1 It works and 
> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out the 
> Hadoop libraries mismatch).
>  
> An example I am using Hive on Spark engine to find the min and max of IDs for 
> a table with 1 billion rows:
>  
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id), 
> stddev(id) from oraclehadoop.dummy;
> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>  
>  
> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>  
> INFO  : Completed compiling 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
> Time taken: 1.911 seconds
> INFO  : Executing 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006): 
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
> INFO  : Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>  
> Query Hive on Spark job[0] stages:
> 0
> 1
> Status: Running (Hive on Spark job[0])
> Job Progress Format
> CurrentTime StageId_StageAttemptId: 
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
> [StageCost]
> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
> INFO  :
> Query Hive on Spark job[0] stages:
> INFO  : 0
> INFO  : 1
> INFO  :
> Status: Running (Hive on Spark job[0])
> INFO  : Job Progress Format
> CurrentTime StageId_StageAttemptId: 
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount 
> [StageCost]
> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0: 0(+1)/1
> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0: 1/1 
> Finished
> Status: Finished successfully in 53.25 seconds
> OK
> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0: 
> 0(+1)/1
> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0: 
> 1/1 Finished
> INFO  : Status: Finished successfully in 53.25 seconds
> INFO  : Completed executing 
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006); 
> Time taken: 56.337

RE: Timed aggregation in Spark

2016-05-23 Thread Ewan Leith

Rather than open a connection per record, if you do a DStream foreachRDD at the 
end of a 5 minute batch window

http://spark.apache.org/docs/latest/streaming-programming-guide.html#output-operations-on-dstreams

then you can do a rdd.foreachPartition to get the RDD partitions. Open a 
connection to vertica (or a pool of them) inside that mapPartitions, then do a 
partition.foreach to write each element from that partition to vertica, before 
finally closing the pool of connections.

Hope this helps,
Ewan

From: Nikhil Goyal [mailto:nownik...@gmail.com]
Sent: 23 May 2016 21:55
To: Ofir Kerker 
Cc: user@spark.apache.org
Subject: Re: Timed aggregation in Spark

I don't think this is solving the problem. So here are the issues:
1) How do we push entire data to vertica. Opening a connection per record will 
be too costly
2) If a key doesn't come again, how do we push this to vertica
3) How do we schedule the dumping of data to avoid loading too much data in 
state.



On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker 
> wrote:
Yes, check out mapWithState:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

_
From: Nikhil Goyal >
Sent: Monday, May 23, 2016 23:28
Subject: Timed aggregation in Spark
To: >


Hi all,

I want to aggregate my data for 5-10 min and then flush the aggregated data to 
some database like vertica. updateStateByKey is not exactly helpful in this 
scenario as I can't flush all the records at once, neither can I clear the 
state. I wanted to know if anyone else has faced a similar issue and how did 
they handle it.

Thanks
Nikhil

Re: Hive_context

2016-05-23 Thread Arun Natva

Can you try a hive JDBC java client from eclipse and query a hive table 
successfully ?

This way we can narrow down where the issue is ?


Sent from my iPhone

> On May 23, 2016, at 5:26 PM, Ajay Chander  wrote:
> 
> I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to it. 
> I copied all the cluster configuration files(hive-site.xml, hdfs-site.xml etc 
> files) inside the ${SPARK_HOME}/conf/ . My application looks like below,
> 
> 
> public class SparkSqlTest {
> 
> public static void main(String[] args) {
> 
> 
> 
> SparkConf sc = new SparkConf().setAppName("SQL_Test").setMaster("local");
> 
> JavaSparkContext jsc = new JavaSparkContext(sc);
> 
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(jsc.sc());
> 
> DataFrame sampleDataFrame = hiveContext.sql("show tables");
> 
> sampleDataFrame.show();
> 
> 
> 
> }
> 
> }
> 
> 
> 
> I am expecting my application to return all the tables from the default 
> database. But somehow it returns empty list. I am just wondering if I need to 
> add anything to my code to point it to hive metastore. Thanks for your time. 
> Any pointers are appreciated.
> 
> 
> 
> Regards,
> 
> Aj
> 
> 
> 
>> On Monday, May 23, 2016, Ajay Chander  wrote:
>> Hi Everyone,
>> 
>> I am building a Java Spark application in eclipse IDE. From my application I 
>> want to use hiveContext to read tables from the remote Hive(Hadoop cluster). 
>> On my machine I have exported $HADOOP_CONF_DIR = {$HOME}/hadoop/conf/. This 
>> path has all the remote cluster conf details like hive-site.xml, 
>> hdfs-site.xml ... Somehow I am not able to communicate to remote cluster 
>> from my app. Is there any additional configuration work that I am supposed 
>> to do to get it work? I specified master as 'local' in the code. Thank you.
>> 
>> Regards,
>> Aj

Re: Hive_context

2016-05-23 Thread Ajay Chander

I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to
it. I copied all the cluster configuration files(hive-site.xml,
hdfs-site.xml etc files) inside the ${SPARK_HOME}/conf/ . My application
looks like below,

public class SparkSqlTest {

public static void main(String[] args) {

SparkConf sc = new SparkConf().setAppName("SQL_Test").setMaster("local");

JavaSparkContext jsc = new JavaSparkContext(sc);

HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(jsc
.sc());

DataFrame sampleDataFrame = hiveContext.sql("show tables");

sampleDataFrame.show();

}

}

I am expecting my application to return all the tables from the default
database. But somehow it returns empty list. I am just wondering if I need
to add anything to my code to point it to hive metastore. Thanks for your
time. Any pointers are appreciated.

Regards,

Aj

On Monday, May 23, 2016, Ajay Chander  wrote:

> Hi Everyone,
>
> I am building a Java Spark application in eclipse IDE. From my application
> I want to use hiveContext to read tables from the remote Hive(Hadoop
> cluster). On my machine I have exported $HADOOP_CONF_DIR =
> {$HOME}/hadoop/conf/. This path has all the remote cluster conf details
> like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
> to remote cluster from my app. Is there any additional configuration work
> that I am supposed to do to get it work? I specified master as 'local' in
> the code. Thank you.
>
> Regards,
> Aj
>

Re: Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal

I don't think this is solving the problem. So here are the issues:
1) How do we push entire data to vertica. Opening a connection per record
will be too costly
2) If a key doesn't come again, how do we push this to vertica
3) How do we schedule the dumping of data to avoid loading too much data in
state.

On Mon, May 23, 2016 at 1:33 PM, Ofir Kerker  wrote:

> Yes, check out mapWithState:
>
> https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html
>
> _
> From: Nikhil Goyal 
> Sent: Monday, May 23, 2016 23:28
> Subject: Timed aggregation in Spark
> To: 
>
>
>
> Hi all,
>
> I want to aggregate my data for 5-10 min and then flush the aggregated
> data to some database like vertica. updateStateByKey is not exactly helpful
> in this scenario as I can't flush all the records at once, neither can I
> clear the state. I wanted to know if anyone else has faced a similar issue
> and how did they handle it.
>
> Thanks
> Nikhil
>
>
>

Re: Timed aggregation in Spark

2016-05-23 Thread Ofir Kerker

Yes, check out 
mapWithState:https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html

_
From: Nikhil Goyal 
Sent: Monday, May 23, 2016 23:28
Subject: Timed aggregation in Spark
To:  

Hi all,
I want to aggregate my data for 5-10 min and then flush the aggregated data to 
some database like vertica. updateStateByKey is not exactly helpful in this 
scenario as I can't flush all the records at once, neither can I clear the 
state. I wanted to know if anyone else has faced a similar issue and how did 
they handle it.
ThanksNikhil

Timed aggregation in Spark

2016-05-23 Thread Nikhil Goyal

Hi all,

I want to aggregate my data for 5-10 min and then flush the aggregated data
to some database like vertica. updateStateByKey is not exactly helpful in
this scenario as I can't flush all the records at once, neither can I clear
the state. I wanted to know if anyone else has faced a similar issue and
how did they handle it.

Thanks
Nikhil

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Ashok Kumar

Hi Dr Mich,
This is very good news. I will be interested to know how Hive engages with 
Spark as an engine. What Spark processes are used to make this work? 
Thanking you 

On Monday, 23 May 2016, 19:01, Mich Talebzadeh  
wrote:
 

 Have a look at this thread
Dr Mich Talebzadeh LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 http://talebzadehmich.wordpress.com 
On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

Hi Timur and everyone.
I will answer your first question as it is very relevant
1) How to make 2 versions of Spark live together on the same cluster (libraries 
clash, paths, etc.) ? 
Most of the Spark users perform ETL, ML operations on Spark as well. So, we may 
have 3 Spark installations simultaneously

There are two distinct points here.
Using Spark as a  query engine. That is BAU and most forum members use it 
everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster 
managers. You start master that does the management of resources and you start 
slaves to create workers. 
 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through 
spark-submit etc. You may or may not use Hive as your database. You may use 
Hbase via Phoenix etcIf you choose to use Hive as your database, on every host 
of cluster including your master host, you ensure that Hive APIs are installed 
(meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to cd 
$SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 hive-site.xml -> 
/usr/lib/hive/conf/hive-site.xml
Now in hive-site.xml you can define all the parameters needed for Spark 
connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE NOT 
RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or workers 
for Spark 1.3.1! It is just an execution engine like mr etc.
Let us look at how we do that in hive-site,xml. Noting the settings for 
hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2 
below. That tells Hive to use spark 1.3.1 as the execution engine. You just 
install spark 1.3.1 on the host just the binary download it is 
/usr/lib/spark-1.3.1-bin-hadoop2.6
In hive-site.xml, you set the properties.
  
    hive.execution.engine
    spark
    
  Expects one of [mr, tez, spark].
  Chooses execution engine. Options are: mr (Map reduce, default), tez, 
spark. While MR
  remains the default engine for historical reasons, it is itself a 
historical engine
  and is deprecated in Hive 2 line. It may be removed without further 
warning.
    
    
    spark.home
    /usr/lib/spark-1.3.1-bin-hadoop2
    something
  

 
    hive.merge.sparkfiles
    false
    Merge small files at the end of a Spark DAG 
Transformation
    
    hive.spark.client.future.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for requests from Hive client to remote Spark driver.
    
  
    hive.spark.job.monitor.timeout
    60s
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is sec if not specified.
  Timeout for job monitor to get Spark job state.
    
 
  
    hive.spark.client.connect.timeout
    1000ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for remote Spark driver in connecting back to Hive client.
    
  
  
    hive.spark.client.server.connect.timeout
    9ms
    
  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, 
us/usec, ns/nsec), which is msec if not specified.
  Timeout for handshake between Hive client and remote Spark driver.  
Checked by both processes.
    
  
  
    hive.spark.client.secret.bits
    256
    Number of bits of randomness in the generated secret for 
communication between Hive client and remote Spark driver. Rounded down to the 
nearest multiple of 8.
  
  
    hive.spark.client.rpc.threads
    8
    Maximum number of threads for remote Spark driver's RPC event 
loop.
  
And other settings as well
That was the Hive stuff for your Spark BAU. So there are two distinct things. 
Now going to Hive itself, you will need to add the correct assembly jar file 
for Hadoop. These are called 
spark-assembly-x.y.z-hadoop2.4.0.jar 
Where x.y.z in this case is 1.3.1 
The assembly file is
spark-assembly-1.3.1-hadoop2.4.0.jar
So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
And you need to compile spark from source excluding Hadoop dependencies 
./make-distribution.sh --name"hadoop2-without-hive" --tgz 
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"

So

Hive_context

2016-05-23 Thread Ajay Chander

Hi Everyone,

I am building a Java Spark application in eclipse IDE. From my application
I want to use hiveContext to read tables from the remote Hive(Hadoop
cluster). On my machine I have exported $HADOOP_CONF_DIR =
{$HOME}/hadoop/conf/. This path has all the remote cluster conf details
like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
to remote cluster from my app. Is there any additional configuration work
that I am supposed to do to get it work? I specified master as 'local' in
the code. Thank you.

Regards,
Aj

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Amit Sela

See SPARK-15489 
I'll try to figure this one out as well, any leads ? "immediate suspects" ?

Thanks,
Amit

On Mon, May 23, 2016 at 10:27 PM Michael Armbrust 
wrote:

> Can you open a JIRA?
>
> On Sun, May 22, 2016 at 2:50 PM, Amit Sela  wrote:
>
>> I've been using Encoders with Kryo to support encoding of generically
>> typed Java classes, mostly with success, in the following manner:
>>
>> public static  Encoder encoder() {
>>   return Encoders.kryo((Class) Object.class);
>> }
>>
>> But at some point I got a decoding exception "Caused by:
>> java.lang.UnsupportedOperationException
>> at java.util.Collections$UnmodifiableCollection.add..."
>>
>> This seems to be because of Guava's `ImmutableList`.
>>
>> I tried registering `UnmodifiableCollectionsSerializer` and `
>> ImmutableListSerializer` from: https://github.com/magro/kryo-serializers
>> but it didn't help.
>>
>> Ideas ?
>>
>> Thanks,
>> Amit
>>
>
>

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Marcelo Vanzin

On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani
 wrote:
> I am passing hive-site.xml through --files option.

You need hive-site-xml in Spark's classpath too. Easiest way is to
copy / symlink hive-site.xml in your Spark's conf directory.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani

Thanks Doug,

I have all the 4 configs (mentioned by you) already in my hive-site.xml. Do
I need to create a hive-site.xml in spark conf directory (it is not there
by default in 1.6.1)? Please suggest.


On Mon, May 23, 2016 at 9:53 PM, Doug Balog 
wrote:

> I have a custom  hive-site.xml for spark in sparks conf directory.
> These properties are the minimal ones that you need for spark, I believe.
>
> hive.metastore.kerberos.principal = copy from your hive-site.xml,  i.e.
> "hive/_h...@foo.com"
> hive.metastore.uris = copy from your hive-site.xml,  i.e. thrift://
> ms1.foo.com:9083
> hive.metastore.sasl.enabled = true
> hive.security.authorization.enabled = false
>
> Cheers,
>
> Doug
>
>
>
> > On May 23, 2016, at 7:41 AM, Chandraprakash Bhagtani <
> cpbhagt...@gmail.com> wrote:
> >
> > Hi,
> >
> > My Spark job is failing with kerberos issues while creating hive context
> in yarn-cluster mode. However it is running with yarn-client mode. My spark
> version is 1.6.1
> >
> > I am passing hive-site.xml through --files option.
> >
> > I tried searching online and found that the same issue is fixed with the
> following jira SPARK-6207. it is fixed in spark 1.4, but I am running 1.6.1
> >
> > Am i missing any configuration here?
> >
> >
> > --
> > Thanks & Regards,
> > Chandra Prakash Bhagtani
>
>


-- 
Thanks & Regards,
Chandra Prakash Bhagtani

Re: Dataset kryo encoder fails on Collections$UnmodifiableCollection

2016-05-23 Thread Michael Armbrust

Can you open a JIRA?

On Sun, May 22, 2016 at 2:50 PM, Amit Sela  wrote:

> I've been using Encoders with Kryo to support encoding of generically
> typed Java classes, mostly with success, in the following manner:
>
> public static  Encoder encoder() {
>   return Encoders.kryo((Class) Object.class);
> }
>
> But at some point I got a decoding exception "Caused by:
> java.lang.UnsupportedOperationException
> at java.util.Collections$UnmodifiableCollection.add..."
>
> This seems to be because of Guava's `ImmutableList`.
>
> I tried registering `UnmodifiableCollectionsSerializer` and `
> ImmutableListSerializer` from: https://github.com/magro/kryo-serializers
> but it didn't help.
>
> Ideas ?
>
> Thanks,
> Amit
>

Re: Dataset API and avro type

2016-05-23 Thread Michael Armbrust

if you are using the kryo encoder, you can only use it to to map to/from
kryo encoded binary data.  This is because spark does not understand kryo's
encoding, its just using it as an opaque blob of bytes.

On Mon, May 23, 2016 at 1:28 AM, Han JU  wrote:

> Just one more question: does Dataset suppose to be able to cast data to an
> avro type? For a very simple format (a string and a long), I can cast it to
> a tuple or case class, but not an avro type (also contains only a string
> and a long).
>
> The error is like this for this very simple type:
>
> === Result of Batch Resolution ===
> !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0,
> string])) null else input[0, string].toString, if (isnull(input[1,
> bigint])) null else input[1, bigint],
> StructField(auctionId,StringType,true), StructField(ts,LongType,true)),
> auctionId#0, ts#1L) AS #2]   Project [createexternalrow(if
> (isnull(auctionId#0)) null else auctionId#0.toString, if (isnull(ts#1L))
> null else ts#1L, StructField(auctionId,StringType,true),
> StructField(ts,LongType,true)) AS #2]
>  +- LocalRelation [auctionId#0,ts#1L]
>
>
> +- LocalRelation
> [auctionId#0,ts#1L]
>
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to
> map struct to Tuple1, but failed as the number
> of fields does not line up.
>  - Input schema: struct
>  - Target schema: struct;
> at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org
> $apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
> at
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:281)
> at org.apache.spark.sql.Dataset.(Dataset.scala:201)
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
> at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
> at org.apache.spark.sql.Dataset.as(Dataset.scala:366)
> at Datasets$.delayedEndpoint$Datasets$1(Datasets.scala:35)
> at Datasets$delayedInit$body.apply(Datasets.scala:23)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
> at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at Datasets$.main(Datasets.scala:23)
> at Datasets.main(Datasets.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
>
> 2016-05-22 22:02 GMT+02:00 Michael Armbrust :
>
>> That's definitely a bug.  If you can come up with a small reproduction it
>> would be great if you could open a JIRA.
>> On May 22, 2016 12:21 PM, "Han JU"  wrote:
>>
>>> Hi Michael,
>>>
>>> The error is like this under 2.0.0-preview. In 1.6.1 the error is very
>>> similar if not exactly the same.
>>> The file is a parquet file containing avro objects.
>>>
>>> Thanks!
>>>
>>> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception:
>>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>>> 'generated.java', Line 25, Column 160: No applicable constructor/method
>>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>>> candidates are: "public static java.nio.ByteBuffer
>>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>>> /* 001 */
>>> /* 002 */ public java.lang.Object generate(Object[] references) {
>>> /* 003 */   return new SpecificSafeProjection(references);
>>> /* 004 */ }
>>> /* 005 */
>>> /* 006 */ class SpecificSafeProjection extends
>>> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>>> /* 007 */
>>> /* 008 */   private Object[] references;
>>> /* 009 */   private MutableRow mutableRow;
>>> /* 010 */   private org.apache.spark.serializer.KryoSerializerInstance
>>> serializer;
>>> /* 011 */
>>> /* 012 */
>>> /* 013 */   public SpecificSafeProjection(Object[] references) {
>>> /* 014 */ this.references = references;
>>> /* 015 */ mutableRow = (MutableRow) references[references.length -
>>> 1];
>>> /* 016 */ serializer =
>>> (org.apache.spark.serializer.KryoSerializerInstance) new
>>> org.apache.spark.serializer.KryoSerializer(new
>>> org.apache.spark.SparkConf()).newInstance();
>>> /* 017 */   }
>>> /* 018 */
>>> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
>>> /* 020 */

Re: why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Jakob Odersky

Spark actually used to depend on Akka. Unfortunately this brought in
all of Akka's dependencies (in addition to Spark's already quite
complex dependency graph) and, as Todd mentioned, led to conflicts
with projects using both Spark and Akka.

It would probably be possible to use Akka and shade it to avoid
conflicts (some additional classloading tricks may be required).
However, considering that only a small portion of Akka's features was
used and scoped quite narrowly across Spark, it isn't worth the extra
maintenance burden. Furthermore, akka-remote uses Netty internally, so
reducing dependencies to core functionality is a good thing IMO

On Mon, May 23, 2016 at 6:35 AM, Todd  wrote:
> As far as I know, there would be Akka version conflicting issue when  using
> Akka as spark streaming source.
>
>
>
>
>
>
> At 2016-05-23 21:19:08, "Chaoqiang"  wrote:
>>I want to know why spark 1.6 use Netty instead of Akka? Is there some
>>difficult problems which Akka can not solve, but using Netty can solve
>>easily?
>>If not, can you give me some references about this changing?
>>Thank you.
>>
>>
>>
>>--
>>View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27004.html
>>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>-
>>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh

Have a look at this thread

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 23 May 2016 at 09:10, Mich Talebzadeh  wrote:

> Hi Timur and everyone.
>
> I will answer your first question as it is very relevant
>
> 1) How to make 2 versions of Spark live together on the same cluster
> (libraries clash, paths, etc.) ?
> Most of the Spark users perform ETL, ML operations on Spark as well. So,
> we may have 3 Spark installations simultaneously
>
> There are two distinct points here.
>
> Using Spark as a  query engine. That is BAU and most forum members use it
> everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster
> managers. You start master that does the management of resources and you
> start slaves to create workers.
>
>  You deploy Spark either by Spark-shell, Spark-sql or submit jobs through
> spark-submit etc. You may or may not use Hive as your database. You may use
> Hbase via Phoenix etc
> If you choose to use Hive as your database, on every host of cluster
> including your master host, you ensure that Hive APIs are installed
> (meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to
> cd $SPARK_HOME/conf
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
> lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 *hive-site.xml ->
> /usr/lib/hive/conf/hive-site.xml*
> Now in hive-site.xml you can define all the parameters needed for Spark
> connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE
> NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or
> workers for Spark 1.3.1! It is just an execution engine like mr etc.
>
> Let us look at how we do that in hive-site,xml. Noting the settings for
> hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2
> below. That tells Hive to use spark 1.3.1 as the execution engine. You just
> install spark 1.3.1 on the host just the binary download it is
> /usr/lib/spark-1.3.1-bin-hadoop2.6
>
> In hive-site.xml, you set the properties.
>
>   
> hive.execution.engine
> spark
> 
>   Expects one of [mr, tez, spark].
>   Chooses execution engine. Options are: mr (Map reduce, default),
> tez, spark. While MR
>   remains the default engine for historical reasons, it is itself a
> historical engine
>   and is deprecated in Hive 2 line. It may be removed without further
> warning.
> 
>   
>
>   
> spark.home
> /usr/lib/spark-1.3.1-bin-hadoop2
> something
>   
>
>  
> hive.merge.sparkfiles
> false
> Merge small files at the end of a Spark DAG
> Transformation
>   
>
>  
> hive.spark.client.future.timeout
> 60s
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>   Timeout for requests from Hive client to remote Spark driver.
> 
>  
>  
> hive.spark.job.monitor.timeout
> 60s
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is sec if not specified.
>   Timeout for job monitor to get Spark job state.
> 
>  
>
>   
> hive.spark.client.connect.timeout
> 1000ms
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>   Timeout for remote Spark driver in connecting back to Hive client.
> 
>   
>
>   
> hive.spark.client.server.connect.timeout
> 9ms
> 
>   Expects a time value with unit (d/day, h/hour, m/min, s/sec,
> ms/msec, us/usec, ns/nsec), which is msec if not specified.
>   Timeout for handshake between Hive client and remote Spark driver.
> Checked by both processes.
> 
>   
>   
> hive.spark.client.secret.bits
> 256
> Number of bits of randomness in the generated secret for
> communication between Hive client and remote Spark driver. Rounded down to
> the nearest multiple of 8.
>   
>   
> hive.spark.client.rpc.threads
> 8
> Maximum number of threads for remote Spark driver's RPC
> event loop.
>   
>
> And other settings as well
>
> That was the Hive stuff for your Spark BAU. So there are two distinct
> things. Now going to Hive itself, you will need to add the correct assembly
> jar file for Hadoop. These are called
>
> spark-assembly-x.y.z-hadoop2.4.0.jar
>
> Where x.y.z in this case is 1.3.1
>
> The assembly file is
>
> spark-assembly-1.3.1-hadoop2.4.0.jar
>
> So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs
>
> ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
> /usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
>
> And you need to compile spark from source excluding Hadoop dependencies
>
>
> ./make-distribution.sh --name

Re: How to set the degree of parallelism in Spark SQL?

2016-05-23 Thread Xinh Huynh

To the original question of parallelism and executors: you can have a
parallelism of 200, even with 2 executors. In the Spark UI, you should see
that the number of _tasks_ is 200 when your job involves shuffling.

Executors vs. tasks:
http://spark.apache.org/docs/latest/cluster-overview.html

Xinh

On Mon, May 23, 2016 at 5:48 AM, Mathieu Longtin 
wrote:

> Since the default is 200, I would guess you're only running 2 executors.
> Try to verify how many executor you are actually running with the web
> interface (port 8080 where the master is running).
>
> On Sat, May 21, 2016 at 11:42 PM Ted Yu  wrote:
>
>> Looks like an equal sign is missing between partitions and 200.
>>
>> On Sat, May 21, 2016 at 8:31 PM, SRK  wrote:
>>
>>> Hi,
>>>
>>> How to set the degree of parallelism in Spark SQL? I am using the
>>> following
>>> but it somehow seems to allocate only two executors at a time.
>>>
>>>  sqlContext.sql(" set spark.sql.shuffle.partitions  200  ")
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Doug Balog

I have a custom  hive-site.xml for spark in sparks conf directory.
These properties are the minimal ones that you need for spark, I believe.

hive.metastore.kerberos.principal = copy from your hive-site.xml,  i.e.  
"hive/_h...@foo.com"
hive.metastore.uris = copy from your hive-site.xml,  i.e. 
thrift://ms1.foo.com:9083
hive.metastore.sasl.enabled = true
hive.security.authorization.enabled = false

Cheers,

Doug

> On May 23, 2016, at 7:41 AM, Chandraprakash Bhagtani  
> wrote:
> 
> Hi,
> 
> My Spark job is failing with kerberos issues while creating hive context in 
> yarn-cluster mode. However it is running with yarn-client mode. My spark 
> version is 1.6.1
> 
> I am passing hive-site.xml through --files option. 
> 
> I tried searching online and found that the same issue is fixed with the 
> following jira SPARK-6207. it is fixed in spark 1.4, but I am running 1.6.1
> 
> Am i missing any configuration here?
> 
> 
> -- 
> Thanks & Regards,
> Chandra Prakash Bhagtani

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread Mich Talebzadeh

depends on what you are using it for. Three parameters are important:

   1. Batch interval
   2. WindowsDuration
   3. SlideDuration

Batch interval is the basic interval at which the system with receive the
data in batches. This is the interval set when creating a StreamingContext.
For example, if you set the batch interval as 2 second, then any input
DStream will generate RDDs of received data at 2 second intervals.
A window operator is defined by two parameters -
WindowDuration / WindowsLength - the length of the window
SlideDuration / SlidingInterval - the interval at which the window will
slide or move forward

Generally speaking, the larger the batch window, the better the overall
performance, but the streaming data output will be updated less
frequently.you will likely run into problems setting your batch window *<
0.5 sec,* and/or when the batch window < the amount of time it takes to run
the task
Beyond that, the window length and sliding interval need to be multiples of
the batch window, but will depend entirely on your reporting requirements.

Consider
batch window = 10 secs
window length = 300 seconds
sliding interval = 60 seconds

In this scenario, you will be creating an output every 60 seconds,
aggregating data that you were collecting every 10 seconds from the source
over a previous 300 seconds

If you were trying to create continuously streaming output as fast as
possible (for example for complex event processing, see below), then you
would probably (almost always) be setting your sliding interval = batch
window and then shrinking the batch window as short as possible.

Example

val sparkConf = new SparkConf().
 setAppName("CEP_streaming").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")
*val ssc = new StreamingContext(sparkConf, Seconds(2))*

*// window length - The duration of the window below that must be multiple
of batch interval n in = > StreamingContext(sparkConf, Seconds(n))val
windowLength = 4// sliding interval - The interval at which the window
operation is performed in other words data is collected within this
"previous interval'val slidingInterval = 2  //* keep this the same as batch
window for continuous streaming. You are aggregating data that you are
collecting over the  batch Window

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 23 May 2016 at 16:32, nsalian  wrote:

> Thanks for the question.
> What kind of data rate are you expecting to receive?
>
>
>
>
> -
> Neelesh S. Salian
> Cloudera
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007p27008.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread nsalian

Thanks for the question.
What kind of data rate are you expecting to receive?




-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007p27008.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

What is the minimum value allowed for StreamingContext's Seconds parameter?

2016-05-23 Thread YaoPau

Just wondering how small the microbatches can be, and any best practices on
the smallest value that should be used in production.  For example, any
issue with running it at 0.01 seconds?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-minimum-value-allowed-for-StreamingContext-s-Seconds-parameter-tp27007.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma

Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?
E.g first 11 fields mapped to Case1 while rest 6 fields mapped to Case2
Any pointer here or code snippet would be really helpful.


-- 
Thanks
Deepak

TFIDF question

2016-05-23 Thread Pasquinell Urbani

Hi all,

I'm following an TF-IDF example but I’m having some issues that i’m not
sure how to fix.

The input is the following

val test = sc.textFile("s3n://.../test_tfidf_products.txt")
test.collect.mkString("\n")

which prints

test: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[370] at textFile
at :121 res241: String = a a b c d e b c d d

After that, I compute the ratings by doing

val test2 = test.map(_.split(" ").toSeq)
val hashingTF2 = new HashingTF()
val tf2: RDD[Vector] = hashingTF2.transform(test2)
tf2.cache()
val idf2 = new IDF().fit(tf2)
val tfidf2: RDD[Vector] = idf2.transform(tf2)
val expandedText = idfModel.transform(tf)
tfidf2.collect.mkString("\n")

which prints

(1048576,[97,98,99,100,101],[0.8109302162163288,0.0,0.0,0.0,0.4054651081081644])
(1048576,[98,99,100],[0.0,0.0,0.0])

The numbers [97,98,99,100,101] are indexes of the vector tfidf2.

I need to access the rating for example for item “a”, but the only way i
have been able to do this is using the method indexOf() of the hasingTF
object.

hashingTF2.indexOf("a")

res236: Int = 97


Is there a better way to perform this?


Thank you all.

sqlContext.read.format("libsvm") not working with spark 1.6+

2016-05-23 Thread dbspace

I have download the the precompile version of apache spark and try to use
sqlContext.read.format("libsvm") but I get the
java.lang.ClassNotFoundException: Failed to load class for data source:
libsvm.  I have post this also in stackoverflow forum  libsvm stackoverflow

  
but I have not received any response yet. Does any one know how to fix this. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sqlContext-read-format-libsvm-not-working-with-spark-1-6-tp27006.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error making REST call from streaming app

2016-05-23 Thread Afshartous, Nick


Hi,


We got the following exception trying to initiate a REST call from the Spark 
app.


This is running Spark 1.5.2 in AWS / Yarn.  Its only happened one time during 
the course of a streaming app

that has been running for months.


Just curious if anyone could shed some more light on root cause.


Thanks,

--

Nick


> User class threw exception: org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 323 in stage 15154.0 failed 4 times, most recent 
> failure: Lost task 323.3 in stage 15154.0 (TID 2010826, 
> ip-10-247-128-182.ec2.internal): 
> com.sun.jersey.spi.service.ServiceConfigurationError: 
> com.sun.jersey.spi.inject.InjectableProvider: : 
> java.io.FileNotFoundException: 
> /mnt/yarn/usercache/hadoop/appcache/application_1452625196513_0026/container_1452625196513_0026_02_03/__app__.jar
>  (No such file or directory)
> at com.sun.jersey.spi.service.ServiceFinder.fail(ServiceFinder.java:610)
> at com.sun.jersey.spi.service.ServiceFinder.parse(ServiceFinder.java:682)
> at com.sun.jersey.spi.service.ServiceFinder.access$500(ServiceFinder.java:159)
> at 
> com.sun.jersey.spi.service.ServiceFinder$AbstractLazyIterator.hasNext(ServiceFinder.java:739)
> at 
> com.sun.jersey.spi.service.ServiceFinder.toClassArray(ServiceFinder.java:595)
> at 
> com.sun.jersey.core.spi.component.ProviderServices.getServiceClasses(ProviderServices.java:318)
> at 
> com.sun.jersey.core.spi.component.ProviderServices.getProviderAndServiceClasses(ProviderServices.java:297)
> at 
> com.sun.jersey.core.spi.component.ProviderServices.getProvidersAndServices(ProviderServices.java:204)
> at 
> com.sun.jersey.core.spi.factory.InjectableProviderFactory.configure(InjectableProviderFactory.java:106)
> at com.sun.jersey.api.client.Client.init(Client.java:263)
> at com.sun.jersey.api.client.Client.access$000(Client.java:118)
> at com.sun.jersey.api.client.Client$1.f(Client.java:191)
> at com.sun.jersey.api.client.Client$1.f(Client.java:187)
> at com.sun.jersey.spi.inject.Errors.processWithErrors(Errors.java:193)
> at com.sun.jersey.api.client.Client.(Client.java:187)
> at com.sun.jersey.api.client.Client.(Client.java:159)
> at com.sun.jersey.api.client.Client.create(Client.java:669)
> at 
> com.wb.analytics.schemaservice.fingerprint.FingerPrintRestClient.getSchema(FingerPrintRestClient.java:48)
> at 
> com.wb.analytics.schemaservice.fingerprint.FingerPrintService.getSchemaFromService(FingerPrintService.java:80)

Re: Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Ted Yu

Can you describe the kerberos issues in more detail ?

Which release of YARN are you using ?

Cheers

On Mon, May 23, 2016 at 4:41 AM, Chandraprakash Bhagtani <
cpbhagt...@gmail.com> wrote:

> Hi,
>
> My Spark job is failing with kerberos issues while creating hive context
> in yarn-cluster mode. However it is running with yarn-client mode. My spark
> version is 1.6.1
>
> I am passing hive-site.xml through --files option.
>
> I tried searching online and found that the same issue is fixed with the
> following jira SPARK-6207. it is fixed in spark 1.4, but I am running 1.6.1
>
> Am i missing any configuration here?
>
>
> --
> Thanks & Regards,
> Chandra Prakash Bhagtani
>

Re:how to config spark thrift jdbc server high available

2016-05-23 Thread Todd



There is a jira that works on spark thrift server HA, the patch works,but still 
hasn't merged into the master branch.






At 2016-05-23 20:10:26, "qmzhang" <578967...@qq.com> wrote:
>Dear guys, please help...
>
>In hive,we can enable hiveserver2 high available by using dynamic service
>discovery for HiveServer2. But how to enable spark thriftserver high
>available?
>
>Thank you for your help
>
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/how-to-config-spark-thrift-jdbc-server-high-available-tp27003.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

Re:why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Todd

As far as I know, there would be Akka version conflicting issue when  using 
Akka as spark streaming source.








At 2016-05-23 21:19:08, "Chaoqiang"  wrote:
>I want to know why spark 1.6 use Netty instead of Akka? Is there some
>difficult problems which Akka can not solve, but using Netty can solve
>easily?
>If not, can you give me some references about this changing?
>Thank you.
>
>
>
>--
>View this message in context: 
>http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27004.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>-
>To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>For additional commands, e-mail: user-h...@spark.apache.org
>

why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang

I want to know why spark 1.6 use Netty instead of Akka? Is there some
difficult problems which Akka can not solve, but using Netty can solve
easily?
If not, can you give me some references about this changing?
Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27005.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

why spark 1.6 use Netty instead of Akka?

2016-05-23 Thread Chaoqiang

I want to know why spark 1.6 use Netty instead of Akka? Is there some
difficult problems which Akka can not solve, but using Netty can solve
easily?
If not, can you give me some references about this changing?
Thank you.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/why-spark-1-6-use-Netty-instead-of-Akka-tp27004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: How to set the degree of parallelism in Spark SQL?

2016-05-23 Thread Mathieu Longtin

Since the default is 200, I would guess you're only running 2 executors.
Try to verify how many executor you are actually running with the web
interface (port 8080 where the master is running).

On Sat, May 21, 2016 at 11:42 PM Ted Yu  wrote:

> Looks like an equal sign is missing between partitions and 200.
>
> On Sat, May 21, 2016 at 8:31 PM, SRK  wrote:
>
>> Hi,
>>
>> How to set the degree of parallelism in Spark SQL? I am using the
>> following
>> but it somehow seems to allocate only two executors at a time.
>>
>>  sqlContext.sql(" set spark.sql.shuffle.partitions  200  ")
>>
>> Thanks,
>> Swetha
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
> --
Mathieu Longtin
1-514-803-8977

how to config spark thrift jdbc server high available

2016-05-23 Thread qmzhang

Dear guys, please help...

In hive,we can enable hiveserver2 high available by using dynamic service
discovery for HiveServer2. But how to enable spark thriftserver high
available?

Thank you for your help




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-config-spark-thrift-jdbc-server-high-available-tp27003.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

odd python.PythonRunner Times values?

2016-05-23 Thread Adrian Bridgett


I'm seeing output like this on our mesos spark slaves:

16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1137, boot = 
-590, init = 593, finish = 1134
16/05/23 11:44:04 INFO python.PythonRunner: Times: total = 1652, boot = 
-446, init = 481, finish = 1617


This seems to be coming from pyspark/worker.py, however it looks like it 
should be being printed as milliseconds (as a long) - and it certainly 
doesn't look that way from the output above!




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-23 Thread chandan prakash

Hi,
I am able to do logging for driver but not for executor.

I am running spark streaming under mesos.
Want to do log4j logging separately for driver and executor.

Used the below option in spark-submit command :
--driver-java-options
"-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogDriver.properties"
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:
/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.properties
"

Logging for driver at path mentioned as in
log4j_RequestLogDriver.properties(/tmp/requestLogDriver.log) is happening
fine.
But for executor, there is no logging happening (shud be at
/tmp/requestLogExecutor.log as mentioned in log4j_RequestLogExecutor.properties
on executor machines)

*Any suggestions how to get logging enabled for executor ?*

TIA,
Chandan

-- 
Chandan Prakash

Spark job is failing with kerberos error while creating hive context in yarn-cluster mode (through spark-submit)

2016-05-23 Thread Chandraprakash Bhagtani

Hi,

My Spark job is failing with kerberos issues while creating hive context in
yarn-cluster mode. However it is running with yarn-client mode. My spark
version is 1.6.1

I am passing hive-site.xml through --files option.

I tried searching online and found that the same issue is fixed with the
following jira SPARK-6207. it is fixed in spark 1.4, but I am running 1.6.1

Am i missing any configuration here?


-- 
Thanks & Regards,
Chandra Prakash Bhagtani

Re: How spark depends on Guava

2016-05-23 Thread Jacek Laskowski

Hi Todd,

It's used heavily for thread pool executors for one. Don't know about other
uses.

Jacek
On 23 May 2016 5:49 a.m., "Todd"  wrote:

> Hi,
> In the spark code, guava maven dependency scope is provided, my question
> is, how spark depends on guava during runtime? I looked into the
> spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like
> com.google.common.base.Preconditions etc...
>

Re:Re: How spark depends on Guava

2016-05-23 Thread Todd

Thanks Mat.
When you look at the spark assembly jar(such as 
spark-assembly-1.6.0-hadoop2.6.1.jar), you will find that there is very few 
classes belonging to guava library.
So i am wondering where guava library comes into play during run-time.






At 2016-05-23 15:42:51, "Mat Schaffer"  wrote:

I got curious so I tried sbt dependencyTree. Looks like Guava comes into spark 
core from a couple places.




-Mat


matschaffer.com


On Mon, May 23, 2016 at 2:32 PM, Todd  wrote:



Can someone please take alook at my question?I am spark-shell local mode and 
yarn-client mode.Spark code uses guava library,spark should have guava in place 
during run time.


Thanks.





At 2016-05-23 11:48:58, "Todd"  wrote:

Hi,
In the spark code, guava maven dependency scope is provided, my question is, 
how spark depends on guava during runtime? I looked into the 
spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like 
com.google.common.base.Preconditions etc...

Re: Spark for offline log processing/querying

2016-05-23 Thread Renato Marroquín Mogrovejo

We also did some benchmarking using analytical queries similar to TPC-H
both with Spark and Presto, and our conclussion was that Spark is a great
general solution but for analytical SQL queries it is still not there yet.
I mean for 10 or 100GB of data you will get your results back but with
Presto was way faster and predictable. But of course if you are planning to
do machine learning or ad-hoc data processing, then Spark is the right
solution.


Renato M.

2016-05-23 9:38 GMT+02:00 Mat Schaffer :

> It's only really mildly interactive. When I used presto+hive in the past
> (just a consumer not an admin) it seemed to be able to provide answers
> within ~2m even for fairly large data sets. Hoping I can get a similar
> level of responsiveness with spark.
>
> Thanks, Sonal! I'll take a look at the example log processor and see what
> I can come up with.
>
>
> -Mat
>
> matschaffer.com
>
> On Mon, May 23, 2016 at 3:08 PM, Jörn Franke  wrote:
>
>> Do you want to replace ELK by Spark? Depending on your queries you could
>> do as you proposed. However, many of the text analytics queries will
>> probably be much faster on ELK. If your queries are more interactive and
>> not about batch processing then it does not make so much sense. I am not
>> sure why you plan to use Presto.
>>
>> On 23 May 2016, at 07:28, Mat Schaffer  wrote:
>>
>> I'm curious about trying to use spark as a cheap/slow ELK
>> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>>
>> - instances rotate local logs
>> - copy rotated logs to s3
>> (s3://logs/region/grouping/instance/service/*.logs)
>> - spark to convert from raw text logs to parquet
>> - maybe presto to query the parquet?
>>
>> I'm still new on Spark though, so thought I'd ask if anyone was familiar
>> with this sort of thing and if there are maybe some articles or documents I
>> should be looking at in order to learn how to build such a thing. Or if
>> such a thing even made sense.
>>
>> Thanks in advance, and apologies if this has already been asked and I
>> missed it!
>>
>> -Mat
>>
>> matschaffer.com
>>
>>
>

Re: How to integrate Spark with OpenCV?

2016-05-23 Thread Jishnu Prathap

Hi Purbanir


Integrating Spark with OpenCV was pretty straightforward .Only 2 things to keep 
in mind ,OpenCV  should be installed in each worker and Sytem.loadlibrary() 
should be written in the program such that it is invoked for each worker once.


Thanks & Regards

 Jishnu Prathap

 jishnuprathap...@gmail.com



From: purbanir [via Apache Spark User List] 

Sent: Wednesday, May 18, 2016 9:43:40 PM
To: Jishnu Menath Prathap (CTO Office)
Subject: Re: How to integrate Spark with OpenCV?


** This mail has been sent from an external source **

Hi

I will need to implement the same thing in few months. It seems that nobody has 
done this before. Nevertheless there are some information we can use:
http://personals.ac.upc.edu/rtous/howto_spark_opencv.xhtml
or this
http://samos-it.com/posts/computer-vision-opencv-sift-surf-kmeans-on-spark.html
Let me know how are you doing with this problem.

Cheers



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-Spark-with-OpenCV-tp21133p26977.html
To unsubscribe from How to integrate Spark with OpenCV?, click 
here.
NAML
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-Spark-with-OpenCV-tp21133p27000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming - Exception thrown while writing record: BlockAdditionEvent

2016-05-23 Thread Ewan Leith

As we increase the throughput on our Spark streaming application, we're finding 
we hit errors with the WriteAheadLog, with errors like this:

16/05/21 20:42:21 WARN scheduler.ReceivedBlockTracker: Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(0,Some(10),None,WriteAheadLogBasedStoreResult(input-0-1463850002991,Some(10),FileBasedWriteAheadLogSegment(hdfs://x.x.x.x:8020/checkpoint/receivedData/0/log-1463863286930-1463863346930,625283,39790
 to the WriteAheadLog.
java.util.concurrent.TimeoutException: Futures timed out after [5000 
milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at 
org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:81)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:232)
 at 
org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:87)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:321)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:500)
 at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229)
 at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:498)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
16/05/21 20:42:26 WARN scheduler.ReceivedBlockTracker: Exception thrown while 
writing record: 
BlockAdditionEvent(ReceivedBlockInfo(1,Some(10),None,WriteAheadLogBasedStoreResult(input-1-1462971836350,Some(10),FileBasedWriteAheadLogSegment(hdfs://x.x.x.x:8020/checkpoint/receivedData/1/log-1463863313080-1463863373080,455191,60798
 to the WriteAheadLog.

I've found someone else on StackOverflow with the same issue, who's suggested 
increasing the spark.streaming.driver.writeAheadLog.batchingTimeout setting, 
but we're not actually seeing significant performance issues on HDFS when the 
issue occurs.

http://stackoverflow.com/questions/34879092/reliability-issues-with-checkpointing-wal-in-spark-streaming-1-6-0

Has anyone else come across this, and any suggested areas we can look at?

Thanks,
Ewan

Re: Dataset API and avro type

2016-05-23 Thread Han JU

Just one more question: does Dataset suppose to be able to cast data to an
avro type? For a very simple format (a string and a long), I can cast it to
a tuple or case class, but not an avro type (also contains only a string
and a long).

The error is like this for this very simple type:

=== Result of Batch Resolution ===
!'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0,
string])) null else input[0, string].toString, if (isnull(input[1,
bigint])) null else input[1, bigint],
StructField(auctionId,StringType,true), StructField(ts,LongType,true)),
auctionId#0, ts#1L) AS #2]   Project [createexternalrow(if
(isnull(auctionId#0)) null else auctionId#0.toString, if (isnull(ts#1L))
null else ts#1L, StructField(auctionId,StringType,true),
StructField(ts,LongType,true)) AS #2]
 +- LocalRelation [auctionId#0,ts#1L]


  +- LocalRelation
[auctionId#0,ts#1L]

Exception in thread "main" org.apache.spark.sql.AnalysisException: Try to
map struct to Tuple1, but failed as the number
of fields does not line up.
 - Input schema: struct
 - Target schema: struct;
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.org
$apache$spark$sql$catalyst$encoders$ExpressionEncoder$$fail$1(ExpressionEncoder.scala:267)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.validate(ExpressionEncoder.scala:281)
at org.apache.spark.sql.Dataset.(Dataset.scala:201)
at org.apache.spark.sql.Dataset.(Dataset.scala:168)
at org.apache.spark.sql.Dataset$.apply(Dataset.scala:57)
at org.apache.spark.sql.Dataset.as(Dataset.scala:366)
at Datasets$.delayedEndpoint$Datasets$1(Datasets.scala:35)
at Datasets$delayedInit$body.apply(Datasets.scala:23)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Datasets$.main(Datasets.scala:23)
at Datasets.main(Datasets.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

2016-05-22 22:02 GMT+02:00 Michael Armbrust :

> That's definitely a bug.  If you can come up with a small reproduction it
> would be great if you could open a JIRA.
> On May 22, 2016 12:21 PM, "Han JU"  wrote:
>
>> Hi Michael,
>>
>> The error is like this under 2.0.0-preview. In 1.6.1 the error is very
>> similar if not exactly the same.
>> The file is a parquet file containing avro objects.
>>
>> Thanks!
>>
>> Caused by: java.util.concurrent.ExecutionException: java.lang.Exception:
>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>> 'generated.java', Line 25, Column 160: No applicable constructor/method
>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>> candidates are: "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>> /* 001 */
>> /* 002 */ public java.lang.Object generate(Object[] references) {
>> /* 003 */   return new SpecificSafeProjection(references);
>> /* 004 */ }
>> /* 005 */
>> /* 006 */ class SpecificSafeProjection extends
>> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>> /* 007 */
>> /* 008 */   private Object[] references;
>> /* 009 */   private MutableRow mutableRow;
>> /* 010 */   private org.apache.spark.serializer.KryoSerializerInstance
>> serializer;
>> /* 011 */
>> /* 012 */
>> /* 013 */   public SpecificSafeProjection(Object[] references) {
>> /* 014 */ this.references = references;
>> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
>> /* 016 */ serializer =
>> (org.apache.spark.serializer.KryoSerializerInstance) new
>> org.apache.spark.serializer.KryoSerializer(new
>> org.apache.spark.SparkConf()).newInstance();
>> /* 017 */   }
>> /* 018 */
>> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
>> /* 020 */ InternalRow i = (InternalRow) _i;
>> /* 021 */ /* decodeusingserializer(input[0,
>> struct> */
>> /* 022 */ /* input[0,
>> struct> */
>> /* 023 */ boolean isNull1 = i.isNullAt(0);
>> /* 024 */ InternalRow

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-23 Thread Mich Talebzadeh

Hi Timur and everyone.

I will answer your first question as it is very relevant

1) How to make 2 versions of Spark live together on the same cluster
(libraries clash, paths, etc.) ?
Most of the Spark users perform ETL, ML operations on Spark as well. So, we
may have 3 Spark installations simultaneously

There are two distinct points here.

Using Spark as a  query engine. That is BAU and most forum members use it
everyday. You run Spark with either Standalone, Yarn or Mesos as Cluster
managers. You start master that does the management of resources and you
start slaves to create workers.

 You deploy Spark either by Spark-shell, Spark-sql or submit jobs through
spark-submit etc. You may or may not use Hive as your database. You may use
Hbase via Phoenix etc
If you choose to use Hive as your database, on every host of cluster
including your master host, you ensure that Hive APIs are installed
(meaning Hive installed). In $SPARK_HOME/conf, you create a soft link to
cd $SPARK_HOME/conf
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ltr hive-site.xml
lrwxrwxrwx 1 hduser hadoop 32 May  3 17:48 *hive-site.xml ->
/usr/lib/hive/conf/hive-site.xml*
Now in hive-site.xml you can define all the parameters needed for Spark
connectivity. Remember we are making Hive use spark1.3.1  engine. WE ARE
NOT RUNNING SPARK 1.3.1 AS A QUERY TOOL. We do not need to start master or
workers for Spark 1.3.1! It is just an execution engine like mr etc.

Let us look at how we do that in hive-site,xml. Noting the settings for
hive.execution.engine=spark and spark.home=/usr/lib/spark-1.3.1-bin-hadoop2
below. That tells Hive to use spark 1.3.1 as the execution engine. You just
install spark 1.3.1 on the host just the binary download it is
/usr/lib/spark-1.3.1-bin-hadoop2.6

In hive-site.xml, you set the properties.

  
hive.execution.engine
spark

  Expects one of [mr, tez, spark].
  Chooses execution engine. Options are: mr (Map reduce, default), tez,
spark. While MR
  remains the default engine for historical reasons, it is itself a
historical engine
  and is deprecated in Hive 2 line. It may be removed without further
warning.

  

  
spark.home
/usr/lib/spark-1.3.1-bin-hadoop2
something
  

 
hive.merge.sparkfiles
false
Merge small files at the end of a Spark DAG
Transformation
  

 
hive.spark.client.future.timeout
60s

  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec,
us/usec, ns/nsec), which is sec if not specified.
  Timeout for requests from Hive client to remote Spark driver.

 
 
hive.spark.job.monitor.timeout
60s

  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec,
us/usec, ns/nsec), which is sec if not specified.
  Timeout for job monitor to get Spark job state.

 

  
hive.spark.client.connect.timeout
1000ms

  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec,
us/usec, ns/nsec), which is msec if not specified.
  Timeout for remote Spark driver in connecting back to Hive client.

  

  
hive.spark.client.server.connect.timeout
9ms

  Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec,
us/usec, ns/nsec), which is msec if not specified.
  Timeout for handshake between Hive client and remote Spark driver.
Checked by both processes.

  
  
hive.spark.client.secret.bits
256
Number of bits of randomness in the generated secret for
communication between Hive client and remote Spark driver. Rounded down to
the nearest multiple of 8.
  
  
hive.spark.client.rpc.threads
8
Maximum number of threads for remote Spark driver's RPC
event loop.
  

And other settings as well

That was the Hive stuff for your Spark BAU. So there are two distinct
things. Now going to Hive itself, you will need to add the correct assembly
jar file for Hadoop. These are called

spark-assembly-x.y.z-hadoop2.4.0.jar

Where x.y.z in this case is 1.3.1

The assembly file is

spark-assembly-1.3.1-hadoop2.4.0.jar

So you add that spark-assembly-1.3.1-hadoop2.4.0.jar to $HIVE_HOME/libs

ls $HIVE_HOME/lib/spark-assembly-1.3.1-hadoop2.4.0.jar
/usr/lib/hive/lib/spark-assembly-1.3.1-hadoop2.4.0.jar

And you need to compile spark from source excluding Hadoop dependencies


./make-distribution.sh --name "hadoop2-without-hive" --tgz
"-Pyarn,hadoop-provided,hadoop-2.4,parquet-provided"


So Hive uses spark engine by default

If you want to use mr in hive you just do

0: jdbc:hive2://rhes564:10010/default>
*set hive.execution.engine=mr;*Hive-on-MR is deprecated in Hive 2 and may
not be available in the future versions. Consider using a different
execution engine (i.e. spark, tez) or using Hive 1.X releases.
No rows affected (0.007 seconds)

With regard to the second question

*2) How stable such construction is on INSERT / UPDATE / CTAS operations?
Any problems with writing into specific tables / directories,

Re: Spark for offline log processing/querying

2016-05-23 Thread Mat Schaffer

It's only really mildly interactive. When I used presto+hive in the past
(just a consumer not an admin) it seemed to be able to provide answers
within ~2m even for fairly large data sets. Hoping I can get a similar
level of responsiveness with spark.

Thanks, Sonal! I'll take a look at the example log processor and see what I
can come up with.


-Mat

matschaffer.com

On Mon, May 23, 2016 at 3:08 PM, Jörn Franke  wrote:

> Do you want to replace ELK by Spark? Depending on your queries you could
> do as you proposed. However, many of the text analytics queries will
> probably be much faster on ELK. If your queries are more interactive and
> not about batch processing then it does not make so much sense. I am not
> sure why you plan to use Presto.
>
> On 23 May 2016, at 07:28, Mat Schaffer  wrote:
>
> I'm curious about trying to use spark as a cheap/slow ELK
> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>
> - instances rotate local logs
> - copy rotated logs to s3
> (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
>
> I'm still new on Spark though, so thought I'd ask if anyone was familiar
> with this sort of thing and if there are maybe some articles or documents I
> should be looking at in order to learn how to build such a thing. Or if
> such a thing even made sense.
>
> Thanks in advance, and apologies if this has already been asked and I
> missed it!
>
> -Mat
>
> matschaffer.com
>
>

Re: Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-23 Thread Virgil Palanciuc

It doesn't.
However, if you have a very large number of keys, with a small number of
very large keys, you can do one of the following:
A. Use a custom partitioner that counts the number of items in a key and
avoids putting large keys together; alternatively, if feasible (and
needed), include part of the value together with the key so that you can
split very large keys in multiple partitions (but that will likely alter
the way you need to do your computations).
B. Count by key, collect to the driver the keys with lots of values. Then
broadcast the set of the "large keys", and:
  i. filter-out the large keys, do your regular processing for the many
small keys
  ii. filter-out the small keys, do a special processing for the very large
keys, using the fact that you can probably store all  of them in memory at
any given point (e.g. thus you can key by value and do random-access to
retrieve "key data" for any given key, in the mapPartitions() code)

There is no universal solution for this problem, but these are good general
solutions that should hopefully set you on the right track.
(note: solution A works for RDDs only; solution B can work with DataFrame
too)

Regards,
Virgil.

On Sat, May 21, 2016 at 11:48 PM, unk1102  wrote:

> Hi I am having DataFrame with huge skew data in terms of TB and I am doing
> groupby on 8 fields which I cant avoid unfortunately. I am looking to
> optimize this I have found hive has
>
> set hive.groupby.skewindata=true;
>
> I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
> guide. Thanks in advance.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark for offline log processing/querying

2016-05-23 Thread Jörn Franke

Do you want to replace ELK by Spark? Depending on your queries you could do as 
you proposed. However, many of the text analytics queries will probably be much 
faster on ELK. If your queries are more interactive and not about batch 
processing then it does not make so much sense. I am not sure why you plan to 
use Presto.

> On 23 May 2016, at 07:28, Mat Schaffer  wrote:
> 
> I'm curious about trying to use spark as a cheap/slow ELK 
> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
> 
> - instances rotate local logs
> - copy rotated logs to s3 (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
> 
> I'm still new on Spark though, so thought I'd ask if anyone was familiar with 
> this sort of thing and if there are maybe some articles or documents I should 
> be looking at in order to learn how to build such a thing. Or if such a thing 
> even made sense.
> 
> Thanks in advance, and apologies if this has already been asked and I missed 
> it!
> 
> -Mat
> 
> matschaffer.com

50 matches

Mail list logo