Structured Streaming for tweets

2016-05-21 Thread singinpirate
Hi all,

The co-founder of Databricks just demo'ed that we can stream tweets with
structured streaming: https://youtu.be/9xSz0ppBtFg?t=16m42s but he didn't
show how he did it - does anyone know how to provide credentials to
structured streaming?


Re: How to set the degree of parallelism in Spark SQL?

2016-05-21 Thread Ted Yu
Looks like an equal sign is missing between partitions and 200.

On Sat, May 21, 2016 at 8:31 PM, SRK  wrote:

> Hi,
>
> How to set the degree of parallelism in Spark SQL? I am using the following
> but it somehow seems to allocate only two executors at a time.
>
>  sqlContext.sql(" set spark.sql.shuffle.partitions  200  ")
>
> Thanks,
> Swetha
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to set the degree of parallelism in Spark SQL?

2016-05-21 Thread SRK
Hi,

How to set the degree of parallelism in Spark SQL? I am using the following
but it somehow seems to allocate only two executors at a time.

 sqlContext.sql(" set spark.sql.shuffle.partitions  200  ")

Thanks,
Swetha





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-set-the-degree-of-parallelism-in-Spark-SQL-tp26996.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Ovidiu-Cristian MARCU
Thank you, Amit! I was looking for this kind of information.

I did not fully read your paper, I see in it a TODO with basically the same 
question(s) [1], maybe someone from Spark team (including Databricks) will be 
so kind to send some feedback..

Best,
Ovidiu

[1] Integrate “Structured Streaming”: //TODO - What (and how) will Spark 2.0 
support (out-of-order, event-time windows, watermarks, triggers, accumulation 
modes) - how straight forward will it be to integrate with the Beam Model ?


> On 21 May 2016, at 23:00, Sela, Amit  wrote:
> 
> It seems I forgot to add the link to the “Technical Vision” paper so there it 
> is - 
> https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
> 
> From: "Sela, Amit" >
> Date: Saturday, May 21, 2016 at 11:52 PM
> To: Ovidiu-Cristian MARCU  >, "user @spark" 
> >
> Cc: Ovidiu Cristian Marcu  >
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> This is a “Technical Vision” paper for the Spark runner, which provides 
> general guidelines to the future development of Spark’s Beam support as part 
> of the Apache Beam (incubating) project.
> This is our JIRA - 
> https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>  
> 
> 
> Generally, I’m currently working on Datasets integration for Batch (to 
> replace RDD) against Spark 1.6, and going towards enhancing Stream processing 
> capabilities with Structured Streaming (2.0)
> 
> And you’re welcomed to ask those questions at the Apache Beam (incubating) 
> mailing list as well ;)
> http://beam.incubator.apache.org/mailing_lists/ 
> 
> 
> Thanks,
> Amit
> 
> From: Ovidiu-Cristian MARCU  >
> Date: Tuesday, May 17, 2016 at 12:11 AM
> To: "user @spark" >
> Cc: Ovidiu Cristian Marcu  >
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
> 
> Could you please consider a short answer regarding the Apache Beam Capability 
> Matrix todo’s for future Spark 2.0 release [4]? (some related references 
> below [5][6])
> 
> Thanks
> 
> [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what 
> 
> [5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 
> 
> [6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 
> 
> 
>> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
>> > 
>> wrote:
>> 
>> Hi,
>> 
>> We can see in [2] many interesting (and expected!) improvements (promises) 
>> like extended SQL support, unified API (DataFrames, DataSets), improved 
>> engine (Tungsten relates to ideas from modern compilers and MPP databases - 
>> similar to Flink [3]), structured streaming etc. It seems we somehow assist 
>> at a smart unification of Big Data analytics (Spark, Flink - best of two 
>> worlds)!
>> 
>> How does Spark respond to the missing What/Where/When/How questions 
>> (capabilities) highlighted in the unified model Beam [1] ?
>> 
>> Best,
>> Ovidiu
>> 
>> [1] 
>> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
>>  
>> 
>> [2] 
>> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>>  
>> 
>> [3] http://stratosphere.eu/project/publications/ 
>> 
>> 
>> 
> 



Re: Wide Datasets (v1.6.1)

2016-05-21 Thread Don Drake
I was able to verify the similar exceptions occur in Spark 2.0.0-preview.
I have create this JIRA: https://issues.apache.org/jira/browse/SPARK-15467

You mentioned using beans instead of case classes, do you have an example
(or test case) that I can see?

-Don

On Fri, May 20, 2016 at 3:49 PM, Michael Armbrust 
wrote:

> I can provide an example/open a Jira if there is a chance this will be
>> fixed.
>>
>
> Please do!  Ping me on it.
>
> Michael
>



-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake 
800-733-2143


Hive 2.0 on Spark 1.6.1 Engine

2016-05-21 Thread Mich Talebzadeh
Hi,

I usually run Hive 2 on Spark 1..3.1 engine (as opposed using the
default MR or TEZ). I tried to make Hive 2 work with TEZ 0.82 but that did
not do much.

Anyway I will try to make it work.

Today I compiled Spark 1.6.1 from source excluding the Hadoop libraries. I
did this one before for Spark 1.3.1 engine.

I created spark-assembly-1.6.1-hadoop2.4.0.jar file and followed the
process that works for Spark 1.3.1.

This is example with Hive 2 on Spark 1.3.1

Starting Spark Job = 0
Query Hive on Spark job[0] stages:
0
1
Status: Running (Hive on Spark job[0])
Job Progress Format
CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]
2016-05-21 22:53:45,512 Stage-0_0: 1(+1)/22 Stage-1_0: 0/1
2016-05-21 22:53:47,517 Stage-0_0: 2(+1)/22 Stage-1_0: 0/1


However, when I use Spark 1.6.1 assembly file I got the following error

hive> select count(1) from sales_staging;
Query ID = hduser_20160521224219_dc9aae02-92bd-4279-87e2-98a6458db783
Total jobs = 1
Launching Job 1 out of 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=
In order to set a constant number of reducers:
  set mapreduce.job.reduces=
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.spark.rdd.RDDOperationScope$
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.SparkContext.hadoopRDD(SparkContext.scala:991)
at
org.apache.spark.api.java.JavaSparkContext.hadoopRDD(JavaSparkContext.scala:419)
at
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateMapInput(SparkPlanGenerator.java:205)
at
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generateParentTran(SparkPlanGenerator.java:145)
at
org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:117)
at
org.apache.hadoop.hive.ql.exec.spark.LocalHiveSparkClient.execute(LocalHiveSparkClient.java:130)
at
org.apache.hadoop.hive.ql.exec.spark.session.SparkSessionImpl.submit(SparkSessionImpl.java:64)
at
org.apache.hadoop.hive.ql.exec.spark.SparkTask.execute(SparkTask.java:112)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:158)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:101)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1840)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1584)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1361)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1184)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1172)
at
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:233)
at
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:184)
at
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:400)
at
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:778)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:717)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:645)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
FAILED: Execution Error, return code -101 from
org.apache.hadoop.hive.ql.exec.spark.SparkTask. Could not initialize class
org.apache.spark.rdd.RDDOperationScope$

I am not sure anyone ahs tried this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
It seems I forgot to add the link to the “Technical Vision” paper so there it 
is - 
https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing

From: "Sela, Amit" >
Date: Saturday, May 21, 2016 at 11:52 PM
To: Ovidiu-Cristian MARCU 
>, "user 
@spark" >
Cc: Ovidiu Cristian Marcu 
>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

This is a “Technical Vision” paper for the Spark runner, which provides general 
guidelines to the future development of Spark’s Beam support as part of the 
Apache Beam (incubating) project.
This is our JIRA - 
https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace 
RDD) against Spark 1.6, and going towards enhancing Stream processing 
capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) 
mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU 
>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" >
Cc: Ovidiu Cristian Marcu 
>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/





Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-21 Thread Sela, Amit
This is a “Technical Vision” paper for the Spark runner, which provides general 
guidelines to the future development of Spark’s Beam support as part of the 
Apache Beam (incubating) project.
This is our JIRA - 
https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel

Generally, I’m currently working on Datasets integration for Batch (to replace 
RDD) against Spark 1.6, and going towards enhancing Stream processing 
capabilities with Structured Streaming (2.0)

And you’re welcomed to ask those questions at the Apache Beam (incubating) 
mailing list as well ;)
http://beam.incubator.apache.org/mailing_lists/

Thanks,
Amit

From: Ovidiu-Cristian MARCU 
>
Date: Tuesday, May 17, 2016 at 12:11 AM
To: "user @spark" >
Cc: Ovidiu Cristian Marcu 
>
Subject: Re: What / Where / When / How questions in Spark 2.0 ?

Could you please consider a short answer regarding the Apache Beam Capability 
Matrix todo’s for future Spark 2.0 release [4]? (some related references below 
[5][6])

Thanks

[4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
[5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
[6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU 
> wrote:

Hi,

We can see in [2] many interesting (and expected!) improvements (promises) like 
extended SQL support, unified API (DataFrames, DataSets), improved engine 
(Tungsten relates to ideas from modern compilers and MPP databases - similar to 
Flink [3]), structured streaming etc. It seems we somehow assist at a smart 
unification of Big Data analytics (Spark, Flink - best of two worlds)!

How does Spark respond to the missing What/Where/When/How questions 
(capabilities) highlighted in the unified model Beam [1] ?

Best,
Ovidiu

[1] 
https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
[2] 
https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
[3] http://stratosphere.eu/project/publications/





Does DataFrame has something like set hive.groupby.skewindata=true;

2016-05-21 Thread unk1102
Hi I am having DataFrame with huge skew data in terms of TB and I am doing
groupby on 8 fields which I cant avoid unfortunately. I am looking to
optimize this I have found hive has

set hive.groupby.skewindata=true;

I dont use Hive I have Spark DataFrame can we achieve above Spark? Please
guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-DataFrame-has-something-like-set-hive-groupby-skewindata-true-tp26995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Reynold Xin
https://issues.apache.org/jira/browse/SPARK-15078 was just a bunch of test
harness and added no new functionality. To reduce confusion, I just
backported it into branch-2.0 so SPARK-15078 is now in 2.0 too.

Can you paste a query you were testing?


On Sat, May 21, 2016 at 10:49 AM, Kamalesh Nair 
wrote:

> Hi,
>
> From the Spark 2.0 Release webinar what I understood is, the newer version
> have significantly expanded the SQL capabilities of Spark, with the
> introduction of a new ANSI SQL parser and support for Subqueries.  It also
> says, Spark 2.0 can run all the 99 TPC-DS queries, which require many of
> the
> SQL:2003 features.
>
> I was testing out Spark 2.0 (apache/branch-2.0 preview) Technical Preview
> provided by Databricks and observed that the Spark SQL Subqueries are now
> supported in Where Clause which is great but it is not supported in Select
> Clause itself. Below listed is the error message displayed.
>
> "Error in SQL statement: AnalysisException: Predicate sub-queries can only
> be used in a Filter"
>
> Is it work in progress planned for the final release or a miss ?
>
> Also, I could find a related JIRA # SPARK-15078 being resolved in Spark
> 2.1.0 version. If this is what address this issue then kindly let me know
> the tentative release dates for Spark 2.1.0. Instead of Spark 2.0 in mid
> June would it be Spark 2.1.0 that would be released during the same time ?
>
> Thanks in advanace for your prompt reply !
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-SQL-Subqueries-tp26993.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco1982
Hi experts,
I'm using Apache Spark Streaming 1.6.1 to write a Java application that
joins two Key/Value data streams and writes the output to HDFS. The two data
streams contain K/V strings and are periodically ingested in Spark from HDFS
by using textFileStream().
The two data streams aren't synchronized, which means that some keys that
are in stream1 at time t0 may appear in stream2 at time t1, or the vice
versa. Hence, my goal is to join the two streams and compute "leftover"
keys, which should be considered for the join operation in the next batch
intervals.
To better clarify this, look at the following algorithm:

variables:
stream1 =  input stream at time t1
stream2 =  input stream at time t1
left_keys_s1 =  records of stream1 that didn't appear in the
join at time t0
left_keys_s2 =  records of stream2 that didn't appear in the
join at time t0

operations at time t1:
out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2)
write out_stream to HDFS
left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should
be used at time t2)
left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should
be used at time t2)

I've tried to implement this algorithm with Spark Streaming unsuccessfully.
Initially, I create two empty streams for leftover keys in this way (this is
only one stream, but the code to generate the second stream is similar):

JavaRDD empty_rdd = sc.emptyRDD(); //sc = Java Spark Context
Queue> q = new LinkedList>();
q.add(empty_rdd);
JavaDStream empty_dstream = jssc.queueStream(q);
JavaPairDStream k1 = empty_dstream.mapToPair(new
PairFunction () {
 @Override
 public scala.Tuple2
call(String s) {
   return new scala.Tuple2(s, s);
 }
   });

Later on, this empty stream is unified (i.e., union()) with stream1 and
finally, after the join, I add the leftover keys from stream1 and call
window(). The same happens with stream2.
The problem is that the operations that generate left_keys_s1 and
left_keys_s2 are transformations without actions, which means that Spark
doesn't create any RDD flow graph and, hence, they are never executed. What
I get right now is a join that outputs only the records whose keys are in
stream1 and stream2 in the same time interval.
Do you guys have any suggestion to implement this correctly with Spark?

Thanks, 
Marco



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-carry-data-streams-over-multiple-batch-intervals-in-Spark-Streaming-tp26994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 2.0 - SQL Subqueries.

2016-05-21 Thread Kamalesh Nair
Hi, 

>From the Spark 2.0 Release webinar what I understood is, the newer version
have significantly expanded the SQL capabilities of Spark, with the
introduction of a new ANSI SQL parser and support for Subqueries.  It also
says, Spark 2.0 can run all the 99 TPC-DS queries, which require many of the
SQL:2003 features.

I was testing out Spark 2.0 (apache/branch-2.0 preview) Technical Preview
provided by Databricks and observed that the Spark SQL Subqueries are now
supported in Where Clause which is great but it is not supported in Select
Clause itself. Below listed is the error message displayed.

"Error in SQL statement: AnalysisException: Predicate sub-queries can only
be used in a Filter"

Is it work in progress planned for the final release or a miss ? 

Also, I could find a related JIRA # SPARK-15078 being resolved in Spark
2.1.0 version. If this is what address this issue then kindly let me know
the tentative release dates for Spark 2.1.0. Instead of Spark 2.0 in mid
June would it be Spark 2.1.0 that would be released during the same time ?

Thanks in advanace for your prompt reply !





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-SQL-Subqueries-tp26993.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to carry data streams over multiple batch intervals in Spark Streaming

2016-05-21 Thread Marco Platania
Hi experts,I'm using Apache Spark Streaming 1.6.1 to write a Java application 
that joins two Key/Value data streams and writes the output to HDFS. The two 
data streams contain K/V strings and are periodically ingested in Spark from 
HDFS by using textFileStream().
The two data streams aren't synchronized, which means that some keys that are 
in stream1 at time t0 may appear in stream2 at time t1, or the vice versa. 
Hence, my goal is to join the two streams and compute "leftover" keys, which 
should be considered for the join operation in the next batch intervals.To 
better clarify this, look at the following algorithm:variables:
stream1 =  input stream at time t1
stream2 =  input stream at time t1
left_keys_s1 =  records of stream1 that didn't appear in the 
join at time t0
left_keys_s2 =  records of stream2 that didn't appear in the 
join at time t0

operations at time t1:
out_stream = (stream1 + left_keys_s1) join (stream2 + left_keys_s2)
write out_stream to HDFS
left_keys_s1 = left_keys_s1 + records of stream1 not in out_stream (should be 
used at time t2)
left_keys_s2 = left_keys_s2 + records of stream2 not in out_stream (should be 
used at time t2)
I've tried to implement this algorithm with Spark Streaming unsuccessfully. 
Initially, I create two empty streams for leftover keys in this way (this is 
only one stream, but the code to generate the second stream is 
similar):JavaRDD empty_rdd = sc.emptyRDD(); //sc = Java Spark Context
Queue q = new LinkedList();
q.add(empty_rdd);
JavaDStream empty_dstream = jssc.queueStream(q);
JavaPairDStream k1 = empty_dstream.mapToPair(new 
PairFunction () {
 @Override
 public scala.Tuple2 
call(String s) {
   return new scala.Tuple2(s, s);
 }
   });
Later on, this empty stream is unified (i.e., union()) with stream1 and 
finally, after the join, I add the leftover keys from stream1 and call 
window(). The same happens with stream2.The problem is that the operations that 
generate left_keys_s1 and left_keys_s2 are transformations without actions, 
which means that Spark doesn't create any RDD flow graph and, hence, they are 
never executed. What I get right now is a join that outputs only the records 
whose keys are in stream1 and stream2 in the same time interval.Do you guys 
have any suggestion to implement this correctly with Spark?Thanks, Marco

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
I got my answer.

The way to access S3 has changed.

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)

val lines = ssc.textFileStream("s3a://amg-events-out/")

This worked.

Cheers,
Ben


> On May 21, 2016, at 4:18 AM, Ted Yu  wrote:
> 
> Maybe more than one version of jets3t-xx.jar was on the classpath.
> 
> FYI
> 
> On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim  > wrote:
> I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of 
> Spark 1.6.0. It seems not to work. I keep getting this error.
> 
> Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand 
> stack
> Exception Details:
>   Location:
> 
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
>  @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not 
> assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', 
> integer }
>   Bytecode:
> 0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0x0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
> 
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at org.apache.spark.streaming.dstream.FileInputDStream.org 
> $apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at scala.Option.orElse(Option.scala:257)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> 

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Sri
Thanks Ted, I know in spark-she'll can we set same in spark-sql shell ?

If I don't set hive context from my understanding spark is using its own SQL 
and date functions right ? Like for example interval ?

Thanks
Sri


Sent from my iPhone

> On 21 May 2016, at 08:19, Ted Yu  wrote:
> 
> In spark-shell:
> 
> scala> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.hive.HiveContext
> 
> scala> var hc: HiveContext = new HiveContext(sc)
> 
> FYI
> 
>> On Sat, May 21, 2016 at 8:11 AM, Sri  wrote:
>> Hi ,
>> 
>> You mean hive-site.xml file right ?,I did placed the hive-site.xml in spark 
>> conf but not sure how spark certain date functions like interval is still 
>> working .
>> Hive 0.14 don't have interval function but how spark is managing to do that ?
>> Does spark has its own date functions ? I am using spark-sql shell for your 
>> information.
>> 
>> Can I set hive context.sql in spark-Sql shell ? As we do in traditional 
>> spark Scala application.
>> 
>> Thanks
>> Sri
>> 
>> Sent from my iPhone
>> 
>>> On 21 May 2016, at 02:24, Mich Talebzadeh  wrote:
>>> 
>>> Sou want to use hive version 0.14 when using Spark 1.6?
>>> 
>>> Go to directory $SPARK_HOME/conf and create a softlink to hive-core.xml file
>>> 
>>> cd $SPARK_HOME
>>> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6> cd conf
>>> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ls -ltr
>>> 
>>> lrwxrwxrwx  1 hduser hadoop   32 May  3 17:48 hive-site.xml -> 
>>> /usr/lib/hive/conf/hive-site.xml
>>> -
>>> 
>>> You can see the softlink in mine. Just create one as below
>>> 
>>> ln -s /usr/lib/hive/conf/hive-site.xml hive-site.xml
>>> 
>>> 
>>> That should work
>>> 
>>> HTH
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
 On 21 May 2016 at 00:57, kali.tumm...@gmail.com  
 wrote:
 Hi All ,
 
 Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
 of inbuilt hive 1.2.1.
 
 I am testing spark-sql locally by downloading spark 1.6 from internet , I
 want to execute my hive queries in spark sql using hive version 0.14 can I
 go back to previous version just for a simple test.
 
 Please share out the steps involved.
 
 
 Thanks
 Sri
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-1-6-with-Hive-0-14-tp26989.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
>>> 
> 


Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Ted Yu
In spark-shell:

scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext

scala> var hc: HiveContext = new HiveContext(sc)

FYI

On Sat, May 21, 2016 at 8:11 AM, Sri  wrote:

> Hi ,
>
> You mean hive-site.xml file right ?,I did placed the hive-site.xml in
> spark conf but not sure how spark certain date functions like interval is
> still working .
> Hive 0.14 don't have interval function but how spark is managing to do
> that ?
> Does spark has its own date functions ? I am using spark-sql shell for
> your information.
>
> Can I set hive context.sql in spark-Sql shell ? As we do in traditional
> spark Scala application.
>
> Thanks
> Sri
>
> Sent from my iPhone
>
> On 21 May 2016, at 02:24, Mich Talebzadeh 
> wrote:
>
> Sou want to use hive version 0.14 when using Spark 1.6?
>
> Go to directory $SPARK_HOME/conf and create a softlink to hive-core.xml
> file
>
> *cd $SPARK_HOME*
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6>
> *cd conf*hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ls -ltr
>
> lrwxrwxrwx  1 hduser hadoop   32 May  3 17:48 hive-site.xml ->
> /usr/lib/hive/conf/hive-site.xml
> -
>
> You can see the softlink in mine. Just create one as below
>
> ln -s /usr/lib/hive/conf/hive-site.xml hive-site.xml
>
>
> That should work
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 21 May 2016 at 00:57, kali.tumm...@gmail.com 
> wrote:
>
>> Hi All ,
>>
>> Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
>> of inbuilt hive 1.2.1.
>>
>> I am testing spark-sql locally by downloading spark 1.6 from internet , I
>> want to execute my hive queries in spark sql using hive version 0.14 can I
>> go back to previous version just for a simple test.
>>
>> Please share out the steps involved.
>>
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-1-6-with-Hive-0-14-tp26989.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> .
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Sri
Hi ,

You mean hive-site.xml file right ?,I did placed the hive-site.xml in spark 
conf but not sure how spark certain date functions like interval is still 
working .
Hive 0.14 don't have interval function but how spark is managing to do that ?
Does spark has its own date functions ? I am using spark-sql shell for your 
information.

Can I set hive context.sql in spark-Sql shell ? As we do in traditional spark 
Scala application.

Thanks
Sri

Sent from my iPhone

> On 21 May 2016, at 02:24, Mich Talebzadeh  wrote:
> 
> Sou want to use hive version 0.14 when using Spark 1.6?
> 
> Go to directory $SPARK_HOME/conf and create a softlink to hive-core.xml file
> 
> cd $SPARK_HOME
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6> cd conf
> hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ls -ltr
> 
> lrwxrwxrwx  1 hduser hadoop   32 May  3 17:48 hive-site.xml -> 
> /usr/lib/hive/conf/hive-site.xml
> -
> 
> You can see the softlink in mine. Just create one as below
> 
> ln -s /usr/lib/hive/conf/hive-site.xml hive-site.xml
> 
> 
> That should work
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 21 May 2016 at 00:57, kali.tumm...@gmail.com  
>> wrote:
>> Hi All ,
>> 
>> Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
>> of inbuilt hive 1.2.1.
>> 
>> I am testing spark-sql locally by downloading spark 1.6 from internet , I
>> want to execute my hive queries in spark sql using hive version 0.14 can I
>> go back to previous version just for a simple test.
>> 
>> Please share out the steps involved.
>> 
>> 
>> Thanks
>> Sri
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-1-6-with-Hive-0-14-tp26989.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


Re: Unit testing framework for Spark Jobs?

2016-05-21 Thread Lars Albertsson
Not that I can share, unfortunately. It is on my backlog to create a
repository with examples, but I am currently a bit overloaded, so don't
hold your breath. :-/

If you want to be notified when it happens, please follow me on Twitter or
Google+. See web site below for links.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109
On May 18, 2016 20:14, "swetha kasireddy"  wrote:

> Hi Lars,
>
> Do you have any examples for the methods that you described for Spark
> batch and Streaming?
>
> Thanks!
>
> On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson 
> wrote:
>
>> Thanks!
>>
>> It is on my backlog to write a couple of blog posts on the topic, and
>> eventually some example code, but I am currently busy with clients.
>>
>> Thanks for the pointer to Eventually - I was unaware. Fast exit on
>> exception would be a useful addition, indeed.
>>
>> Lars Albertsson
>> Data engineering consultant
>> www.mapflat.com
>> +46 70 7687109
>>
>> On Mon, Mar 28, 2016 at 2:00 PM, Steve Loughran 
>> wrote:
>> > this is a good summary -Have you thought of publishing it at the end of
>> a URL for others to refer to
>> >
>> >> On 18 Mar 2016, at 07:05, Lars Albertsson  wrote:
>> >>
>> >> I would recommend against writing unit tests for Spark programs, and
>> >> instead focus on integration tests of jobs or pipelines of several
>> >> jobs. You can still use a unit test framework to execute them. Perhaps
>> >> this is what you meant.
>> >>
>> >> You can use any of the popular unit test frameworks to drive your
>> >> tests, e.g. JUnit, Scalatest, Specs2. I prefer Scalatest, since it
>> >> gives you choice of TDD vs BDD, and it is also well integrated with
>> >> IntelliJ.
>> >>
>> >> I would also recommend against using testing frameworks tied to a
>> >> processing technology, such as Spark Testing Base. Although it does
>> >> seem well crafted, and makes it easy to get started with testing,
>> >> there are drawbacks:
>> >>
>> >> 1. I/O routines are not tested. Bundled test frameworks typically do
>> >> not materialise datasets on storage, but pass them directly in memory.
>> >> (I have not verified this for Spark Testing Base, but it looks so.)
>> >> I/O routines are therefore not exercised, and they often hide bugs,
>> >> e.g. related to serialisation.
>> >>
>> >> 2. You create a strong coupling between processing technology and your
>> >> tests. If you decide to change processing technology (which can happen
>> >> soon in this fast paced world...), you need to rewrite your tests.
>> >> Therefore, during a migration process, the tests cannot detect bugs
>> >> introduced in migration, and help you migrate fast.
>> >>
>> >> I recommend that you instead materialise input datasets on local disk,
>> >> run your Spark job, which writes output datasets to local disk, read
>> >> output from disk, and verify the results. You can still use Spark
>> >> routines to read and write input and output datasets. A Spark context
>> >> is expensive to create, so for speed, I would recommend reusing the
>> >> Spark context between input generation, running the job, and reading
>> >> output.
>> >>
>> >> This is easy to set up, so you don't need a dedicated framework for
>> >> it. Just put your common boilerplate in a shared test trait or base
>> >> class.
>> >>
>> >> In the future, when you want to replace your Spark job with something
>> >> shinier, you can still use the old tests, and only replace the part
>> >> that runs your job, giving you some protection from regression bugs.
>> >>
>> >>
>> >> Testing Spark Streaming applications is a different beast, and you can
>> >> probably not reuse much from your batch testing.
>> >>
>> >> For testing streaming applications, I recommend that you run your
>> >> application inside a unit test framework, e.g, Scalatest, and have the
>> >> test setup create a fixture that includes your input and output
>> >> components. For example, if your streaming application consumes from
>> >> Kafka and updates tables in Cassandra, spin up single node instances
>> >> of Kafka and Cassandra on your local machine, and connect your
>> >> application to them. Then feed input to a Kafka topic, and wait for
>> >> the result to appear in Cassandra.
>> >>
>> >> With this setup, your application still runs in Scalatest, the tests
>> >> run without custom setup in maven/sbt/gradle, and you can easily run
>> >> and debug inside IntelliJ.
>> >>
>> >> Docker is suitable for spinning up external components. If you use
>> >> Kafka, the Docker image spotify/kafka is useful, since it bundles
>> >> Zookeeper.
>> >>
>> >> When waiting for output to appear, don't sleep for a long time and
>> >> then check, since it will slow down your tests. Instead enter a loop
>> >> where you poll for the results and sleep for a few milliseconds in
>> >> between, with a long timeout (~30s) before the test fails with a
>> >> timeout.
>> 

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
Ted,

I only see 1 jets3t-0.9.0 jar in the classpath after running this to list the 
jars.

val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/jets3t-0.9.0.jar

I don’t know what else could be wrong.

Thanks,
Ben

> On May 21, 2016, at 4:18 AM, Ted Yu  wrote:
> 
> Maybe more than one version of jets3t-xx.jar was on the classpath.
> 
> FYI
> 
> On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim  > wrote:
> I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of 
> Spark 1.6.0. It seems not to work. I keep getting this error.
> 
> Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand 
> stack
> Exception Details:
>   Location:
> 
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
>  @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not 
> assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', 
> integer }
>   Bytecode:
> 0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0x0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
> 
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at org.apache.spark.streaming.dstream.FileInputDStream.org 
> $apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at scala.Option.orElse(Option.scala:257)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 

Re: spark on yarn

2016-05-21 Thread Shushant Arora
3.And is the same behavior applied to streaming application also?

On Sat, May 21, 2016 at 7:44 PM, Shushant Arora 
wrote:

> And will it allocate rest executors when other containers get freed which
> were occupied by other hadoop jobs/spark applications?
>
> And is there any minimum (% of executors demanded vs available) executors
> it wait for to be freed or just start with even 1 .
>
> Thanks!
>
> On Thu, Apr 21, 2016 at 8:39 PM, Steve Loughran 
> wrote:
>
>> If there isn't enough space in your cluster for all the executors you
>> asked for to be created, Spark will only get the ones which can be
>> allocated. It will start work without waiting for the others to arrive.
>>
>> Make sure you ask for enough memory: YARN is a lot more unforgiving about
>> memory use than it is about CPU
>>
>> > On 20 Apr 2016, at 16:21, Shushant Arora 
>> wrote:
>> >
>> > I am running a spark application on yarn cluster.
>> >
>> > say I have available vcors in cluster as 100.And I start spark
>> application with --num-executors 200 --num-cores 2 (so I need total
>> 200*2=400 vcores) but in my cluster only 100 are available.
>> >
>> > What will happen ? Will the job abort or it will be submitted
>> successfully and 100 vcores will be aallocated to 50 executors and rest
>> executors will be started as soon as vcores are available ?
>> >
>> > Please note dynamic allocation is not enabled in cluster. I have old
>> version 1.2.
>> >
>> > Thanks
>> >
>>
>>
>


Re: spark on yarn

2016-05-21 Thread Shushant Arora
And will it allocate rest executors when other containers get freed which
were occupied by other hadoop jobs/spark applications?

And is there any minimum (% of executors demanded vs available) executors
it wait for to be freed or just start with even 1 .

Thanks!

On Thu, Apr 21, 2016 at 8:39 PM, Steve Loughran 
wrote:

> If there isn't enough space in your cluster for all the executors you
> asked for to be created, Spark will only get the ones which can be
> allocated. It will start work without waiting for the others to arrive.
>
> Make sure you ask for enough memory: YARN is a lot more unforgiving about
> memory use than it is about CPU
>
> > On 20 Apr 2016, at 16:21, Shushant Arora 
> wrote:
> >
> > I am running a spark application on yarn cluster.
> >
> > say I have available vcors in cluster as 100.And I start spark
> application with --num-executors 200 --num-cores 2 (so I need total
> 200*2=400 vcores) but in my cluster only 100 are available.
> >
> > What will happen ? Will the job abort or it will be submitted
> successfully and 100 vcores will be aallocated to 50 executors and rest
> executors will be started as soon as vcores are available ?
> >
> > Please note dynamic allocation is not enabled in cluster. I have old
> version 1.2.
> >
> > Thanks
> >
>
>


Re: Spark Streaming S3 Error

2016-05-21 Thread Ted Yu
Maybe more than one version of jets3t-xx.jar was on the classpath.

FYI

On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim  wrote:

> I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of
> Spark 1.6.0. It seems not to work. I keep getting this error.
>
> Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on
> operand stack
> Exception Details:
>   Location:
>
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
> @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is
> not assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore',
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object'
> }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String',
> 'java/lang/String', 'java/lang/String',
> 'org/jets3t/service/model/S3Object', integer }
>   Bytecode:
> 0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0x0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
>
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
> at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at org.apache.spark.streaming.dstream.FileInputDStream.org
> $apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
> at
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
> at
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at scala.Option.orElse(Option.scala:257)
> at
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
> at
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at
> 

Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Jörn Franke
What is the motivation to use such an old version of Hive? This will lead to 
less performance and other risks.

> On 21 May 2016, at 01:57, "kali.tumm...@gmail.com"  
> wrote:
> 
> Hi All , 
> 
> Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
> of inbuilt hive 1.2.1.
> 
> I am testing spark-sql locally by downloading spark 1.6 from internet , I
> want to execute my hive queries in spark sql using hive version 0.14 can I
> go back to previous version just for a simple test.
> 
> Please share out the steps involved.
> 
> 
> Thanks
> Sri
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-1-6-with-Hive-0-14-tp26989.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: set spark 1.6 with Hive 0.14 ?

2016-05-21 Thread Mich Talebzadeh
Sou want to use hive version 0.14 when using Spark 1.6?

Go to directory $SPARK_HOME/conf and create a softlink to hive-core.xml file

*cd $SPARK_HOME*
hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6>
*cd conf*hduser@rhes564: /usr/lib/spark-1.6.1-bin-hadoop2.6/conf> ls -ltr

lrwxrwxrwx  1 hduser hadoop   32 May  3 17:48 hive-site.xml ->
/usr/lib/hive/conf/hive-site.xml
-

You can see the softlink in mine. Just create one as below

ln -s /usr/lib/hive/conf/hive-site.xml hive-site.xml


That should work

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 21 May 2016 at 00:57, kali.tumm...@gmail.com 
wrote:

> Hi All ,
>
> Is there a way to ask spark and spark-sql to use Hive 0.14 version instead
> of inbuilt hive 1.2.1.
>
> I am testing spark-sql locally by downloading spark 1.6 from internet , I
> want to execute my hive queries in spark sql using hive version 0.14 can I
> go back to previous version just for a simple test.
>
> Please share out the steps involved.
>
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/set-spark-1-6-with-Hive-0-14-tp26989.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to avoid empty unavoidable group by keys in DataFrame?

2016-05-21 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use
case. I have large dataset around 1 TB which I need to process/update in
DataFrame. Now my jobs shuffles huge data and slows things because of
shuffling and groupby. One reason I see is my data is skew some of my group
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by.

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla"); 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-avoid-empty-unavoidable-group-by-keys-in-DataFrame-tp26992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org