Re: Can a Spark Driver Program be a REST Service by itself?

2015-07-01 Thread Arush Kharbanda
You can try using Spark Jobserver

https://github.com/spark-jobserver/spark-jobserver

On Wed, Jul 1, 2015 at 4:32 PM, Spark Enthusiast sparkenthusi...@yahoo.in
wrote:

 Folks,

 My Use case is as follows:

 My Driver program will be aggregating a bunch of Event Streams and acting
 on it. The Action on the aggregated events is configurable and can change
 dynamically.

 One way I can think of is to run the Spark Driver as a Service where a
 config push can be caught via an API that the Driver exports.
 Can I have a Spark Driver Program run as a REST Service by itself? Is this
 a common use case?
 Is there a better way to solve my problem?

 Thanks




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How many executors can I acquire in standalone mode ?

2015-05-26 Thread Arush Kharbanda
I believe you would be restricted by the number of cores you have in your
cluster. Having a worker running without a core is useless.

On Tue, May 26, 2015 at 3:04 PM, canan chen ccn...@gmail.com wrote:

 In spark standalone mode, there will be one executor per worker. I am
 wondering how many executor can I acquire when I submit app ? Is it greedy
 mode (as many as I can acquire )?




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Arush Kharbanda
Hi Evo,

Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you
would be able to run multiple executors on the same JVM/worker.

https://issues.apache.org/jira/browse/SPARK-1706.

Thanks
Arush

On Tue, May 26, 2015 at 2:54 PM, canan chen ccn...@gmail.com wrote:

 I think the concept of task in spark should be on the same level of task
 in MR. Usually in MR, we need to specify the memory the each mapper/reducer
 task. And I believe executor is not a user-facing concept, it's a spark
 internal concept. For spark users they don't need to know the concept of
 executor, but need to know the concept of task.

 On Tue, May 26, 2015 at 5:09 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 This is the first time I hear that “one can specify the RAM per task” –
 the RAM is granted per Executor (JVM). On the other hand each Task operates
 on ONE RDD Partition – so you can say that this is “the RAM allocated to
 the Task to process” – but it is still within the boundaries allocated to
 the Executor (JVM) within which the Task is running. Also while running,
 any Task like any JVM Thread can request as much additional RAM e.g. for
 new Object instances  as there is available in the Executor aka JVM Heap



 *From:* canan chen [mailto:ccn...@gmail.com]
 *Sent:* Tuesday, May 26, 2015 9:30 AM
 *To:* Evo Eftimov
 *Cc:* user@spark.apache.org
 *Subject:* Re: How does spark manage the memory of executor with
 multiple tasks



 Yes, I know that one task represent a JVM thread. This is what I
 confused. Usually users want to specify the memory on task level, so how
 can I do it if task if thread level and multiple tasks runs in the same
 executor. And even I don't know how many threads there will be. Besides
 that, if one task cause OOM, it would cause other tasks in the same
 executor fail too. There's no isolation between tasks.



 On Tue, May 26, 2015 at 4:15 PM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 An Executor is a JVM instance spawned and running on a Cluster Node
 (Server machine). Task is essentially a JVM Thread – you can have as many
 Threads as you want per JVM. You will also hear about “Executor Slots” –
 these are essentially the CPU Cores available on the machine and granted
 for use to the Executor



 Ps: what creates ongoing confusion here is that the Spark folks have
 “invented” their own terms to describe the design of their what is
 essentially a Distributed OO Framework facilitating Parallel Programming
 and Data Management in a Distributed Environment, BUT have not provided
 clear dictionary/explanations linking these “inventions” with standard
 concepts familiar to every Java, Scala etc developer



 *From:* canan chen [mailto:ccn...@gmail.com]
 *Sent:* Tuesday, May 26, 2015 9:02 AM
 *To:* user@spark.apache.org
 *Subject:* How does spark manage the memory of executor with multiple
 tasks



 Since spark can run multiple tasks in one executor, so I am curious to
 know how does spark manage memory across these tasks. Say if one executor
 takes 1GB memory, then if this executor can run 10 tasks simultaneously,
 then each task can consume 100MB on average. Do I understand it correctly ?
 It doesn't make sense to me that spark run multiple tasks in one executor.







-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Websphere MQ as a data source for Apache Spark Streaming

2015-05-25 Thread Arush Kharbanda
Hi Umesh,

You can connect to Spark Streaming with MQTT  refer to the example.

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/MQTTWordCount.scala



Thanks
Arush



On Mon, May 25, 2015 at 3:43 PM, umesh9794 umesh.chaudh...@searshc.com
wrote:

 I was digging into the possibilities for Websphere MQ as a data source for
 spark-streaming becuase it is needed in one of our use case. I got to know
 that  MQTT http://mqtt.org/   is the protocol that supports the
 communication from MQ data structures but since I am a newbie to spark
 streaming I need some working examples for the same. Did anyone try to
 connect the MQ with spark streaming. Please devise the best way for doing
 so.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Websphere-MQ-as-a-data-source-for-Apache-Spark-Streaming-tp23013.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark SQL performance issue.

2015-04-23 Thread Arush Kharbanda
Hi

Can you share your Web UI, depicting your task level breakup.I can see many
thing
s that can be improved.

1. JavaRDDPerson rdds = ...rdds.cache(); -this caching is not needed as
you are not reading the rdd  for any action

2.Instead of collecting as list, if you can save as text file, it would be
better. As it would avoid moving results to the driver.

Thanks
Arush

On Thu, Apr 23, 2015 at 2:47 PM, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

  why are you cache both rdd and table?
 I try to cache all the data to avoid the bad performance for the first
 query. Is it right?

  Which stage of job is slow?
 The query is run many times on one sqlContext and each query execution
 takes 1 second.

 2015-04-23 11:33 GMT+03:00 ayan guha guha.a...@gmail.com:

 Quick questions: why are you cache both rdd and table?
 Which stage of job is slow?
 On 23 Apr 2015 17:12, Nikolay Tikhonov tikhonovnico...@gmail.com
 wrote:

 Hi,
 I have Spark SQL performance issue. My code contains a simple JavaBean:

 public class Person implements Externalizable {
 private int id;
 private String name;
 private double salary;
 
 }


 Apply a schema to an RDD and register table.

 JavaRDDPerson rdds = ...
 rdds.cache();

 DataFrame dataFrame = sqlContext.createDataFrame(rdds, Person.class);
 dataFrame.registerTempTable(person);

 sqlContext.cacheTable(person);


 Run sql query.

 sqlContext.sql(SELECT id, name, salary FROM person WHERE salary =
 YYY
 AND salary = XXX).collectAsList()


 I launch standalone cluster which contains 4 workers. Each node runs on
 machine with 8 CPU and 15 Gb memory. When I run the query on the
 environment
 over RDD which contains 1 million persons it takes 1 minute. Somebody can
 tell me how to tuning the performance?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-performance-issue-tp22627.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: problem with spark thrift server

2015-04-23 Thread Arush Kharbanda
Hi

What do you mean disable the driver? what are you trying to achieve.

Thanks
Arush

On Thu, Apr 23, 2015 at 12:29 PM, guoqing0...@yahoo.com.hk 
guoqing0...@yahoo.com.hk wrote:

 Hi ,
 I have a question about spark thrift server , i deployed the spark on yarn
  and found if the spark driver disable , the spark application will be
 crashed on yarn.  appreciate for any suggestions and idea .

 Thank you!




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


[SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Hi

As per JIRA this issue is resolved, but i am still facing this issue.

SPARK-2734 - DROP TABLE should also uncache table


-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: [SQL] DROP TABLE should also uncache table

2015-04-16 Thread Arush Kharbanda
Yes, i am able to reproduce the problem. Do you need the scripts to create
the tables?

On Thu, Apr 16, 2015 at 10:50 PM, Yin Huai yh...@databricks.com wrote:

 Can your code that can reproduce the problem?

 On Thu, Apr 16, 2015 at 5:42 AM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 As per JIRA this issue is resolved, but i am still facing this issue.

 SPARK-2734 - DROP TABLE should also uncache table


 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Seems Spark SQL accesses some more columns apart from those created by hive.

You can always recreate the tables, you would need to execute the table
creation scripts but it would be good to avoid recreation.

On Fri, Mar 27, 2015 at 3:20 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I did copy hive-conf.xml form Hive installation into spark-home/conf. IT
 does have all the meta store connection details, host, username, passwd,
 driver and others.



 Snippet
 ==


 configuration

 property
   namejavax.jdo.option.ConnectionURL/name
   valuejdbc:mysql://host.vip.company.com:3306/HDB/value
 /property

 property
   namejavax.jdo.option.ConnectionDriverName/name
   valuecom.mysql.jdbc.Driver/value
   descriptionDriver class name for a JDBC metastore/description
 /property

 property
   namejavax.jdo.option.ConnectionUserName/name
   valuehiveuser/value
   descriptionusername to use against metastore database/description
 /property

 property
   namejavax.jdo.option.ConnectionPassword/name
   valuesome-password/value
   descriptionpassword to use against metastore database/description
 /property

 property
   namehive.metastore.local/name
   valuefalse/value
   descriptioncontrols whether to connect to remove metastore server or
 open a new metastore server in Hive Client JVM/description
 /property

 property
   namehive.metastore.warehouse.dir/name
   value/user/hive/warehouse/value
   descriptionlocation of default database for the warehouse/description
 /property

 ..



 When i attempt to read hive table, it does not work. dw_bid does not
 exists.

 I am sure there is a way to read tables stored in HDFS (Hive) from Spark
 SQL. Otherwise how would anyone do analytics since the source tables are
 always either persisted directly on HDFS or through Hive.


 On Fri, Mar 27, 2015 at 1:15 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Since hive and spark SQL internally use HDFS and Hive metastore. The only
 thing you want to change is the processing engine. You can try to bring
 your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive
 site xml captures the metastore connection details).

 Its a hack,  i havnt tried it. I have played around with the metastore
 and it should work.

 On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
 wrote:

 I have few tables that are created in Hive. I wan to transform data
 stored in these Hive tables using Spark SQL. Is this even possible ?

 So far i have seen that i can create new tables using Spark SQL dialect.
 However when i run show tables or do desc hive_table it says table not
 found.

 I am now wondering is this support present or not in Spark SQL ?

 --
 Deepak




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com




 --
 Deepak




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Can spark sql read existing tables created in hive

2015-03-27 Thread Arush Kharbanda
Since hive and spark SQL internally use HDFS and Hive metastore. The only
thing you want to change is the processing engine. You can try to bring
your hive-site.xml to %SPARK_HOME%/conf/hive-site.xml.(Ensure that the hive
site xml captures the metastore connection details).

Its a hack,  i havnt tried it. I have played around with the metastore and
it should work.

On Fri, Mar 27, 2015 at 12:04 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:

 I have few tables that are created in Hive. I wan to transform data stored
 in these Hive tables using Spark SQL. Is this even possible ?

 So far i have seen that i can create new tables using Spark SQL dialect.
 However when i run show tables or do desc hive_table it says table not
 found.

 I am now wondering is this support present or not in Spark SQL ?

 --
 Deepak




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Checking Data Integrity in Spark

2015-03-27 Thread Arush Kharbanda
Its not possible to configure Spark to do checks based on xmls. You would
need to write jobs to do the validations you need.

On Fri, Mar 27, 2015 at 5:13 PM, Sathish Kumaran Vairavelu 
vsathishkuma...@gmail.com wrote:

 Hello,

 I want to check if there is any way to check the data integrity of the
 data files. The use case is perform data integrity check on large files
 100+ columns and reject records (write it another file) that does not meet
 criteria's (such as NOT NULL, date format, etc). Since there are lot of
 columns/integrity rules we should able to data integrity check through
 configurations (like xml, json, etc); Please share your thoughts..


 Thanks

 Sathish




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
You can look at the Spark SQL programming guide.
http://spark.apache.org/docs/1.3.0/sql-programming-guide.html

and the Spark API.
http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.package

On Thu, Mar 26, 2015 at 5:21 PM, Masf masfwo...@gmail.com wrote:

 Ok,

 Thanks. Some web resource where I could check the functionality supported
 by Spark SQL?

 Thanks!!!

 Regards.
 Miguel Ángel.

 On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian lian.cs@gmail.com
 wrote:

  We're working together with AsiaInfo on this. Possibly will deliver an
 initial version of window function support in 1.4.0. But it's not a promise
 yet.

 Cheng

 On 3/26/15 7:27 PM, Arush Kharbanda wrote:

 Its not yet implemented.

  https://issues.apache.org/jira/browse/SPARK-1442

 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote:

 Hi.

  Are the Windowing and Analytics functions supported in Spark SQL (with
 HiveContext or not)? For example in Hive is supported
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


  Some tutorial or documentation where I can see all features supported
 by Spark SQL?


  Thanks!!!
 --


 Regards.
 Miguel Ángel




  --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





 --


 Saludos.
 Miguel Ángel




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Arush Kharbanda
Its not yet implemented.

https://issues.apache.org/jira/browse/SPARK-1442

On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 Are the Windowing and Analytics functions supported in Spark SQL (with
 HiveContext or not)? For example in Hive is supported
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


 Some tutorial or documentation where I can see all features supported by
 Spark SQL?


 Thanks!!!
 --


 Regards.
 Miguel Ángel




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: JavaKinesisWordCountASLYARN Example not working on EMR

2015-03-25 Thread Arush Kharbanda
 with 265.4 MB RAM,
 BlockManagerId(, ip-10-80-175-92.ec2.internal, 49982)
 15/03/25 11:52:51 INFO storage.BlockManagerMaster: Registered BlockManager
 *Exception in thread main java.lang.NullPointerException*
 *at

 org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:581)*
 at

 org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
 at org.apache.spark.SparkContext.(SparkContext.scala:541)
 at

 org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:642)
 at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:75)
 at

 org.apache.spark.streaming.api.java.JavaStreamingContext.(JavaStreamingContext.scala:132)
 at

 org.apache.spark.examples.streaming.JavaKinesisWordCountASLYARN.main(JavaKinesisWordCountASLYARN.java:127)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at

 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/JavaKinesisWordCountASLYARN-Example-not-working-on-EMR-tp6.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark SQL: Day of month from Timestamp

2015-03-24 Thread Arush Kharbanda
Hi

You can use functions like year(date),month(date)

Thanks
Arush

On Tue, Mar 24, 2015 at 12:46 PM, Harut Martirosyan 
harut.martiros...@gmail.com wrote:

 Hi guys.

 Basically, we had to define a UDF that does that, is there a built in
 function that we can use for it?

 --
 RGRDZ Harut




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark sql thrift server slower than hive

2015-03-23 Thread Arush Kharbanda
A basis change needed by spark is setting the executor memory which
defaults to 512MB by default.

On Mon, Mar 23, 2015 at 10:16 AM, Denny Lee denny.g@gmail.com wrote:

 How are you running your spark instance out of curiosity?  Via YARN or
 standalone mode?  When connecting Spark thriftserver to the Spark service,
 have you allocated enough memory and CPU when executing with spark?

 On Sun, Mar 22, 2015 at 3:39 AM fanooos dev.fano...@gmail.com wrote:

 We have cloudera CDH 5.3 installed on one machine.

 We are trying to use spark sql thrift server to execute some analysis
 queries against hive table.

 Without any changes in the configurations, we run the following query on
 both hive and spark sql thrift server

 *select * from tableName;*

 The time taken by spark is larger than the time taken by hive which is not
 supposed to be the like that.

 The hive table is mapped to json files stored on HDFS directory and we are
 using *org.openx.data.jsonserde.JsonSerDe* for
 serialization/deserialization.

 Why spark takes much more time to execute the query than hive ?



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Spark-sql-thrift-server-slower-than-
 hive-tp22177.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Idempotent count

2015-03-18 Thread Arush Kharbanda
Hi Binh,

It stores the state as well the unprocessed data.  It is a subset of the
records that you aggregated so far.

This provides a good reference for checkpointing.

http://spark.apache.org/docs/1.2.1/streaming-programming-guide.html#checkpointing


On Wed, Mar 18, 2015 at 12:52 PM, Binh Nguyen Van binhn...@gmail.com
wrote:

 Hi Arush,

 Thank you for answering!
 When you say checkpoints hold metadata and Data, what is the Data? Is it
 the Data that is pulled from input source or is it the state?
 If it is state then is it the same number of records that I aggregated
 since beginning or only a subset of it? How can I limit the size of
 state that is kept in checkpoint?

 Thank you
 -Binh

 On Tue, Mar 17, 2015 at 11:47 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Hi

 Yes spark streaming is capable of stateful stream processing. With or
 without state is a way of classifying state.
 Checkpoints hold metadata and Data.

 Thanks


 On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com
 wrote:

 Hi all,

 I am new to Spark so please forgive me if my questions is stupid.
 I am trying to use Spark-Streaming in an application that read data
 from a queue (Kafka) and do some aggregation (sum, count..) and
 then persist result to an external storage system (MySQL, VoltDB...)

 From my understanding of Spark-Streaming, I can have two ways
 of doing aggregation:

- Stateless: I don't have to keep state and just apply new delta
values to the external system. From my understanding, doing in this way I
may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state
to external system. Doing in this way I have correct aggregation result 
 but
I have to keep data in two places (state and external system)

 My questions are:

- Is my understanding of Stateless and Statefull aggregation
correct? If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when
it saves checkpoint?

 Please kindly help!

 Thanks
 -Binh




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: sparksql native jdbc driver

2015-03-18 Thread Arush Kharbanda
Yes, I have been using Spark SQL from the onset. Haven't found any other
Server for Spark SQL for JDBC connectivity.

On Wed, Mar 18, 2015 at 5:50 PM, sequoiadb mailing-list-r...@sequoiadb.com
wrote:

 hey guys,

 In my understanding SparkSQL only supports JDBC connection through hive
 thrift server, is this correct?

 Thanks

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Idempotent count

2015-03-18 Thread Arush Kharbanda
Hi

Yes spark streaming is capable of stateful stream processing. With or
without state is a way of classifying state.
Checkpoints hold metadata and Data.

Thanks


On Wed, Mar 18, 2015 at 4:00 AM, Binh Nguyen Van binhn...@gmail.com wrote:

 Hi all,

 I am new to Spark so please forgive me if my questions is stupid.
 I am trying to use Spark-Streaming in an application that read data
 from a queue (Kafka) and do some aggregation (sum, count..) and
 then persist result to an external storage system (MySQL, VoltDB...)

 From my understanding of Spark-Streaming, I can have two ways
 of doing aggregation:

- Stateless: I don't have to keep state and just apply new delta
values to the external system. From my understanding, doing in this way I
may end up with over counting when there is failure and replay.
- Statefull: Use checkpoint to keep state and blindly save new state
to external system. Doing in this way I have correct aggregation result but
I have to keep data in two places (state and external system)

 My questions are:

- Is my understanding of Stateless and Statefull aggregation correct?
If not please correct me!
- For the Statefull aggregation, What does Spark-Streaming keep when
it saves checkpoint?

 Please kindly help!

 Thanks
 -Binh




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Hive on Spark with Spark as a service on CDH5.2

2015-03-17 Thread Arush Kharbanda
Hive on Spark and accessing HiveContext from the shall are seperate things.

Hive on Spark -
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

To access hive on Spark you need to built with -Phive.

http://spark.apache.org/docs/1.2.1/building-spark.html#building-with-hive-and-jdbc-support

On Tue, Mar 17, 2015 at 11:35 AM, anu anamika.guo...@gmail.com wrote:

 *I am not clear if spark sql supports HIve on Spark when spark is run as a
 service in CDH 5.2? *

 Can someone please clarify this. If this is possible, how what
 configuration
 changes have I to make to import hive context in spark shell as well as to
 be able to do a spark-submit for the job to be run on the entire cluster.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-with-Spark-as-a-service-on-CDH5-2-tp22091.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Does spark-1.3.0 support the analytic functions defined in Hive, such as row_number, rank

2015-03-16 Thread Arush Kharbanda
You can track the issue here.

https://issues.apache.org/jira/browse/SPARK-1442

Its currently not supported, i guess the test cases are work in progress.


On Mon, Mar 16, 2015 at 12:44 PM, hseagle hsxup...@gmail.com wrote:

 Hi all,

  I'm wondering whether the latest spark-1.3.0 supports the windowing
 and
 analytic funtions in hive, such as row_number, rank and etc.

  Indeed, I've done some testing by using spark-shell and found that
 row_number is not supported yet.

  But I still found that there were some test case related to row_number
 and other analytics functions. These test cases is defined in

 sql/hive/target/scala-2.10/test-classes/ql/src/test/queries/clientpositive/windowing_multipartitioning.q

 So my question is divided into two parts, One is whether the analytics
 function is supported or not, the othere one is that if it's not supported
 why there are still some test cases

 hseagle



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Does-spark-1-3-0-support-the-analytic-functions-defined-in-Hive-such-as-row-number-rank-tp22072.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: can not submit job to spark in windows

2015-03-12 Thread Arush Kharbanda
-56b32155-2779-4345-9597-2bfa6a87a51d\pi.py
 Traceback (most recent call last):
   File C:/spark-1.2.1-bin-hadoop2.4/bin/pi.py, line 29, in module
 sc = SparkContext(appName=PythonPi)
   File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 105,
 in __
 init__
 conf, jsc)
   File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 153,
 in _d
 o_init
 self._jsc = jsc or self._initialize_context(self._conf._jconf)
   File C:\spark-1.2.1-bin-hadoop2.4\python\pyspark\context.py, line 202,
 in _i
 nitialize_context
 return self._jvm.JavaSparkContext(jconf)
   File
 C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g
 ateway.py, line 701, in __call__
   File
 C:\spark-1.2.1-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc
 ol.py, line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling
 None.org.apache.spa
 rk.api.java.JavaSparkContext.
 : java.lang.NullPointerException
 at java.lang.ProcessBuilder.start(Unknown Source)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:445)
 at org.apache.hadoop.util.Shell.run(Shell.java:418)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
 650)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:445)
 at org.apache.spark.SparkContext.addFile(SparkContext.scala:1004)
 at
 org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
 8)
 at
 org.apache.spark.SparkContext$$anonfun$12.apply(SparkContext.scala:28
 8)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at org.apache.spark.SparkContext.init(SparkContext.scala:288)
 at
 org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.sc
 ala:61)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)

 at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
 Source)

 at
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
 Sou
 rce)
 at java.lang.reflect.Constructor.newInstance(Unknown Source)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
 at
 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
 at py4j.Gateway.invoke(Gateway.java:214)
 at
 py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand
 .java:79)
 at
 py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
 at py4j.GatewayConnection.run(GatewayConnection.java:207)
 at java.lang.Thread.run(Unknown Source)

 What is wrong on my side?

 Should I run some scripts before spark-submit.cmd?

 Regards,
 Sergey.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/can-not-submit-job-to-spark-in-windows-tp21824.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How to read from hdfs using spark-shell in Intel hadoop?

2015-03-11 Thread Arush Kharbanda
You can add resolvers on SBT using

resolvers +=
  Sonatype OSS Snapshots at
https://oss.sonatype.org/content/repositories/snapshots;


On Thu, Feb 26, 2015 at 4:09 PM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:

 Hi,

 I am not able to read from HDFS(Intel distribution hadoop,Hadoop version
 is 1.0.3) from spark-shell(spark version is 1.2.1). I built spark using the
 command
 mvn -Dhadoop.version=1.0.3 clean package and started  spark-shell and read
 a HDFS file using sc.textFile() and the exception is

  WARN hdfs.DFSClient: Failed to connect to /10.88.6.133:50010, add to
 deadNodes and continuejava.net.SocketTimeoutException: 12 millis
 timeout while waiting for channel to be ready for read. ch :
 java.nio.channels.SocketChannel[connected local=/10.88.6.131:44264
 remote=/10.88.6.133:50010]

 The same problem is asked in the this mail.
  RE: Spark is unable to read from HDFS
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E






 RE: Spark is unable to read from HDFS
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E
 Hi, Thanks for the reply. I've tried the below.
 View on mail-archives.us.apache.org
 http://mail-archives.us.apache.org/mod_mbox/spark-user/201309.mbox/%3cf97adee4fba8f6478453e148fe9e2e8d3cca3...@hasmsx106.ger.corp.intel.com%3E
 Preview by Yahoo


 As suggested in the above mail,
 *In addition to specifying HADOOP_VERSION=1.0.3 in the
 ./project/SparkBuild.scala file, you will need to specify the
 libraryDependencies and name spark-core  resolvers. Otherwise, sbt will
 fetch version 1.0.3 of hadoop-core from apache instead of Intel. You can
 set up your own local or remote repository that you specify *

 Now HADOOP_VERSION is deprecated and -Dhadoop.version should be used. Can
 anybody please elaborate on how to specify tat SBT should fetch hadoop-core
 from Intel which is in our internal repository?

 Thanks  Regards,
 Meethu M




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Connecting a PHP/Java applications to Spark SQL Thrift Server

2015-03-04 Thread Arush Kharbanda
For java You can use hive-jdbc connectivity jars to connect to Spark-SQL.

The driver is inside the hive-jdbc Jar.

*http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html
http://hive.apache.org/javadocs/r0.11.0/api/org/apache/hadoop/hive/jdbc/HiveDriver.html*




On Wed, Mar 4, 2015 at 1:26 PM, n...@reactor8.com wrote:

 SparkSQL supports JDBC/ODBC connectivity, so if that's the route you
 needed/wanted to connect through you could do so via java/php apps.  Havent
 used either so cant speak to the developer experience, assume its pretty
 good as would be preferred method for lots of third party enterprise
 apps/tooling

 If you prefer using the thrift server/interface, if they don't exist
 already
 in open source land you can use thrift definitions to generate client libs
 in any supported thrift language and use that for connectivity.  Seems one
 issue with thrift-server is when running in cluster mode.  Seems like it
 still exists but UX of error has been cleaned up in 1.3:

 https://issues.apache.org/jira/browse/SPARK-5176



 -Original Message-
 From: fanooos [mailto:dev.fano...@gmail.com]
 Sent: Tuesday, March 3, 2015 11:15 PM
 To: user@spark.apache.org
 Subject: Connecting a PHP/Java applications to Spark SQL Thrift Server

 We have installed hadoop cluster with hive and spark and the spark sql
 thrift server is up and running without any problem.

 Now we have set of applications need to use spark sql thrift server to
 query
 some data.

 Some of these applications are java applications and the others are PHP
 applications.

 As I am an old fashioned java developer, I used to connect java
 applications
 to BD servers like Mysql using a JDBC driver. Is there a corresponding
 driver for connecting with Spark Sql Thrift server ? Or what is the library
 I need to use to connect to it?


 For PHP, what are the ways we can use to connect PHP applications to Spark
 Sql Thrift Server?





 --
 View this message in context:

 http://apache-spark-user-list.1001560.n3.nabble.com/Connecting-a-PHP-Java-ap
 plications-to-Spark-SQL-Thrift-Server-tp21902.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
 commands, e-mail: user-h...@spark.apache.org



 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: TreeNodeException: Unresolved attributes

2015-03-04 Thread Arush Kharbanda
Why don't you formulate a string before you pass it to the hql function
(appending strings), and hql function is deprecated. You should use sql.

http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.sql.hive.HiveContext

On Wed, Mar 4, 2015 at 6:15 AM, Anusha Shamanur anushas...@gmail.com
wrote:

 Hi,


 I am trying to run a simple select query on a table.


 val restaurants=hiveCtx.hql(select * from TableName where column like
 '%SomeString%' )

 This gives an error as below:

 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
 attributes: *, tree:

 How do I solve this?


 --
 Regards,
 Anusha




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Unable to submit spark job to mesos cluster

2015-03-04 Thread Arush Kharbanda
)
 at
 com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [1 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at akka.remote.Remoting.start(Remoting.scala:173)
 ... 18 more
 ]
 Exception in thread main java.lang.ExceptionInInitializerError
 at
 com.algofusion.reconciliation.execution.ReconExecutionController.initialize(ReconExecutionController.java:257)
 at
 com.algofusion.reconciliation.execution.ReconExecutionController.main(ReconExecutionController.java:105)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [1 milliseconds]
 at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at akka.remote.Remoting.start(Remoting.scala:173)
 at
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
 at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
 at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
 at
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
 at
 org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1454)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1450)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:156)
 at org.apache.spark.SparkContext.init(SparkContext.scala:203)
 at
 com.algofusion.reconciliation.execution.utils.ExecutionUtils.clinit(ExecutionUtils.java:130)
 ... 2 more

 Regards,
 Sarath.





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-27 Thread Arush Kharbanda
Can you share what error you are getting when the job fails.

On Thu, Feb 26, 2015 at 4:32 AM, Darin McBeath ddmcbe...@yahoo.com.invalid
wrote:

 I'm using Spark 1.2, stand-alone cluster on ec2 I have a cluster of 8
 r3.8xlarge machines but limit the job to only 128 cores.  I have also tried
 other things such as setting 4 workers per r3.8xlarge and 67gb each but
 this made no difference.

 The job frequently fails at the end in this step (saveasHadoopFile).   It
 will sometimes work.

 finalNewBaselinePairRDD is hashPartitioned with 1024 partitions and a
 total size around 1TB.  There are about 13.5M records in
 finalNewBaselinePairRDD.  finalNewBaselinePairRDD is String,String


 JavaPairRDDText, Text finalBaselineRDDWritable =
 finalNewBaselinePairRDD.mapToPair(new
 ConvertToWritableTypes()).persist(StorageLevel.MEMORY_AND_DISK_SER());

 // Save to hdfs (gzip)
 finalBaselineRDDWritable.saveAsHadoopFile(hdfs:///sparksync/,
 Text.class, Text.class,
 SequenceFileOutputFormat.class,org.apache.hadoop.io.compress.GzipCodec.class);


 If anyone has any tips for what I should look into it would be appreciated.

 Thanks.

 Darin.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How to pass a org.apache.spark.rdd.RDD in a recursive function

2015-02-27 Thread Arush Kharbanda
Passing RDD's around is not a good idea. RDD's are immutable and cant be
changed inside functions. Have you considered taking a different approach?

On Thu, Feb 26, 2015 at 3:42 AM, dritanbleco dritan.bl...@gmail.com wrote:

 Hello

 i am trying to pass as a parameter a org.apache.spark.rdd.RDD table to a
 recursive function. This table should be changed in any step of the
 recursion and could not be just a global var

 need help :)

 Thank you



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-pass-a-org-apache-spark-rdd-RDD-in-a-recursive-function-tp21805.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: updateStateByKey and invFunction

2015-02-24 Thread Arush Kharbanda
You can use a reduceByKeyAndWindow with your specific time window. You can
specify the inverse function in reduceByKeyAndWindow.

On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma ashishonl...@gmail.com
wrote:

 So say I want to calculate top K users visiting a page in the past 2 hours
 updated every 5 mins.

 so here I want to maintain something like this

 Page_01 = {user_01:32, user_02:3, user_03:7...}
 ...

 Basically a count of number of times a user visited a page. Here my key is
 page name/id and state is the hashmap.

 Now in updateStateByKey I get the previous state and new events coming
 *in* the window. Is there a way to also get the events going *out* of the
 window? This was I can incrementally update the state over a rolling window.

 What is the efficient way to do it in spark streaming?

 Thanks
 Ashish




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: On app upgrade, restore sliding window data.

2015-02-24 Thread Arush Kharbanda
I think this could be of some help to you.

https://issues.apache.org/jira/browse/SPARK-3660



On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro matus.f...@kik.com wrote:

 Hi,

 Our application is being designed to operate at all times on a large
 sliding window (day+) of data. The operations performed on the window
 of data will change fairly frequently and I need a way to save and
 restore the sliding window after an app upgrade without having to wait
 the duration of the sliding window to warm up. Because it's an app
 upgrade, checkpointing will not work unfortunately.

 I can potentially dump the window to an outside storage periodically
 or on app shutdown, but I don't have an ideal way of restoring it.

 I thought about two non-ideal solutions:
 1. Load the previous data all at once into the sliding window on app
 startup. The problem is, at one point I will have double the data in
 the sliding window until the initial batch of data goes out of scope.
 2. Broadcast the previous state of the window separately from the
 window. Perform the operations on both sets of data until it comes out
 of scope. The problem is, the data will not fit into memory.

 Solutions that would solve my problem:
 1. Ability to pre-populate sliding window.
 2. Have control over batch slicing. It would be nice for a Receiver to
 dictate the current batch timestamp in order to slow down or fast
 forward time.

 Any feedback would be greatly appreciated!

 Thank you,
 Matus

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Cannot access Spark web UI

2015-02-18 Thread Arush Kharbanda
)
 at
 org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at
 org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:767)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at
 org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
 org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
 org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
 at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
 at
 org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410)
 at
 org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

 Powered by Jetty://

 --
 Thanks  Regards,

 *Mukesh Jha me.mukesh@gmail.com*





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda
I find monoids pretty useful in this respect, basically separating out the
logic in a monoid and then applying the logic to either a stream or a
batch. A list of such practices could be really useful.

On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:

 Hey,

 It seems pretty clear that one of the strength of Spark is to be able to
 share your code between your batch and streaming layer. Though, given that
 Spark streaming uses DStream being a set of RDDs and Spark uses a single
 RDD there might some complexity associated with it.

 Of course since DStream is a superset of RDDs, one can just run the same
 code at the RDD granularity using DStream::forEachRDD. While this should
 work for map, I am not sure how that can work when it comes to reduce phase
 given that a group of keys spans across multiple RDDs.

 One of the option is to change the dataset object on which a job works on.
 For example of passing an RDD to a class method, one passes a higher level
 object (MetaRDD) that wraps around RDD or DStream depending the context. At
 this point the job calls its regular maps, reduces and so on and the
 MetaRDD wrapper would delegate accordingly.

 Just would like to know the official best practice from the spark
 community though.

 Thanks,




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark and Spark Streaming code sharing best practice.

2015-02-18 Thread Arush Kharbanda
Monoids are useful in Aggregations and try avoiding Anonymous functions,
creating out functions out of the spark code allows the functions to be
reused(Possibly between Spark and Spark Streaming)

On Thu, Feb 19, 2015 at 6:56 AM, Jean-Pascal Billaud j...@tellapart.com
wrote:

 Thanks Arush. I will check that out.

 On Wed, Feb 18, 2015 at 11:06 AM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 I find monoids pretty useful in this respect, basically separating out
 the logic in a monoid and then applying the logic to either a stream or a
 batch. A list of such practices could be really useful.

 On Thu, Feb 19, 2015 at 12:26 AM, Jean-Pascal Billaud j...@tellapart.com
 wrote:

 Hey,

 It seems pretty clear that one of the strength of Spark is to be able to
 share your code between your batch and streaming layer. Though, given that
 Spark streaming uses DStream being a set of RDDs and Spark uses a single
 RDD there might some complexity associated with it.

 Of course since DStream is a superset of RDDs, one can just run the same
 code at the RDD granularity using DStream::forEachRDD. While this should
 work for map, I am not sure how that can work when it comes to reduce phase
 given that a group of keys spans across multiple RDDs.

 One of the option is to change the dataset object on which a job works
 on. For example of passing an RDD to a class method, one passes a higher
 level object (MetaRDD) that wraps around RDD or DStream depending the
 context. At this point the job calls its regular maps, reduces and so on
 and the MetaRDD wrapper would delegate accordingly.

 Just would like to know the official best practice from the spark
 community though.

 Thanks,




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How to integrate hive on spark

2015-02-18 Thread Arush Kharbanda
Hi

Did you try these steps.


https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started

Thanks
Arush

On Wed, Feb 18, 2015 at 7:20 PM, sandeepvura sandeepv...@gmail.com wrote:

 Hi ,

 I am new to sparks.I had installed spark on 3 node cluster.I would like to
 integrate hive on spark .

 can anyone please help me on this,

 Regards,
 Sandeep.v



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-integrate-hive-on-spark-tp21702.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: How to connect a mobile app (Android/iOS) with a Spark backend?

2015-02-18 Thread Arush Kharbanda
I am running Spark Jobs behind tomcat. We didn't face any issues.But for us
the user base is very small.

The possible blockers could be
1. If there are many users of the system. Then jobs might have to w8, you
might want to think about the kind of scheduling you want to do.
2.Again if the no of users is a bit high, Tomcat doesn't scale really
well.(Not sure how much of a blocker it is).

Thanks
Arush

On Wed, Feb 18, 2015 at 6:41 PM, Ralph Bergmann | the4thFloor.eu 
ra...@the4thfloor.eu wrote:

 Hi,


 I have dependency problems to use spark-core inside of a HttpServlet
 (see other mail from me).

 Maybe I'm wrong?!

 What I want to do: I develop a mobile app (Android and iOS) and want to
 connect them with Spark on backend side.

 To do this I want to use Tomcat. The app uses https to ask Tomcat for
 the needed data and Tomcat asks Spark.

 Is this the right way? Or is there a better way to connect my mobile
 apps with the Spark backend?

 I hope that I'm not the first one who want to do this.



 Ralph




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: JsonRDD to parquet -- data loss

2015-02-17 Thread Arush Kharbanda
I am not sure, if this the easiest way to solve your problem. But you can
connect to the HIVE metastore(through derby) and find the HDFS path from
there.

On Wed, Feb 18, 2015 at 9:31 AM, Vasu C vasuc.bigd...@gmail.com wrote:

 Hi,

 I am running spark batch processing job using spark-submit command. And
 below is my code snippet.  Basically converting JsonRDD to parquet and
 storing it in HDFS location.

 The problem I am facing is if multiple jobs are are triggered parallely,
 even though job executes properly (as i can see in spark webUI), there is
 no parquet file created in hdfs path. If 5 jobs are executed parallely than
 only 3 parquet files are getting created.

 Is this the data loss scenario ? Or am I missing something here. Please
 help me in this

 Here tableName is unique with timestamp appended to it.


 val sqlContext = new org.apache.spark.sql.SQLContext(sc)

 val jsonRdd  = sqlContext.jsonRDD(results)

 val parquetTable = sqlContext.parquetFile(parquetFilePath)

 parquetTable.registerTempTable(tableName)

 jsonRdd.insertInto(tableName)


 Regards,

   Vasu C




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: spark-core in a servlet

2015-02-17 Thread Arush Kharbanda
I am not sure if this could be causing the issue but spark  is compatible
with scala 2.10.
Instead of spark-core_2.11 you might want to try spark-core_2.10

On Wed, Feb 18, 2015 at 5:44 AM, Ralph Bergmann | the4thFloor.eu 
ra...@the4thfloor.eu wrote:

 Hi,


 I want to use spark-core inside of a HttpServlet. I use Maven for the
 build task but I have a dependency problem :-(

 I get this error message:

 ClassCastException:

 com.sun.jersey.server.impl.container.servlet.JerseyServletContainerInitializer
 cannot be cast to javax.servlet.ServletContainerInitializer

 When I add this exclusions it builds but than there are other classes
 not found at runtime:

   dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-core_2.11/artifactId
  version1.2.1/version
  exclusions
 exclusion
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
 /exclusion
 exclusion
groupIdorg.eclipse.jetty/groupId
artifactId*/artifactId
 /exclusion
  /exclusions
   /dependency


 What can I do?


 Thanks a lot!,

 Ralph

 --

 Ralph Bergmann

 iOS and Android app developer


 www  http://www.the4thFloor.eu

 mail ra...@the4thfloor.eu
 skypedasralph

 google+  https://plus.google.com/+RalphBergmann
 xing https://www.xing.com/profile/Ralph_Bergmann3
 linkedin https://www.linkedin.com/in/ralphbergmann
 gulp https://www.gulp.de/Profil/RalphBergmann.html
 github   https://github.com/the4thfloor


 pgp key id   0x421F9B78
 pgp fingerprint  CEE3 7AE9 07BE 98DF CD5A E69C F131 4A8E 421F 9B78




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Class loading issue, spark.files.userClassPathFirst doesn't seem to be working

2015-02-17 Thread Arush Kharbanda
Hi

Did you try to make maven pick the latest version

http://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Management

That way solrj won't cause any issue, you can try this and check if the
part of your code where you access HDFS works fine?



On Wed, Feb 18, 2015 at 10:23 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:

 I'm getting the below error when running spark-submit on my class. This
 class
 has a transitive dependency on HttpClient v.4.3.1 since I'm calling SolrJ
 4.10.3 from within the class.

 This is in conflict with the older version, HttpClient 3.1 that's a
 dependency of Hadoop 2.4 (I'm running Spark 1.2.1 built for Hadoop 2.4).

 I've tried setting spark.files.userClassPathFirst to true in SparkConf in
 my
 program, also setting it to true in  $SPARK-HOME/conf/spark-defaults.conf
 as

 spark.files.userClassPathFirst true

 No go, I'm still getting the error, as below. Is there anything else I can
 try? Are there any plans in Spark to support multiple class loaders?

 Exception in thread main java.lang.NoSuchMethodError:

 org.apache.http.impl.conn.SchemeRegistryFactory.createSystemDefault()Lorg/apache/http/conn/scheme/SchemeRegistry;
 at

 org.apache.http.impl.client.SystemDefaultHttpClient.createClientConnectionManager(SystemDefaultHttpClient.java:121)
 at

 org.apache.http.impl.client.AbstractHttpClient.getConnectionManager(AbstractHttpClient.java:445)
 at

 org.apache.solr.client.solrj.impl.HttpClientUtil.setMaxConnections(HttpClientUtil.java:206)
 at

 org.apache.solr.client.solrj.impl.HttpClientConfigurer.configure(HttpClientConfigurer.java:35)
 at

 org.apache.solr.client.solrj.impl.HttpClientUtil.configureClient(HttpClientUtil.java:142)
 at

 org.apache.solr.client.solrj.impl.HttpClientUtil.createClient(HttpClientUtil.java:118)
 at

 org.apache.solr.client.solrj.impl.HttpSolrServer.init(HttpSolrServer.java:168)
 at

 org.apache.solr.client.solrj.impl.HttpSolrServer.init(HttpSolrServer.java:141)
 ...





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Class-loading-issue-spark-files-userClassPathFirst-doesn-t-seem-to-be-working-tp21693.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Configration Problem? (need help to get Spark job executed)

2015-02-17 Thread Arush Kharbanda
.devpeng.xx:38773) with 8 cores
 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: Granted executor ID
 app-20150214103013-0001/1 on hostPort
 devpeng-db-cassandra-3.devpeng.xe:38773 with 8 cores, 512.0 MB RAM
 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
 app-20150214103013-0001/0 is now LOADING
 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
 app-20150214103013-0001/1 is now LOADING
 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
 app-20150214103013-0001/0 is now RUNNING
 15/02/14 10:30:13 INFO AppClient$ClientActor: Executor updated:
 app-20150214103013-0001/1 is now RUNNING
 15/02/14 10:30:13 INFO NettyBlockTransferService: Server created on 58200
 15/02/14 10:30:13 INFO BlockManagerMaster: Trying to register BlockManager
 15/02/14 10:30:13 INFO BlockManagerMasterActor: Registering block manager
 192.168.2.103:58200 with 530.3 MB RAM, BlockManagerId(driver, ,
 58200)
 15/02/14 10:30:13 INFO BlockManagerMaster: Registered BlockManager
 15/02/14 10:30:13 INFO SparkDeploySchedulerBackend: SchedulerBackend is
 ready for scheduling beginning after reached minRegisteredResourcesRatio:
 0.0
 15/02/14 10:30:14 INFO Cluster: New Cassandra host
 devpeng-db-cassandra-1.devpeng.gkh-setu.de/:9042 added
 15/02/14 10:30:14 INFO Cluster: New Cassandra host /xxx:9042 added
 15/02/14 10:30:14 INFO Cluster: New Cassandra host :9042 added
 15/02/14 10:30:14 INFO CassandraConnector: Connected to Cassandra cluster:
 GKHDevPeng
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx
 (DC1)
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host xxx
 (DC1)
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host 
 (DC1)
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
 x (DC1)
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
 x (DC1)
 15/02/14 10:30:14 INFO LocalNodeFirstLoadBalancingPolicy: Adding host
 x (DC1)
 15/02/14 10:30:15 INFO CassandraConnector: Disconnected from Cassandra
 cluster: GKHDevPeng
 15/02/14 10:30:16 INFO SparkContext: Starting job: count at
 TestApp.scala:23
 15/02/14 10:30:16 INFO DAGScheduler: Got job 0 (count at TestApp.scala:23)
 with 3 output partitions (allowLocal=false)
 15/02/14 10:30:16 INFO DAGScheduler: Final stage: Stage 0(count at
 TestApp.scala:23)
 15/02/14 10:30:16 INFO DAGScheduler: Parents of final stage: List()
 15/02/14 10:30:16 INFO DAGScheduler: Missing parents: List()
 15/02/14 10:30:16 INFO DAGScheduler: Submitting Stage 0 (CassandraRDD[0]
 at RDD at CassandraRDD.scala:49), which has no missing parents
 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(4472) called with
 curMem=0, maxMem=556038881
 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 4.4 KB, free 530.3 MB)
 15/02/14 10:30:16 INFO MemoryStore: ensureFreeSpace(3082) called with
 curMem=4472, maxMem=556038881
 15/02/14 10:30:16 INFO MemoryStore: Block broadcast_0_piece0 stored as
 bytes in memory (estimated size 3.0 KB, free 530.3 MB)
 15/02/14 10:30:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in
 memory on x  (size: 3.0 KB, free: 530.3 MB)
 15/02/14 10:30:16 INFO BlockManagerMaster: Updated info of block
 broadcast_0_piece0
 15/02/14 10:30:16 INFO SparkContext: Created broadcast 0 from broadcast at
 DAGScheduler.scala:838
 15/02/14 10:30:16 INFO DAGScheduler: Submitting 3 missing tasks from Stage
 0 (CassandraRDD[0] at RDD at CassandraRDD.scala:49)
 15/02/14 10:30:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 3 tasks
 15/02/14 10:30:31 WARN TaskSchedulerImpl: Initial job has not accepted any
 resources; check your cluster UI to ensure that workers are registered and
 have sufficient memory







 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Identify the performance bottleneck from hardware prospective

2015-02-17 Thread Arush Kharbanda
Hi

How big is your dataset?

Thanks
Arush

On Tue, Feb 17, 2015 at 4:06 PM, Julaiti Alafate jalaf...@eng.ucsd.edu
wrote:

 Thank you very much for your reply!

 My task is to count the number of word pairs in a document. If w1 and w2
 occur together in one sentence, the number of occurrence of word pair (w1,
 w2) adds 1. So the computational part of this algorithm is simply a
 two-level for-loop.

 Since the cluster is monitored by Ganglia, I can easily see that neither
 CPU or network IO is under pressure. The only parameter left is memory. In
 the executor tab of Spark Web UI, I can see a column named memory used.
 It showed that only 6GB of 20GB memory is used. I understand this is
 measuring the size of RDD that persist in memory. So can I at least assume
 the data/object I used in my program is not exceeding memory limit?

 My confusion here is, why can't my program run faster while there is still
 efficient memory, CPU time and network bandwidth it can utilize?

 Best regards,
 Julaiti


 On Tue, Feb 17, 2015 at 12:53 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 What application are you running? Here's a few things:

 - You will hit bottleneck on CPU if you are doing some complex
 computation (like parsing a json etc.)
 - You will hit bottleneck on Memory if your data/objects used in the
 program is large (like defining playing with HashMaps etc inside your map*
 operations), Here you can set spark.executor.memory to a higher number and
 also you can change the spark.storage.memoryFraction whose default value is
 0.6 of your executor memory.
 - Network will be a bottleneck if data is not available locally on one of
 the worker and hence it has to collect it from others, which is a lot of
 Serialization and data transfer across your cluster.

 Thanks
 Best Regards

 On Tue, Feb 17, 2015 at 11:20 AM, Julaiti Alafate jalaf...@eng.ucsd.edu
 wrote:

 Hi there,

 I am trying to scale up the data size that my application is handling.
 This application is running on a cluster with 16 slave nodes. Each slave
 node has 60GB memory. It is running in standalone mode. The data is coming
 from HDFS that also in same local network.

 In order to have an understanding on how my program is running, I also
 had a Ganglia installed on the cluster. From previous run, I know the stage
 that taking longest time to run is counting word pairs (my RDD consists of
 sentences from a corpus). My goal is to identify the bottleneck of my
 application, then modify my program or hardware configurations according to
 that.

 Unfortunately, I didn't find too much information on Spark monitoring
 and optimization topics. Reynold Xin gave a great talk on Spark Summit 2014
 for application tuning from tasks perspective. Basically, his focus is on
 tasks that oddly slower than the average. However, it didn't solve my
 problem because there is no such tasks that run way slow than others in my
 case.

 So I tried to identify the bottleneck from hardware prospective. I want
 to know what the limitation of the cluster is. I think if the executers are
 running hard, either CPU, memory or network bandwidth (or maybe the
 combinations) is hitting the roof. But Ganglia reports the CPU utilization
 of cluster is no more than 50%, network utilization is high for several
 seconds at the beginning, then drop close to 0. From Spark UI, I can see
 the nodes with maximum memory usage is consuming around 6GB, while
 spark.executor.memory is set to be 20GB.

 I am very confused that the program is not running fast enough, while
 hardware resources are not in shortage. Could you please give me some hints
 about what decides the performance of a Spark application from hardware
 perspective?

 Thanks!

 Julaiti






-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Question about spark streaming+Flume

2015-02-16 Thread Arush Kharbanda
Hi

Can you try this

val lines = FlumeUtils.createStream(ssc,localhost,)
// Print out the count of events received from this server in each
batch
   lines.count().map(cnt = Received  + cnt +  flume events. at  +
System.currentTimeMillis() )

lines.forechRDD(_.foreach(println))

Thanks
Arush

On Tue, Feb 17, 2015 at 11:17 AM, bit1...@163.com bit1...@163.com wrote:

 Hi,
 I am trying Spark Streaming + Flume example:

 1. Code
 object SparkFlumeNGExample {
def main(args : Array[String]) {
val conf = new SparkConf().setAppName(SparkFlumeNGExample)
val ssc = new StreamingContext(conf, Seconds(10))

val lines = FlumeUtils.createStream(ssc,localhost,)
 // Print out the count of events received from this server in each
 batch
lines.count().map(cnt = Received  + cnt +  flume events. at  +
 System.currentTimeMillis() ).print()
ssc.start()
ssc.awaitTermination();
 }
 }
 2. I submit the application with following sh:
 ./spark-submit --deploy-mode client --name SparkFlumeEventCount --master
 spark://hadoop.master:7077 --executor-memory 512M --total-executor-cores 2
 --class spark.examples.streaming.SparkFlumeNGWordCount
 spark-streaming-flume.jar


 When I write data to flume, I only notice the following console
 information that input is added.
 storage.BlockManagerInfo: Added input-0-1424151807400 in memory on
 localhost:39338 (size: 1095.0 B, free: 267.2 MB)
 15/02/17 00:43:30 INFO scheduler.JobScheduler: Added jobs for time
 142415181 ms
 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time
 142415182 ms
 
 15/02/17 00:43:40 INFO scheduler.JobScheduler: Added jobs for time
 142415187 ms

 But I didn't the output from the code: Received X flumes events

 I am no idea where the problem is, any idea? Thanks


 --




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: hive-thriftserver maven artifact

2015-02-16 Thread Arush Kharbanda
You can build your own spark with option -Phive-thriftserver.

You can publish the jars locally. I hope that would solve your problem.

On Mon, Feb 16, 2015 at 8:54 PM, Marco marco@gmail.com wrote:

 Ok, so will it be only available for the next version (1.30)?

 2015-02-16 15:24 GMT+01:00 Ted Yu yuzhih...@gmail.com:

 I searched for 'spark-hive-thriftserver_2.10' on this page:
 http://mvnrepository.com/artifact/org.apache.spark

 Looks like it is not published.

 On Mon, Feb 16, 2015 at 5:44 AM, Marco marco@gmail.com wrote:

 Hi,

 I am referring to https://issues.apache.org/jira/browse/SPARK-4925
 (Hive Thriftserver Maven Artifact). Can somebody point me (URL) to the
 artifact in a public repository ? I have not found it @Maven Central.

 Thanks,
 Marco





 --
 Viele Grüße,
 Marco




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
How many nodes do you have in your cluster, how many cores, what is the
size of the memory?

On Fri, Feb 13, 2015 at 12:42 AM, Manas Kar manasdebashis...@gmail.com
wrote:

 Hi Arush,
  Mine is a CDH5.3 with Spark 1.2.
 The only change to my spark programs are
 -Dspark.driver.maxResultSize=3g -Dspark.akka.frameSize=1000.

 ..Manas

 On Thu, Feb 12, 2015 at 2:05 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 What is your cluster configuration? Did you try looking at the Web UI?
 There are many tips here

 http://spark.apache.org/docs/1.2.0/tuning.html

 Did you try these?

 On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com
 wrote:

 Hi,
  I have a Hidden Markov Model running with 200MB data.
  Once the program finishes (i.e. all stages/jobs are done) the program
 hangs for 20 minutes or so before killing master.

 In the spark master the following log appears.

 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal
 error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting
 down ActorSystem [sparkMaster]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 at scala.collection.immutable.List$.newBuilder(List.scala:396)
 at
 scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
 at
 scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
 at
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
 at
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 at org.json4s.MonadicJValue.org
 $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
 at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
 at
 org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
 at
 org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
 at
 org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
 at
 org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
 at
 org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
 at
 org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
 at
 org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)

 Can anyone help?

 ..Manas




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Master dies after program finishes normally

2015-02-12 Thread Arush Kharbanda
What is your cluster configuration? Did you try looking at the Web UI?
There are many tips here

http://spark.apache.org/docs/1.2.0/tuning.html

Did you try these?

On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar manasdebashis...@gmail.com
wrote:

 Hi,
  I have a Hidden Markov Model running with 200MB data.
  Once the program finishes (i.e. all stages/jobs are done) the program
 hangs for 20 minutes or so before killing master.

 In the spark master the following log appears.

 2015-02-12 13:00:05,035 ERROR akka.actor.ActorSystemImpl: Uncaught fatal
 error from thread [sparkMaster-akka.actor.default-dispatcher-31] shutting
 down ActorSystem [sparkMaster]
 java.lang.OutOfMemoryError: GC overhead limit exceeded
 at scala.collection.immutable.List$.newBuilder(List.scala:396)
 at
 scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:69)
 at
 scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:105)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:58)
 at
 scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:53)
 at
 scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:239)
 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:243)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:26)
 at
 org.json4s.MonadicJValue$$anonfun$org$json4s$MonadicJValue$$findDirectByName$1.apply(MonadicJValue.scala:22)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at
 scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at
 scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
 at
 scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
 at org.json4s.MonadicJValue.org
 $json4s$MonadicJValue$$findDirectByName(MonadicJValue.scala:22)
 at org.json4s.MonadicJValue.$bslash(MonadicJValue.scala:16)
 at
 org.apache.spark.util.JsonProtocol$.taskStartFromJson(JsonProtocol.scala:450)
 at
 org.apache.spark.util.JsonProtocol$.sparkEventFromJson(JsonProtocol.scala:423)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:71)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2$$anonfun$apply$1.apply(ReplayListenerBus.scala:69)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:69)
 at
 org.apache.spark.scheduler.ReplayListenerBus$$anonfun$replay$2.apply(ReplayListenerBus.scala:55)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
 at
 org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:55)
 at
 org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:726)
 at
 org.apache.spark.deploy.master.Master.removeApplication(Master.scala:675)
 at
 org.apache.spark.deploy.master.Master.finishApplication(Master.scala:653)
 at
 org.apache.spark.deploy.master.Master$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$29.apply(Master.scala:399)

 Can anyone help?

 ..Manas




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark Streaming distributed batch locking

2015-02-12 Thread Arush Kharbanda
* We have an inbound stream of sensor data for millions of devices (which
have unique identifiers). Spark Streaming can handel events in the ballpark
of 100-500K records/sec/node - *so you need to decide on a cluster
accordingly. And its scalable.*

* We need to perform aggregation of this stream on a per device level.
The aggregation will read data that has already been processed (and
persisted) in previous batches. - *You need to do stateful stream
processing, Spark streaming allows you to do that checkout - **updateStateByKey
-**http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html*

* Key point:  When we process data for a particular device we need to
ensure that no other processes are processing data for that particular
device.  This is because the outcome of our processing will affect the
downstream processing for that device.  Effectively we need a distributed
lock. - *You can make the source device as a key and then updateStateByKey
in spark using the key.*

* In addition the event device data needs to be processed in the order
that the events occurred. - *You would need to implement this in your code
adding timestamp as a data item. Spark Streaming dosnt ensure in order
delivery of your event.*

On Thu, Feb 12, 2015 at 4:51 PM, Legg John john.l...@axonvibe.com wrote:

 Hi

 After doing lots of reading and building a POC for our use case we are
 still unsure as to whether Spark Streaming can handle our use case:

 * We have an inbound stream of sensor data for millions of devices (which
 have unique identifiers).
 * We need to perform aggregation of this stream on a per device level.
 The aggregation will read data that has already been processed (and
 persisted) in previous batches.
 * Key point:  When we process data for a particular device we need to
 ensure that no other processes are processing data for that particular
 device.  This is because the outcome of our processing will affect the
 downstream processing for that device.  Effectively we need a distributed
 lock.
 * In addition the event device data needs to be processed in the order
 that the events occurred.

 Essentially we can¹t have two batches for the same device being processed
 at the same time.

 Can Spark handle our use case?

 Any advice appreciated.

 Regards
 John


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: 8080 port password protection

2015-02-12 Thread Arush Kharbanda
You could apply a password using a filter using a server. Though it dosnt
looks like the right grp for the question. It can be done for spark also
for Spark UI.

On Thu, Feb 12, 2015 at 10:19 PM, MASTER_ZION (Jairo Linux) 
master.z...@gmail.com wrote:

 Hi everyone,

 Im creating a development machine in AWS and i would like to protect the
 port 8080 using a password.

 Is it possible?


 Best Regards

 *Jairo Moreno*




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
 I am a little confused here, why do you want to create the tables in hive.
You want to create the tables in spark-sql, right?

If you are not able to find the same tables through tableau then thrift is
connecting to a diffrent metastore than your spark-shell.

One way to specify a metstore to thrift is to provide the path to
hive-site.xml while starting thrift using --files hive-site.xml.

similarly you can specify the same metastore to your spark-submit or
sharp-shell using the same option.



On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote:

 Arush,

 As for #2 do you mean something like this from the docs:

 // sc is an existing SparkContext.val sqlContext = new 
 org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value 
 STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 
 'examples/src/main/resources/kv1.txt' INTO TABLE src)
 // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, 
 value).collect().foreach(println)

 Or did you have something else in mind?

 -Todd


 On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote:

 Arush,

 Thank you will take a look at that approach in the morning.  I sort of
 figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
 for clarifying it for me.

 -Todd

 On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
 JSON files? NO
 2.  Do I need to do something to expose these via hive / metastore other
 than creating a table in hive? Create a table in spark sql to expose via
 spark sql
 3.  Does the thriftserver need to be configured to expose these in some
 fashion, sort of related to question 2 you would need to configure thrift
 to read from the metastore you expect it read from - by default it reads
 from metastore_db directory present in the directory used to launch the
 thrift server.
  On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote:

 Hi,

 I'm trying to understand how and what the Tableau connector to SparkSQL
 is able to access.  My understanding is it needs to connect to the
 thriftserver and I am not sure how or if it exposes parquet, json,
 schemaRDDs, or does it only expose schemas defined in the metastore / hive.


 For example, I do the following from the spark-shell which generates a
 schemaRDD from a csv file and saves it as a JSON file as well as a parquet
 file.

 import *org.apache.sql.SQLContext
 *import com.databricks.spark.csv._
 val sqlContext = new SQLContext(sc)
 val test = 
 sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out)
 test.saveAsParquetFile(/data/out)

 When I connect from Tableau, the only thing I see is the default
 schema and nothing in the tables section.

 So my questions are:

 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
 JSON files?
 2.  Do I need to do something to expose these via hive / metastore
 other than creating a table in hive?
 3.  Does the thriftserver need to be configured to expose these in some
 fashion, sort of related to question 2.

 TIA for the assistance.

 -Todd






-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Error while querying hive table from spark shell

2015-02-10 Thread Arush Kharbanda
(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1410)
 ... 93 more
 Caused by: java.lang.NoClassDefFoundError:
 org/datanucleus/exceptions/NucleusException
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at javax.jdo.JDOHelper$18.run(JDOHelper.java:2018)
 at javax.jdo.JDOHelper$18.run(JDOHelper.java:2016)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.jdo.JDOHelper.forName(JDOHelper.java:2015)
 at
 javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1162)
 at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
 at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:310)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:339)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:248)
 at
 org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:223)
 at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
 at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
 at
 org.apache.hadoop.hive.metastore.RawStoreProxy.init(RawStoreProxy.java:58)
 at
 org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:67)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:497)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:475)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:523)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:397)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:356)
 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.init(RetryingHMSHandler.java:54)
 at
 org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:59)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStore.newHMSHandler(HiveMetaStore.java:4944)
 at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.init(HiveMetaStoreClient.java:171)
 ... 98 more
 Caused by: java.lang.ClassNotFoundException:
 org.datanucleus.exceptions.NucleusException
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 ... 124 more



 Regards,
 Kundan




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
BTW what tableau connector are you using?

On Wed, Feb 11, 2015 at 12:55 PM, Arush Kharbanda 
ar...@sigmoidanalytics.com wrote:

  I am a little confused here, why do you want to create the tables in
 hive. You want to create the tables in spark-sql, right?

 If you are not able to find the same tables through tableau then thrift is
 connecting to a diffrent metastore than your spark-shell.

 One way to specify a metstore to thrift is to provide the path to
 hive-site.xml while starting thrift using --files hive-site.xml.

 similarly you can specify the same metastore to your spark-submit or
 sharp-shell using the same option.



 On Wed, Feb 11, 2015 at 5:23 AM, Todd Nist tsind...@gmail.com wrote:

 Arush,

 As for #2 do you mean something like this from the docs:

 // sc is an existing SparkContext.val sqlContext = new 
 org.apache.spark.sql.hive.HiveContext(sc)
 sqlContext.sql(CREATE TABLE IF NOT EXISTS src (key INT, value 
 STRING))sqlContext.sql(LOAD DATA LOCAL INPATH 
 'examples/src/main/resources/kv1.txt' INTO TABLE src)
 // Queries are expressed in HiveQLsqlContext.sql(FROM src SELECT key, 
 value).collect().foreach(println)

 Or did you have something else in mind?

 -Todd


 On Tue, Feb 10, 2015 at 6:35 PM, Todd Nist tsind...@gmail.com wrote:

 Arush,

 Thank you will take a look at that approach in the morning.  I sort of
 figured the answer to #1 was NO and that I would need to do 2 and 3 thanks
 for clarifying it for me.

 -Todd

 On Tue, Feb 10, 2015 at 5:24 PM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
 JSON files? NO
 2.  Do I need to do something to expose these via hive / metastore
 other than creating a table in hive? Create a table in spark sql to expose
 via spark sql
 3.  Does the thriftserver need to be configured to expose these in some
 fashion, sort of related to question 2 you would need to configure thrift
 to read from the metastore you expect it read from - by default it reads
 from metastore_db directory present in the directory used to launch the
 thrift server.
  On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote:

 Hi,

 I'm trying to understand how and what the Tableau connector to
 SparkSQL is able to access.  My understanding is it needs to connect to 
 the
 thriftserver and I am not sure how or if it exposes parquet, json,
 schemaRDDs, or does it only expose schemas defined in the metastore / 
 hive.


 For example, I do the following from the spark-shell which generates a
 schemaRDD from a csv file and saves it as a JSON file as well as a parquet
 file.

 import *org.apache.sql.SQLContext
 *import com.databricks.spark.csv._
 val sqlContext = new SQLContext(sc)
 val test = 
 sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out)
 test.saveAsParquetFile(/data/out)

 When I connect from Tableau, the only thing I see is the default
 schema and nothing in the tables section.

 So my questions are:

 1.  Can the connector fetch or query schemaRDD's saved to Parquet or
 JSON files?
 2.  Do I need to do something to expose these via hive / metastore
 other than creating a table in hive?
 3.  Does the thriftserver need to be configured to expose these in
 some fashion, sort of related to question 2.

 TIA for the assistance.

 -Todd






 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: ZeroMQ and pyspark.streaming

2015-02-10 Thread Arush Kharbanda
No, zeromq api is not supported in python as of now.
On 5 Feb 2015 21:27, Sasha Kacanski skacan...@gmail.com wrote:

 Does pyspark supports zeroMQ?
 I see that java does it, but I am not sure for Python?
 regards

 --
 Aleksandar Kacanski



Re: SparkSQL + Tableau Connector

2015-02-10 Thread Arush Kharbanda
1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
files? NO
2.  Do I need to do something to expose these via hive / metastore other
than creating a table in hive? Create a table in spark sql to expose via
spark sql
3.  Does the thriftserver need to be configured to expose these in some
fashion, sort of related to question 2 you would need to configure thrift
to read from the metastore you expect it read from - by default it reads
from metastore_db directory present in the directory used to launch the
thrift server.
 On 11 Feb 2015 01:35, Todd Nist tsind...@gmail.com wrote:

 Hi,

 I'm trying to understand how and what the Tableau connector to SparkSQL is
 able to access.  My understanding is it needs to connect to the
 thriftserver and I am not sure how or if it exposes parquet, json,
 schemaRDDs, or does it only expose schemas defined in the metastore / hive.


 For example, I do the following from the spark-shell which generates a
 schemaRDD from a csv file and saves it as a JSON file as well as a parquet
 file.

 import *org.apache.sql.SQLContext
 *import com.databricks.spark.csv._
 val sqlContext = new SQLContext(sc)
 val test = 
 sqlContext.csfFile(/data/test.csv)test.toJSON().saveAsTextFile(/data/out)
 test.saveAsParquetFile(/data/out)

 When I connect from Tableau, the only thing I see is the default schema
 and nothing in the tables section.

 So my questions are:

 1.  Can the connector fetch or query schemaRDD's saved to Parquet or JSON
 files?
 2.  Do I need to do something to expose these via hive / metastore other
 than creating a table in hive?
 3.  Does the thriftserver need to be configured to expose these in some
 fashion, sort of related to question 2.

 TIA for the assistance.

 -Todd



Re: Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-02-09 Thread Arush Kharbanda
Is this what you are looking for


   1. Build Spark with the YARN profile
   http://spark.apache.org/docs/1.2.0/building-spark.html. Skip this step
   if you are using a pre-packaged distribution.
   2. Locate the spark-version-yarn-shuffle.jar. This should be under
   $SPARK_HOME/network/yarn/target/scala-version if you are building
   Spark yourself, and under lib if you are using a distribution.
   3. Add this jar to the classpath of all NodeManagers in your cluster.
   4. In the yarn-site.xml on each node, add spark_shuffle to
   yarn.nodemanager.aux-services, then set
   yarn.nodemanager.aux-services.spark_shuffle.class to
   org.apache.spark.network.yarn.YarnShuffleService. Additionally, set all
   relevantspark.shuffle.service.* configurations
   http://spark.apache.org/docs/1.2.0/configuration.html.
   5. Restart all NodeManagers in your cluster.


On Wed, Jan 28, 2015 at 1:30 AM, Corey Nolet cjno...@gmail.com wrote:

 I've read that this is supposed to be a rather significant optimization to
 the shuffle system in 1.1.0 but I'm not seeing much documentation on
 enabling this in Yarn. I see github classes for it in 1.2.0 and a property
 spark.shuffle.service.enabled in the spark-defaults.conf.

 The code mentions that this is supposed to be run inside the Nodemanager
 so I'm assuming it needs to be wired up in the yarn-site.xml under the
 yarn.nodemanager.aux-services property?







-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: checking

2015-02-06 Thread Arush Kharbanda
Yes they are.

On Fri, Feb 6, 2015 at 5:06 PM, Mohit Durgapal durgapalmo...@gmail.com
wrote:

 Just wanted to know If my emails are reaching the user list.


 Regards
 Mohit




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Use Spark as multi-threading library and deprecate web UI

2015-02-05 Thread Arush Kharbanda
You can use akka, that is the underlying Multithreading library Spark uses.

On Thu, Feb 5, 2015 at 9:56 PM, Shuai Zheng szheng.c...@gmail.com wrote:

 Nice. I just try and it works. Thanks very much!

 And I notice there is below in the log:

 15/02/05 11:19:09 INFO Remoting: Remoting started; listening on addresses
 :[akka.tcp://sparkDriver@NY02913D.global.local:8162]
 15/02/05 11:19:10 INFO AkkaUtils: Connecting to HeartbeatReceiver:
 akka.tcp://sparkDriver@NY02913D.global.local:8162/user/HeartbeatReceiver

 As I understand. The local mode will have driver and executors in the same
 java process. So is there any way for me to also disable above two
 listeners? Or they are not optional even in local mode?

 Regards,

 Shuai



 -Original Message-
 From: Sean Owen [mailto:so...@cloudera.com]
 Sent: Thursday, February 05, 2015 10:53 AM
 To: Shuai Zheng
 Cc: user@spark.apache.org
 Subject: Re: Use Spark as multi-threading library and deprecate web UI

 Do you mean disable the web UI? spark.ui.enabled=false

 Sure, it's useful with master = local[*] too.

 On Thu, Feb 5, 2015 at 9:30 AM, Shuai Zheng szheng.c...@gmail.com wrote:
  Hi All,
 
 
 
  It might sounds weird, but I think spark is perfect to be used as a
  multi-threading library in some cases. The local mode will naturally
  boost multiple thread when required. Because it is more restrict and
  less chance to have potential bug in the code (because it is more data
  oriental, not thread oriental). Of course, it cannot be used for all
  cases, but in most of my applications, it is enough (90%).
 
 
 
  I want to hear other people’s idea about this.
 
 
 
  BTW: if I run spark in local mode, how to deprecate the web UI
  (default listen on 4040), because I don’t want to start the UI every
  time if I use spark as a local library.
 
 
 
  Regards,
 
 
 
  Shuai


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: how to specify hive connection options for HiveContext

2015-02-05 Thread Arush Kharbanda
Hi

Are you trying to run a spark job from inside eclipse? and want the job to
access hive configuration options.? To  access hive tables?

Thanks
Arush

On Tue, Feb 3, 2015 at 7:24 AM, guxiaobo1982 guxiaobo1...@qq.com wrote:

 Hi,

 I know two options, one for spark_submit, the other one for spark-shell,
 but how to set for programs running inside eclipse?

 Regards,




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Errors in the workers machines

2015-02-05 Thread Arush Kharbanda
1. For what reasons is using Spark the above ports? What internal component
is triggering them? -Akka(guessing from the error log)  is used to schedule
tasks and to notify executors - the ports used are random by default
2. How I can get rid of these errors? - Probably the ports are not open on
your server.You can set certain ports and open them using  spark.driver.port
and spark.executor.port. Or you can open all ports between the masters and
slaves.
for a cluster on ec2, the ec2 script takes care of the required.

3. Why the application is still finished with success? - DO you have more
worker in the cluster which are able to connect.
4. Why is trying with more ports? - Not sure, Its picking the ports
randomly.

On Thu, Feb 5, 2015 at 2:30 PM, Spico Florin spicoflo...@gmail.com wrote:

 Hello!
  I received the following errors in the workerLog.log files:

 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker@stream4:33660]
 - [akka.tcp://sparkExecutor@stream4:47929]: Error [Association failed
 with [akka.tcp://sparkExecutor@stream4:47929]] [
 akka.remote.EndpointAssociationException: Association failed with
 [akka.tcp://sparkExecutor@stream4:47929]
 Caused by:
 akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
 Connection refused: stream4/x.x.x.x:47929
 ]
 (For security reason  have masked the IP with x.x.x.x). The same errors
 occurs for different ports
 (42395,39761).
 Even though I have these errors the application is finished with success.
 I have the following questions:
 1. For what reasons is using Spark the above ports? What internal
 component is triggering them?
 2. How I can get rid of these errors?
 3. Why the application is still finished with success?
 4. Why is trying with more ports?

 I look forward for your answers.
   Regards.
  Florin





-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

Arush Kharbanda || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Union in Spark

2015-02-01 Thread Arush Kharbanda
Hi Deep,

What is your configuration and what is the size of the 2 data sets?

Thanks
Arush

On Mon, Feb 2, 2015 at 11:56 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 I did not check the console because once the job starts I cannot run
 anything else and have to force shutdown the system. I commented parts of
 codes and I tested. I doubt it is because of union. So, I want to change it
 to something else and see if the problem persists.

 Thank you

 On Mon, Feb 2, 2015 at 11:53 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 How do you know the cluster is not responsive because of Union?
 Did you check the spark web console?

 Best Regards,

 Jerry


 On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com
 wrote:

 The cluster hangs.

 On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling...@gmail.com wrote:

 Hi Deep,

 what do you mean by stuck?

 Jerry


 On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan 
 pradhandeep1...@gmail.com wrote:

 Hi,
 Is there any better operation than Union. I am using union and the
 cluster is getting stuck with a large data set.

 Thank you








-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Error when running spark in debug mode

2015-01-31 Thread Arush Kharbanda
Hi Ankur,

Its running fine for me for spark 1.1 and changes to log4j properties file.

Thanks
Arush

On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava 
ankur.srivast...@gmail.com wrote:

 Hi Arush

 I have configured log4j by updating the file log4j.properties in
 SPARK_HOME/conf folder.

 If it was a log4j defect we would get error in debug mode in all apps.

 Thanks
 Ankur
  Hi Ankur,

 How are you enabling the debug level of logs. It should be a log4j
 configuration. Even if there would be some issue it would be in log4j and
 not in spark.

 Thanks
 Arush

 On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava 
 ankur.srivast...@gmail.com wrote:

 Hi,

 When ever I enable DEBUG level logs for my spark cluster, on running a
 job all the executors die with the below exception. On disabling the DEBUG
 logs my jobs move to the next step.


 I am on spark-1.1.0

 Is this a known issue with spark?

 Thanks
 Ankur

 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
 SecurityManager: authentication disabled; ui acls disabled; users with view
 permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)

 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils - In
 createActorSystem, requireCookie is: off

 2015-01-29 22:27:42,871
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
 akka.event.slf4j.Slf4jLogger - Slf4jLogger started

 2015-01-29 22:27:42,912
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Starting remoting

 2015-01-29 22:27:43,057
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Remoting started; listening on addresses :[akka.tcp://
 driverPropsFetcher@10.77.9.155:36035]

 2015-01-29 22:27:43,060
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Remoting now listens on addresses: [akka.tcp://
 driverPropsFetcher@10.77.9.155:36035]

 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
 Successfully started service 'driverPropsFetcher' on port 36035.

 2015-01-29 22:28:13,077 [main] ERROR
 org.apache.hadoop.security.UserGroupInformation -
 PriviledgedActionException as:ubuntu
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]

 Exception in thread main
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)

 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

 Caused by: java.security.PrivilegedActionException:
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:415)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

 ... 4 more

 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [30 seconds]

 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

 at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)

 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)

 at scala.concurrent.Await$.result(package.scala:107)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)

 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)

 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)

 ... 7 more




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Error when running spark in debug mode

2015-01-31 Thread Arush Kharbanda
Can you share your log4j file.

On Sat, Jan 31, 2015 at 1:35 PM, Arush Kharbanda ar...@sigmoidanalytics.com
 wrote:

 Hi Ankur,

 Its running fine for me for spark 1.1 and changes to log4j properties
 file.

 Thanks
 Arush

 On Fri, Jan 30, 2015 at 9:49 PM, Ankur Srivastava 
 ankur.srivast...@gmail.com wrote:

 Hi Arush

 I have configured log4j by updating the file log4j.properties in
 SPARK_HOME/conf folder.

 If it was a log4j defect we would get error in debug mode in all apps.

 Thanks
 Ankur
  Hi Ankur,

 How are you enabling the debug level of logs. It should be a log4j
 configuration. Even if there would be some issue it would be in log4j and
 not in spark.

 Thanks
 Arush

 On Fri, Jan 30, 2015 at 4:24 AM, Ankur Srivastava 
 ankur.srivast...@gmail.com wrote:

 Hi,

 When ever I enable DEBUG level logs for my spark cluster, on running a
 job all the executors die with the below exception. On disabling the DEBUG
 logs my jobs move to the next step.


 I am on spark-1.1.0

 Is this a known issue with spark?

 Thanks
 Ankur

 2015-01-29 22:27:42,467 [main] INFO  org.apache.spark.SecurityManager -
 SecurityManager: authentication disabled; ui acls disabled; users with view
 permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)

 2015-01-29 22:27:42,478 [main] DEBUG org.apache.spark.util.AkkaUtils -
 In createActorSystem, requireCookie is: off

 2015-01-29 22:27:42,871
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO
 akka.event.slf4j.Slf4jLogger - Slf4jLogger started

 2015-01-29 22:27:42,912
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Starting remoting

 2015-01-29 22:27:43,057
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Remoting started; listening on addresses :[akka.tcp://
 driverPropsFetcher@10.77.9.155:36035]

 2015-01-29 22:27:43,060
 [driverPropsFetcher-akka.actor.default-dispatcher-4] INFO  Remoting -
 Remoting now listens on addresses: [akka.tcp://
 driverPropsFetcher@10.77.9.155:36035]

 2015-01-29 22:27:43,067 [main] INFO  org.apache.spark.util.Utils -
 Successfully started service 'driverPropsFetcher' on port 36035.

 2015-01-29 22:28:13,077 [main] ERROR
 org.apache.hadoop.security.UserGroupInformation -
 PriviledgedActionException as:ubuntu
 cause:java.util.concurrent.TimeoutException: Futures timed out after [30
 seconds]

 Exception in thread main
 java.lang.reflect.UndeclaredThrowableException: Unknown exception in doAs

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)

 at
 org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:113)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:156)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

 Caused by: java.security.PrivilegedActionException:
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]

 at java.security.AccessController.doPrivileged(Native Method)

 at javax.security.auth.Subject.doAs(Subject.java:415)

 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)

 ... 4 more

 Caused by: java.util.concurrent.TimeoutException: Futures timed out
 after [30 seconds]

 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)

 at
 scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)

 at
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)

 at scala.concurrent.Await$.result(package.scala:107)

 at
 org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:125)

 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)

 at
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)

 ... 7 more




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com




 --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark streaming - tracking/deleting processed files

2015-01-31 Thread Arush Kharbanda
Hi Ganterm,

Thats obvious. If you look at the documentation for textFileStream.

Create a input stream that monitors a Hadoop-compatible filesystem for new
files and reads them as text files (using key as LongWritable, value as
Text and input format as TextInputFormat). Files must be written to the
monitored directory by moving them from another location within the same
file system. File names starting with . are ignored.

You need to move files to the directory when the system is up, You need to
manage that using shell script.  Moving files one at a time, moving them
out via another script.

On Fri, Jan 30, 2015 at 11:37 PM, ganterm gant...@gmail.com wrote:

 We are running a Spark streaming job that retrieves files from a directory
 (using textFileStream).
 One concern we are having is the case where the job is down but files are
 still being added to the directory.
 Once the job starts up again, those files are not being picked up (since
 they are not new or changed while the job is running) but we would like
 them
 to be processed.
 Is there a solution for that? Is there a way to keep track what files have
 been processed and can we force older files to be picked up? Is there a
 way to delete the processed files?

 Thanks!
 Markus



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: groupByKey is not working

2015-01-30 Thread Arush Kharbanda
Hi Amit,

What error does it through?

Thanks
Arush

On Sat, Jan 31, 2015 at 1:50 AM, Amit Behera amit.bd...@gmail.com wrote:

 hi all,

 my sbt file is like this:

 name := Spark

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core % 1.1.0

 libraryDependencies += net.sf.opencsv % opencsv % 2.3


 *code:*

 object SparkJob
 {

   def pLines(lines:Iterator[String])={
 val parser=new CSVParser()
 lines.map(l={val vs=parser.parseLine(l)
   (vs(0),vs(1).toInt)})
   }

   def main(args: Array[String]) {
 val conf = new SparkConf().setAppName(Spark Job).setMaster(local)
 val sc = new SparkContext(conf)
 val data = sc.textFile(/home/amit/testData.csv).cache()
 val result = data.mapPartitions(pLines).groupByKey
 //val list = result.filter(x= {(x._1).contains(24050881)})

   }

 }


 Here groupByKey is not working . But same thing is working from *spark-shell.*

 Please help me


 Thanks

 Amit




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Building Spark behind a proxy

2015-01-30 Thread Arush Kharbanda
Hi Somya,

I meant when you configure the JAVA_OPTS and when you don't configure the
JAVA_OPTS is there any difference in the error message?

Are you facing the same issue when you built using maven?

Thanks
Arush

On Thu, Jan 29, 2015 at 10:22 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:

 I can do a
 wget
 http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
 and get the file successfully on a shell.



 On Thu, Jan 29, 2015 at 11:51 AM, Boromir Widas vcsub...@gmail.com
 wrote:

 At least a part of it is due to connection refused, can you check if curl
 can reach the URL with proxies -
 [FATAL] Non-resolvable parent POM: Could not transfer artifact
 org.apache:apache:pom:14 from/to central (
 http://repo.maven.apache.org/maven2): Error transferring file:
 Connection refused from
 http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom

 On Thu, Jan 29, 2015 at 11:35 AM, Soumya Simanta 
 soumya.sima...@gmail.com wrote:



 On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda 
 ar...@sigmoidanalytics.com wrote:

 Does  the error change on build with and without the built options?

 What do you mean by build options? I'm just doing ./sbt/sbt assembly
 from $SPARK_HOME


 Did you try using maven? and doing the proxy settings there.


  No I've not tried maven yet. However, I did set proxy settings inside
 my .m2/setting.xml, but it didn't make any difference.







-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Dependency unresolved hadoop-yarn-common 1.0.4 when running quickstart example

2015-01-29 Thread Arush Kharbanda
Hi Sarwar,

For a quick fix you can exclude dependencies for yarn(you wont be needing
them if you are running locally).

libraryDependencies +=
  log4j % log4j % 1.2.15 exclude(javax.jms, jms)


You can also analyze your dependencies using this plugin

https://github.com/jrudolph/sbt-dependency-graph


Thanks
Arush

On Thu, Jan 29, 2015 at 4:20 AM, Sarwar Bhuiyan sarwar.bhui...@gmail.com
wrote:

 Hello all,

 I'm trying to build the sample application on the spark 1.2.0 quickstart
 page (https://spark.apache.org/docs/latest/quick-start.html) using the
 following build.sbt file:

 name := Simple Project

 version := 1.0

 scalaVersion := 2.10.4

 libraryDependencies += org.apache.spark %% spark-core % 1.2.0

 Upon calling sbt package, it downloaded a lot of dependencies but
 eventually failed with some warnings and errors. Here's the snippet:

 [warn]  ::
 [warn]  ::  UNRESOLVED DEPENDENCIES ::
 [warn]  ::
 [warn]  :: org.apache.hadoop#hadoop-yarn-common;1.0.4: not found
 [warn]  :: org.apache.hadoop#hadoop-yarn-client;1.0.4: not found
 [warn]  :: org.apache.hadoop#hadoop-yarn-api;1.0.4: not found
 [warn]  ::
 [warn]  ::
 [warn]  ::  FAILED DOWNLOADS::
 [warn]  :: ^ see resolution messages for details  ^ ::
 [warn]  ::
 [warn]  ::
 org.eclipse.jetty.orbit#javax.transaction;1.1.1.v201105210645!javax.transaction.orbit
 [warn]  ::
 org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016!javax.servlet.orbit
 [warn]  ::
 org.eclipse.jetty.orbit#javax.mail.glassfish;1.4.1.v201005082020!javax.mail.glassfish.orbit
 [warn]  ::
 org.eclipse.jetty.orbit#javax.activation;1.1.0.v201105071233!javax.activation.orbit
 [warn]  ::



 Upon checking the maven repositories there doesn't seem to be any
 hadoop-yarn-common 1.0.4. I've tried explicitly setting a dependency to
 hadoop-yarn-common 2.4.0 for example but to no avail. I've also tried
 setting a number of different repositories to see if maybe one of them
 might have that dependency. Still no dice.

 What's the best way to resolve this for a quickstart situation? Do I have
 to set some sort of profile or environment variable which doesn't try to
 bring the 1.0.4 yarn version?

 Any help would be greatly appreciated.

 Sarwar




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: unknown issue in submitting a spark job

2015-01-29 Thread Arush Kharbanda
)
 at
 akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363)
 at java.lang.Thread.run(Thread.java:745)
 15/01/29 08:54:33 WARN storage.BlockManagerMasterActor: Removing
 BlockManager BlockManagerId(0, ip-10-10-8-191.us-west-2.compute.internal,
 47722, 0) with no recent heart beats: 82575ms exceeds 45000ms
 15/01/29 08:54:33 INFO spark.ContextCleaner: Cleaned RDD 1
 15/01/29 08:54:33 WARN util.AkkaUtils: Error sending message in 1 attempts
 akka.pattern.AskTimeoutException:
 Recipient[Actor[akka://sparkDriver/user/BlockManagerMaster#-538003375]] had
 already been terminated.
 at
 akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
 at
 org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:175)
 at

 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:218)
 at

 org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:126)





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/unknown-issue-in-submitting-a-spark-job-tp21418.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: is there a master for spark cluster in ec2

2015-01-29 Thread Arush Kharbanda
Hi Mohit,

You can set the master instance type with -m.

To setup a cluster you need to use the ec2/spark-ec2 script.

You need to create a AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in your
aws web console under Security Credentials. And pass it on to script above.
Once you do that you should be able to setup your cluster using spark-ec2
options.

Thanks
Arush



On Thu, Jan 29, 2015 at 6:41 AM, Mohit Singh mohit1...@gmail.com wrote:

 Hi,
   Probably a naive question.. But I am creating a spark cluster on ec2
 using the ec2 scripts in there..
 But is there a master param I need to set..
 ./bin/pyspark --master [ ] ??
 I don't yet fully understand the ec2 concepts so just wanted to confirm
 this??
 Thanks

 --
 Mohit

 When you want success as badly as you want the air, then you will get it.
 There is no other secret of success.
 -Socrates




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Building Spark behind a proxy

2015-01-29 Thread Arush Kharbanda
)
 at xsbt.boot.Launch$.launch(Launch.scala:117)
 at xsbt.boot.Launch$.apply(Launch.scala:19)
 at xsbt.boot.Boot$.runImpl(Boot.scala:44)
 at xsbt.boot.Boot$.main(Boot.scala:20)
 at xsbt.boot.Boot.main(Boot.scala)
 [error] org.apache.maven.model.building.ModelBuildingException: 1 problem
 was encountered while building the effective model for
 org.apache.spark:spark-parent:1.1.1
 [error] [FATAL] Non-resolvable parent POM: Could not transfer artifact
 org.apache:apache:pom:14 from/to central (
 http://repo.maven.apache.org/maven2): Error transferring file: Connection
 refused from
 http://repo.maven.apache.org/maven2/org/apache/apache/14/apache-14.pom
 and 'parent.relativePath' points at wrong local POM @ line 21, column 11
 [error] Use 'last' for the full log.
 Project loading failed: (r)etry, (q)uit, (l)ast, or (i)gnore?




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Hive on Spark vs. SparkSQL using Hive ?

2015-01-28 Thread Arush Kharbanda
Spark SQL on Hive

1. The purpose of Spark SQL is to allow Spark users to selectively use SQL
expressions (with not a huge number of functions currently supported) when
writing Spark jobs
2. Already Available

Hive on Spark
1.Spark users will automatically get the whole set of Hive’s rich features,
including any new features that Hive might introduce in the future.
2. Under Development

On Thu, Jan 29, 2015 at 4:54 AM, ogoh oke...@gmail.com wrote:


 Hello,
 probably this question was already asked but still I'd like to confirm from
 Spark users.

 This following blog shows 'hive on spark' :

 http://blog.cloudera.com/blog/2014/12/hands-on-hive-on-spark-in-the-aws-cloud/
 .
 How is it different from using hive as data storage of SparkSQL
 (http://spark.apache.org/docs/latest/sql-programming-guide.html)?
 Also, is there any update about SparkSQL's next release (current one is
 still alpha)?

 Thanks,
 OGoh




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Hive-on-Spark-vs-SparkSQL-using-Hive-tp21412.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com


Re: Spark serialization issues with third-party libraries

2014-11-24 Thread Arush Kharbanda
Hi

You can see my code here .

Its a POC to implement UIMA on spark

https://bitbucket.org/SigmoidDev/uimaspark

https://bitbucket.org/SigmoidDev/uimaspark/src/8476fdf16d84d0f517cce45a8bc1bd3410927464/UIMASpark/src/main/scala/
*UIMAProcessor.scala*?at=master

this is the class where the major part of the integration happens.

Thanks
Arush

On Sun, Nov 23, 2014 at 7:52 PM, jatinpreet jatinpr...@gmail.com wrote:

 Thanks Sean, I was actually using instances created elsewhere inside my RDD
 transformations which as I understand is against Spark programming model. I
 was referred to a talk about UIMA and Spark integration from this year's
 Spark summit, which had a workaround for this problem. I just had to make
 some class members transient.

 http://spark-summit.org/2014/talk/leveraging-uima-in-spark

 Thanks



 -
 Novice Big Data Programmer
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-serialization-issues-with-third-party-libraries-tp19454p19589.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 

[image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com