Re: Having problem with Spark streaming with Kinesis

2014-12-13 Thread A.K.M. Ashrafuzzaman
Thanks Aniket,
The trick is to have the #workers >= #shards + 1. But I don’t know why is that.
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Here in the figure[spark streaming kinesis architecture], it seems like one 
node should be able to take on more than one shards.


A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred

(M) 880-175-5592433
Twitter | Blog | Facebook

Check out The Academy, your #1 source
for free content marketing resources

On Nov 26, 2014, at 6:23 PM, A.K.M. Ashrafuzzaman  
wrote:

> Hi guys,
> When we are using Kinesis with 1 shard then it works fine. But when we use 
> more that 1 then it falls into an infinite loop and no data is processed by 
> the spark streaming. In the kinesis dynamo DB, I can see that it keeps 
> increasing the leaseCounter. But it do start processing.
> 
> I am using,
> scala: 2.10.4
> java version: 1.8.0_25
> Spark: 1.1.0
> spark-streaming-kinesis-asl: 1.1.0
> 
> A.K.M. Ashrafuzzaman
> Lead Software Engineer
> NewsCred
> 
> (M) 880-175-5592433
> Twitter | Blog | Facebook
> 
> Check out The Academy, your #1 source
> for free content marketing resources
> 



Re: Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Krishna Sankar
a) There is no absolute RSME - it depends on the domain. Also RSME is the
error based on what you have seen so far, a snapshot of a slice of the
domain.
b) My suggestion is put the system in place, see what happens when users
interact with the system and then you can think of reducing the RSME as
needed. For all you know, RSME could go up with another set of data
c) I would prefer Scala, but Java would work as well.
d) For a desktop app, you have two ways to go.
Either run Spark in local machine and build an app or
Have Spark run in a server/cluster and build a browser app. This
depends on the data size and scaling requirements.
e) I haven't seen any C# interfaces. Might be a good request candidate.
Cheers


On Sat, Dec 13, 2014 at 7:17 PM, Saurabh Agrawal  wrote:
>
>
> Requesting guidance on my queries in trail email.
>
>
>
> -Original Message-
> *From: *Saurabh Agrawal
> *Sent: *Saturday, December 13, 2014 07:06 PM GMT Standard Time
> *To: *user@spark.apache.org
> *Subject: *Building Desktop application for ALS-MlLib/ Training ALS
>
>
>
> Hi,
>
>
>
> I am a new bee in spark and scala world
>
>
>
> I have been trying to implement Collaborative filtering using MlLib
> supplied out of the box with Spark and Scala
>
>
>
> I have 2 problems
>
>
>
> 1.   The best model was trained with rank = 20 and lambda = 5.0, and
> numIter = 10, and its RMSE on the test set is 25.718710831912485. The best
> model improves the baseline by 18.29%. Is there a scientific way in which
> RMSE could be brought down? What is a descent acceptable value for RMSE?
>
> 2.   I picked up the Collaborative filtering algorithm from
> http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html
> and executed the given code with my dataset. Now, I want to build a
> desktop application around it.
>
> a.   What is the best language to do this Java/ Scala? Any
> possibility to do this using C#?
>
> b.  Can somebody please share any relevant documents/ source or any
> helper links to help me get started on this?
>
>
>
> Your help is greatly appreciated
>
>
>
> Thanks!!
>
>
>
> Regards,
>
> Saurabh Agrawal
>
> --
> This e-mail, including accompanying communications and attachments, is
> strictly confidential and only for the intended recipient. Any retention,
> use or disclosure not expressly authorised by Markit is prohibited. This
> email is subject to all waivers and other terms at the following link:
> http://www.markit.com/en/about/legal/email-disclaimer.page
>
> Please visit http://www.markit.com/en/about/contact/contact-us.page? for
> contact information on our offices worldwide.
>
> MarkitSERV Limited has its registered office located at Level 4, Ropemaker
> Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and
> regulated by the Financial Conduct Authority with registration number 207294
>


Calling ALS-MlLib from desktop application/ Training ALS

2014-12-13 Thread Saurabh Agrawal

Requesting guidance on my queries in trail email.



-Original Message-
From: Saurabh Agrawal
Sent: Saturday, December 13, 2014 07:06 PM GMT Standard Time
To: user@spark.apache.org
Subject: Building Desktop application for ALS-MlLib/ Training ALS




Hi,



I am a new bee in spark and scala world



I have been trying to implement Collaborative filtering using MlLib supplied 
out of the box with Spark and Scala



I have 2 problems



1.   The best model was trained with rank = 20 and lambda = 5.0, and 
numIter = 10, and its RMSE on the test set is 25.718710831912485. The best 
model improves the baseline by 18.29%. Is there a scientific way in which RMSE 
could be brought down? What is a descent acceptable value for RMSE?

2.   I picked up the Collaborative filtering algorithm from 
http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html 
and executed the given code with my dataset. Now, I want to build a desktop 
application around it.

a.   What is the best language to do this Java/ Scala? Any possibility to 
do this using C#?

b.  Can somebody please share any relevant documents/ source or any helper 
links to help me get started on this?



Your help is greatly appreciated



Thanks!!



Regards,

Saurabh Agrawal


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

MarkitSERV Limited has its registered office located at Level 4, Ropemaker 
Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by 
the Financial Conduct Authority with registration number 207294


Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-13 Thread Yana Kadiyska
Since you mentioned this, I had a related quandry recently -- it also says
that the forum archives "*u...@spark.incubator.apache.org
"/* *d...@spark.incubator.apache.org
 *respectively, yet the "Community page"
clearly says to email the @spark.apache.org list (but the nabble archive is
linked right there too). IMO even putting a clear explanation at the top

"Posting here requires that you create an account via the UI. Your message
will be sent to both spark.incubator.apache.org and spark.apache.org (if
that is the case, i'm not sure which alias nabble posts get sent to)" would
make things a lot more clear.

On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen  wrote:
>
> I've noticed that several users are attempting to post messages to Spark's
> user / dev mailing lists using the Nabble web UI (
> http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there
> are many posts in Nabble that are not posted to the Apache lists and are
> flagged with "This post has NOT been accepted by the mailing list yet."
> errors.
>
> I suspect that the issue is that users are not completing the sign-up
> confirmation process (
> http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
> which is preventing their emails from being accepted by the mailing list.
>
> I wanted to mention this issue to the Spark community to see whether there
> are any good solutions to address this.  I have spoken to users who think
> that our mailing list is unresponsive / inactive because their un-posted
> messages haven't received any replies.
>
> - Josh
>


Re: unread block data when reading from NFS

2014-12-13 Thread Yana
Someone just posted a very similar question:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-tt20668.html

I ran into this a few weeks back -- I can't remember if my jar was built
against a different version of spark or if I had accidentally included a
hadoop jar on the path. It's a pretty obscure error but it did seem to be
related to mismatched versions of jars. I would carefully look at the fatjar
you're submitting and the executor classpath



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/unread-block-data-when-reading-from-NFS-tp20672p20673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the "alpha component" label is 
that we are still tweaking some of the APIs so we have not yet guaranteed API 
stability for it. However, that is likely to happen soon (possibly 1.3). One of 
the major things added in Spark 1.2 was an external data sources API 
(https://github.com/apache/spark/pull/2475), so we wanted to get a bit of 
feedback on that to provide a stable API for those as well.

Matei

> On Dec 13, 2014, at 5:26 PM, Xiaoyong Zhu  wrote:
> 
> Thanks Denny for your information!
> For #1, what I meant is the Spark SQL beta/official release date (as today it 
> is still in alpha phase)… thought today I see it has most basic 
> functionalities,  I don’t know when will the next milestone happen? i.e. Beta?
> For #2, thanks for the information! I read it and it’s really useful! My take 
> is that, Hive on Spark is still Hive (thus having all the metastore 
> information and Hive interfaces such as the REST APIs), while Spark SQL is 
> the expansion of Spark and use several interfaces (HiveContext for example) 
> to support run Hive queries. Is this correct?
>  
> Then a following question would be, does Spark SQL has some REST APIs, just 
> as what WebHCat exposes, to help users to submit queries remotely, other than 
> logging into a cluster and execute the command in spark-sql command line? 
>  
> Xiaoyong
>   <>
> From: Denny Lee [mailto:denny.g@gmail.com ] 
> Sent: Saturday, December 13, 2014 10:59 PM
> To: Xiaoyong Zhu; user@spark.apache.org 
> Subject: Re: Spark SQL Roadmap?
>  
> Hi Xiaoyong,
> 
> SparkSQL has already been released and has been part of the Spark code-base 
> since Spark 1.0.  The latest stable release is Spark 1.1 (here's the Spark 
> SQL Programming Guide 
> ) and we're 
> currently voting on Spark 1.2.
>  
> Hive on Spark is an initiative by Cloudera to help folks whom are already 
> using Hive but instead of using traditional MR it will utilize Spark.  For 
> more information, check 
> outhttp://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
>  
> .
>  
> For anyone who is building new projects in Spark, IMHO I would suggest 
> jumping to SparkSQL first.
>  
> HTH!
> Denny
>  
>  
> On Sat Dec 13 2014 at 5:00:56 AM Xiaoyong Zhu  > wrote:
> Dear spark experts, I am very interested in Spark SQL availability in the 
> future – could someone share with me the information about the following 
> questions?
> 1.   Is there some ETAs for the Spark SQL release?
> 
> 2.   I heard there is a Hive on Spark program also – what’s the 
> difference between Spark SQL and Hive on Spark?
> 
>  
> Thanks!
> Xiaoyong



RE: Spark SQL Roadmap?

2014-12-13 Thread Xiaoyong Zhu
Thanks Denny for your information!
For #1, what I meant is the Spark SQL beta/official release date (as today it 
is still in alpha phase)… thought today I see it has most basic 
functionalities,  I don’t know when will the next milestone happen? i.e. Beta?
For #2, thanks for the information! I read it and it’s really useful! My take 
is that, Hive on Spark is still Hive (thus having all the metastore information 
and Hive interfaces such as the REST APIs), while Spark SQL is the expansion of 
Spark and use several interfaces (HiveContext for example) to support run Hive 
queries. Is this correct?

Then a following question would be, does Spark SQL has some REST APIs, just as 
what WebHCat exposes, to help users to submit queries remotely, other than 
logging into a cluster and execute the command in spark-sql command line?

Xiaoyong

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Saturday, December 13, 2014 10:59 PM
To: Xiaoyong Zhu; user@spark.apache.org
Subject: Re: Spark SQL Roadmap?

Hi Xiaoyong,

SparkSQL has already been released and has been part of the Spark code-base 
since Spark 1.0.  The latest stable release is Spark 1.1 (here's the Spark SQL 
Programming 
Guide) and we're 
currently voting on Spark 1.2.

Hive on Spark is an initiative by Cloudera to help folks whom are already using 
Hive but instead of using traditional MR it will utilize Spark.  For more 
information, check out 
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/.

For anyone who is building new projects in Spark, IMHO I would suggest jumping 
to SparkSQL first.

HTH!
Denny


On Sat Dec 13 2014 at 5:00:56 AM Xiaoyong Zhu 
mailto:xiaoy...@microsoft.com>> wrote:
Dear spark experts, I am very interested in Spark SQL availability in the 
future – could someone share with me the information about the following 
questions?

1.   Is there some ETAs for the Spark SQL release?

2.   I heard there is a Hive on Spark program also – what’s the difference 
between Spark SQL and Hive on Spark?

Thanks!
Xiaoyong


Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-13 Thread Josh Rosen
I've noticed that several users are attempting to post messages to Spark's
user / dev mailing lists using the Nabble web UI (
http://apache-spark-user-list.1001560.n3.nabble.com/).  However, there are
many posts in Nabble that are not posted to the Apache lists and are
flagged with "This post has NOT been accepted by the mailing list yet."
errors.

I suspect that the issue is that users are not completing the sign-up
confirmation process (
http://apache-spark-user-list.1001560.n3.nabble.com/mailing_list/MailingListOptions.jtp?forum=1),
which is preventing their emails from being accepted by the mailing list.

I wanted to mention this issue to the Spark community to see whether there
are any good solutions to address this.  I have spoken to users who think
that our mailing list is unresponsive / inactive because their un-posted
messages haven't received any replies.

- Josh


Building Desktop application for ALS-MlLib/ Training ALS

2014-12-13 Thread Saurabh Agrawal


Hi,



I am a new bee in spark and scala world



I have been trying to implement Collaborative filtering using MlLib supplied 
out of the box with Spark and Scala



I have 2 problems



1.   The best model was trained with rank = 20 and lambda = 5.0, and 
numIter = 10, and its RMSE on the test set is 25.718710831912485. The best 
model improves the baseline by 18.29%. Is there a scientific way in which RMSE 
could be brought down? What is a descent acceptable value for RMSE?

2.   I picked up the Collaborative filtering algorithm from 
http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html 
and executed the given code with my dataset. Now, I want to build a desktop 
application around it.

a.   What is the best language to do this Java/ Scala? Any possibility to 
do this using C#?

b.  Can somebody please share any relevant documents/ source or any helper 
links to help me get started on this?



Your help is greatly appreciated



Thanks!!



Regards,

Saurabh Agrawal


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

MarkitSERV Limited has its registered office located at Level 4, Ropemaker 
Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by 
the Financial Conduct Authority with registration number 207294


Building Desktop application for ALS-MlLib/ Training ALS

2014-12-13 Thread Saurabh Agrawal


Hi,



I am a new bee in spark and scala world



I have been trying to implement Collaborative filtering using MlLib supplied 
out of the box with Spark and Scala



I have 2 problems



1.   The best model was trained with rank = 20 and lambda = 5.0, and 
numIter = 10, and its RMSE on the test set is 25.718710831912485. The best 
model improves the baseline by 18.29%. Is there a scientific way in which RMSE 
could be brought down? What is a descent acceptable value for RMSE?

2.   I picked up the Collaborative filtering algorithm from 
http://ampcamp.berkeley.edu/5/exercises/movie-recommendation-with-mllib.html 
and executed the given code with my dataset. Now, I want to build a desktop 
application around it.

a.   What is the best language to do this Java/ Scala? Any possibility to 
do this using C#?

b.  Can somebody please share any relevant documents/ source or any helper 
links to help me get started on this?



Your help is greatly appreciated



Thanks!!



Regards,

Saurabh Agrawal


This e-mail, including accompanying communications and attachments, is strictly 
confidential and only for the intended recipient. Any retention, use or 
disclosure not expressly authorised by Markit is prohibited. This email is 
subject to all waivers and other terms at the following link: 
http://www.markit.com/en/about/legal/email-disclaimer.page

Please visit http://www.markit.com/en/about/contact/contact-us.page? for 
contact information on our offices worldwide.

MarkitSERV Limited has its registered office located at Level 4, Ropemaker 
Place, 25 Ropemaker Street, London, EC2Y 9LY and is authorized and regulated by 
the Financial Conduct Authority with registration number 207294


Re: JSON Input files

2014-12-13 Thread Helena Edelson
One solution can be found here: 
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets

- Helena
@helenaedelson

On Dec 13, 2014, at 11:18 AM, Madabhattula Rajesh Kumar  
wrote:

> Hi Team,
> 
> I have a large JSON file in Hadoop. Could you please let me know 
> 
> 1. How to read the JSON file
> 2. How to parse the JSON file
> 
> Please share any example program based on Scala
> 
> Regards,
> Rajesh



Re: Error: Spark-streaming to Cassandra

2014-12-13 Thread Helena Edelson
I am curious why you use the 1.0.4 java artifact with the latest 1.1.0? This 
might be your compilation problem - The older java version.
 

   com.datastax.spark
   spark-cassandra-connector_2.10
   1.1.0


   com.datastax.spark
   spark-cassandra-connector-java_2.10
   1.0.4


See:
-  doc 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/7_java_api.md
-  mvn repo 
http://mvnrepository.com/artifact/com.datastax.spark/spark-cassandra-connector-java_2.10/1.1.0
  
- Helena
@helenaedelson


On Dec 8, 2014, at 12:47 PM, m.sar...@accenture.com wrote:

> Hi,
> 
> I am intending to save the streaming data from kafka into Cassandra,
> using spark-streaming:
> But there seems to be problem with line
> javaFunctions(data).writerBuilder("testkeyspace", "test_table", 
> mapToRow(TestTable.class)).saveToCassandra();
> I am getting 2 errors.
> the code, the error-log and POM.xml dependencies are listed below:
> Please help me find the reason as to why is this happening.
> 
> 
> public class SparkStream {
>static int key=0;
>public static void main(String args[]) throws Exception
>{
>if(args.length != 3)
>{
>System.out.println("SparkStream   
> ");
>System.exit(1);
>}
> 
>Logger.getLogger("org").setLevel(Level.OFF);
>Logger.getLogger("akka").setLevel(Level.OFF);
>Map topicMap = new HashMap();
>String[] topic = args[2].split(",");
>for(String t: topic)
>{
>topicMap.put(t, new Integer(3));
>}
> 
>/* Connection to Spark */
>SparkConf conf = new SparkConf();
>JavaSparkContext sc = new JavaSparkContext("local[4]", 
> "SparkStream",conf);
>JavaStreamingContext jssc = new JavaStreamingContext(sc, new 
> Duration(3000));
> 
> 
>  /* connection to cassandra */
> /*conf.set("spark.cassandra.connection.host", "127.0.0.1:9042");
>CassandraConnector connector = CassandraConnector.apply(sc.getConf());
>Session session = connector.openSession();
>session.execute("CREATE TABLE IF NOT EXISTS testkeyspace.test_table 
> (key INT PRIMARY KEY, value TEXT)");
> */
> 
>/* Receive Kafka streaming inputs */
>JavaPairReceiverInputDStream messages = 
> KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
> 
> 
>/* Create DStream */
>JavaDStream data = messages.map(new 
> Function, TestTable >()
>{
>public TestTable call(Tuple2 message)
>{
>return new TestTable(new Integer(++key), message._2() );
>}
>}
>);
> 
> 
>/* Write to cassandra */
>javaFunctions(data).writerBuilder("testkeyspace", "test_table", 
> mapToRow(TestTable.class)).saveToCassandra();
>//  data.print();
> 
> 
>jssc.start();
>jssc.awaitTermination();
> 
>}
> }
> 
> class TestTable implements Serializable
> {
>Integer key;
>String value;
> 
>public TestTable() {}
> 
>public TestTable(Integer k, String v)
>{
>key=k;
>value=v;
>}
> 
>public Integer getKey(){
>return key;
>}
> 
>public void setKey(Integer k){
>key=k;
>}
> 
>public String getValue(){
>return value;
>}
> 
>public void setValue(String v){
>value=v;
>}
> 
>public String toString(){
>return MessageFormat.format("TestTable'{'key={0},
> value={1}'}'", key, value);
>}
> }
> 
> The output log is:
> 
> [INFO] Compiling 1 source file to
> /root/Documents/SparkStreamSample/target/classes
> [INFO] 2 errors
> [INFO] -
> [ERROR] COMPILATION ERROR :
> [INFO] -
> [ERROR] 
> /root/Documents/SparkStreamSample/src/main/java/com/spark/SparkStream.java:[76,81]
> cannot find symbol
>  symbol:   method mapToRow(java.lang.Class)
>  location: class com.spark.SparkStream
> [ERROR] 
> /root/Documents/SparkStreamSample/src/main/java/com/spark/SparkStream.java:[76,17]
> no suitable method found for
> javaFunctions(org.apache.spark.streaming.api.java.JavaDStream)
>method 
> com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(org.apache.spark.streaming.api.java.JavaDStream,java.lang.Class)
> is not applicable
>  (cannot infer type-variable(s) T
>(actual and formal argument lists differ in length))
>method 
> com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(org.apache.spark.streaming.dstream.DStream,java.lang.Class)
> is not applicable
>  (cannot infer type-variable(s) T
>(actual and formal argument lists differ in length))
>method 
> com.datastax.spark.connector.CassandraJavaUtil.javaFunctions(org.apache.spark.api.java.JavaRDD,java.lang.Class)
> is not applicable
>  (cannot infer type-variable(s) T
>(actual and formal argument lists differ in length))
>method 
> com.datastax.spark.connecto

JSON Input files

2014-12-13 Thread Madabhattula Rajesh Kumar
Hi Team,

I have a large JSON file in Hadoop. Could you please let me know

1. How to read the JSON file
2. How to parse the JSON file

Please share any example program based on Scala

Regards,
Rajesh


Re: Including data nucleus tools

2014-12-13 Thread spark.dubovsky.jakub
So to answer my own question. It is a bug and there is unmerged PR for that 
already.

https://issues.apache.org/jira/browse/SPARK-2624
https://github.com/apache/spark/pull/3238

Jakub


-- Původní zpráva --
Od: spark.dubovsky.ja...@seznam.cz
Komu: spark.dubovsky.ja...@seznam.cz
Datum: 12. 12. 2014 15:26:35
Předmět: Re: Including data nucleus tools

"
Hi,

  I had time to try it again. I submited my app by the same command with 
these additional options:

  --jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,
lib/datanucleus-rdbms-3.2.9.jar

  Now an app successfully creates hive context. So my question remains: Is 
"classpath entries" from sparkUI the same classpath as mentioned in submit 
script message?

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  If so then why the script fails to really include datanucleus jars on 
classpath? I found no bug about this on jira. Or is there a way how 
particular yarn/os settings on our cluster overrides this?

  Thanks in advance

  Jakub


-- Původní zpráva --
Od: spark.dubovsky.ja...@seznam.cz
Komu: Michael Armbrust 
Datum: 7. 12. 2014 3:02:33
Předmět: Re: Including data nucleus tools

"
Next try. I copied whole dist directory created by make-distribution script 
to cluster not just assembly jar. Then I used

./bin/spark-submit --num-executors 200 --master yarn-cluster --class org.
apache.spark.mllib.CreateGuidDomainDictionary ../spark/root-0.1.jar ${args}

 ...to run app again. Startup scripts printed this message:

"Spark assembly has been built with Hive, including Datanucleus jars on 
classpath"

  ...so I thought I am finally there. But job started and failed on the same
ClassNotFound exception as before. Is "classpath" from script message just 
classpath of driver? Or is it the same classpath which is affected by --jars
option? I was trying to find out from scripts but I was not able to find 
where --jars option is processed.

  thanks


-- Původní zpráva --
Od: Michael Armbrust 
Komu: spark.dubovsky.ja...@seznam.cz
Datum: 6. 12. 2014 20:39:13
Předmět: Re: Including data nucleus tools

"



On Sat, Dec 6, 2014 at 5:53 AM, mailto:/skin/default/img/empty.gif)> wrote:"
Bonus question: Should the class org.datanucleus.api.jdo.
JDOPersistenceManagerFactory be part of assembly? Because it is not in jar 
now.

"



No these jars cannot be put into the assembly because they have extra 
metadata files that live in the same location (so if you put them all in an 
assembly they overrwrite each other).  This metadata is used in discovery.  
Instead they must be manually put on the classpath in their original form 
(usually using --jars). 



 
"
"
"

Re: Spark SQL Roadmap?

2014-12-13 Thread Denny Lee
Hi Xiaoyong,

SparkSQL has already been released and has been part of the Spark code-base
since Spark 1.0.  The latest stable release is Spark 1.1 (here's the Spark
SQL Programming Guide
) and we're
currently voting on Spark 1.2.

Hive on Spark is an initiative by Cloudera to help folks whom are already
using Hive but instead of using traditional MR it will utilize Spark.  For
more information, check out
http://blog.cloudera.com/blog/2014/07/apache-hive-on-apache-spark-motivations-and-design-principles/
.

For anyone who is building new projects in Spark, IMHO I would suggest
jumping to SparkSQL first.

HTH!
Denny


On Sat Dec 13 2014 at 5:00:56 AM Xiaoyong Zhu 
wrote:

>  Dear spark experts, I am very interested in Spark SQL availability in
> the future – could someone share with me the information about the
> following questions?
>
> 1.   Is there some ETAs for the Spark SQL release?
>
> 2.   I heard there is a Hive on Spark program also – what’s the
> difference between Spark SQL and Hive on Spark?
>
>
>
> Thanks!
>
> Xiaoyong
>


unread block data when reading from NFS

2014-12-13 Thread gtinside
Hi ,

I am trying to read a csv file in the following way :
val csvData = sc.textFile("file:///tmp/sample.csv") 
csvData.collect().length

This works file on spark-shell but when I try to do spark-submit of the jar,
I get the following exceptions :
java.lang.IllegalStateException: unread block data
   
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2420)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1380)
   
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1989)
   
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1913)
   
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1348)
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:724)

Can you please help me here ?

Regards,
Gaurav




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/unread-block-data-when-reading-from-NFS-tp20672.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL Roadmap?

2014-12-13 Thread Xiaoyong Zhu
Dear spark experts, I am very interested in Spark SQL availability in the 
future - could someone share with me the information about the following 
questions?

1.   Is there some ETAs for the Spark SQL release?

2.   I heard there is a Hive on Spark program also - what's the difference 
between Spark SQL and Hive on Spark?

Thanks!
Xiaoyong


Re: Read data from SparkStreaming from Java socket.

2014-12-13 Thread Guillermo Ortiz
I got it, thanks,, a silly question,, why if I do:
out.write("hello " + System.currentTimeMillis() + "\n"); it doesn't
detect anything and if I do
out.println("hello " + System.currentTimeMillis());  it works??

I'm doing with spark
val errorLines = lines.filter(_.contains("hello"))


2014-12-13 8:12 GMT+01:00 Tathagata Das :
> Yes, socketTextStream starts a TCP client that tries to connect to a
> TCP server (localhost: in your case). If there is a server running
> on that port that can send data to connected TCP connections, then you
> will receive data in the stream.
>
> Did you check out the quick example in the streaming programming guide?
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
> That has instructions to start a netcat server on port  and send
> data to spark streaming through that.
>
> TD
>
> On Fri, Dec 12, 2014 at 9:54 PM, Akhil Das  wrote:
>> socketTextStream is Socket client which will read from a TCP ServerSocket.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Dec 12, 2014 at 7:21 PM, Guillermo Ortiz 
>> wrote:
>>>
>>> I dont' understand what spark streaming socketTextStream is waiting...
>>> is it like a server so you just have to send data from a client?? or
>>> what's it excepting?
>>>
>>> 2014-12-12 14:19 GMT+01:00 Akhil Das :
>>> > I have created a Serversocket program which you can find over here
>>> > https://gist.github.com/akhld/4286df9ab0677a555087 It simply listens to
>>> > the
>>> > given port and when the client connects, it will send the contents of
>>> > the
>>> > given file. I'm attaching the executable jar also, you can run the jar
>>> > as:
>>> >
>>> > java -jar SocketBenchmark.jar student 12345 io
>>> >
>>> > Here student is the file which will be sent to the client whoever
>>> > connects
>>> > on 12345, i have it tested and is working with SparkStreaming
>>> > (socketTextStream).
>>> >
>>> >
>>> > Thanks
>>> > Best Regards
>>> >
>>> > On Fri, Dec 12, 2014 at 6:25 PM, Guillermo Ortiz 
>>> > wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I'm a newbie with Spark,, I'm just trying to use SparkStreaming and
>>> >> filter some data sent with a Java Socket but it's not working... it
>>> >> works when I use ncat
>>> >>
>>> >> Why is it not working??
>>> >>
>>> >> My sparkcode is just this:
>>> >> val sparkConf = new
>>> >> SparkConf().setMaster("local[2]").setAppName("Test")
>>> >> val ssc = new StreamingContext(sparkConf, Seconds(5))
>>> >> val lines = ssc.socketTextStream("localhost", )
>>> >> val errorLines = lines.filter(_.contains("hello"))
>>> >> errorLines.print()
>>> >>
>>> >> I created a client socket which sends data to that port, but it could
>>> >> connect any address, I guess that Spark doesn't work like a
>>> >> serverSocket... what's the way to send data from a socket with Java to
>>> >> be able to read from socketTextStream??
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: "Session" for connections?

2014-12-13 Thread Ashic Mahtab
Thanks for the response. The fact that they'll get killed when the sc is closed 
is quite useful in this case. I'm looking at a cluster of four workers trying 
to send messages to rabbitmq, which can have many sessions open without much 
penalty. For other stores (like say SQL) and larger clusters, the idle 
connections would be a bigger issues. One last question...if I do leave them 
open till the end of the job, does that mean one per worker or one per rdd 
partition? I'd imagine the former, but wanted to confirm.
Regards,Ashic.

> From: tathagata.das1...@gmail.com
> Date: Sat, 13 Dec 2014 15:16:46 +0800
> Subject: Re: "Session" for connections?
> To: as...@live.com
> CC: user@spark.apache.org
> 
> That is your call. If you think it is not a problem to have large
> number of open but idle connections to your data store, then it is
> probably okay to let them hang around until the executor is killed
> (when the sparkContext is closed).
> 
> TD
> 
> On Fri, Dec 12, 2014 at 11:51 PM, Ashic Mahtab  wrote:
> > Looks like the way to go.
> >
> > Quick question regarding the connection pool approach - if I have a
> > connection that gets lazily instantiated, will it automatically die if I
> > kill the driver application? In my scenario, I can keep a connection open
> > for the duration of the app, and aren't that concerned about having idle
> > connections as long as the app is running. For this specific scenario, do I
> > still need to think of the timeout, or would it be shut down when the driver
> > stops? (Using a stand alone cluster btw).
> >
> > Regards,
> > Ashic.
> >
> >> From: tathagata.das1...@gmail.com
> >> Date: Thu, 11 Dec 2014 06:33:49 -0800
> >
> >> Subject: Re: "Session" for connections?
> >> To: as...@live.com
> >> CC: user@spark.apache.org
> >>
> >> Also, this is covered in the streaming programming guide in bits and
> >> pieces.
> >>
> >> http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
> >>
> >> On Thu, Dec 11, 2014 at 4:55 AM, Ashic Mahtab  wrote:
> >> > That makes sense. I'll try that.
> >> >
> >> > Thanks :)
> >> >
> >> >> From: tathagata.das1...@gmail.com
> >> >> Date: Thu, 11 Dec 2014 04:53:01 -0800
> >> >> Subject: Re: "Session" for connections?
> >> >> To: as...@live.com
> >> >> CC: user@spark.apache.org
> >> >
> >> >>
> >> >> You could create a lazily initialized singleton factory and connection
> >> >> pool. Whenever an executor starts running the firt task that needs to
> >> >> push out data, it will create the connection pool as a singleton. And
> >> >> subsequent tasks running on the executor is going to use the
> >> >> connection pool. You will also have to intelligently shutdown the
> >> >> connections because there is not a obvious way to shut them down. You
> >> >> could have a usage timeout - shutdown connection after not being used
> >> >> for 10 x batch interval.
> >> >>
> >> >> TD
> >> >>
> >> >> On Thu, Dec 11, 2014 at 4:28 AM, Ashic Mahtab  wrote:
> >> >> > Hi,
> >> >> > I was wondering if there's any way of having long running session
> >> >> > type
> >> >> > behaviour in spark. For example, let's say we're using Spark
> >> >> > Streaming
> >> >> > to
> >> >> > listen to a stream of events. Upon receiving an event, we process it,
> >> >> > and if
> >> >> > certain conditions are met, we wish to send a message to rabbitmq.
> >> >> > Now,
> >> >> > rabbit clients have the concept of a connection factory, from which
> >> >> > you
> >> >> > create a connection, from which you create a channel. You use the
> >> >> > channel to
> >> >> > get a queue, and finally the queue is what you publish messages on.
> >> >> >
> >> >> > Currently, what I'm doing can be summarised as :
> >> >> >
> >> >> > dstream.foreachRDD(x => x.forEachPartition(y => {
> >> >> > val factory = ..
> >> >> > val connection = ...
> >> >> > val channel = ...
> >> >> > val queue = channel.declareQueue(...);
> >> >> >
> >> >> > y.foreach(z => Processor.Process(z, queue));
> >> >> >
> >> >> > cleanup the queue stuff.
> >> >> > }));
> >> >> >
> >> >> > I'm doing the same thing for using Cassandra, etc. Now in these
> >> >> > cases,
> >> >> > the
> >> >> > session initiation is expensive, so foing it per message is not a
> >> >> > good
> >> >> > idea.
> >> >> > However, I can't find a way to say "hey...do this per worker once and
> >> >> > only
> >> >> > once".
> >> >> >
> >> >> > Is there a better pattern to do this?
> >> >> >
> >> >> > Regards,
> >> >> > Ashic.
> >> >>
> >> >> -
> >> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> >> For additional commands, e-mail: user-h...@spark.apache.org
> >> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> 
> --